Parser Combinators in Elixir: Taming Semi-Structured Text

davidsulc

David Sulc

Posted on October 25, 2022

Parser Combinators in Elixir: Taming Semi-Structured Text

The need to manipulate strings comes up quite often, whether it's to validate user-provided values or transform text into structured data that can be used programmatically.

Most often, we'll reach for regular expressions to accomplish this task, but sometimes there's a better solution to the problem: parser combinators. In this two-part article, we'll explore how they work.

Before moving on, let's define what 'parsing' is:

A parser is a software component that takes input data (frequently text) and builds a data structure.

Source: Wikipedia.

In other words, when talking about transforming text into structured data, we essentially mean parsing.

Let's get into it!

Parser Combinators: The What, How, and Why

So what are parser combinators? Well, as their name implies, they are parsers that can be combined — to make bigger, better parsers.

In effect, parser combinators enable the writing of complex parsers from simpler ones. You're less likely to make mistakes, and the parsers are a lot easier to maintain due to better readability.

In contrast, regular expressions are often a poor choice for non-trivial parsing. Writing them is error-prone, they're challenging to maintain, and their format doesn't lend itself well to documentation/explanation.

To illustrate working with parser combinators, we'll work on transforming a string representation of a Dutch phone number into a canonical representation. This can prove useful, as a web form can capture a user's phone number to later be used programmatically, for example.

The "Parser" and "Combinator" Bits: NimbleParsec

We'll be using NimbleParsec as the resulting parsers are more efficient, but other libraries such as Combine also exist and work well.

As an aside, note that it's been pointed out that NimbleParsec is more akin to a parser generator than a parser combinator: that's true, but the difference won't matter when learning how to make use of parser generators.

Parser Examples in Elixir

As mentioned before, parsing is the act of taking unstructured data (typically strings) and transforming them into structured data.

We've got some examples right in Elixir.

We can turn a string containing a URI into a data structure via URI.new/1:

iex> URI.new("https://elixir-lang.org/")
{:ok, %URI{
  fragment: nil,
  host: "elixir-lang.org",
  path: "/",
  port: 443,
  query: nil,
  scheme: "https",
  userinfo: nil
}}

iex> URI.new("/invalid_greater_than_in_path/>")
{:error, ">"}
Enter fullscreen mode Exit fullscreen mode

Similarly, we can convert a string representation of an integer into an actual integer by using Integer.parse/2:

iex> Integer.parse("34")
{34, ""}

iex> Integer.parse("34.5")
{34, ".5"}

iex> Integer.parse("three")
:error
Enter fullscreen mode Exit fullscreen mode

Each of these examples will accept a string and either return a parsed result (possibly with "leftover" string content, as in the Integer.parse/2 example) or an error indication.

Combinator Examples in Elixir

What if we had lists indicating how many visitors came from URIs (formatted as "URI & count" pairs)? It would be nice to have a single function to do the heavy lifting for us:

iex> parse_visit_count("https://elixir-lang.org/ 123")
{:ok, %{
  referrer: %URI{
    fragment: nil,
    host: "elixir-lang.org",
    path: "/",
    port: 443,
    query: nil,
    scheme: "https",
    userinfo: nil
  },
  visits: 123
}}
Enter fullscreen mode Exit fullscreen mode

Easy, I hear you say — just do something like:

def parse_visit_count(string) do
  with [uri_string, count_string | []] <- string |> String.trim() |> String.split(),
        {:ok, uri} <- URI.new(uri_string),
        {count, ""} <- Integer.parse(count_string) do
    {:ok, %{referrer: uri, visits: count}}
  else
    _ -> :error
  end
end
Enter fullscreen mode Exit fullscreen mode

That's indeed pretty straightforward. But what if there are several URI/visits pairs per line? Then the logic is already much more involved. And what if these URI/visits pairs are in reverse order (visits first, then URI)?

Using a parser combinator approach, we look at the problem differently. We've got a parser for URIs, and another one for integers. Let's write a parser for white space, then combine those three parsers into a single one for our specific use case. In pseudo-code, it would look like this:

def parse_visit_count(...) do
   ...
   |> uri_parser()
   |> whitespace_parser()
   |> integer_parser()
end
Enter fullscreen mode Exit fullscreen mode

And since we're now in combinator-land, we could then use parse_visit_count and combine it with other parsers to step up the complexity. Neat, right?

Code Walkthrough: Our Parser Combinator Example

Let's work through a concrete example of using a parser combinator: we'll parse a phone number into a data structure.

The phone number we'll use for our examples is one from the Netherlands: +31 20 42 84 105.

According to Wikipedia, this number can be formatted in various ways (note we'll only be covering a subset of the possibilities within this article):

The area code ('A') is commonly separated with a dash ('-') and sometimes a space from the subscriber's number ('B').
The length of the area code for landlines is either 2 or 3 digits, depending on the population density of the area. This leaves 7 or 6 digits for the subscriber's number, resulting in a format of either 0AA-BBBBBBB or 0AAA-BBBBBB. [...]
The trunk prefix '0' is dropped when prefixed by the country code: +31 AA BBBBBBBB, [...], etcetera. [...]

In other words, this phone number can also be formatted as (among others):

  • +31 20 4284105
  • +31 20-42 84 105
  • 020-42 84 105
  • 020-4284105
  • 020 42 84 105

Deconstructing the Problem

From a quick look at the formats above, we can tell that the main "chunks" we need to extract are:

  • Country code - the "+31" prefix, which may be absent
  • Area code - the "020", "(0)20", or "20" value
  • Subscriber number - the remaining seven digits, which may be separated by spaces

Getting Started

We'll use NimbleParsec, so let's go ahead and create a new project with mix new demo and add the dependency in mix.exs:

# in mix.exs

defp deps do
  [
    {:nimble_parsec, "~> 1.2"}
  ]
end
Enter fullscreen mode Exit fullscreen mode

Run mix deps.get to ensure everything's ready for us. With that in place, let's start writing our phone number parser, which we'll do within its own module:

# lib/demo/phone_number/parser.ex

defmodule Demo.PhoneNumber.Parser do
  @moduledoc false

  import NimbleParsec

  dutch_phone_number = string("+31 20 42 84 105")

  defparsec(:parse, dutch_phone_number)
end
Enter fullscreen mode Exit fullscreen mode

Within this PhoneNumber.Parser module, we need to import NimbleParsec so we can call on its functions (such as string) more conveniently. Then, we define our parser. Right now, that consists of only parsing the exact phone number string via NimbleParsec's string/2 function (hence why the phone number value is hardcoded). And last but not least, we call defparsec/3 to "finalize" the parser and make it executable.

Let's give it a spin! Open an IEx session by typing iex -S mix in the project's root folder from within a terminal. We can then try our parser:

iex(1)> Demo.PhoneNumber.Parser.parse("+31 20 42 84 105")
{:ok, ["+31 20 42 84 105"], "", %{}, {1, 0}, 16}

iex(2)> Demo.PhoneNumber.Parser.parse("foobar")
{:error, "expected string \"+31 20 42 84 105\"", "foobar", %{}, {1, 0}, 0}
Enter fullscreen mode Exit fullscreen mode

The result is a tuple containing six terms. The first is the :ok or :error tag indicating whether the provided string could be parsed.

The second value is (in the :ok case) a list of the parsed information, or (in the :error case) the error reason.

The remaining values can be ignored as we won't get into them in this article.

Parsing Local Numbers

Now that we've created a parser for an exact string, let's dig deeper into NimbleParsec's functionality by expanding the variety of phone numbers our parser will be able to process.

Local phone numbers look like this:

  • 020-42 84 105
  • 020-4284105
  • 020 42 84 105

There's an area code, then a separator which can be a space or a dash. This is followed by the subscriber number, which can contain spaces to make it easier to read.

Note also that the "area code" portion consists of the trunk prefix (which is "0") followed by two or three numbers indicating the area itself.

So conceptually, we're looking at writing a parser that looks like area_code <> separator <> subscriber_number.

Let's start with the area code:

# lib/demo/phone_number/parser.ex

defmodule Demo.PhoneNumber.Parser do
  @moduledoc false

  import NimbleParsec

  trunk_prefix = string("0")

  area_code =
    trunk_prefix
    |> string("20")

  dutch_phone_number = area_code

  defparsec(:parse, dutch_phone_number)
end
Enter fullscreen mode Exit fullscreen mode

Let's give it a whirl in iex -S mix:

iex(1)> Demo.PhoneNumber.Parser.parse("020-42 84 105")
{:ok, ["0", "20"], "-42 84 105", %{}, {1, 0}, 3}
Enter fullscreen mode Exit fullscreen mode

We can see our parser is properly extracting out the area code: it's being returned as ["0", "20"] because the area_code combinator is composed of trunk_prefix (yielding "0"), followed by string("20") (logically yielding "20"). The remainder of the input string is left unparsed (as "-42 84 105"). Success!

So what's going on? We create a trunk_prefix parser that will match the specific "0" string by using string/2.

Then, we define an area_code parser that matches this trunk_prefix followed by the "20" string. And finally, we define the parse function as the parser specified by the area_code combinator by calling defparsec/3.

We're on a roll, so let's add to our current parser so that it also captures the separator. This separator can be either a dash or a space. It sounds like a perfect match for choice/3, which will apply the first parser it finds that can parse the input.

# lib/demo/phone_number/parser.ex

defmodule Demo.PhoneNumber.Parser do
  @moduledoc false

  import NimbleParsec

  # ...

  separator = choice([string("-"), string(" ")])

  dutch_phone_number =
    area_code
    |> concat(separator)

  defparsec(:parse, dutch_phone_number)
end
Enter fullscreen mode Exit fullscreen mode

In choice([string("-"), string(" ")]), the choice([...]) combinator will attempt to parse a "-" string. If it's not able to, it will attempt parsing a " " string. Only if both of these options fail, does the entire choice combinator fail. Conversely, if the first string("-") option succeeds, none of the subsequent parsers provided to choice will be tested.

You'll notice that we weren't able to simply pipe the two area_code and separator combinators into each other: separator doesn't accept an argument, so we can't pass in area_code. NimbleParsec does, however, make it easy to concatenate two combinators with concat/2, so that's exactly how we plug these two functions into one another. Let's check things still work as expected:

# in `iex -S mix`

iex(1)> Demo.PhoneNumber.Parser.parse("020-42 84 105")
{:ok, ["0", "20", "-"], "42 84 105", %{}, {1, 0}, 4}

iex(2)> Demo.PhoneNumber.Parser.parse("020 42 84 105")
{:ok, ["0", "20", " "], "42 84 105", %{}, {1, 0}, 4}

iex(3)> Demo.PhoneNumber.Parser.parse("020/42 84 105")
{:error, "expected string \"-\" or string \" \"", "/42 84 105", %{}, {1, 0}, 3}
Enter fullscreen mode Exit fullscreen mode

Great! Both the dash and space separators are getting picked up, while the invalid slash yields an error.

Let's now write an overly accepting parser that we can come back to and tighten down: we want to capture all digit and space characters until "end of string". To do so, we'll use utf8_string/3 to parse digits, times/3 to get us repetition, and finally, eos/1 to represent "end of string". It looks like this:

# lib/demo/phone_number/parser.ex

defmodule Demo.PhoneNumber.Parser do
  @moduledoc false

  import NimbleParsec

  # ...

  digit = utf8_string([?0..?9], 1)

  area_code =
    trunk_prefix
    |> times(digit, 2)

  # ...

  subscriber_number =
    choice([digit, string(" ")])
    |> times(min: 1)

  dutch_phone_number =
    area_code
    |> concat(separator)
    |> concat(subscriber_number)
    |> eos()

  defparsec(:parse, dutch_phone_number)
end
Enter fullscreen mode Exit fullscreen mode

We define our digit combinator as a UTF8 string that contains values in the 0-9 range and is of length 1: in other words, it's a string of a single digit.

Note we've also improved the area_code combinator: instead of matching only the specific "20" area code, it will match any two digits. The side effect is that the parse results will be slightly different: ["0", "2", "0", "-", ...] instead of ["0", "20", "-", ...].

Finally, since eos() has an optional combinator argument, there's no need to wrap it with a concat() call within the pipeline.

Time for a Short Intermission

We've covered a decent amount of ground so far. Let's take a break and let all of that sink in.

Our parser is starting to take shape, but it's still not very good. We'll improve it in the next installment where we'll investigate writing custom parsers, among other topics.

Until next time, happy coding!

P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!

💖 💪 🙅 🚩
davidsulc
David Sulc

Posted on October 25, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related