Parser Combinators in Elixir: A Deeper Dive
David Sulc
Posted on November 22, 2022
In our last post, we wrote a basic parser for phone numbers using Elixir. It was a bit simplistic since it didn't really respect the format phone numbers are expected to have, but it was a great start.
We'll now improve the parser to ensure we only accept phone numbers that fit the spec and make our return type an instance of structured data.
Let's dive straight in!
Formatting Parse Results in Elixir
At this point, we've got rudimentary parsers processing the main parts of a local phone number. Let's now make the parse results easier to put into a data structure.
Right now, our parser output looks like this:
iex(1)> Demo.PhoneNumber.Parser.parse("020-42 84 105")
{:ok, ["0", "2", "0", "-", "4", "2", " ", "8", "4", " ", "1", "0", "5"], "", %{},
{1, 0}, 13}
In passing, you might note that the "remaining unparsed" value (third element in the tuple) is the ""
empty string, meaning our parser successfully consumes the entire input string.
For starters, let's remove noise from the parse result: we're not interested in the trunk value (it's always "0"
and therefore provides no information) or the separator. Nor, for that matter, are we interested in the spaces in the subscriber number. Let's wrap each of those in an ignore/2
so they get dropped from the output:
# lib/demo/phone_number/parser.ex
defmodule Demo.PhoneNumber.Parser do
@moduledoc false
import NimbleParsec
# ...
area_code =
ignore(trunk_prefix)
|> times(digit, 2)
separator = choice([string("-"), string(" ")])
subscriber_number =
choice([digit, ignore(string(" "))])
|> times(min: 1)
dutch_phone_number =
area_code
|> concat(ignore(separator))
|> concat(subscriber_number)
|> eos()
defparsec(:parse, dutch_phone_number)
end
This indeed removes the trunk prefix and separator from the parse result (given they're now both ignore
d), but it isn't exactly straightforward to understand since all the parts are just dumped into an array:
iex(1)> Demo.PhoneNumber.Parser.parse("020-42 84 105")
{:ok, ["2", "0", "4", "2", "8", "4", "1", "0", "5"], "", %{}, {1, 0}, 13}
We should tag the results to identify which ones come from particular parsers. That sounds like a perfect fit for tag/2
:
# lib/demo/phone_number/parser.ex
defmodule Demo.PhoneNumber.Parser do
@moduledoc false
import NimbleParsec
# ...
area_code =
ignore(trunk_prefix)
|> times(digit, 2)
|> tag(:area_code)
# ...
subscriber_number =
choice([digit, ignore(string(" "))])
|> times(min: 1)
|> tag(:subscriber_number)
# ...
defparsec(:parse, dutch_phone_number)
end
We're now closer to our goal:
iex(1)> Demo.PhoneNumber.Parser.parse("020-42 84 105")
{:ok,
[area_code: ["2", "0"], subscriber_number: ["4", "2", "8", "4", "1", "0", "5"]],
"", %{}, {1, 0}, 13}
These values are still a bit messy though: we want a single string value instead of a collection of digit strings. Essentially, we want to Enum.join(digits_strings, "")
, which is where reduce/3
enters the picture:
# lib/demo/phone_number/parser.ex
defmodule Demo.PhoneNumber.Parser do
@moduledoc false
import NimbleParsec
# ...
area_code =
ignore(trunk_prefix)
|> times(digit, 2)
|> reduce({Enum, :join, [""]})
|> tag(:area_code)
# ...
subscriber_number =
choice([digit, ignore(string(" "))])
|> times(min: 1)
|> reduce({Enum, :join, [""]})
|> tag(:subscriber_number)
# ...
defparsec(:parse, dutch_phone_number)
end
We're steadily moving towards our goal of having parse results that are nicely formatted and easy to convert into a data structure:
iex(1)> Demo.PhoneNumber.Parser.parse("020-42 84 105")
{:ok, [area_code: ["20"], subscriber_number: ["4284105"]], "", %{}, {1, 0}, 13}
We still need a little tweak: instead of the tagged values containing an array of a single element, we want to just have the element directly. That's easy. We'll just replace our use of tag/3
with unwrap_and_tag/3
:
# lib/demo/phone_number/parser.ex
defmodule Demo.PhoneNumber.Parser do
@moduledoc false
import NimbleParsec
# ...
area_code =
ignore(trunk_prefix)
|> times(digit, 2)
|> reduce({Enum, :join, [""]})
|> unwrap_and_tag(:area_code)
# ...
subscriber_number =
choice([digit, ignore(string(" "))])
|> times(min: 1)
|> reduce({Enum, :join, [""]})
|> unwrap_and_tag(:subscriber_number)
# ...
defparsec(:parse, dutch_phone_number)
end
Check it out:
iex(1)> Demo.PhoneNumber.Parser.parse("020-42 84 105")
{:ok, [area_code: "20", subscriber_number: "4284105"], "", %{}, {1, 0}, 13}
Creating a Struct
Let's create a PhoneNumber
struct to hold our parsed data:
# lib/demo/phone_number.ex
defmodule Demo.PhoneNumber do
alias Demo.PhoneNumber.Parser
defstruct [:country_code, :area_code, :subscriber_number]
def new(string) when is_binary(string) do
case Parser.parse(string) do
{:ok, results, "", _, _, _} -> {:ok, struct!(__MODULE__, results)}
{:error, reason, _rest, _, _, _} -> {:error, reason}
end
end
end
What do we have here? After defining a struct with defstruct/1
, we define a new/1
function that will accept a string input. Then it will run our fancy parser: if the parser is unable to process the string, it will return {:error, reason}
.
If, on the other hand, our parser is able to process the string, we know that results
will be a keyword list of the parsed components found in the phone number (such as [area_code: "20", subscriber_number: "4284105"]
as shown above). And we know this will indeed be a keyword list, because we've done so much work getting our parser to format and tag the output in this manner.
All of this hard work is finally paying off: since our keyword list keys are the same as the struct's, we can create a new struct instance by simply calling struct!/2
with our parse results!
We now have the public API to parse phone numbers into a data structure. Time for a quick test in iex -S mix
:
iex(1)> Demo.PhoneNumber.new("020-42 84 105")
{:ok,
%Demo.PhoneNumber{
area_code: "20",
country_code: nil,
subscriber_number: "4284105"
}}
iex(2)> Demo.PhoneNumber.new("foobar")
{:error, "expected ..."}
Writing Custom Combinators Using Elixir
We've made great strides so far: our parser now returns a data structure! This will be extremely helpful down the line when the data it contains needs to be validated or otherwise processed.
There's still some work ahead of us, though: aside from the fact that it doesn't yet parse international numbers, it also doesn't check for number lengths. Area codes may have 2 or 3 digits, while subscriber numbers may have 7 or 6 digits respectively.
To assist us in achieving this goal, we'll be writing more advanced and powerful combinators. As you may recall from the previous article, combinators are parsers that can be combined into bigger and better parsers. We'll be using NimbleParsec to define a complex parser from smaller and simpler ones.
Since we want the parser behavior to vary according to provided arguments, we need to define some functions. But since NimbleParsec's defparsec/3
is executed during compilation, we can't invoke a function defined in the same module.
To work around this, we'll define a Helpers
sub-module and our configurable parsers there:
# lib/demo/phone_number/parser/helpers.ex
defmodule Demo.PhoneNumber.Parser.Helpers do
@moduledoc false
import NimbleParsec
@doc """
Parses a single digit.
"""
def digit(combinator \\ empty()) do
combinator
|> utf8_string([?0..?9], 1)
end
@doc """
Parses `count` digits.
For example, `digits(3)` would parse 3 digits.
"""
def digits(combinator \\ empty(), count) do
combinator
|> duplicate(digit(), count)
|> reduce({Enum, :join, [""]})
end
@doc """
Parses a phone number area code.
Since phone number area codes can only be 2 or 3 digits in length,
only `2` and `3` are valid values for the `length` parameter.
"""
def area_code(combinator \\ empty(), length) when length in [2, 3] do
combinator
|> digits(length)
end
end
As you can see, we use combinator
as the first function argument, defaulting to NimbleParsec's empty()
combinator if none is provided. This enables us to compose combinators where appropriate, without having to use concat/2
as a crutch.
Updating the Parser with Functions
Let's update our parser to use these new functions:
# lib/demo/phone_number/parser.ex
defmodule Demo.PhoneNumber.Parser do
@moduledoc false
import NimbleParsec
import Demo.PhoneNumber.Parser.Helpers
# ...
area_code_with_trunk_prefix =
ignore(trunk_prefix)
|> area_code(2)
|> unwrap_and_tag(:area_code)
# ...
subscriber_number =
choice([digit(), ignore(string(" "))])
|> times(min: 1)
|> reduce({Enum, :join, [""]})
|> unwrap_and_tag(:subscriber_number)
dutch_phone_number =
area_code_with_trunk_prefix
|> concat(ignore(separator))
|> concat(subscriber_number)
|> eos()
defparsec(:parse, dutch_phone_number)
end
We've renamed area_code
to area_code_with_trunk_prefix
so it won't get confused with the area_code
imported from the new Helpers
module. Also take note that digit
has become digit()
, as it's now a function rather than a combinator.
Our code still works, but we've yet to fix the subscriber number parser: if the area code is 2 digits, then the subscriber number should be 7 digits. Conversely, if the area code is 3 digits long, the subscriber should contain only 6 digits. So let's now refactor to have a variable-length subscriber_number
combinator along with a configurable local_number
combinator:
# lib/demo/phone_number/parser/helpers.ex
defmodule Demo.PhoneNumber.Parser.Helpers do
@moduledoc false
import NimbleParsec
# ... various combinator definitions ...
def subscriber_number(combinator \\ empty(), length) when length in [6, 7] do
combinator
|> digit()
|> times(ignore(optional(string(" "))) |> digit(), length - 1)
|> ignore(optional(string(" ")))
|> reduce({Enum, :join, [""]})
end
def local_number(combinator \\ empty(), area_code_length, subscriber_number_length) do
separator = choice([string("-"), string(" ")])
combinator
|> area_code(area_code_length)
|> unwrap_and_tag(:area_code)
|> concat(ignore(separator))
|> concat(subscriber_number(subscriber_number_length) |> unwrap_and_tag(:subscriber_number))
end
end
There's quite a bit going on here, so let's take a look.
Side note: if you ever get stuck when writing sub-parsers, you can create entry points to only test that portion of your code. For example, by adding defparsec(:subscriber_number, subscriber_number(7))
at the bottom of Demo.PhoneNumber.Parser
, you can run iex -S mix
and test the parsing of subscriber numbers with Demo.PhoneNumber.Parser.subscriber_number("65 23 125")
.
The subscriber number is defined as (mandatorily) starting with a digit, followed by an optional space and then another digit, with the latter grouping repeating one time less than the desired length provided as an argument. That means that we'll end up with as many actual digits as requested via length
(along with a variable number of spaces mixed in). We ignore the whitespace so it won't show up in the results, and then combine all parsed digits into a single string.
We then define a local_number
combinator that will accept two lengths: one for the area code, and the other for the subscriber portion. We combine these with an ignored separator
, and tag each segment. Since we know each area_code
and subscriber_number
combinator will yield a single result, we can use unwrap_and_tag/3
to extract the parsed result from the returned array and tag it with the provided atom.
As you may have seen, our subscriber_number
combinator accepts a previous combinator as the first argument: we could have chained it directly to the previous combinator.
Why, then, are we using concat/2
? Because it will, in effect, "separate" the combinator we're interested in, so that unwrap_and_tag/3
will apply only to the subscriber_number
sub-parser rather than to the combinator result of subscriber_number
and its prior combinator.
In addition, the Enum.join
step we have in subscriber_number
would run into problems without concat/2
. It attempts to join
the result of the previous combinator (which will include area_code: "20"
), so it will crash (as Enum.join
won't know how to process a tuple). Keep this in mind if you run into unexpected failures when writing your own combinators.
Rewriting the Main Parser in Elixir
With these new custom combinators available, let's go back and rewrite our main parser:
# lib/demo/phone_number/parser.ex
defmodule Demo.PhoneNumber.Parser do
@moduledoc false
import NimbleParsec
import Demo.PhoneNumber.Parser.Helpers
trunk_prefix = string("0")
local_portion =
choice([
local_number(2, 7),
local_number(3, 6)
])
dutch_phone_number =
ignore(trunk_prefix)
|> concat(local_portion)
|> eos()
defparsec(:parse, dutch_phone_number)
end
Nice and readable: we can easily tell that the local portion of the number is always expected to be 9 digits long, with a longer area code resulting in a short subscriber number.
We're almost done! We just need to handle the international prefix:
# lib/demo/phone_number/parser.ex
defmodule Demo.PhoneNumber.Parser do
@moduledoc false
import NimbleParsec
import Demo.PhoneNumber.Parser.Helpers
# ...
international_prefix =
ignore(string("+"))
|> digits(2)
|> ignore(string(" "))
|> map({String, :to_integer, []})
|> unwrap_and_tag(:country_code)
dutch_phone_number =
choice([international_prefix, ignore(trunk_prefix)])
|> concat(local_portion)
|> eos()
defparsec(:parse, dutch_phone_number)
end
We simply define the international_prefix
combinator using functions that are now familiar to us, and then use map/3
to ensure the result is an integer: we may want to compare it to valid ISO country codes later on.
With that in place, it's simply a matter of telling our parser that we're either:
- In the international case (where our parser does not expect the trunk prefix to be present)
- In a national context, where the trunk prefix should indeed be part of the phone number (but removed from the output)
Voilà, we've got our phone number parser defined in a way that's understandable and easy to maintain. Take a look:
iex(1)> Demo.PhoneNumber.new("+31 20-42 84 105")
{:ok,
%Demo.PhoneNumber{
area_code: "20",
country_code: 31,
subscriber_number: "4284105"
}}
Next Steps: Play Around with Parser Combinators
The best way to get a good grasp on how parser combinators work is to experiment with them.
Want to give it a spin but can't think of an example? Try extending our code to parse other cases.
Write a parser for ISO 8601 datetimes such as 2019-11-14T00:55:31.820Z
. And if you want to get really crazy, you can extend it to handle durations and time intervals.
You can also try your hand at writing a recursive parser: there's an example in NimbleParsec's parsec/2
documentation to get you started.
As touched upon in the introduction to this article, NimbleParsec is more of a parser generator: take a look at Combine and try to convert the above code to make use of Combine, or come up with your own example.
The way Combine works is that you combine parser functions, and obtain a new parser function. Parser functions can then be applied via Combine.parse/3
. Here's a toy example:
# the public parser
def username_and_id(string) do
Combine.parse(string, username_and_id_parser())
end
# these are internal parsers for the combinator
# a "username and id" is simply a sequence of
# username, space, id
defp username_and_id_parser() do
sequence([
username(),
ignore(string(" ")),
id()
])
end
# a username consists of word characters
defp username(), do: word()
# an id is an integer
defp id(), do: integer()
Calling this function will return the (non-ignored) parse result:
iex> username_and_id("foobar 123")
[["foobar", 123]]
Ready for the big leagues? Write your own parser combinator from scratch to fully understand how everything works under the hood. Here's an example parsing a subset of SQL, written by Saša Jurić.
Wrapping Up
In the first part of this series, we briefly defined parser combinators before building a simple one in Elixir.
In this second and final part, we dived deeper into the parser combinator that we built last time and improved it. We also gave you some ideas for ways to further explore parser combinators.
Happy coding!
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!
Posted on November 22, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.