Advent of Code 2020: Day 04 using PEG grammars in Python
Yuan Gao
Posted on December 5, 2020
Continuing my pursuit for unusual solutions to Advent of Code, for this one I treat the data as if it were a programming language, and write a whole PEG-based parser to "execute it".
Things mentioned in this post: Parsing Expression Grammars (PEG), parsing, string splitting, more regex
The Challenge Part 1
Link to challenge on Advent of Code 2020 website
Presented is another parsing/validation challenge. We're given some "passport" data that looks like this:
ecl:gry pid:860033327 eyr:2020 hcl:#fffffd
byr:1937 iyr:2017 cid:147 hgt:183cm
Which are key-value pairs separated by blank-lines, and a task to validate to ensure each passport has the right set of values in them.
The rules, as stated is simply that all fields except cid
must be present
Reading the file
Unlike before, the data comes in as a set of entries, that each span more than one line, separated by blank lines. So we can't use readlines()
like in the Day 2 solution. Instead, we can read all the data in at once, and then split by a double-newline:
data = open("input.txt").read().split("\n\n")
Output
['eyr:2029 byr:1931 hcl:z cid:128\necl:amb hgt:150cm iyr:2015 pid:148714704',
'byr:2013 hgt:70cm pid:76982670 ecl:#4f9a1c\nhcl:9e724b eyr:1981 iyr:2027',
'pid:261384974 iyr:2015\nhgt:172cm eyr:2020\nbyr:2001 hcl:#59c2d9 ecl:amb cid:163',
...
Now we have an array that contains one passport entry to validate per index. Throughout this post, if you see me use entry
, it's just one of these entries that I've pulled out for testing (e.g. entry = data[0]
)
Parsing using split()
We're really just dealing with a bunch of space (or newline)-separated strings here, so we can just go ahead and split()
them. split()
automatically splits strings by whitespace, so handles the newlines as well:
entry.split()
Output:
['eyr:2024',
'hcl:#b6652a',
'cid:340',
'byr:1929',
'ecl:oth',
'iyr:2014',
'pid:186640193',
'hgt:193in']
Now we can break each of these down into key-value pairs by splitting on the :
as well, and using list comprehension to apply it to all elements:
[item.split(":") for item in entry.split()]
Output
[['eyr', '2024'],
['hcl', '#b6652a'],
['cid', '340'],
['byr', '1929'],
['ecl', 'oth'],
['iyr', '2014'],
['pid', '186640193'],
['hgt', '193in']]
Python's dict()
function conveniently takes a list of lists like this and turns it into a dictionary for us:
parsed = dict(item.split(":") for item in entry.split())
Output
{'eyr': '2024',
'hcl': '#b6652a',
'cid': '340',
'byr': '1929',
'ecl': 'oth',
'iyr': '2014',
'pid': '186640193',
'hgt': '193in'}
The useful thing with having a dictionary like this is we can easily retrieve just the keys in this dictionary:
parsed.keys()
Output
dict_keys(['eyr', 'hcl', 'cid', 'byr', 'ecl', 'iyr', 'pid', 'hgt'])
Validation
Now that we have a way to parse the entries, our validation rule for this challenge is simply all fields other than cid
must be present.
Recall the set theory solution from Day 01. Any time we want to detect if one list has some relation to another list (ignoring order of the list), we can use set theory, and Python's powerful set features.
So we can create our set of required fields:
required = {'eyr', 'hcl', 'byr', 'ecl', 'iyr', 'pid', 'hgt'}
And then we can use set difference()
to figure out what fields in our passport entry are missing:
missing = required.difference(parsed.keys())
If this results in an empty set, it means we have all the keys we need for this passport entry, and it is therefore valid. If this set has anything in it, it means it is invalid.
So we can set up a simple list comprehension to process these differences for each entry, and count them up. So the whole code looks like this:
data = open("input.txt").read().split("\n\n")
required = {'eyr', 'hcl', 'byr', 'ecl', 'iyr', 'pid', 'hgt'}
valids = 0
for entry in data:
items = dict(item.split(":") for item in entry.split())
valid = not required.difference(items.keys())
if valid:
valids += 1
print("Total valid", valids)
But of course, this can be code-golfed some down to a single line of code, if you want to be wildly unreadable. It's not useful to do this, but it's cool that python can achieve this compact a line of code.
print("Total valid", sum(not {'eyr', 'hcl', 'byr', 'ecl', 'iyr', 'pid', 'hgt'}.difference([item.split(":")[0] for item in entry.split()]) for entry in open("input.txt").read().split("\n\n")))
The Challenge Part 2
Part 2 of the challenge extends this to add a whole new set of validation rules for each of the parts. Now this is where the fun begins
Since we have a parser that is able to collect each key:value pair, we can simply write new rules to check that each value has the correct function. But where would the fun be in that? Let's take a completely different direction.
We looked at using State Machines and Regex for parsing duties in Day 02, where we used state machines or regex for consuming specific patterns of data. That's fun and all, but what we're doing here is a little different - reading different tokens out of the input data, and checking whether those tokens are valid or not based on a larger set of "syntax" rules. Sounds a lot like programming right?
So why don't we approach this as if those passport fields were a programming language. Instead of splitting strings and processing each part of the string like we're handling data, what if we treat those fields like keywords and values in a programming language.
Guess what, parsing programming languages is a "solved problem". Enter PEG, Parsing Expression Grammar, which are used to describe (and parse) programming languages! I will be using the Python library parsimonious, a very lightweight and fast PEG-based parser. I have used parsimonious
before to make a tiny scripting language for a game dialogue engine, and enjoyed using it.
PEG
If you like regex, you'll love PEG. PEG uses pattern matching, but in addition to that, it can do recursive nested pattern matching. It's a bit like having regex inside regex inside regex.
Let's look at this format/validation rule:
hgt (Height) - a number followed by either cm or in:
- If cm, the number must be at least 150 and at most 193.
- If in, the number must be at least 59 and at most 76.
So we expect the value to look something like this: hgt:190cm
or hgt:70in
. What does this look like in PEG?
from parsimonious.grammar import Grammar
grammar = Grammar(r"""
HGT = "hgt:" (HGTCM / HGTIN)
HGTCM = NUMB "cm"
HGTIN = NUMB "in"
NUMB = ~r"[0-9]{2,4}"
""")
Here it is. What we've done here is define the HGT
node as being the characters hgt:
followed by either a HGTCM
node or a HGTIN
node. Then we've defined the HGTCM
node as an NUMB
followed by "cm" and we've defined NUMB
with a bit of regex saying it must be 2 to 4 digits (the reason it's 2-4 digits is to be able to reuse NUMB for years later)
In actual use, we get something like this:
print(grammar.parse("hgt:123cm"))
Output
<Node called "HGT" matching "hgt:123cm">
<Node matching "hgt:">
<Node matching "123cm">
<Node called "HGTCM" matching "123cm">
<RegexNode called "NUMB" matching "123">
<Node matching "cm">
Tadaa, we've parsed the height value like it was a programming language. The generated result is a syntax tree, that contains all the components that we've pulled out, including the important values. However, this isn't the whole story, while regex and PEG allows us to do things like ensure all the right number of characters are there, i.e. the right syntax, what it can't do is check that a value is between a certain range. We are going to have to delegate that to some later python code.
If we gave it a bad value that didn't match our grammar (this is missing the in
or cm
part
print(grammar.parse("hgt:123"))
Output
ParseError: Rule <Literal "cm"> didn't match at '' (line 1, column 8).
Ok, Here's all of the syntax rules for the different keys and values (this is not a finished PEG):
grammar = Grammar(r"""
BYR = "byr:" NUMB
IYR = "iyr:" NUMB
EYR = "eyr:" NUMB
HGT = "hgt:" (HGTCM / HGTIN)
HCL = "hcl:" ~r"#[0-9a-f]{6}"
ECL = "ecl:" ("amb" / "blu" / "brn" / "gry" / "grn" / "hzl" / "oth")
PID = "pid:" ~r"[0-9]{9}"
CID = "cid:" ~r"[0-9a-zA-Z]*"
HGTCM = NUMB "cm"
HGTIN = NUMB "in"
NUMB = ~r"[0-9]{2,4}"
""")
As can be seen, any Year is defined as 4 digits, any validation of their values will be done later; while ecl
has a list of 7 possible values. We actually have two options here: either have the 7 possible values as part of PEG, in which case a parse error will be raised when there is an incorrect value; or have them as part of validation later, in which case the PEG will parse fine, and the code will raise the error later. In a real use-case, which one to go for depends on where you want the error to happen.
However, the PEG isn't finished. A PEG must match the WHOLE string. so far we've only defined the different key:values. But we expect multiple key:values to appear in the entry, so we need an overall expression node.
grammar = Grammar(r"""
EXPR = ITEM+
ITEM = (BYR / IYR / EYR / HGT / HCL / ECL / PID / CID) WHITE?
WHITE = ~r"[\s]+"
BYR = "byr:" NUMB
IYR = "iyr:" NUMB
EYR = "eyr:" NUMB
HGT = "hgt:" (HGTCM / HGTIN)
HCL = "hcl:" ~r"#[0-9a-f]{6}"
ECL = "ecl:" ("amb" / "blu" / "brn" / "gry" / "grn" / "hzl" / "oth")
PID = "pid:" ~r"[0-9]{9}"
CID = "cid:" ~r"[0-9a-zA-Z]*"
HGTCM = NUMB "cm"
HGTIN = NUMB "in"
NUMB = ~r"[0-9]{2,4}"
""")
The extra parent EXPR added at the top states that this parsed entry must consist of at least one lots of ITEM; and each ITEM is one of the seven entries followed by optional whitespace. That's it! We now have a PEG to parse our passport entries. At this stage we can already throw out any passport entries that have invalid syntax, but we need a few more steps to validate the actual data such as the year numbers, so we will need to turn to traversing the generated syntax tree.
A successful syntax tree looks like this:
<Node called "EXPR" matching "iyr:2010 hgt:158cm hcl:#b6652a ecl:blu byr:1944 eyr:2021 pid:093154719">
<Node called "ITEM" matching "iyr:2010 ">
<Node matching "iyr:2010">
<Node called "IYR" matching "iyr:2010">
<Node matching "iyr:">
<RegexNode called "NUMB" matching "2010">
<Node matching " ">
<RegexNode called "WHITE" matching " ">
<Node called "ITEM" matching "hgt:158cm ">
<Node matching "hgt:158cm">
<Node called "HGT" matching "hgt:158cm">
<Node matching "hgt:">
<Node matching "158cm">
<Node called "HGTCM" matching "158cm">
<RegexNode called "NUMB" matching "158">
<Node matching "cm">
<Node matching " ">
<RegexNode called "WHITE" matching " ">
<Node called "ITEM" matching "hcl:#b6652a ">
<Node matching "hcl:#b6652a">
<Node called "HCL" matching "hcl:#b6652a">
<Node matching "hcl:">
<RegexNode matching "#b6652a">
<Node matching " ">
<RegexNode called "WHITE" matching " ">
<Node called "ITEM" matching "ecl:blu ">
<Node matching "ecl:blu">
<Node called "ECL" matching "ecl:blu">
<Node matching "ecl:">
<Node matching "blu">
<Node matching "blu">
<Node matching " ">
<RegexNode called "WHITE" matching " ">
<Node called "ITEM" matching "byr:1944 ">
<Node matching "byr:1944">
<Node called "BYR" matching "byr:1944">
<Node matching "byr:">
<RegexNode called "NUMB" matching "1944">
<Node matching " ">
<RegexNode called "WHITE" matching " ">
<Node called "ITEM" matching "eyr:2021 ">
<Node matching "eyr:2021">
<Node called "EYR" matching "eyr:2021">
<Node matching "eyr:">
<RegexNode called "NUMB" matching "2021">
<Node matching " ">
<RegexNode called "WHITE" matching " ">
<Node called "ITEM" matching "pid:093154719">
<Node matching "pid:093154719">
<Node called "PID" matching "pid:093154719">
<Node matching "pid:">
<RegexNode matching "093154719">
<Node matching "">
It's a lot of nodes there, but embedded within this is the formal structure of our "programming language" that are the passport entries, and we can simply loop over these nodes to find the values we need to check, and we're guaranteed that these nodes appear in exactly the order that we defined in our PEG.
Validation
To validate our newly-created syntax tree, we need to recursively travel down the tree to visit each node, and then do something with its value. Fortunately, Parsimonious already comes with a tree traversal class called NodeVisitor
which goes depth-first, diving deep into the node and then evaluating each node using a function you provide.
As an example, here's a node visitor that ONLY checks that the Birth Year is between 1920 and 2002:
class PassportVisitor(NodeVisitor):
def visit_BYR(self, node, visited_children):
assert 1920 <= visited_children[1] <= 2002
def visit_YEAR(self, node, visited_children):
return int(node.text)
def generic_visit(self, node, visited_children):
return visited_children or node
A few things going on here. the generic_visit()
method is a requirement, as this is the function that parsimonious runs when it doesn't find a matching method specific to the node. I've given it a visit_YEAR()
method that will automatically get called on YEAR nodes that we defined earlier. The job of this method is to convert the 4-digit year from a string into an integer so that upstream nodes don't have to do that conversion. We are absolutely sure that the string will contain 4-digits because if it didn't, the PEG would have failed to parse. This is how the node visitor works alongside the PEG to make sure both the format and values are correct.
Further up, visit_BYR()
now handles the full byr:
key and value. At this point, the children nodes have already been processed, including the YEAR
node which has returned a number. We can therefore access the result of the YEAR
node inside visited_children[1]
since the BYR
block has two children (inside child 0 is the literal "byr:" as defined in PEG.
Since our node visitor's only job is to validate errors, we don't even need to return any values out of this BYR node since we don't need to use it. We only need to raise exceptions when validation fails. We can make use of Python's assert
keyword to do this (but raise Exception
would have also been fine).
Now, the full node visitor with all the other rules implemented looks like:
class PassportVisitor(NodeVisitor):
def visit_BYR(self, node, visited_children):
assert 1920 <= visited_children[1] <= 2002
def visit_IYR(self, node, visited_children):
assert 2010 <= visited_children[1] <= 2020
def visit_EYR(self, node, visited_children):
assert 2020 <= visited_children[1] <= 2030
def visit_HGTCM(self, node, visited_children):
assert 150 <= visited_children[0] <= 193
def visit_HGTIN(self, node, visited_children):
assert 59 <= visited_children[0] <= 76
def visit_NUMB(self, node, visited_children):
return int(node.text)
def generic_visit(self, node, visited_children):
return visited_children or node
We're almost there! This node visitor applies the remaining numerical checks that aren't covered by the PEG. However the one thing it doesn't do is tell us whether we have seen all of the needed keys, since it only checks each key individually. We can add that by using the same set
comprehension done earlier, we just need to have the ITEM node visitor return what its nested key is, so that we can have the EXPR node visitor check all our keys for us.
The final code looks like this:
from parsimonious.grammar import Grammar, NodeVisitor
from parsimonious.exceptions import ParseError, IncompleteParseError, VisitationError
grammar = Grammar(r"""
EXPR = ITEM+
ITEM = (BYR / IYR / EYR / HGT / HCL / ECL / PID / CID) WHITE?
WHITE = ~r"[\s]+"
BYR = "byr:" NUMB
IYR = "iyr:" NUMB
EYR = "eyr:" NUMB
HGT = "hgt:" (HGTCM / HGTIN)
HCL = "hcl:" ~r"#[0-9a-f]{6}"
ECL = "ecl:" ("amb" / "blu" / "brn" / "gry" / "grn" / "hzl" / "oth")
PID = "pid:" ~r"[0-9]{9}"
CID = "cid:" ~r"[0-9a-zA-Z]*"
HGTCM = NUMB "cm"
HGTIN = NUMB "in"
NUMB = ~r"[0-9]{2,4}"
""")
class PassportVisitor(NodeVisitor):
def visit_EXPR(self, node, visited_children):
assert not {"BYR", "IYR", "EYR", "HGT", "HCL", "ECL", "PID"}.difference(visited_children)
def visit_ITEM(self, node, visited_children):
return node.children[0].children[0].expr_name
def visit_BYR(self, node, visited_children):
assert 1920 <= visited_children[1] <= 2002
def visit_IYR(self, node, visited_children):
assert 2010 <= visited_children[1] <= 2020
def visit_EYR(self, node, visited_children):
assert 2020 <= visited_children[1] <= 2030
def visit_HGTCM(self, node, visited_children):
assert 150 <= visited_children[0] <= 193
def visit_HGTIN(self, node, visited_children):
assert 59 <= visited_children[0] <= 76
def visit_NUMB(self, node, visited_children):
return int(node.text)
def generic_visit(self, node, visited_children):
return visited_children or node
pv = PassportVisitor()
pv.grammar = grammar
data = open("input.txt").read().split("\n\n")
valid = 0
for entry in data:
try:
pv.parse(entry)
except (ParseError, VisitationError, IncompleteParseError):
continue
else:
valid += 1
print("Valid:", valid)
At the end here we are looping over the entries, and catching for one of several different types of error the parser can throw when there's an exception (including our own assert
)
At 64 lines of code, it's not the most compact of parsers possible, but hopefully this demonstrates some ideas about how you can approach writing parsers for tasks like this without having to re-invent syntax parsing.
Onwards!
Posted on December 5, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.