Advent of Code 2020: Day 01 using numpy and vectorized calculations
Yuan Gao
Posted on December 4, 2020
I'm going to be doing Advent of Code 2020 in Python, but every challenge I will try to explore a different way of doing it, maybe learn a few new libraries in the process.
Things mentioned in this post: context managers, generators, list comprehension, itertools, set theory, numpy, cupy
The Challenge Part 1
Link to challenge on Advent of Code 2020 website
Boiling down the challenge, you are given a list of numbers containing two entries that sum to 2020, the challenge is to simply find those two numbers. (And then multiply them together to submit the result)
The example given is:
1721
979
366
299
675
1456
Where in the correct answer is 1721 and 299. The actual list given to you is much longer than this, but still only has exactly two numbers that add up to 2020.
Advent Of Code will give you a file to download, you can save that file as a text file. I've saved it as input.txt
in my workspace.
Reading the file
The data comes as a file containing one number per row. Unfortunately if we read this file directly, we would get a bunch of strings, so we have to convert these to numbers to do math with. Here's a fairly typical approach to doing so in python.
data = []
with open("input.txt", "r") as fp:
for value_str in fp:
data.append(int(value_str))
It's basic, won't win you any awards. But, there's actually something interesting going on here worth talking about.
Context managers
Why do we use with
like this? In python, with
is used as part of a "Context Manager". Context managers are smart little things that automatically run some code when entering and when exiting the context. This is very good for file operations, database operations, and things where you want to automatically close your file/connection even if your program crashes due to an exception.
In our production web backend code, we use a database that has a transaction feature. The Transaction means that every database operation completed up to that point can be rolled back (undone) if we needed to. This is very useful as some actions require multiple things to happen, and should there be a failure, we don't leave half-completed database operations.
File operations are the same, though it's totally possible to write this code without a context manager (see below). The only difference is, should an exception occur somewhere between the open()
and close()
, then the file handle stays open. In this case it's not a problem because nobody else is using the file, and this program closes pretty quick. But in production uses, maybe you're opening files that other processes are waiting to write to. It's best practice to use context managers for file operations so that things are closed neatly.
data = []
fp = open("input.txt")
for value_str in fp:
data.append(int(value_str))
fp.close()
Generators
The other thing going on here is that we're directly saying for value_str in fp
. fp
is our open file, how is it that we're directly using it in a for..in
loop like this? This is because Python's file objects actually have a bunch of extra features that let them be used in several different ways, one of them is as a generator, allowing you to pull one row at a time out of the file simply by using it in a for..in
loop. If you're interested in the technical implementation, under the hood, the returned file object fp
has a __next__()
method that is just doing self.readline()
If you didn't want to use the generator, then perhaps it would look something like this:
data = []
with open("input.txt") as fp:
while True:
value_str = fp.readline():
if not value_str:
break
data.append(int(value_str))
That's no fun. We have to manually read each line in a loop, and exit the loop if that line was empty.
Although... if you're here to have fun, Python 3.8 introduces the walrus operator :=
which actually injects some fun into this poor code:
data = []
with open("input.txt") as fp:
while value_str := fp.readline():
data.append(int(value_str))
Look at that! It's fun again.
And in case you were wondering, yes it is possible to get more compact using list comprehension, and the handy .readlines()
function that returns an array of lines from a file object
data = [int(value_str) for value_str in open("input.txt").readlines()]
There's even better ways to do it than this, I'll show you in a later code chunk.
The naïve solution
There's no shame in starting with the obvious but inefficient solution. In the real world, this is often enough!
So we have a list of numbers. The "brute-force" solution here is simply to take each number. Add it to each other number, and see if the total is 2020. Let's go ahead and do that, see what we learn.
count = len(data)
# outer loop
for i in range(count):
# inner loop
for j in range(i, count):
# check numbers
left_number = data[i]
right_number = data[j]
if left_number + right_number == 2020:
print("Found it!", left_number, right_number)
Output:
Found it! 201 1819
So the outer loop loops i
between 0 and however long the data is. And then for each of those, the inner loop loops j
between i and however long the data is, thereby covering every combination. We simply check if a couple of them add up to 2020 and print it out.
There's quite a few things wrong with this code. It's not great code. We can do better.
A better Naïve solution
Firstly, we should make use of python's for..in
better. We don't need to use i
and j
counter variables, python isn't that kind of language, and to do so is pretty rare. Instead, we can iterate over the data directly:
# outer loop
for left_number in data:
# inner loop
for right_number in data:
if left_number + right_number == 2020:
print("Found it!", left_number, right_number)
Output:
Found it! 201 1819
Found it! 1819 201
What the! We have two solutions. Yes, unfortunately using two loops like this means we lose out on the little optimization we did earlier, where we start the inner loop at the outer loop's index; and since we don't exit the loop once a solution is found, the code went ahead the reciprocal pair. Let's fix that.
# outer loop
for idx, left_number in enumerate(data):
# inner loop
for right_number in data[idx:]:
if left_number + right_number == 2020:
print("Found it!", left_number, right_number)
The built-in enumerate()
function now causes the for..in
to return bot the original value, and also its index (basically the equivalent of i
from before). We can then use that in the inner loop to "slice" data
and start from part-way up. Both of these things are what some people call "pythonic".
Letting libraries deal with it
One of the important ideas in programming is that certain things are "solved problems". What this means is it's something that's so well-trodden that there is almost no benefit in re-inventing it. Countless people have written this code, and therefore there is bound to be a library out there that is well-maintained and well-documented enough to solve this problem. We try not to write our own code to solve "solved problems". Iterating over all the combinations of a set of things is a "solved problem". So there should be a library we can use, right?
Yep, it's such a common problem that Python ships with a library called itertools
which contains various tools for doing iterations. In our case we want to use the combinations()
function, which, if given our data
will simply spit out every single combination of two (or more if we wanted) values from it:
from itertools import combinations
for left_number, right_number in combinations(data, 2):
if left_number + right_number == 2020:
print("Found it!", left_number, right_number)
break
Much cleaner. No more nested loops, we just ask combinations()
to give us one of a set of unique combinations of 2 elements from data
every iteration of the for
loop, leaving us to just check if they sum.
Even though we let the itertools library do the heavy-lifting, this is still a brute-force method, where every combination is checked. Is there another way to do it?
Getting mathematical
With a little maths, yes we can. Since we're looking for two numbers that add up to 2020, for any given number in the list, we know what its matching pair looks like, it's just 2020 minus the number!
If we know what the matching pair looks like, for any given number on the list, we could calculate what the match is, and then go see if it exits elsewhere in the list!
for left_number in data:
# calculate match
right_number = 2020-left_number
# does it exist?
if right_number in data:
print("Found it!", left_number, right_number)
break
Here's another pythonic thing: if..in
can very easily check if a value is in a list.
Getting upSET about things
What we're doing here is going through each number in data
, calculating the complementary value, and then seeing if that complementary value was inside data
. What if I said we could do this all in one go?
What if instead of going through each number in turn, we just created the whole list of complementary values, and now we just checked if any value in that list of complementary values appears in the original data
list.
This is set theory! We want to find the intersection between data
and complement
values, therein lies our answer. Python has powerful Set features.
First we have to create our complement list:
complement = []
for number in data:
complement.append(2020-number)
Not so fast, this is not pythonic enough. We must switch to list comprehension:
complement = [2020-number for number in data]
Much better. Python's list comprehension is very powerful and allows writing compact code. The two code snippets above are pretty much identical, however there's a specific reason I want to use list comprehension, which will become clear in a sec.
Now that we have our complement list, we can go ahead and create python Set objects from them:
complement = [2020-number for number in data]
c_set = set(complement)
d_set = set(data)
print("Found it!", c_set.intersection(d_set))
Output:
Found it! {201, 1819}
As you can see, with python's Set features, and a little bit of math, it's very easy to find the answer.
Now... why did I switch to list comprehension? Because Python loooves comprehension, and in addition to list comprehension, there is also dict and set comprehension. We can generate the complement set directly:
complement = {2020-number for number in data}
Simply by switching to curly braces, complement
is now a set rather than a list. What's more, the intersection()
method will automatically convert lists given as arguments as well, meaning our whole code, if we used the short-form data input lines, actually looks like this now:
data = [int(value_str) for value_str in open("input.txt").readlines()]
complement = {2020-number for number in data}
print("Found it!", complement.intersection(data))
SHORTER, FASTER
So, I mentioned "solved problems". It turns out, reading numbers from files is also a solved problem. As is subtracting numbers from a big list of numbers. These are all common tasks in scientific computing, and Python has one very important library for that: numpy
. Numpy is a numerical computing/multi-dimensional array computing library jam-packed with features to help data scientists and mathematicians with their data. I'd go as far as to say that if you learn Python, you should know numpy, it comes up a lot.
Numpy conveniently has functions to do all of this, including subtracting numbers from everything, and doing intersections. So the code just looks like this:
import numpy as np
data = np.loadtxt("input.txt")
complement = 2020 - data
print("Found it!", np.intersect1d(data, complement))
Output
Found it! [ 201 1819]
That's it! Four lines, no loops. Load data, calculate complement, run intersection between these two lists. Done.
Although the line 2020 - data
looks like a regular subtraction, because data
is now a numpy object, it is smart enough to know that this means to do this subtraction for every member of data
, creating a new numpy object complement
with as many elements as there were in data
. A lot of code using numpy looks like regular mathematical operations but in fact applies to big arrays/matrices all at once.
As an aside, since the Advent Of Code challenge wants you to multiply the two numbers together, you can get that directly with a .prod()
method too:
np.intersect1d(data, complement).prod()
I SAID FASTER
Python is not the fastest language out there. It's actually well known for being a little slow. Numpy gets around this by being written in C, which means when you use numpy, you are using blazing fast optimized number crunching code. This is great for data scientists where though Python is slow, the heavy-lifting is being done by numpy.
A lot of people consider how easy it is to write python to be a major advantage, outweighing its disadvantages. For a lot of tasks, like data science, where code might be run a handful of times to process some data, and then put aside, this speed of writing python has real benefits. After all, it's faster and probably cheaper to just write some code and leave it number-crunching overnight than to spend several days optimizing the code in C/C++ just so it can complete in a few minutes.
So Python has dominated the data science scene, and since it is inherently not a fast language, rather than force data scientists to learn C/C++ and refactor their code, the Python ecosystem has a different approach to improving performance: parallelism.
Python's approach is, yes, basically "throw more computing power at it", while still being easy to write. Python has a plethora of libraries and tools (many are built-in) that unlock multi-processing, or processing across multiple machines, or even moving processing to the GPU.
To illustrate this kind of approach, the above numpy code can easily be converted to run on the GPU, simply by swapping numpy
for a compatible library cupy
. Where numpy's implementation under the hood is in C, cupy's implementation uses CUDA for nVidia GPUs. It maintains the same functions as numpy which means in most cases is a simple drop-in replacement.
import cupy as np
data = np.loadtxt("input.txt")
complement = 2020 - data
print("Found it!", np.intersect1d(data, complement))
See what I did there? Python lets you change the name of imports when you import them, since I did import numpy as np
earlier, I can now simply import cupy as np
and have cupy masquerading as numpy. Since it has the same functions, everything still works, and now I've moved my processing onto the GPU with a two-letter change of my code.
Ultimately this particular problem was probably too small to see any benefits in speed, but hopefully this illustrates something about how we use Python for numerical problems.
The Challenge Part 2
The second part of the question is a small extension of the earlier problem but to 3 numbers. In this case, unfortuately it's no longer straightforward to use the sets or numpy intersections with 3 number. But fortunately, the earlier itertools-based solution can extend to 3 numbers very easily, so it's a simple case of:
from itertools import combinations
for a, b, c in combinations(data, 3):
if a + b + c == 2020:
print("Found it!", a, b, c)
break
Onwards!
Posted on December 4, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.