Leonhard Euler Meets Genome Sequencing
Alexander Kulikov
Posted on April 10, 2024
What is the largest jigsaw puzzle ever faced by humankind?? It is the human genome assembly puzzle solved in 2022 after computer scientists used millions of short overlapping DNA fragments to sequence the complete human genome without any gaps. Intriguingly, the challenge of genome sequencing is closely related to some algorithmic problems one may face during a coding interview. Fear not, we will teach you how to ace them!
Genome Assembly
Picture a genome as a long string composed of four nucleotides: A, C, G, and T. As the first approximation, you can think of your genome as a 3-billion nucleotide long string. Biologists still have not figured out how to read the genome from the first to the last letter like we read a book. Instead, they shatter the genome into millions of small pieces and try to assemble them into contiguous chromosomes. In the Genome Assembly Problem, the goal is to assemble a genome from a collection of its short overlapping fragments called reads, not unlike assembling jigsaw puzzles.
To illustrate, here are five identical copies of a “mini-genome” and a set of reads extracted from each copy. Note that some fragments from each copy are lost during genome shattering, making the assembly problem even more intricate. And just like the random order in which jigsaw pieces are handed to us, the reads are also given in a random order. Our goal is to reconstruct a genome from these reads.
This is how a solution of the genome assembly for this toy example may look like.
Three Puzzles: One Stroke, Snow Plow, and Universal String
What is in common between the following three puzzles?
In the One Stroke Puzzle, the goal is to trace each segment in the given figure exactly once without lifting your pencil. It turns out that the One Stroke Puzzle can be solved for the left figure, but cannot be solved for the right figure. Can you figure out what is the key difference between these two puzzles? And how would you solve the One Stroke Puzzle for a figure with millions of segments?
In the Snow Plow Puzzle, you need to find the shortest route for a snow truck to clean every city block and return to its original location. Note that in this problem, the truck may need to visit the same block more than once. You need to find a route visiting the minimum number of blocks. Try our interactive puzzle! How would you solve the Snow Plow Puzzle for New York City with a whooping 120,000 city blocks?
In the Universal String Puzzle, the goal is to find a string that contains each binary string of length k as a substring exactly once. For example, the string 0011101000 solves the Universal String Puzzle for k=3 because it contains each binary string of length 3 exactly once. Can you solve the Universal String Puzzle for k=10?
You are probably wondering why we used the “DNA logo” for the Universal String Puzzle. It turned out that the algorithms for this problem are quite similar to the algorithms that biologists use for genome assembly. In this post, we will be designing algorithms for solving these three seemingly different puzzles using the same algorithmic idea!
Leonhard Euler and Seven Bridges of Königsberg
Meet Leonard Euler, one of the world's greatest mathematicians who lived three centuries ago. He knew nothing about our three puzzles, let alone genome assembly. Instead, he was interested in the “Seven Bridges of Königsberg” problem and solved it. Euler's idea is the key to both assembling genomes and solving our three puzzles!
In the 18th century, Königsberg consisted of four parts connected by seven bridges. The Seven Bridges of Königsberg puzzle asks you to find a walk through Königsberg that visits each bridge exactly once.
This is how such a walk could look like. Starting at the northern part, we visit bridges one by one. By the end of the day, we end up in the eastern part after visiting six out of seven bridges.
We missed one bridge, and we cannot traverse it without visiting any other bridge again, thus failing to solve the “Seven Bridges of Königsberg” puzzle. But our failure does not imply that the puzzle has no solution! Indeed, maybe there is a better strategy that results in a walk visiting each bridge exactly once. Try our interactive puzzle!
To show you the solution, we will transform Königsberg with four parts and seven bridges into a graph with four nodes and seven edges. Each part corresponds to a node in the graph, and each bridge corresponds to an edge connecting two nodes. Afterwards, we remove the map of the city and the bridges. What is left is a simple One Stroke Puzzle represented as a graph! Euler laid the foundations of graph theory by finding an approach that solves this puzzle for arbitrarily large graphs.
What is a Graph?
A graph is a collection of nodes and edges between them. The graph below has seven nodes and twelve edges between them.
The degree of a node is defined as the number of edges incident to this node. Below, we show the degrees of all nodes. Note that all nodes in this graph have even degrees — we classify such nodes as “even” nodes.
A cycle in a graph is called Eulerian if it visits every edge exactly once. Recall that a cycle must start and end at the same node. A graph is called Eulerian if it has an Eulerian cycle.
This toy example reveals that all nodes of an Eulerian graph are even. Indeed, for each node, the number of times an Eulerian cycle enters it is equal to the number of times it leaves it. We say that a graph is balanced if all its nodes are even.
We just proved that each Eulerian graph is balanced. But is it true that each connected balanced graph is Eulerian? Euler proved that this is indeed the case: a connected graph is Eulerian if and only if it is balanced.
Eulerian Cycle Theorem. A connected graph is Eulerian if and only if it is balanced.
We will now recruit an ant to prove that a connected balanced graph has an Eulerian cycle.
Let an ant randomly walk through the graph without visiting the same edge twice! If ant was a genius, it would just construct an Eulerian cycle in a single walk.
A less intelligent ant would randomly choose a node and start walking. The ant can only get stuck at the starting node because, since all nodes are even, there is a way out of each other node that the ant enters.
The ant has completed a cycle but has not proved Euler’s theorem yet... The constructed cycle is not Eulerian. Can you instruct the ant to traverse the graph differently to enlarge this initial cycle?
Let’s start at a node with still unexplored edges and traverse the constructed cycle starting from this new node.
Then, after traversing this cycle, the ant will end up in a node with still unexplored edges and will be able to continue the walk. We will thus enlarge the original cycle!
After completing the cycle, start random exploration of still untraversed edges in the graph.
No Eulerian cycle yet… Can we again enlarge the green-blue cycle by traversing it starting from a new node? Which one?
Again, the ant should choose a node with yet unused edges. For example, this one:
From Eulerian Cycle to Eulerian Path
We now know how to find an Eulerian cycle, but how can we find an Eulerian path that does not require the ant to return to the starting node? The One Stroke Puzzle is equivalent to the Eulerian Path rather than the Eulerian Cycle Problem. It turns out that a graph has an Eulerian path if and only if it is either balanced or has two odd nodes.
Eulerian Path Theorem. A connected graph has an Eulerian path if and only if it is either balanced or has two odd nodes, i.e., nodes of odd degree.
Assume that there are two odd nodes. By joining them by a virtual edge, we get an Eulerian graph. Let's start traversing an Eulerian cycle starting from the virtual edge. Then, we traverse the rest of the cycle. By removing the virtual edge, we get an Eulerian path between two nodes of odd degree that starts at the second edge and ends at the last edge of the constructed Eulerian cycle.
Eulerian Cycle Theorem for Directed Graphs
Our ant proved the Euler theorem for undirected graphs. Now, teach the same ant to prove the Euler theorem for directed graphs. It turned out to be straightforward. A directed graph is balanced if the indegree of each node in this graph equals its outdegree.
Eulerian Cycle Theorem for directed graphs. A strongly connected directed graph is Eulerian if and only if it is balanced.
Seven Bridges of Königsberg: Solution
Now, back to the graph for Seven Bridges of Königsberg puzzle. Since every node in this graph has an odd degree, there is no walk through Königsberg visiting every bridge exactly once!
Eulerian Cycles in Python
In Python, to find an Eulerian cycle of a graph, one can first construct a graph by passing the list of edges and then call a built-in method.
import networkx as nx
graph = nx.Graph([
(1, 2), (1, 3), (1, 4), (2, 3),
(2, 4), (3, 4), (3, 5), (4, 5)
])
print(nx.eulerian_path(graph))
One Stroke Puzzle: Solution
Now, let's get back to the One Stroke Puzzle. We represent each figure as a graph and mark odd-degree nodes in red. A figure can be drawn without lifting a pencil if it has zero or two odd nodes. Thus, to check whether a One Stroke Puzzle can be solved, you need to count the number of nodes of odd degree!
Snow Plow Puzzle: Solution
Now, we get back to our Snow Plow Puzzle. Recall that the goal in this problem is to visit every city block at least once while minimizing the number of visited blocks.
Consider the grid-graph for the Snow Plow Puzzle. There are eight nodes of odd degree in this non-Eulerian graph. Since the snow plow has to return to its original position, it traverses an Eulerian cycle in a graph where each edge that is traversed
N times are substituted by N parallel edges. Thus, our goal is to turn the grid graph into Eulerian by adding as few such edges as possible.
For the 3-by-3 grid, the best way is to add four edges. Each added edge indicates that the snow plow will traverse a city block more than once. The resulting graph is Eulerian. The number of edges in the resulting graph is 28, and this is exactly the length of the optimal snow plow route!
Interestingly, there is also a built-in Python method that allows one to turn a non-Eulerian graph into an Eulerian by adding the minimum number of additional parallel edges to the existing edges of the graph. As in our example, it first identifies nodes of odd degree in a graph and then joins them by paths of minimal length to make the graph Eulerian. This leads to 4-line Python code for solving the Snow Plow Puzzle.
import networkx as nx
grid = nx.MultiGraph(nx.grid_2d_graph(m=4, n=4))
grid = nx.eulerize(grid)
cycle = nx.eulerian_circuit(grid, source=(0, 0))
print('→'.join(str(edge[0]) for edge in cycle))
For the four-by-four grid, it constructs the following solution.
Universal String Puzzle
We saw how to solve the One Stroke Puzzle using Euler’s theorem. But how would you solve the Universal String Puzzle?
For simplicity, we will focus on the circular version of the Universal String Puzzle — constructing a circular string that contains each binary k-mer as a substring exactly once. Below, we show the answer for k equal to 2 and 3.
We now show how to find the shortest circular string containing all binary 4-mers as substrings. To this end, we create a graph whose nodes are all binary 3-mers. Then, each 4-mer is represented as an edge from its prefix to its suffix in this graph. It turns out that this graph is Eulerian and its Eulerian cycle spells a universal string. We leave it to you as an exercise.
Conclusion
You have now learned how to solve three different algorithmic problems, all based on the same “Seven Bridges of Königsberg” puzzle. This puzzle is not as old as the Josephus Problem that we discussed in the “Computer Science Meets Ancient History” edition of our “Ace Your Next Coding Interview by Learning Algorithms” series. However, there is something about this puzzle that makes it unique. Indeed, puzzles, as fun as they can be, are not necessarily useful for solving scientific problems. And Euler, of course, had no clue that the “Seven Bridges of Königsberg” Puzzle will turn into a workhorse of the multi-billion dollar genome sequencing industry that has transformed modern biomedicine. And where can you learn about other algorithmic challenges that often pop up at the coding interviews? Just stay tuned for the next edition of our “Ace Your Next Coding Interview by Learning Algorithms” series!
Posted on April 10, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.