Building a Mobile Document Scanner with Zero Dependencies: The Hough Transform
101arrowz
Posted on January 3, 2022
After the Sobel operator provides the gradient of the image, we're most of the way to finding the edges of the document. If you don't know what the Sobel operator is, I strongly recommend reading the previous article in the series first.
However, having a visual representation of the edges isn't useful; we need to have mathematical representations for every edge in the image in order to find their intersections (the corners of the document), for which we can use the Hough transform.
The Hough transform allows us to find imperfect matches for arbitrary visual patterns using a bucketed voting system. There are two ways to understand this algorithm: mathematical and intuitive. Let's go through both before discussing how we can implement it.
In Mathematical Terms
Since the Hough transform can technically find circles, ellipses, triangles, or any other arbitrary pattern, we would need an individual mathematical analysis for each type of pattern we want to detect.
For the purposes of this project, we searched for lines, which are what the Hough transform was originally designed for and are therefore the easiest type of pattern to detect. (If you're wondering why we don't just look for rectangles to find the document, we'll get to that near the end).
First, let's decide how we want to represent our lines mathematically. A natural choice might be the famous:
This form allows us to represent any line that could possibly exist in 2-D space by modifying the parameters m (the slope of the line) and b (the y-intercept). If we want a line with a 30 degree incline that's 1200 pixels from the bottom of the image, we can use:
This appears visually accurate when plotted too:
The one issue with this representation is what happens when we try to create a vertical line. Vertical lines do not move horizontally, their run is always zero while their rise is an arbitrary number. Technically, we can use either positive or negative infinity to represent the slope, but then we would have no way to know where on the x-axis the line is located, since this equation only specifies the y-intercept.
Although it's possible to work around this problem, it's also important to consider the fact that we want to be able to differentiate between visually different lines, but this form makes it difficult to do so. Consider these four lines:
The black line has slope 0.1 (i.e. m = 0.1), the purple line slope 2, the blue line slope 10, and the red line slope 30.
Although visually, the red and blue lines are very similar visually, their slopes vary by 20, and although the purple and black lines appear different, their slopes differ by only 1.9. If we want to use slope, we would need to find some way to emphasize small differences in slope at lower values.
Instead of dealing with all of these problems, we can represent the lines more accurately using polar coordinates.
"Normal" coordinates are also known as Cartesian coordinates: they are represented as (x, y)
, where x is the location on the horizontal axis and y is the location on the vertical axis. Polar coordinates are instead represented as (r, Θ)
, where r is the distance from the origin and theta is the angle counterclockwise from what would be the positive x-axis in Cartesian coordinates. Here are a few examples:
Polar coordinates and Cartesian coordinates always satisfy the following equations:
Although we can convert our original form y = mx + b
into polar, we'd end up with the same issues around visual similarity and vertical lines. Instead, we can use Hesse normal form, which can represent lines using a single polar coordinate.
Most online explanations make Hesse normal form more complicated than necessary for our purposes, so here's an intuitive explanation. Imagine you have an arbitrary polar coordinate. Draw a segment from the origin to this coordinate. Now, draw a line perpendicular to that segment that contains the coordinate. This line is uniquely identified by the polar coordinate.
Here's a graph of what that looks like:
The green line segment connects the origin to the point, so the perpendicular purple line is the line we can describe using the point (5, 30°)
.
This gives us an easy way to differentiate between lines: if the points are far from one another, the lines are visually different. There are no more cases where a small change in a variable causes a major visual change for the line because r and theta each have a "linear" visual effect. For example, a change in theta of 10° will always cause a similar visual difference for the line, no matter what the exact value of theta is.
More importantly, Hesse normal form makes it easy to find the lines that any coordinate in Cartesian space lies on. If we know the angle Θ in Hesse normal form and have a Cartesian coordinate (x, y)
that line passes through, we can solve for r:
In the above equation, any two points that lie on the same line of angle Θ will produce the same value of r. We'll discuss why this quality is so important soon. For now, I'll provide an intuitive explanation of the voting process in the Hough transform.
Buckets of Paint
Imagine you've been tasked with finding the most common color of paint out of one million buckets.
One solution could be to go through each bucket and keep a tally of how many buckets you've seen with each color. However, that approach offers very limited precision: you can't give an exact color but rather something general like "green" or "yellow." Additionally, this solution does not take into account variations in the amount of paint per bucket.
A better solution would be to create a large grid of empty paint tanks, where going up the grid yields brighter colors and moving to either side yields a different hue. In other words, we could find where in the following plot each color lies:
I know this plot disregards saturation, but for the purposes of this example we can assume every color of paint is fully saturated.
Imagine that there are grid lines along every degree of hue and every 0.01 increase in value in the above plot. We can estimate the hue and value of each bucket of paint, then dump the contents of the bucket into the tank in the grid corresponding with that hue and value.
For instance, if we come across a bucket with dark red paint, we would dump it into one of the tanks in the bottom left corner of the grid (since the bottom region has darker colors and the left region has red colors).
At the end, we could find the tanks with the most paint to determine the most common color in the paint.
This approach solves two of the problems with our original tallying approach. Since we're pouring out the buckets into a grid, we accurately account for any differences in the amount of paint per bucket. More importantly, our final result is an exact color, and in theory the maximum error versus the true most common color is the area of one tank (one degree error in hue and 0.01 error in value).
It's important to note that this approach would be a poor choice if we did not have so many buckets of paint as data points. For instance, if there were only a few thousands buckets, the majority of the 18,000 tanks would be completely empty after we finished pouring all the paint out, and tiny errors in our approximation of the color would cause incorrect results.
For example, if we found ten buckets with almost the exact same shade of yellow with slightly different brightness, we might place them in ten separate tanks, whereas two bright red paint buckets that we estimated to have the exact same hue and brightness would go in the same tank. At the end, we would find two buckets worth of paint in the bright red tank, and only one bucket worth in each of the yellow tanks, so our algorithm would decide that red was the most common color even though yellow was clearly more prevalent.
Where's this analogy going?
If you recall from earlier, we discussed how Hesse normal form lets us represent any line with a point in polar coordinates, and how visually similar lines can be represented by coordinates that are mathematically near each other. Let's discuss how we can actually use it to find lines in our gradient image.
For every pixel in the image, we can find all of the lines going through the image that the pixel could possibly lie on. For now, we'll assume that a line in every direction is possible. We can loop from Θ = 0° to Θ = 179° in one degree increments and solve for r using the equation from earlier to find 180 potential lines in Hesse normal form (r, Θ)
per pixel. (Note we don't go to 359° because lines extend infinitely in two opposite directions, so any angle above 180° yields a line identical to some angle below 180°.)
So now we have 180 mathematical lines per pixel in the image. What can we actually do with that?
Remember that we're trying to find the lines that correspond with edges in the image; in other words, lines that go through many pixels with a high gradient magnitude. If we consider the 180 lines in each pixel with high gradient magnitude, we can search for the lines that appear in multiple of those pixels and definitely claim that those are the edges in the image.
However, it's almost impossible to find the exact same (r, Θ)
in two separate pixels because we're not restricted to integers for r. Therefore, we need to find the lines that most nearly go through pixels with high gradient magnitude.
The paint-bucket problem and the actual problem we have to solve are actually quite similar. In the paint-bucket problem, we were searching for an approximate paint color that was most common in terms of hue and value. Here, we need to find an approximate line that is most common among all the lines passing through pixels with high gradient magnitude in terms of r and Θ.
We can actually apply the same solution we used for the paint-bucket problem here! We create a grid of numbers ranging from Θ = 0° to Θ = 179° as you move vertically, and from r = -d to r = d as you move horizontally, where d is the hypotenuse of the dimensions of the image. For every pixel in the image, we find every line that passes through that pixel and add the value of the gradient magnitude to each position in the grid that corresponds to one of the lines.
This process is known as voting in the Hough transform because each line we calculate "votes" for the position in the grid most similar to itself, and the positions with the most votes are the edges we are looking for.
At the end, the locations with the greatest numbers must have an (r, Θ)
line that passes through many points with high gradient magnitude. Therefore, these locations are actually the edges of the image in Hesse normal form.
At the end of this process, we can trace the edges of the image. We actually have some promising results!
As you can see above, we detected the edges of the document in red. Since they're lines and not segments, we didn't stop at the corners of the document, but we can easily find the intersections of these lines to find the corners of the document, which is one of the last steps for our document scanner!
Finishing up
There are two optimizations we can make to this algorithm. Let's recap. After finding the gradient magnitude of the image, we iterate through each pixel and find lines of each angle from 0° to 179° that go through that pixel in terms of (r, Θ)
(Hesse normal form). For each of these 180 lines, we use the value of Θ as-is and round the value of r to an integer to calculate a row and column in a grid of numbers. We then add the gradient magnitude at the original pixel to the entry in the grid. At the end, the positions in the grid with the greatest values correspond to lines in (r, Θ)
that are most likely to be edges.
At the moment, we assume that every angle from 0° to 179° is equally likely for a line going through any given point. However, if you remember from the previous article, we actually have the gradient magnitude AND the gradient direction from the Sobel operator. We know that the gradient direction is the direction of steepest ascent for the intensity of the image, so it should actually be nearly perpendicular to the edge at every pixel.
To envision this fact, imagine that you're standing on the edge of a cliff, and think about your distance from the center of the Earth as a function of your lateral position. You would come much closer to the center of the Earth if you stepped forwards, whereas moving in any other direction wouldn't change your vertical position as much, so the direction of the gradient is forwards. (I don't recommend verifying this experimentally.)
If you stepped backwards, you would move away from the edge of the cliff. The direction of the actual edge of the cliff is to your left and right, i.e. perpendicular to the gradient direction.
Using the knowledge that edges are nearly perpendicular to the gradient, we can stop assuming that every angle is equally likely. For each point in the image, we will only allow the lines nearly perpendicular to the gradient at each pixel to vote instead of checking every angle.
The other optimization is adjusting the sizes of each bin in the grid. I found empirically that one degree of difference in the angle was actually a pretty substantial visual difference. I decided to use an integer from 0 to 255 to represent the angle instead, not only because it made the size of each box 0.7° instead of 1° but also because values from 0 to 255 fit in a single byte, which was nice to deal with for practical reasons.
However, the grid portion of the Hough transform was already taking a lot of memory, and with this change the amount was more than I was satisfied with. Therefore, I increased the size of the bins for r from 1 to 2. This halved the amount of memory necessary but only increased the maximum error for the edges detected from one pixel to two pixels, which is nearly unnoticeable.
Conclusions
In short, we've found mathematical representations of the edges in the image by applying the Hough transform to the output of the Sobel operator. This is possible because each edge-like pixel votes for all the lines it could lie on, and we take the lines with the most votes at the end to be the actual edges in the image.
At the end of this process, we've basically found a bunch of (r, Θ)
lines that could potentially represent the edges of the document we're trying to find... or they could just be the edges of a desk, folder, or tablet that happened to be in the background of the image. Remember that image I showed you earlier with just the edges of the document being detected? That was after a LOT of beautification. Here's the actual output.
We still got the edges of the document, but there are a ton of duplicates due to imperfections in our algorithms, most of which have only been estimations. We also have a few false positives: the pen, the small notebook, and the keyboard in the background all looked like edges to our algorithm.
We need a way to filter out the false positives and duplicates while retaining the actual edges of the document. Then, we need to find the four edges most likely to be our document and use its corners to finish the document detection code. So in the next article, we'll discuss non-max suppression and how I designed a heuristic quadrilateral scoring function.
Posted on January 3, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
January 3, 2022
December 27, 2021
December 27, 2021
December 27, 2021
December 27, 2021