Visualizing House Price Distributions
Anthony Agnone
Posted on July 25, 2019
With Zillow and python's Folium, it's easier than ever
Wait, but Why?
I’m in the process of closing on my first home in Atlanta, GA, and have been heavily using various real estate websites like Zillow, Redfin, and Trulia. I’ve also been toying with Zillow’s API, although somewhat spotty in functionality and documentation. Despite its shortcomings, I was fully inspired once I read the post by Lukas Frei on using the folium
library to seamlessly create geography-based visualizations. A few days and some quick fun later, I’ve combined Zillow and Folium to make some cool visualizations of housing prices both within Atlanta and across the U.S.
Topics
- API integration
- Graph traversal
- Visualization
A Small Working Example
Let’s start simple by using some pre-aggregated data I downloaded from the Zillow website. This data set shows the median price by square foot for every state in the U.S. for each month from April 1996 to May 2019. Naturally, one could build a rich visualization on the progression of these prices over time; however, let’s stick with the most recent prices for now, which are in the last column of the file.
Having a look at the top-10 states, there aren’t many surprises. To be clear, I was initially caught off guard by the ordering of some of these, notably D.C. and Hawaii topping the chart. However, recall the normalization of “per square foot” in the metric. By that token, I’m maybe more surprised now that California still hits #3, given its size.
Anyways, onto the show! Since this is a visualization article, I’ll avoid throwing too many lines of code in your face, and link it all to you to it at the end of the article. In short, I downloaded a GeoJSON file of the U.S. states from the folium repo. This was a great find, because it immediately gave me the schema of the data that I needed to give to folium for a seamless process; the only information I needed to add was the pricing data (to generate coloring in the final map). After providing that, a mere 5 lines of code got me the following plot:
One Step Further
Now that I’d dipped my toes into the waters of Zillow and Folium, I was ready to be immersed. I decided to create a heat map of Metro Atlanta housing prices. One of the drawbacks of the Zillow API is that it’s rather limited in search functionality — I couldn’t find any way to perform a search based on lat/long coordinates, which would have been quite convenient for creating a granular heat map. Nevertheless, I took it as an opportunity to brush up on some crawler-style code; I used the results of an initial search by a city’s name as seeds for future calls to get the comps (via the GetComps endpoint) of those homes.
It’s worth noting that Zillow does have plenty of URL-based search (example) filters that one could use to e.g. search by lat/long (see below). Obtaining the homes from the web page then becomes a scraping job, though, and you are subject to any sudden changes in Zillow’s web page structure. That being said, scraping projects can be a lot of fun; if you’d like to build this into what I made, let me know!
Returning to the chosen path, I mentioned that I used initial results as entry points into the web of homes in a given city. With those entry points, I kept recursing into calls for each homes comps. An important assumption here is that Zillow’s definition of similarity between houses includes location proximity in addition to other factors. Without location proximity, the comp-based traversal of homes will be very non-smooth with respect to location.
So, what algorithms are at our disposal for traversing through a network of nodes in different ways? Of course, breadth-first search (BFS) and depth-first search (DFS) quickly come to mind. For the curious, have a look at the basic logic flow of it below. Besides a set membership guard, new homes are only added to the collection when they satisfy the constraints asserted in the meets_criteria
function. For now, I do a simple L2 distance check between a pre-defined root lat/long location and the current home’s location. This criterion encouraged the search to stay local to the root, for the purposes of a well-connected and granular heat map. The implementation below uses DFS by popping off the end of the list (line 5) and adding to the end of the list (14), but BFS can be quickly achieved by changing either line (but not both) to instead use the front of the list.
Letting this algorithm run for 10,000 iterations on Atlanta homes produces the following map in just a few minutes! What’s more, the generated web page by folium is interactive, allowing common map navigation tools like zooming and panning. To prove out its modularity, I generated some smaller-scale maps of prices for Boston, MA and Seattle, WA as well.
The Code
As promised, here’s the project. It has a Make+Docker setup for ease of use and reproducibility. If you’d like to get an intro to how these two tools come together nicely for reproducible data science, keep reading here. Either way, the README will get you up and running in no time, either via script or Jupyter notebook. Happy viz!
What Now?
There are numerous different directions in which we could take this logic next. I’ve detailed a few below for stimulation, but I’d prefer to move in the direction that has the most support, impact, and collaboration. What do you think?
Posted on July 25, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.