Making Use of Zip Codes
Magali
Posted on July 28, 2022
Overview
I am continuing to learn Python coding and data science as a student in Flatiron School’s Online Data Science Bootcamp. My motivation is to explore how these skills can be applied to make better, more informed, and timely operating and policy decisions in finance and economics. I am already finding value in the application of these skills to real-world questions and challenges.
Business Problem
My second project involved working with King County, Washington’s house price dataset. This is a large dataset, with over 21,000 entries. Among these entries are house prices and location-related indicators such as latitude, longitude, and over 70 zip codes in King County. Since location can affect home prices, I was interested in exploring how house prices vary across the county. The challenge is that the dataset does not include addresses nor city names, which makes the zip code data difficult to interpret if one does not know what area the zip code represents. Moreover, because zip codes were created to assist postal workers deliver mail, they do not necessarily cleanly delineate an actual area that one would be able to reference on a map. Indeed, many zip codes have odd boundaries. Could the zip code data be made useful and interpretable in Python for this analysis?
Data Exploration
The first step is to confirm that price indeed varies by location. A simple scatterplot of longitude and latitude, with markers weighted by price, indicates that it does.
Next, I explored the zip code data, which does not have a normal distribution. I then plotted the average price by zip code, which further supports the view that location matters. Homes in some zip codes have higher mean prices than homes in other zip codes. There is no order in the relationship between zip codes and home prices.
Bringing in City Data
Since I—and most people—do not know what areas each of these 70+ zip codes represent, I brought in city data obtained from King County GIS Open Data to match with zip codes. I have done similar matching of data previously, but never with Python so this was a test of my new skills. I approached the challenge by creating a dictionary with zip codes as keys and cities as values; calling on the dictionary when applying the pd.replace function to the dataset; and, finally, creating a visualization of home prices by city. Check it out.
#Load primary dataframe that underlies the King County home price analysis.
data = pd.read_csv('data/kingcounty.csv')
#Load second dataframe with zipcode and city data, obtained from King County GIS Open Data.
df_zips = pd.read_csv('data/zipcodes.csv')
#Using this second dataframe, make a dictionary with key, values represented by zipcode, city.
zips_dictionary = dict(zip(df_zips.zipcode, df_zips.city))
zips_dictionary
{98001: 'Auburn',
98002: 'Auburn',
98003: 'Federal_Way',
98004: 'Bellevue',
98005: 'Bellevue',
98006: 'Bellevue',
98007: 'Bellevue',
98008: 'Bellevue',...}
#In primary dataframe named data, create new column named “City” that contains zip codes as placeholder values.
data['city'] = data.zipcode
#Call on dictionary to replace zip code values in this new column with city names.
data.replace({'city': zips_dictionary}, inplace=True)
#Examine results
data.city.value_counts()
Seattle 8777
Renton 1584
Bellevue 1263
Kent 1197
Redmond 960
Kirkland 941
Auburn 907
Sammamish 778
Federal_Way 777
Issaquah 717
Creating Visualization of Mean Price by City
#Plot mean price by city
fig, ax = plt.subplots(figsize = (8,4))
# Plot mean price by city
mean_price_by_city = data.groupby('city')['price'].mean()
mean_price_by_city.sort_values(ascending=True).plot(kind='bar', color='powderblue', label='Mean Price by City')
# Plot mean price for King County
data['price_mean'] = data.price.mean()
data['price_mean'].plot(kind='line', color = 'mediumblue', label = 'Mean Price of King County: $504,333')
#Format y-axis
plt.ylabel('Mean Price ($)',size=12)
current_values = plt.gca().get_yticks()
plt.gca().set_yticklabels(['{:,.0f}'.format(x) for x in current_values])
ax.set_ylim([0, 1500000])
#Format x-xis
plt.xlabel('City', size=12)
plt.xticks(rotation=90)
#Add legend, title
plt.legend(loc='upper left', borderaxespad=0.2, edgecolor='white', fontsize=11)
fig.suptitle('Home Prices: County and City', fontsize=15)
fig.subplots_adjust(top=0.94)
#Remove chart borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show();
Conclusion
By bringing in city data to match with each zip codes, the variation in prices across the county can be more easily interpreted. The number of bins were reduced from over seventy to twenty or so, which reduces granularity but results in a grouping that is well known and understandable by the public. This is useful because subsequent analysis—for example, on price per square footage and structural home features—can be analyzed for differences within the county, through the lens of cities.
Posted on July 28, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.