CincyBC
Posted on May 30, 2022
Cover Image from Census Website
If you caught my post last week on how to easily get started with Postgres/Postgis, this is a followup on how to create your own local geocoder with Census/TIGER data. It's so easy, the folks behind Postgis have even created scripts to automate scraping and loading Census data for you. If you followed the tutorial, you now have a database called Geocoder
. Inside your Geocode
database with Postgis extension, you can just follow the tutorial in the Postgis docs, but I hope this tutorial provides a bit more context and few extra steps that make it easier for beginners.
The first step is to run these queries in your query tool.
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION postgis_tiger_geocoder;
--this one is optional if you want to use the rules based standardizer (pagc_normalize_address)
CREATE EXTENSION address_standardizer;
These queries will install a tool that will try to match a string with a degree of certainty. If you've ever used the Python tool Fuzzy Wuzzy, you'll know what fuzzystrmatch does.
It also creates schemas in your database. Since you create schemas, it's a good idea to grant access to them to your other users:
GRANT USAGE ON SCHEMA tiger to PUBLIC;
GRANT USAGE ON SCHEMA tiger_data TO PUBLIC;
GRANT SELECT, REFERENCES, TRIGGER ON ALL TABLES IN SCHEMA tiger TO PUBLIC;
GRANT SELECT, REFERENCES, TRIGGER ON ALL TABLES IN SCEHMA tiger_data TO PUBLIC;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA tiger TO PUBLIC;
ALTER DEFAULT PRIVILEGES IN SCHEMA tiger_data
GRANT SELECT, REFERENCES ON TABLES TO PUBLIC;
The last query will set default to future proof access to these tables. (These queries came from the Manning book Postgis in Action)
If you check the tables in the schema, you'll find they're mostly empty. The next step then is to fill them with the data you need to geocode!
This next query will set up variables for a bash/shell script, so the query changes at the end depending on your os
. The docs also call this profile 'debbie'
. You can call it geocoder
or whatever you want. Just remember it for the following query.
INSERT INTO tiger.loader_platform(os, declare_sect, pgbin, wget, unzip_command, psql, path_sep,
loader, environ_set_command, county_process_command)
SELECT 'debbie', declare_sect, pgbin, wget, unzip_command, psql, path_sep,
loader, environ_set_command, county_process_command
FROM tiger.loader_platform
WHERE os = 'sh';
Now that you've create this record, go into the table tiger.loader_platform
and set the paths to everything. If you don't have wget
, this is how the bash scripts will pull data from the census, so you'll need to install it and set the path to the library (unless it is set in your Path) to where you can call it. If you have a mac, you can easily get set with wget with Homebrew: brew install wget
.
After you've set those variables (don't worry about setting pgbin if you have psql
in your Path as we did if you followed the tutorial in my last post), you need to create a folder called gisdata/
. The tutorial recommends you put it in your root, but I put it on my Desktop/
so I could easily get rid of it after finishing grabbing the data. If you set it to anything except for your root, make sure you update the staging_fold
in tiger.loader_platform
.
This next query will generate a script to scrape data from the Census (it beats downloading, unzipping, running a loading command in shell, and then deleting all of the dozens of files you'll need to download for the geocoder).
psql -c "SELECT Loader_Generate_Nation_Script('debbie')" -d geocoder -tA > /gisdata/nation_script_load.sh
Again, change 'debbie' to the profile we created above. Also, the last bit is the path to your /gisdata/
directory, so if you put it on your deskstop like I did, you would need to update that to something like ~/Desktop/gisdata/nation_script_load.sh
. If you're on a PC, you'll need to create a .bat
file instead of .sh
.
Boom, just like that, you've got a script to automate pulling data from the census and loading it into your database!
Go ahead and look in the script to make sure it came out OK. Does it have your database name and password? Will it be able to run psql
with how it is set? If psql
is in your Path, then you don't need to specify where pgbin
is. I completely removed that variable and references to it from the script when I went through this tutorial.
The next task is to run the script. It shouldn't take too long depending on your internet speed.
sh nation_script_load.sh
You've just loaded basic data for all states and counties in the US. To confirm, you can run the following query:
SELECT count(*) FROM tiger_data.county_all;
It'll return 3233
. Doing the same FROM tiger_data.state_all;
will return all 50 states + DC + territories for a count of 56.
Now, it's time to make a couple of decisions. Will you be doing any census data work with the geocoder? Will you want to know which census tract the address is in? If not, skip the next query. If you do, run this query:
UPDATE tiger.loader_lookuptables SET load = true WHERE load = false AND lookup_name IN('tract', 'bg', 'tabblock');
This will allow you to download the necessary files to find the census tract, block, and block group for addresses you geocode.
The next decision is for which address you'll be running the geocoder for. If you live in New York, there is a possibility you'll care about geocoding addresses in NY, but not really care about geocoding address in WY, so why waste the time/disk space?
In the same fashion we created the automated script to load the basic nation/counties data above (a necessary requirement no matter if you only want NY), it's time to pull the data for the state(s) you want with the following query:
psql -c "SELECT Loader_Generate_Script(ARRAY['NY'], 'debbie')" -d geocoder -tA > /gisdata/ny_load.sh
Again, fill in which states you want in the ARRAY by their abbreviation; change 'debbie' to whatever you called your profile to store variables ('geocode'
); and make sure you have the correct destination for your file (e.g. ~/Desktop/gisdata/ny_load.sh
). Run your script (this can take a lot of time depending on how fast your internet/computer are): sh ny_load.sh
.
That's it, you now have a geocoder on your local machine!
The docs recommend you clean up your tables. If you created these new, and didn't have issues during setup, these queries shouldn't actually do much, if anything, but it's always good to clean up zombie rows for performance.
SELECT install_missing_indexes();
vacuum (analyze, verbose) tiger.addr;
vacuum (analyze, verbose) tiger.edges;
vacuum (analyze, verbose) tiger.faces;
vacuum (analyze, verbose) tiger.featnames;
vacuum (analyze, verbose) tiger.place;
vacuum (analyze, verbose) tiger.cousub;
vacuum (analyze, verbose) tiger.county;
vacuum (analyze, verbose) tiger.state;
vacuum (analyze, verbose) tiger.zip_lookup_base;
vacuum (analyze, verbose) tiger.zip_state;
vacuum (analyze, verbose) tiger.zip_state_loc;
Give it a test run! If you chose NY like I did in the tutorial, run this query to get the geo-coordinates for Rockefeller Center Plaza:
SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat,
(addy).address As stno, (addy).streetname As street,
(addy).streettypeabbrev As styp, (addy).location As city, (addy).stateabbrev As state,(addy).zip
FROM geocode('45 Rockefeller Plaza, New York, NY 10111', 1) As g;
You'll notice something called a rating when you run this query. If you get a rating of 0, it's certain it got the address correct. The higher the rating, the less the fuzzystrmatch is sure it got it right. You can play around with that by spelling Rockefeller
something else like Rockfella
and see what it comes up with. You can also run this without the zip code and it'll come up with the correct coordinates (albeit, slightly slower query).
You can also run batch queries if you want to get multiple geo-coordinates in a single query.
Happy geocoding!
Posted on May 30, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.