In Part 1 I covered the installation and configuration of the PostGIS, Open Addresses, and Nominatim geocoders. In this article we’ll download and geocode some reference data for evaluation. In Part 3 I’ll compare the results.
First, we’ll download, load, and postprocess the OpenAddresses data we’re using as a reference. Then, we’ll pull out a sample of 50,000 records and geocode them with each of our geocoders.
OpenAddresses Data
OpenAddresses is an effort to aggregate and standardize a global collection of address data. We’ll be using the Florida extract of the OpenAddresses data as our benchmark. Download it to get started.
wget http://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-us_south.zip
unzip openaddr-collected-us_south.zip
cd us/fl
Create our table to hold the data:
psql << EOF
CREATE TABLE addresses (
lon float,
lat float,
house_number text,
street text,
unit text,
city text,
district text,
region text,
postcode text,
id text,
hash text);
EOF
Now load all the CSV files. (There is some overlapping coverage in these files that we’ll deal with later.)
for FILE in *.csv
do
echo "Loading $FILE"
psql -c "\copy addresses FROM $FILE CSV HEADER"
done
Pop open psql
and let’s take a look!
SELECT count(*) FROM addresses;
count
----------
20558789
20 million addresses, ripe for geocoding! Unfortunately, the OpenAddresses data is based on aggregated data from public sources which is sometimes incomplete. Let’s check the coverage of the address fields.
SELECT
count(*) AS total,
count(house_number) AS h_number,
count(street) AS street,
count(city) AS city,
count(postcode) AS postcode,
count(COALESCE(city, postcode)) AS city_or_post,
count(house_number || street || city || postcode) AS all
FROM
addresses;
-[ RECORD 1 ]+---------
total | 20558789
h_number | 18432928
street | 20538609
city | 17411392
postcode | 10488002
city_or_post | 20223850
all | 7610472
It looks like only half (10 million) of all addresses have a zip code, 17.4 million have a city, and 7.6 million have all address components. Instead of dropping those without all address components, we’ll classify each address based on the completeness of the components to see how the geocoders stand up to missing data.
ALTER TABLE addresses ADD COLUMN components TEXT;
-- Consider "unincorporated" to be a missing city component
UPDATE addresses SET city = NULL WHERE city = 'Unincorporated';
UPDATE addresses
SET components =
CASE
-- We won't even try to geocode these
WHEN house_number || street IS NULL THEN 'bad'
WHEN city || postcode IS NOT NULL THEN 'all'
WHEN city IS NOT NULL then 'city only'
WHEN postcode IS NOT NULL THEN 'postcode only'
ELSE 'street only'
END;
SELECT components, COUNT(*) FROM addresses GROUP BY 1;
components | count
---------------+---------
bad | 2131815
all | 7019191
postcode only | 3390495
street only | 327029
city only | 7690259
Let’s create a stratified random sample of these addresses:
- 35,000 (70%) with all address components
- 7,500 (15%) with street + postcode only
- 7,500 (15%) with street + city only
SELECT setseed(0.5);
CREATE TABLE sampled_addy AS
(
SELECT *
FROM addresses
WHERE components = 'all'
ORDER BY random()
LIMIT 35000
)
UNION ALL
(
SELECT *
FROM addresses
WHERE components = 'postcode only'
ORDER BY random()
LIMIT 7500
)
UNION ALL
(
SELECT *
FROM addresses
WHERE components = 'city only'
ORDER BY random()
LIMIT 7500
);
ALTER TABLE sampled_addy ADD COLUMN addy_id SERIAL PRIMARY KEY;
Now that we have a more manageable test set, let’s do a little additional hygiene:
UPDATE sampled_addy
SET
street = upper(street),
unit = COALESCE(upper(unit), ''),
-- I noticed some city names have embedded hyphens/underscores
city = COALESCE(upper(regexp_replace(city, '_|-', ' ', 'g')), ''),
-- Should only be Florida
region = 'FL',
postcode = COALESCE(substr(postcode, 1, 5), '');
Let’s create a geospatial point column representing the coordinates.
ALTER TABLE sampled_addy ADD COLUMN geom GEOMETRY('POINT', 4326);
UPDATE sampled_addy
SET geom = ST_SetSrid(ST_MakePoint(lon, lat), 4326);
CREATE INDEX ON sampled_addy USING gist(geom);
Geocoding: PostGIS Tiger Geocoder
We’ll create a table to hold the results from each geocoder. First, the Tiger geocoder.
CREATE TABLE geocoded (
addy_id integer,
lat float,
lon float,
geom geometry('POINT', 4326),
precision float,
method text,
UNIQUE(addy_id, method));
We have a few options on the granularity of the address components we submit.
One option is to concatenate all address components into a single freeform
string and let the geocoder’s address parser handle it. However, since we
already have some address components broken out, we can also try specifying the
city, state, and zip code components individually. The street number and name
components still need to be parsed since the unit numbers are often embedded in
the street
field and predirections are not broken out. We’ll try both.
First we’ll use the freeform addresses. The geocode
function will accept a
freeform address string, parse the address into the geocoder’s norm_addy
type, and return the normalized address, the geocoded geometry, and a rating
representing the estimated quality of the geocode.
INSERT INTO geocoded
SELECT DISTINCT ON (addy_id)
addy_id,
ST_Y((g.geo).geomout)::numeric,
ST_X((g.geo).geomout)::numeric,
ST_Transform((g.geo).geomout, 4326),
(g.geo).rating,
'postgis-freeform'
FROM (
SELECT
a.addy_id,
geocode(house_number || ' ' || street || ', ' || city || ' ' ||
region || ' ' || postcode) as geo
FROM sampled_addy a
LEFT JOIN geocoded c
ON c.method = 'postgis-freeform' AND a.addy_id = c.addy_id
WHERE c.addy_id IS NULL
) g
ORDER BY addy_id, (g.geo).rating;
Let’s try one more time, manually setting the city, state, and zipcode where available. We’ll still need the geocoder to parse the address so we can extract the street number, predirection, street name, postdirection, and unit number.
INSERT INTO tiger_geocoded
SELECT DISTINCT ON (addy_id)
addy_id,
ST_Y((g.geo).geomout)::numeric,
ST_X((g.geo).geomout)::numeric,
ST_Transform((g.geo).geomout, 4326),
(g.geo).rating,
'postgis-parsed'
FROM (
SELECT
addy_id,
geocode(
(
(norm).address,
(norm).predirabbrev,
(norm).streetname,
(norm).streettypeabbrev,
(norm).postdirabbrev,
(norm).internal,
city,
'FL',
substr(postcode, 1, 5),
true
)::norm_addy) as geo
FROM (
SELECT
a.*,
normalize_address(house_number || ' ' ||
street || ', ' ||
city || ' ' ||
region || ' ' || postcode) as norm
FROM sampled_addy a
LEFT JOIN geocoded c ON c.method = 'postgis-parsed' AND a.addy_id = c.addy_id
WHERE c.addy_id IS NULL
) n
) g
ORDER BY addy_id, (g.geo).rating;
Geocoding: The Geocommons Geocoder::US
Here’s a quick ruby script to geocode our benchmark data with the Geocommons geocoder (note that I’ve made no effort to make this efficient):
require 'pg'
require 'geocoder/us'
# Path to the SQLite3 database we created during setup
db = Geocoder::US::Database.new('/home/ubuntu/geocoder/database/geocoder.db')
# Connection to our PostgreSQL database housing the benchmark data
conn = PG.connect()
geocoded = Array.new
sql = %q(
SELECT
a.addy_id,
house_number || ' ' ||
street || ', ' ||
city || ' ' ||
region || ' ' ||
postcode AS freeform
FROM sampled_addy a
-- Only those we haven't geocoded
LEFT JOIN geocoded c ON c.method = 'geocommons' AND c.addy_id = a.addy_id
WHERE c.addy_id IS NULL;
)
result = conn.exec(sql)
result.each do |row|
g = db.geocode(row.values_at('freeform')[0])[0]
if g != nil
# skip if we don't have a match
g[:addy_id] = row.values_at('addy_id')[0]
geocoded << g
end
end
conn.prepare(
'insert',
"INSERT INTO geocoded (addy_id, lat, lon, precision, method) "\
"VALUES ($1, $2, $3, $4, 'geocommons') ")
geocoded.each do |coded|
conn.exec_prepared('insert', [coded[:addy_id],
coded[:lat],
coded[:lon],
coded[:score]])
end
conn.close()
Geocoding: Nominatim
And finally, a python script to pull the freeform addresses from the database,
throw them at our Nominatim endpoint, and insert the results into our
geocoded
table:
import requests
import psycopg2 as pg
con = pg.connect(dbname='geocoder')
cur = con.cursor()
# The nominatime endpoint from our local installation
url = 'http://localhost/nominatim/search.php'
sql = '''
SELECT
a.addy_id,
house_number || ' ' || street || ', ' || city || ' ' || region ||
' ' || postcode as freeform
FROM sampled_addy a
LEFT JOIN geocoded c ON c.method = 'nominatim' AND c.addy_id = a.addy_id
WHERE c.addy_id IS NULL;
'''
geocoded = []
cur.execute(sql)
for result in cur:
# Make a get request to our nominatime endpoint with our address
g = requests.get(url, params={'q': result[1], 'format': 'json'}).json()
if len(g) > 0:
coded = g[0]
coded['addy_id'] = result[0]
geocoded.append(coded)
else:
pass
cur.executemany('''
INSERT INTO geocoded(addy_id, lat, lon, method)
VALUES(%(addy_id)s, %(lat)s, %(lon)s, 'nominatim')''', geocoded)
con.commit()
con.close()
In Part 3 we’ll analyze the results.