Geocoder Showdown Part 3
Sep 24, 2016
9 minutes read

This is the final part in a three part series on comparing some common solutions for offline range-based geocoding.

In this post, we’ll examine the results among all three geocoders using R.

Let’s start by pulling in our geocoded address data along with the actual latitude and longitude.

library(MASS)
library(GGally)
library(RPostgreSQL)
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)
library(party)

con <- dbConnect(PostgreSQL(), dbname='geocoder')

results <- dbGetQuery(con, "
      SELECT
        a.addy_id,
        a.house_number,
        a.street,
        a.city,
        a.region,
        a.postcode,
        a.components,
        a.lon as actual_lon,
        a.lat as actual_lat,
        c.lon as coded_lon,
        c.lat as coded_lat,
        c.precision,
        c.method,
        -- Calculate the distance (in meters) between the actual location and 
        -- our geocoded location
        ST_DistanceSphere(a.geom, c.geom) as error
    FROM
        -- Our reference set of 50,000 addresses
        sampled_addy a
        -- The of our geocoding
        LEFT JOIN geocoded c USING(addy_id);
  
")

dbDisconnect(con)

# Just the geocoded addresses. We'll come back to the rest later
coded <- subset(results, !is.na(method))

One row for each combination of address + geocoder.

head(results)
addy_idhouse_numberstreetcityregionpostcodecomponentsactual_lonactual_latcoded_loncoded_latprecisionmethoderror
1 24 MONROE AVENUE LEHIGH ACRES FL 33936 all -81.57660 26.60327 -81.57685 26.60103 0 postgis-freeform 250.06258
2 219 EAST HARTFORD STREETHERNANDO FL 34442 all -82.43075 28.88657 -82.43103 28.88607 11 postgis-freeform 62.48374
3 4720 AUTUMN ROAD MALONE FL 32445 all -85.20044 30.96571 -85.19940 30.96540 0 postgis-freeform 105.06017
4 1428 MACK SESSIONS ROAD PERRY FL 32348 all -83.59483 30.14966 -83.59225 30.14924 7 postgis-freeform 252.60864
5 3801 CROWN POINT ROAD JACKSONVILLE FL 32257 all -81.61398 30.19662 -81.61604 30.19463 0 postgis-freeform 297.21101
6 1815 SOUTHEAST 6TH STREETCAPE CORAL FL 33990 all -81.93620 26.64175 -81.93620 26.64160 0 postgis-freeform 16.51182

Let’s see how many were successfully geocoded by each:

table(coded$method)
      geocommons        nominatim postgis-freeform   postgis-parsed 
           49521            26260            49788            49777 

While the PostGIS and Geocommons geocoders assigned a geocode for nearly all 50,000 addresses, Nominatim only returned a result for 26,260 records.

Let’s do a quick check of the distribution of the geocoding errors:

coded %>% 
    group_by(method) %>% 
    do(data.frame(t(quantile(.$error, c(0, .05, .10, .25, .5, .75, .9, .95), 
                             na.rm=TRUE))))
methodX0.X5.X10.X25.X50.X75.X90.X95.
geocommons 0.36986388 17.96087 21.97979 34.27920 70.32634 198.3388 3258.5688 12943.918
nominatim 0.15634822 14.65268 18.42186 29.18190 62.39547 148.3324 384.8684 1158.814
postgis-freeform0.09711241 15.56505 19.27518 31.55322 67.73683 193.3140 3362.8243 18565.128
postgis-parsed 0.09711241 15.53364 19.23007 31.32689 66.67424 184.9740 2783.7127 21272.894

For all geocoders, the median error (X50) was less than 70 meters. Nominatim appears to outperform the others, especially at the right tail of the distribution (95% of Nominatim geocodes were within 1,159 meters, while 95% of the Geocommons geocodes were within 12,944 meters). This is misleading, though, since Nominatim only returned a geocode for about half of all addresses.

It looks like the errors follow an exponential distribution. For instance, among the PostGIS freeform results, the best 10% of results were within 20 meters, the middle 50% under 68 meters, the best 90% under 3,363 meters, and the best 95% under 18,566 meters.

# Plot distribution of errors
ggplot(data=coded, aes(x=error)) +
    geom_histogram(binwidth=500) +
    facet_grid(method ~ .) +
    xlim(0, 20000) +
    ylim(0, 4000) +
    ggtitle("Error Distribution Among Geocoders")

png

That’s a very long tail. While most results are within a football field of the actual address, the worst results can be miles away.

# Errors, log transformed
(errors_log <- ggplot(data=coded, aes(x=error)) +
    geom_histogram() +
    facet_grid(method ~ .) +
    scale_x_log10("Error (meters)", label=comma))

png

The error distributions follow similar shapes which the exception of Nominatim, which has both fewer severe errors (very few over 10 kilometers) and fewer matches within 100 meters.

Since many of our addresses were incomplete, we classified each based on the available components:

  • all - Street address, city, and zipcode all present
  • city only - Street address and city present, but no postcode (zipcode)
  • postcode only - Street address and postcode present, but no city name

Let’s see how each geocoder holds up to missing data.

errors_log + facet_grid(method ~ components) 

png

As expected, all geocoders perform best on addresses without a missing zip code.

Let’s quantify this a little better.

coded %>%
    group_by(method) %>%
    summarize(within_25 = sum(error < 25),
              within_50 = sum(error < 50),
              within_100 = sum(error < 100),
              within_200 = sum(error < 200))
methodwithin_25within_50within_100within_200
geocommons 6916 18967 30014 37217
nominatim 5335 11210 16951 21363
postgis-freeform8775 19946 30583 37610
postgis-parsed 8836 20129 30894 37997

While nominatim had a better error distribution among those it geocoded, it had the the fewest total number of results within 200 meters.

error_ranges <- coded %>%
    group_by(components, method) %>%
    summarize(within_25 = sum(error < 25),
              within_50 = sum(error < 50),
              within_100 = sum(error < 100),
              within_200 = sum(error < 200))

# Hardcode the number of total records in the perctage calculations 
# since there are missing results in each
error_ranges <- transform(error_ranges,
                          within_25  = ifelse(components == 'all',  
                                              within_25 / 35000,  
                                              within_25 / 7500),
                          within_50  = ifelse(components == 'all',  
                                              within_50 / 35000,  
                                              within_50 / 7500),
                          within_100 = ifelse(components == 'all', 
                                              within_100 / 35000, 
                                              within_100 / 7500),
                          within_200 = ifelse(components == 'all', 
                                              within_200 / 35000, 
                                              within_200 / 7500))
error_ranges[, 3:6] <- lapply(error_ranges[3:6], percent)
error_ranges
componentsmethodwithin_25within_50within_100within_200
all geocommons 13.6% 38.5% 61.7% 76.9%
all nominatim 9.6% 21.0% 32.5% 41.2%
all postgis-freeform16.9% 39.7% 61.6% 76.1%
all postgis-parsed 17.1% 40.3% 62.5% 77.2%
city only geocommons 10.5% 27.4% 43.5% 54.5%
city only nominatim 6.3% 13.2% 19.6% 24.8%
city only postgis-freeform16.5% 36.0% 55.5% 68.4%
city only postgis-parsed 16.1% 35.2% 54.3% 67.0%
postcode only geocommons 18.4% 45.8% 68.5% 83.0%
postcode only nominatim 20.0% 38.4% 54.9% 67.9%
postcode only postgis-freeform21.5% 44.6% 64.9% 78.1%
postcode only postgis-parsed 21.8% 45.3% 66.0% 79.3%

Interestingly, all geocoders do better when no city is present vs. a fully formed address. This may be an artifact of the reference data (Open Addresses), which sometimes contains errors in the address components.

The PostGIS and Geocommons geocoders both provide a numerical rating for each geocoding result. Higher ratings are better for Geocommons while lower ratings are better for PostGIS.

Let’s explore the relationship between the rating and the quality of the geocode.

ggplot(subset(coded, !is.na(precision)), aes(x=precision, y=error)) +
    geom_point(alpha=.2, size=.1) +
    geom_smooth() +
    ggtitle('Geocoding Error vs. Precision Score, by Geocoder') +
    facet_wrap(~ method, scales="free", nrow=2) +
    scale_y_log10()

png

ggplot(subset(coded, !is.na(precision)), aes(x=precision, y=error)) +
    geom_point(alpha=.2, size=.1) +
    geom_smooth() +
    ggtitle('Geocoding Error vs. Precision Score, by Geocoder and Components') +
    facet_grid(components ~ method, scales="free_x") +
    scale_y_log10()

png

For the Geocommons geocoder, the errors seem to stabilize at ratings over 0.8. For PostGIS, the errors improve continuously down to zero.

Next, let’s look at the relationship between the errors of various geocoders for each address. If one geocoder has a very small error, do the other two perform similarly (and vice versa)? Or is it common for one geocoder to perform poorly on an address while another does well?

# Reshape so that each row is an address and each column is the error for each geocoder
casted <- results %>%
    mutate(error_log = log10(error)) %>%  # Easier to log transform the error here
    select(-coded_lon, -coded_lat, -precision, -error) %>%
    sample_frac() %>%  # So we don't plot one geocoder's results on top
    spread(method, error_log)

names(casted) <- gsub("-", "_", names(casted))  # Valid R column names\
options(warn = -1)
options(repr.plot.width=12, repr.plot.height=8)
ggpairs(
    casted,
    columns=c("geocommons", "nominatim", "postgis_freeform", "postgis_parsed"),
    lower = list(continuous = wrap("points", alpha=0.3, size=.05)),
    mapping = ggplot2::aes(color = components))

png

Interestingly, many addresses are coded with poor accuracy with one geocoder but very high accuracy with another (keep in mind that the scales in the above plot are log-transformed). It’s not unusual for an address to be miscoded by over a thousand yards with one geocoder but have a very small error with another

Given the difference in performance among the geocoders for a given address and the fact that both PostGIS and Geocommons return an indicator of accuracy, I wondered whether we could predict which geocoder gave the most accurate result based on those scores.

training <- coded %>%
  group_by(addy_id) %>%
  # For apples-to-apples, only use the freeform methods
  filter(method != 'postgis-parsed') %>%
  # Add a column indicating which geocoder gave the "best" result
  mutate(best = method[which.min(error)],
         method = gsub('-', '_', method)) %>%
  # We won't use these method-level variables
  select(-coded_lon, -coded_lat) %>%
  # We'll need both the precision and error for each method, so some 
  # reshaping is needed
  gather(key = 'measure', value='value', precision, error) %>%
  unite(col='measure', method, measure) %>%
  spread(key=measure, value=value)
head(training)
addy_idhouse_numberstreetcityregionpostcodecomponentsactual_lonactual_latbestgeocommons_errorgeocommons_precisionnominatim_errornominatim_precisionpostgis_freeform_errorpostgis_freeform_precision
1 24 MONROE AVENUE LEHIGH ACRES FL 33936 all -81.57660 26.60327 postgis-freeform 250.29422 1.000 NA NA 250.06258 0
2 219 EAST HARTFORD STREETHERNANDO FL 34442 all -82.43075 28.88657 postgis-freeform 65.61230 1.000 NA NA 62.48374 11
3 4720 AUTUMN ROAD MALONE FL 32445 all -85.20044 30.96571 postgis-freeform 105.32220 1.000 NA NA 105.06017 0
4 1428 MACK SESSIONS ROAD PERRY FL 32348 all -83.59483 30.14966 postgis-freeform 257.03190 0.805 NA NA 252.60864 7
5 3801 CROWN POINT ROAD JACKSONVILLE FL 32257 all -81.61398 30.19662 nominatim 301.04544 0.937 284.86059 NA 297.21101 0
6 1815 SOUTHEAST 6TH STREETCAPE CORAL FL 33990 all -81.93620 26.64175 postgis-freeform 18.63594 0.935 18.98551 NA 16.51182 0
options(repr.plot.width=18, repr.plot.height=10)
# For now, let's fit a simple, shallow classification tree
plotTreeDepth <- function(depth){
    tree.fit <- ctree(factor(best) ~ factor(components) + geocommons_precision +
                    postgis_freeform_precision,
                  data = training,
                  controls = ctree_control(maxdepth=depth))
    plot(tree.fit)
}
plotTreeDepth(1)

png

The first split indicates that, if PostGIS returns a precision of <= 20, it’s usually the most accurate of the three. If the score is greater than 20, the Geocommons geocoder is most likely to be closest.

plotTreeDepth(2)

png

Continuing to bifurcate each node, we see that a PostGIS score of 0 makes both PostGIS and Nominatim more likely to have a low error. Where the PostGIS precision is over 20, Geommons is most likely to do best, especially among addresses with no zip code.

plotTreeDepth(3)

png

Now we have a node where Nominatim does best: addresses where the PostGIS accuracy is between 0 and 20, and the address has a postcode by no city.

We could continue to grow our classification tree, or better yet apply a more robust model and determine whether we can use it to predict which geocoder is most accurate. I may tackle that in a future project.

Final Thoughts

Range-based geocoding remains one of the more difficult problems in GIS, but these three can get us most of the way there (especially if you can tolerate some error). The results to this analysis are limited to the reference dataset I used (OpenAddresses), which contains some erroneous and possibly invalid addresses. I’d love to hear suggestions for more appropriate reference data.

There are solutions for free offline geocoding beyond strictly range-based methods. Pelias is one that looks particularly interesting: it incorporates both OpenStreetMap and OpenAddresses data (as well as others) into the default datasets.

If you’re interested in the results data with addresses and errors for each geocoder, grab it here.


Back to posts


comments powered by Disqus