The right name in the wrong state
Our resolver scored 93.7% on the metric we'd been quoting for months. On the same addresses, its median answer was 326 kilometers from the truth.
Both numbers are correct. That's the uncomfortable part.
Articles about the Who's On First gazetteer, concordance scoring, and resolver candidate ranking.
View All TagsOur resolver scored 93.7% on the metric we'd been quoting for months. On the same addresses, its median answer was 326 kilometers from the truth.
Both numbers are correct. That's the uncomfortable part.
There is a particular kind of engineering misery where you fix a bug three times and it never gets better, because the bug is in your ruler. This is that story.
Our neural parser handles German two ways. Native order — Hauptstraße 5, 10115 Berlin — is the layout real German feeds and real German people use. International order — 5 Hauptstraße, Berlin, 10115 — is the Americanized layout our evaluation set happens to ship. For months, international-order German "collapsed": locality accuracy sat around 44% while native cleared 80%. We had a story for it. The postcode anchor — a side-channel that feeds the model a country hint derived from the postcode — sits at the trailing postcode, which in international order lands on the far side of the locality from where it's needed. Plausible. So we retrained.
Ask a geocoder for "Berlin" and it has to make a choice. There's the one in Germany, obviously. There's also Berlin, New Hampshire (population nine thousand and change), Berlin, Wisconsin, Berlin, Connecticut, and a dozen more scattered across the United States like the name was on sale. The parser hands you the word Berlin tagged as a locality; something downstream has to decide which dot on the map that is. How would you even know if it picked right?
For a long time our answer was a scorecard that checked the name. Did the resolved place's name equal the expected name? Tick. Move on. It is a completely reasonable thing to measure, and it was lying to us for months.
We spent a good month teaching our resolver exactly one trick. Take a postcode, drop its centroid into the city polygon that happens to contain it, read off the city. It's a genuinely good trick. It got the Netherlands to 95% and Germany to 93%, and for a while it felt like the whole problem was going to fall to it. Then we pointed it at Japan, and Japan calmly informed us that it has no city polygons to drop anything into.
What follows is a two-country story about what a geocoder can still do when the map underneath it goes thin, and where it finally can't. Japan we resolved anyway, 94% of the way, by putting the polygon down and asking a different question. Korea handed the same problem back to us turned inside-out: it let us pin the coordinate perfectly, every time, and then stopped us cold at the one thing we were really after, which is the name of the place you've landed in.
Three questions sit under all of it, so let me put them on the table before we start:
We set out to fix a small wart in our address parser and came away with a number that told us to put the screwdriver down.
Here is the wart. When our postcode extractor sees a five-digit run and wants to know whether it's a real postcode or just a house number that happens to look like one, it peeks at the words sitting next to it and checks them against every country's street vocabulary we know — American, German, French, all at once. That "all at once" is fine at three countries. At twenty it gets loud, and a German street suffix starts shadowing an English word by sheer coincidence. So we went looking for the clean way to tell the extractor which country's words to bother with.
That question has a much bigger sibling, and chasing the sibling is where the story actually is.
Our neural address parser passes 20.7% of our test suite. The rule-based parser it's meant to replace passes 93.7%. By that scoreboard, we should delete the neural model and go home.
We shipped the neural model instead. Here's why both numbers are true — and why the one that matters says the opposite.
Mailwoman is an open-source address parser + geocoder that uses Who's On First as its gazetteer. This post is a practical reference on WOF's gotchas and the tooling we built to work around them. Try the demo or see what ships today.
Who's On First is the best open gazetteer we have. It's also one of the strangest datasets you'll encounter as a developer. This post is about what makes it hard to use, what makes it worth the effort, and the tooling we built inside Mailwoman to tame it.
If you've ever tried to answer "what city is this address in?" programmatically, using open data without paying a geocoding API, you've probably already run into WOF. And you probably had some questions.