Skip to main content

7 posts tagged with "Resolver / WOF"

Articles about the Who's On First gazetteer, concordance scoring, and resolver candidate ranking.

View All Tags

We spent three retrains fixing a German bug that didn't exist

· 5 min read
Teffen Ellis
Sister Software

There is a particular kind of engineering misery where you fix a bug three times and it never gets better, because the bug is in your ruler. This is that story.

Our neural parser handles German two ways. Native order — Hauptstraße 5, 10115 Berlin — is the layout real German feeds and real German people use. International order — 5 Hauptstraße, Berlin, 10115 — is the Americanized layout our evaluation set happens to ship. For months, international-order German "collapsed": locality accuracy sat around 44% while native cleared 80%. We had a story for it. The postcode anchor — a side-channel that feeds the model a country hint derived from the postcode — sits at the trailing postcode, which in international order lands on the far side of the locality from where it's needed. Plausible. So we retrained.

Which Berlin? When your metric grades the wrong thing

· 4 min read
Teffen Ellis
Sister Software

Ask a geocoder for "Berlin" and it has to make a choice. There's the one in Germany, obviously. There's also Berlin, New Hampshire (population nine thousand and change), Berlin, Wisconsin, Berlin, Connecticut, and a dozen more scattered across the United States like the name was on sale. The parser hands you the word Berlin tagged as a locality; something downstream has to decide which dot on the map that is. How would you even know if it picked right?

For a long time our answer was a scorecard that checked the name. Did the resolved place's name equal the expected name? Tick. Move on. It is a completely reasonable thing to measure, and it was lying to us for months.

The map runs out before the country does

· 11 min read
Teffen Ellis
Sister Software

We spent a good month teaching our resolver exactly one trick. Take a postcode, drop its centroid into the city polygon that happens to contain it, read off the city. It's a genuinely good trick. It got the Netherlands to 95% and Germany to 93%, and for a while it felt like the whole problem was going to fall to it. Then we pointed it at Japan, and Japan calmly informed us that it has no city polygons to drop anything into.

What follows is a two-country story about what a geocoder can still do when the map underneath it goes thin, and where it finally can't. Japan we resolved anyway, 94% of the way, by putting the polygon down and asking a different question. Korea handed the same problem back to us turned inside-out: it let us pin the coordinate perfectly, every time, and then stopped us cold at the one thing we were really after, which is the name of the place you've landed in.

Three questions sit under all of it, so let me put them on the table before we start:

  • What do you do when the gazetteer gives you points where you expected shapes?
  • Does the move that rescues Japan actually generalize, or did we get lucky once and dress it up as a method?
  • And the question with no comfortable answer: what happens when the map is simply missing the part of a country you most need to see?

Does a postcode know what country it's in?

· 8 min read
Teffen Ellis
Sister Software

We set out to fix a small wart in our address parser and came away with a number that told us to put the screwdriver down.

Here is the wart. When our postcode extractor sees a five-digit run and wants to know whether it's a real postcode or just a house number that happens to look like one, it peeks at the words sitting next to it and checks them against every country's street vocabulary we know — American, German, French, all at once. That "all at once" is fine at three countries. At twenty it gets loud, and a German street suffix starts shadowing an English word by sheer coincidence. So we went looking for the clean way to tell the extractor which country's words to bother with.

That question has a much bigger sibling, and chasing the sibling is where the story actually is.

Our parser fails 80% of our own tests. We shipped it anyway.

· 4 min read
Teffen Ellis
Sister Software

Our neural address parser passes 20.7% of our test suite. The rule-based parser it's meant to replace passes 93.7%. By that scoreboard, we should delete the neural model and go home.

We shipped the neural model instead. Here's why both numbers are true — and why the one that matters says the opposite.

Taming Who's On First — making sense of the world's open place data

· 10 min read
Teffen Ellis
Sister Software
If you found this via search

Mailwoman is an open-source address parser + geocoder that uses Who's On First as its gazetteer. This post is a practical reference on WOF's gotchas and the tooling we built to work around them. Try the demo or see what ships today.

Who's On First is the best open gazetteer we have. It's also one of the strangest datasets you'll encounter as a developer. This post is about what makes it hard to use, what makes it worth the effort, and the tooling we built inside Mailwoman to tame it.

If you've ever tried to answer "what city is this address in?" programmatically, using open data without paying a geocoding API, you've probably already run into WOF. And you probably had some questions.