Gazetteer-first geocoding
The most radical simplification: don't parse at all. Treat the input as an information retrieval problem. Tokenize the string, try every token and n-gram against a gazetteer index, return the best placetype match. The parser is a search engine. The resolver is the search result.
The approachโ
This is the architecture behind Airmail, the Rust tantivy-based geocoder that sidesteps parsing entirely.
- Tokenize the input. Split on whitespace and punctuation. Produce every n-gram up to some reasonable limit (trigrams are usually sufficient for place names).
- Query a gazetteer index. The index contains every place name from Who's On First, OpenStreetMap, GeoNames, or another gazetteer. Each entry has a name, placetype, parent chain, and coordinate. The index supports prefix search and fuzzy matching.
- Rank results. Score each hit by: name match quality (exact > prefix > fuzzy), placetype priority (locality > region > country, or configurable), population (larger places score higher, all else equal), and token coverage (an n-gram that covers more of the input scores higher).
- Return the best placetype match. The result is a place with a placetype, coordinate, and parent chain. The system does not distinguish between "this token is a street" and "this token is a locality" โ it only knows "this n-gram matches a place in the index."
This is genuinely elegant. It avoids every parsing problem by not parsing. The gazetteer knows what places exist. The search engine finds which place names appear in the input. The ranker picks the best one.
When it worksโ
- You only need administrative places. Country, region, locality, neighbourhood โ these are in the gazetteer. Streets, building numbers, and venue names are not (or are in a different index). If your use case is "which city is this in?", gazetteer-first works.
- You need global coverage. The gazetteer covers every country. The search engine handles every script (the index stores UTF-8). No per-country regexes, no per-locale rules, no language-specific heuristics. Add a new country by adding its gazetteer data.
- You have ambiguous input.
Springfieldreturns 34 candidates, ranked by population. The user or downstream system picks. The search engine doesn't need to resolve ambiguity โ it surfaces it. - You want to avoid the parsing problem entirely. No token-type classification, no BIO labels, no CRF, no solver, no policy registry. Just a search index and a ranking function. The architecture is a few hundred lines of code plus a pre-built index.
- Your addresses are short and place-name-heavy.
Paris, Franceโ two tokens, both in the index.West 26th Street, New York, NY 10010โ many tokens, some in the index (New York, NY), some not (West, 26th, Street, 10010). The more of the input that is place names, the better this works.
What you loseโ
- Streets and building numbers.
123 Main St, Springfield, ILโ the gazetteer finds "Springfield, IL." "123 Main St" is not a place name, so it's ignored. The geocoder returns the city centroid. If you need street-level or building-level accuracy, this architecture does not provide it. - Venue names.
Empire State Building, NYCโ the gazetteer finds "NYC." "Empire State Building" is not a place name in most gazetteers (WOF has some landmarks, but coverage is sparse). The geocoder returns the city centroid, not the building. - Component-level output. The result is "Springfield, IL" โ a place, not a parse. The system does not tell you which tokens were the locality, which were the region, and which were noise. If you need structured output (
{locality: Springfield, region: IL, street: Main St}), you need a parser. - Token coverage as a signal.
Paris, FranceandParis, Texasboth match a locality. The gazetteer returns both. The ranker picks Paris, France (higher population). If the input also containsTexas, the ranker should boost Paris, Texas โ but only if the ranker knows thatTexasis a region that contains Paris, Texas. Gazetteer-first architectures can add this as a post-ranking filter, but it's not part of the core approach. - Addresses that are not place names.
50 miles West of Socorro, New Mexicoโ the gazetteer finds "Socorro, New Mexico." "50 miles West of" is not a place name and is dropped. The geocoder returns Socorro, which is 50 miles wrong. Directional descriptions are invisible to the gazetteer.
Where Mailwoman fitsโ
Mailwoman and gazetteer-first architectures are investigating the same problem from opposite directions. Gazetteer-first says "skip parsing, search the index." Mailwoman says "parse first, then search the index with structured queries." Both converge on the same resolver โ a WOF SQLite (or tantivy) index queried with place names.
Mailwoman's parser produces structured components (locality=Springfield, region=IL) that the resolver uses for constrained lookup. The gazetteer-first approach produces unstructured n-gram matches that the ranker evaluates. The structured approach is more precise when the parser is correct; the unstructured approach is more robust when the parser is wrong.
A hybrid is possible: use gazetteer-first as a fallback when the parser's confidence is low. If the parser returns locality=??? confidence=0.3, fall back to gazetteer-first on the raw input and return the best placetype match. The parser handles the well-formed addresses; the gazetteer handles the messy ones.
See alsoโ
- Normalize to match โ the string-matching approach
- Locality-only geocoding โ the next step up in resolution
- Resolver and Who's On First โ the gazetteer Mailwoman uses
- Airmail โ the reference implementation