Locality-only geocoding
Parse only the locality โ the city, town, or village. Everything else is supplementary. Geocode to city-level accuracy. This is sufficient for statistical aggregation, market analysis, regional routing, and most applications that don't need to find a specific building.
The approachโ
- Find the locality. Tokenize the input. Try each token and bigram against a gazetteer of locality names. "Springfield" matches. "New York" matches as a bigram. "West 26th" does not match and is ignored. The approach is a targeted version of gazetteer-first geocoding: skip everything except locality names.
- Disambiguate if possible. If the input also contains a state abbreviation or postcode, use it to filter candidates.
Springfield, ILโ Springfield, IL.Springfieldalone โ 34 candidates, ranked by population. - Return the locality centroid. The geographic center of the city. Off by up to several miles for large cities (Los Angeles is 500 square miles), within a few hundred meters for small towns.
- Ignore the rest. The street, building number, apartment, and venue are not needed for city-level geocoding. They are preserved in the raw input for display purposes.
This is the minimum viable geocoder: city-level accuracy from a gazetteer lookup. It handles every country with a gazetteer. It handles ambiguous names (Springfield) by surfacing the ambiguity rather than picking wrong. It handles every script the gazetteer supports.
When it worksโ
- Statistical aggregation. Customer distribution by city. Sales by metropolitan area. Disease incidence by municipality. City-level accuracy is the unit of analysis for most public health, marketing, and demographic applications.
- Regional routing. "Route this delivery to the Los Angeles warehouse" โ you need to know it's in Los Angeles, not which building. The street address is for the last mile; the locality is for the regional sort.
- Market analysis. "How many customers do we have in the Bay Area?" โ city-level geocoding places customers in San Francisco, Oakland, San Jose, etc. Street-level accuracy adds detail but doesn't change the regional answer.
- Gazetteer coverage is good for localities. WOF has ~200,000 locality records globally. OSM has millions of place nodes. GeoNames has ~12 million populated places. For most of the world's population, the locality they live in is in a gazetteer.
- You serve countries without street-level addressing. Japan uses block-based addressing without named streets. Rural India, sub-Saharan Africa, and informal settlements use descriptive addresses without standard street names. Locality-only geocoding works where street-level geocoding has nothing to work with.
- You need global coverage fast. One gazetteer, one lookup strategy, every country. No per-country regexes, no per-locale rules. Add a country by adding its gazetteer data.
What you loseโ
- Everything finer than city-level. The building, the street, the neighborhood. A geocode at the city centroid is correct at the city level but wrong for any specific address within the city. A delivery to "123 Main St, Springfield" routed to the Springfield centroid is off by up to several miles.
- Large cities. Los Angeles, New York, Tokyo, London โ a centroid in a 500-square-mile city is off by up to 15 miles from the actual address. City-level accuracy in a large city is approximately county-level accuracy.
- Duplicate locality names. 34 Springfields. Three Newports (UK). Two Eursinges (Netherlands, same province). Without a state or postcode, the gazetteer returns all candidates and the ranker picks by population. The largest Springfield is in Illinois. If the user meant Springfield, MA, the geocoder is wrong โ and confident about it.
- Postal city vs. legal city. USPS accepts "Los Angeles" as the mailing city for addresses in Beverly Hills and West Hollywood. The gazetteer places Beverly Hills as a separate city. A locality-only geocoder that looks up "Los Angeles" in the gazetteer returns the Los Angeles centroid โ 5 miles from the actual address in Beverly Hills.
- Neighborhoods as cities.
Brooklyn, NYโ USPS accepts Brooklyn as a mailing city for many NYC ZIP codes. The gazetteer may have Brooklyn as a neighbourhood of New York City, or may not have it as a locality at all. A locality-only geocoder that doesn't find "Brooklyn" in the locality index returns nothing for a real address.
Where Mailwoman fitsโ
Mailwoman's resolver uses locality as the primary search key, constrained by region and postcode. The resolver returns top-K candidates per administrative span. The reconciler picks the coherent candidate from the joint parse. This is locality-only geocoding with more constraints: the resolver knows the region and postcode from the parser, so it doesn't need to disambiguate 34 Springfields from population alone.
A system that starts with locality-only geocoding can add Mailwoman as a refinement pass. Locality-only gives you city-level accuracy. Mailwoman gives you the street and building number, feeding a more precise resolver query. The two approaches share the same gazetteer infrastructure โ the resolver is the same code, queried with more or fewer constraints.
See alsoโ
- Gazetteer-first geocoding โ the full-index approach
- Postcode-only geocoding โ the other single-field approach
- Falsehoods about administrative hierarchy โ the city-name assumptions that break this
- Resolver and Who's On First โ the gazetteer Mailwoman uses