Skip to main content

Locality-only geocoding

Parse only the locality โ€” the city, town, or village. Everything else is supplementary. Geocode to city-level accuracy. This is sufficient for statistical aggregation, market analysis, regional routing, and most applications that don't need to find a specific building.

The approachโ€‹

  1. Find the locality. Tokenize the input. Try each token and bigram against a gazetteer of locality names. "Springfield" matches. "New York" matches as a bigram. "West 26th" does not match and is ignored. The approach is a targeted version of gazetteer-first geocoding: skip everything except locality names.
  2. Disambiguate if possible. If the input also contains a state abbreviation or postcode, use it to filter candidates. Springfield, IL โ†’ Springfield, IL. Springfield alone โ†’ 34 candidates, ranked by population.
  3. Return the locality centroid. The geographic center of the city. Off by up to several miles for large cities (Los Angeles is 500 square miles), within a few hundred meters for small towns.
  4. Ignore the rest. The street, building number, apartment, and venue are not needed for city-level geocoding. They are preserved in the raw input for display purposes.

This is the minimum viable geocoder: city-level accuracy from a gazetteer lookup. It handles every country with a gazetteer. It handles ambiguous names (Springfield) by surfacing the ambiguity rather than picking wrong. It handles every script the gazetteer supports.

When it worksโ€‹

  • Statistical aggregation. Customer distribution by city. Sales by metropolitan area. Disease incidence by municipality. City-level accuracy is the unit of analysis for most public health, marketing, and demographic applications.
  • Regional routing. "Route this delivery to the Los Angeles warehouse" โ€” you need to know it's in Los Angeles, not which building. The street address is for the last mile; the locality is for the regional sort.
  • Market analysis. "How many customers do we have in the Bay Area?" โ€” city-level geocoding places customers in San Francisco, Oakland, San Jose, etc. Street-level accuracy adds detail but doesn't change the regional answer.
  • Gazetteer coverage is good for localities. WOF has ~200,000 locality records globally. OSM has millions of place nodes. GeoNames has ~12 million populated places. For most of the world's population, the locality they live in is in a gazetteer.
  • You serve countries without street-level addressing. Japan uses block-based addressing without named streets. Rural India, sub-Saharan Africa, and informal settlements use descriptive addresses without standard street names. Locality-only geocoding works where street-level geocoding has nothing to work with.
  • You need global coverage fast. One gazetteer, one lookup strategy, every country. No per-country regexes, no per-locale rules. Add a country by adding its gazetteer data.

What you loseโ€‹

  • Everything finer than city-level. The building, the street, the neighborhood. A geocode at the city centroid is correct at the city level but wrong for any specific address within the city. A delivery to "123 Main St, Springfield" routed to the Springfield centroid is off by up to several miles.
  • Large cities. Los Angeles, New York, Tokyo, London โ€” a centroid in a 500-square-mile city is off by up to 15 miles from the actual address. City-level accuracy in a large city is approximately county-level accuracy.
  • Duplicate locality names. 34 Springfields. Three Newports (UK). Two Eursinges (Netherlands, same province). Without a state or postcode, the gazetteer returns all candidates and the ranker picks by population. The largest Springfield is in Illinois. If the user meant Springfield, MA, the geocoder is wrong โ€” and confident about it.
  • Postal city vs. legal city. USPS accepts "Los Angeles" as the mailing city for addresses in Beverly Hills and West Hollywood. The gazetteer places Beverly Hills as a separate city. A locality-only geocoder that looks up "Los Angeles" in the gazetteer returns the Los Angeles centroid โ€” 5 miles from the actual address in Beverly Hills.
  • Neighborhoods as cities. Brooklyn, NY โ€” USPS accepts Brooklyn as a mailing city for many NYC ZIP codes. The gazetteer may have Brooklyn as a neighbourhood of New York City, or may not have it as a locality at all. A locality-only geocoder that doesn't find "Brooklyn" in the locality index returns nothing for a real address.

Where Mailwoman fitsโ€‹

Mailwoman's resolver uses locality as the primary search key, constrained by region and postcode. The resolver returns top-K candidates per administrative span. The reconciler picks the coherent candidate from the joint parse. This is locality-only geocoding with more constraints: the resolver knows the region and postcode from the parser, so it doesn't need to disambiguate 34 Springfields from population alone.

A system that starts with locality-only geocoding can add Mailwoman as a refinement pass. Locality-only gives you city-level accuracy. Mailwoman gives you the street and building number, feeding a more precise resolver query. The two approaches share the same gazetteer infrastructure โ€” the resolver is the same code, queried with more or fewer constraints.

See alsoโ€‹