37 docs tagged with "domain"

Addresses that break geocoders

What is an address? covered the data model. This article goes the other way: a tour of address shapes that consistently break parsers and resolvers, with concrete examples for each. If you are building or evaluating a geocoder, this is the failure-mode catalogue worth keeping next to your test suite.

Amenity queries

An amenity query asks for the nearest instance of a generic category. The user doesn't care which gas station — they care which one is closest, open, or cheapest. The geocoder's job is to translate a category label into a set of candidate locations and rank them.

Close-enough geocoding

Not every application needs sub-meter accuracy. For most applications, "in the right city" is close enough. A geocode that places a customer in the correct metropolitan area is sufficient for market analysis, sales territory assignment, and regional logistics. The coordinate doesn't need to be correct — it needs to be useful.

Exotic point-of-interest queries

Not every geocoder query is an address. A large fraction of real-world searches ask for points of interest — named places, categories of things, brands, landmarks, and transit infrastructure. "Find the nearest gas station" is not an address. "Where is the Eiffel Tower?" is not an address. "Show me every Hilton in Manhattan" is not an address.

Falsehoods about address format

The falsehoods

Falsehoods about address shapes and dimensions

An address is not necessarily a point on a map. It is not necessarily a polygon. It is not necessarily a discrete building. It is not necessarily at ground level. It is not necessarily the only address at its coordinates. And it is not necessarily stationary.

Falsehoods about addresses

This article series is inspired by and cites Michael Tandy's excellent, exhaustive original — the canonical catalogue of address falsehoods, maintained since 2013. Tandy's article is a taxonomy of assumptions that break parsers, validators, and databases. This series expands on that taxonomy, adding historical context on how geocoders have handled (or failed to handle) each category, and what Mailwoman's neural approach changes.

Falsehoods about administrative hierarchy

The falsehoods

Falsehoods about geocoded precision and frontages

"Close enough" is a statement about your use case, not about the coordinate. A geocode that is correct for statistical aggregation may be catastrophically wrong for emergency dispatch. And the coordinate itself answers a question — "the front door" — that nobody bothered to define.

Falsehoods about numbers in addresses

The falsehoods

Falsehoods about postcodes

The falsehoods

Falsehoods about street names

The falsehoods

Franchise and brand queries

A franchise query names a specific chain or brand. The user doesn't want the nearest restaurant — they want the nearest McDonalds. The brand name is the query, and the geocoder's job is to resolve it to one or more specific locations.

Gazetteer-first geocoding

The most radical simplification: don't parse at all. Treat the input as an information retrieval problem. Tokenize the string, try every token and n-gram against a gazetteer index, return the best placetype match. The parser is a search engine. The resolver is the search result.

How can a building have two addresses?

A single physical building can have multiple valid addresses. For commercial buildings, multifamily housing, corner properties, and any structure that touches more than one administrative system, that is the normal state of affairs. A parser that assumes "one building = one address" will fail silently on a large fraction of real-world queries.

How humans break addresses

Users do not type addresses the way gazetteers store them. They type what they know, in the order they think of it, with the spellings their keyboard supports, trusting autocomplete suggestions they didn't verify. A parser that only handles well-formed addresses fails on real input.

How mail delivery actually works

Before you can judge a parser, you need to know what it is parsing for. Address parsing is not an academic exercise in string labelling — it exists to route physical objects to physical places through a system designed in the 19th century and patched ever since.

Human-in-the-loop geocoding

Don't parse. Don't resolve. Don't guess. Show the user what you think they meant and let them confirm. The parser is a suggestion engine, not an authority. The user is the ground truth.

Landmark queries

A landmark query names a specific, usually unique, place. The user doesn't want the nearest instance of a category — they want the Eiffel Tower, the Golden Gate Bridge, the Empire State Building. The geocoder's job is to recognize the landmark name and resolve it to a single coordinate or small candidate set.

Locality-only geocoding

Parse only the locality — the city, town, or village. Everything else is supplementary. Geocode to city-level accuracy. This is sufficient for statistical aggregation, market analysis, regional routing, and most applications that don't need to find a specific building.

Normalize to match

The simplest geocoder doesn't parse addresses at all. It normalizes them — strips punctuation, lowercases, expands abbreviations — and matches the result against a known database of addresses. The "parser" is a fuzzy string matcher. The "resolver" is a hash table lookup.

Postcode-only geocoding

The fastest geocoder extracts the postcode, looks it up in a postcode-to-coordinate table, and returns the centroid. It doesn't parse the street, the building number, or the locality. It doesn't need to. The postcode is the most machine-readable part of any address, and for many applications, postcode-level accuracy is sufficient.

Regex-anchored fields

Most applications don't need a full address parse. They need three or four fields: the postcode, the state, the street number, maybe the street name. Extract those with regexes. Ignore everything else. The unparsed tokens are "the rest" — available for display, label printing, and downstream processing, but not part of the geocoding decision.

Regional variant queries

A regional variant is a local term for an amenity, brand, or category that differs from the global or standard name. "Servo" means gas station in Australia. "Bodega" means corner store in New York City. "マクド" (makudo) means McDonalds in the Kansai region of Japan. In their region, these are the standard — the everyday word millions of people reach for, every bit as legitimate as the global name.

The 90% trap

A geocoder that correctly parses 90% of addresses sounds good. It sounds especially good when the alternative — building or buying a better one — costs engineering time or API fees. The trap is that the 10% tail is not random. It is concentrated in the populations you most need to serve: rural areas, developing economies, multifamily housing, and anyone whose address doesn't look like a suburban US single-family home.

The case for simple geocoders

Not every geocoding problem needs a neural parser. This series steel-mans the reasonably defensible compromises that work for most applications — the architectures that ship in days, cover 90% of addresses, and cost nothing to maintain. Each article describes one approach, when it works, what it loses, and where Mailwoman fits (or doesn't).

The database fallacy

There is a persistent belief in geocoding: "If we just had a database of all addresses, this problem would be trivial." The belief is wrong in three independent ways. Together, they make the database approach not merely incomplete but structurally inadequate.

The tokenization tautology

Traditional address parsers split the input into tokens, classify each token independently, then try to reassemble the pieces into a coherent parse. This sequence contains a structural circularity: you cannot group tokens correctly without knowing their types, and you cannot type them correctly without knowing their groups. The traditional architecture resolves this with heuristics, exceptions, and solver post-processing. The exception pile grows without bound.

Transit queries

A transit query names a station, stop, airport, terminal, or interchange. The user wants to find the transit facility — to depart from it, arrive at it, or navigate near it. Transit queries sit between landmark queries (named stations are landmarks) and amenity queries (unnamed bus stops are amenities).

What is a concordance?

In Mailwoman's architecture, a concordance is the resolver's answer to the question: "Do these parsed components form a coherent place in the real world?" It is the mechanism that prevents the parser from emitting a parse that is structurally valid but geographically impossible — like "Paris, Texas" labelled as "Paris, Île-de-France" or "NY-NY Steakhouse, Houston TX" with "NY" tagged as a region.

What is a postcode?

A postcode is a routing instruction, not a geographic area. It tells a postal service how to sort and deliver mail. It does not tell a geocoder where a building is, what municipality it belongs to, or what polygon contains it. Confusing these two things is the most common error in address geocoding — and the source of a surprising fraction of production bugs.

What is a ZIP Code?

The US ZIP Code is the most influential postal code system in the world — not because it is the best, but because US-origin address data dominates geocoding training sets and shapes what parsers expect. Understanding its structure is essential for understanding why US-trained parsers fail on international addresses.

What is an address?

Addresses look obvious. You see one every day. But parsing them turns out to be one of the harder problems in natural language processing, because a single string can mean many places, and a single place can be written many ways.

What is an intersection address?

An intersection address names a location by the crossing of two streets. Unlike a conventional street address ("350 5th Ave"), it does not specify a building number. It says "find where these two streets meet." This is common in urban navigation, emergency dispatch, and some countries (notably Japan) where the street-address system is entirely different.

Why a neural parser?

The first five articles in this track established the problem:

Why not just use Google's API?

Google's Geocoding API is the default choice for geocoding. It is fast, globally comprehensive, and returns results in milliseconds. For many applications, calling Google is the right move. For applications that need to own their address data, audit their results, or operate at scale, it is the expensive kind of cheap.

Why not use geocode.earth?

If Google's API is the proprietary default, geocode.earth is the open-source default. It runs Pelias — the geocoder Mailwoman forked — as a hosted service with global coverage, open data, and transparent pricing. For many applications, geocode.earth is the right answer. For applications that need to own their parser, it has the same fundamental limitation as every hosted API: you do not own the parsing layer.