Skip to main content

Falsehoods programmers believe about addresses

This article series is inspired by and cites Michael Tandy's excellent, exhaustive original โ€” the canonical catalogue of address falsehoods, maintained since 2013. Tandy's article is a taxonomy of assumptions that break parsers, validators, and databases. This series expands on that taxonomy, adding historical context on how geocoders have handled (or failed to handle) each category, and what Mailwoman's neural approach changes.

The falsehoods are the central casesโ€‹

Tandy's falsehoods are not edge cases. They are the central cases that rule-based geocoders fail on. Each falsehood is a place where a human can see what's happening ("that's a building number, even though it's a fraction") but a regex cannot. The thesis of Mailwoman's neural approach is that a model trained on diverse address data can learn to handle these cases without explicit rules โ€” and more importantly, can handle combinations of falsehoods that no rule set could enumerate.

Mailwoman is not the first project to notice this. Deepparse (2020) showed that a BiLSTM could match libpostal on structured addresses. The academic literature since has confirmed that transformers beat CRFs on noisy and multilingual address data. What Mailwoman adds is a staged pipeline that separates concerns: a phrase grouper proposes boundaries, a neural classifier types spans, a CRF enforces sequence validity, and a reconciler checks joint coherence against a gazetteer. Each stage handles a different class of falsehood without the others needing to know about it.

The categoriesโ€‹

Each article in this series takes one category of falsehood, explains what traditional geocoders assumed, what counterexamples broke those assumptions, and how Mailwoman's architecture addresses the class of problem rather than the individual counterexample.

CategoryArticleWhat it covers
NumbersFalsehoods about numbers in addressesZero, negative, fractions, duplicates, ranges, names that are numbers
StreetsFalsehoods about street namesMissing suffixes, numbered streets, recurring names, addresses with no streets at all
PostcodesFalsehoods about postcodesLeading zeros, multi-city postcodes, per-building postcodes, missing postcodes
HierarchyFalsehoods about administrative hierarchyNo states, no counties, duplicate city names, city-states
FormatFalsehoods about address formatPunctuation, non-ASCII, variable ordering, mixed character sets, changing addresses
ShapesFalsehoods about address shapes and dimensionsNot a point, not a polygon, not a building, not at ground level, not unique per coordinate
PrecisionFalsehoods about geocoded precision and frontagesNot one correct coordinate, not always "close enough," not the front door

The originalโ€‹

If you haven't read Michael Tandy's original article, start there: Falsehoods programmers believe about addresses. It is the reference this series builds on. Every falsehood in these articles originates in Tandy's catalogue or in the operator's own production-geocoder experience. Attribution is in each article's introduction.

See alsoโ€‹