Falsehoods programmers believe about addresses
This article series is inspired by and cites Michael Tandy's excellent, exhaustive original โ the canonical catalogue of address falsehoods, maintained since 2013. Tandy's article is a taxonomy of assumptions that break parsers, validators, and databases. This series expands on that taxonomy, adding historical context on how geocoders have handled (or failed to handle) each category, and what Mailwoman's neural approach changes.
The falsehoods are the central casesโ
Tandy's falsehoods are not edge cases. They are the central cases that rule-based geocoders fail on. Each falsehood is a place where a human can see what's happening ("that's a building number, even though it's a fraction") but a regex cannot. The thesis of Mailwoman's neural approach is that a model trained on diverse address data can learn to handle these cases without explicit rules โ and more importantly, can handle combinations of falsehoods that no rule set could enumerate.
Mailwoman is not the first project to notice this. Deepparse (2020) showed that a BiLSTM could match libpostal on structured addresses. The academic literature since has confirmed that transformers beat CRFs on noisy and multilingual address data. What Mailwoman adds is a staged pipeline that separates concerns: a phrase grouper proposes boundaries, a neural classifier types spans, a CRF enforces sequence validity, and a reconciler checks joint coherence against a gazetteer. Each stage handles a different class of falsehood without the others needing to know about it.
The categoriesโ
Each article in this series takes one category of falsehood, explains what traditional geocoders assumed, what counterexamples broke those assumptions, and how Mailwoman's architecture addresses the class of problem rather than the individual counterexample.
| Category | Article | What it covers |
|---|---|---|
| Numbers | Falsehoods about numbers in addresses | Zero, negative, fractions, duplicates, ranges, names that are numbers |
| Streets | Falsehoods about street names | Missing suffixes, numbered streets, recurring names, addresses with no streets at all |
| Postcodes | Falsehoods about postcodes | Leading zeros, multi-city postcodes, per-building postcodes, missing postcodes |
| Hierarchy | Falsehoods about administrative hierarchy | No states, no counties, duplicate city names, city-states |
| Format | Falsehoods about address format | Punctuation, non-ASCII, variable ordering, mixed character sets, changing addresses |
| Shapes | Falsehoods about address shapes and dimensions | Not a point, not a polygon, not a building, not at ground level, not unique per coordinate |
| Precision | Falsehoods about geocoded precision and frontages | Not one correct coordinate, not always "close enough," not the front door |
The originalโ
If you haven't read Michael Tandy's original article, start there: Falsehoods programmers believe about addresses. It is the reference this series builds on. Every falsehood in these articles originates in Tandy's catalogue or in the operator's own production-geocoder experience. Attribution is in each article's introduction.
See alsoโ
- How mail delivery actually works โ the system these falsehoods enter
- How humans break addresses โ the failure taxonomy organized by root cause
- The tokenization tautology โ why rule-based parsers can't handle combinations of falsehoods