Skip to main content

Falsehoods programmers believe about address format

The falsehoods

Programmers consistently assume addresses are written in ASCII, follow a fixed ordering, don't contain special characters, and stay the same over time. Real addresses violate all of these — and the violations compound: an address that uses Cyrillic, in non-Western ordering, with an apostrophe in the street name, at a location that changed jurisdictions last year.

"Addresses are written in ASCII."

The Greek tax office: Χανδρή 1 & Θεσσαλονίκης, Τ.Κ. 18346, Αθήνα. A Russian parcel with a Cyrillic address label. A Japanese address: 東京都千代田区丸の内1-1-1. A Chinese address: 北京市东城区东长安街1号. An Arabic address: شارع الملك فهد, الرياض.

Rule-based parsers that operate on ASCII character ranges ([a-zA-Z] for street names, \d for numbers) are blind to non-Latin scripts. The tokenizer must handle multi-byte characters, the classifier must recognize script-specific patterns, and the resolver must match names in the script they were written in.

Mailwoman's SentencePiece tokenizer uses byte_fallback=true — when a character is not in the trained vocabulary, the tokenizer falls back to byte-level encoding. This means the tokenizer can handle any Unicode character, even scripts it was not explicitly trained on. The A1 tokenizer (v0.5.0) halved the byte-fallback rate from 36.7% to 18.2% by training on a corpus that includes CJK, Cyrillic, Hangul, Han, and Armenian scripts.

"Addresses follow a fixed ordering."

The United States uses most-specific-to-least-specific: [number] [street] [city] [state] [zip]. Japan uses least-specific-to-most-specific: [zip] [prefecture] [city] [ward] [district] [block] [lot]. Hungary uses three different orders depending on how many lines are available:

  • One line: {zip} {town}, {street} {buildingNr}
  • Multiple lines: {street} {buildingNr}. {zip} {town}, {country}
  • Envelope: {street} {buildingNr}. {town} {zip} {country}

The same address, three different written orders. A parser that assumes a single ordering will successfully parse one form and fail on the other two.

European addresses often place the building number after the street name: République 12 (France), Hauptstraße 5 (Germany), Kornetintie 6 A II krs (Finland). The Anglophone pattern of number-before-street is the exception.

"Addresses don't contain special characters."

St. Judes & St. Pauls C of E (Va) Primary School, 10 Kingsbury Road, London, N1 4AZ contains an ampersand, parentheses, a comma, and a period. 1 Acre View, Bo'ness, EH51 9RQ contains an apostrophe. 1 Highview Terrace, Westward Ho!, Bideford, EX39 1AQ contains an exclamation mark (omitted in some databases). Flat 1.4, Ziggurat Building, 60-66 Saffron Hill, London, EC1N 8QX contains a decimal point in the flat number.

Kirkland, Lane, Mathias & Perry, North Muskham Prebend Church Street, Southwell, NG25 0HQ contains two commas and an ampersand — in the organization name, not as address separators.

The Hague's official Dutch name is 's-Gravenhage — starting with an apostrophe-s and containing a hyphen. A parser that strips punctuation from the start of tokens will reduce 's-Gravenhage to s-Gravenhage — a different name.

"A single address uses a single character set."

International mail often includes the destination country in both the source and destination character sets, so postal workers at both ends can read it. A parcel from Japan to France might have France written in both Latin and Japanese scripts. A parser that expects a single character set per address will see two country names at different positions in different scripts.

"Written addresses don't change."

Douglas Perreault's condo in Florida changed address three times: 14100 N 46th St., Alpha 39, Lutz, FL 33549Tampa, FL 33612 (new post office) → FL 33613 (ZIP Code change) → 14410 Hanging Moss Circle, #101 (block renaming). Same physical location, four different written addresses over a few years.

Cities, streets, and entire countries get renamed. The county of Gwent, UK no longer exists. Lenin Street became something else across Eastern Europe. Territory disputes change which administrative region an address is officially in — sometimes without changing the building's physical location.

"An address fits in 5 lines."

Department For Environment Food & Rural Affairs (D E F R A), State Veterinary Service, Animal Health Office, Hadrian House, Wavell Drive, Rosehill Industrial Estate, Carlisle, CA1 2TB, United Kingdom needs 8 lines. Addresses that include organization names, department names, and industrial-park locations routinely exceed the 5-line assumption built into address forms.

"Street names are short."

The longest street name in Germany: Bischöflich-Geistlicher-Rat-Josef-Zinnbauer-Straße in Dingolfing, Bavaria. An 89-character street name in Bihać, Bosnia: Aleja Alije Izetbegovića Prvig Predsjednika Predsjedništva Republika Bosna i Hercegovina. A form field with a 50-character limit on street names rejects real addresses.

"Real place names won't contain rude words."

Middlesex, Scunthorpe, Penistone, and many others are real English place names. Address validators that filter for profanity reject real places. The Scunthorpe problem (named for the town whose name contains a substring interpreted as profanity) is a classic example of overzealous content filtering.

How traditional geocoders handled these

libpostal handles multi-script addresses through its CRF's character-level features. The model operates on character n-grams, not words, so it can handle any script without script-specific tokenization. libpostal's normalize step handles Unicode normalization (NFD → NFC) and transliteration, but it does not handle mixed-script addresses (different scripts in different components).

Pelias tokenizes on whitespace and punctuation. Non-ASCII characters are preserved in the token stream. Pelias's rule classifiers operate on the token text directly — if a token matches a dictionary entry, it classifies; if not, it moves on. This means Pelias can handle non-ASCII names that appear in its dictionaries (WOF gazetteer entries, OpenStreetMap tags) but not names in scripts that are not well-represented in open data.

Google's API handles multi-script, multi-lingual addresses comprehensively. Google's model is trained on global address data and can parse addresses in dozens of scripts. The trade-off is that the model is proprietary — you cannot inspect how it handles a specific script or format.

What the neural approach changes

The normalizer (Stage 1) handles Unicode normalization (NFD → NFC), case folding, and whitespace collapsing. The normalizer operates on bytes, not semantics — it cleans the input for downstream stages without making classification decisions.

The tokenizer uses SentencePiece with byte-fallback. Characters in the trained vocabulary (16K for A0, 48K for A1) produce subword tokens with known embeddings. Characters outside the vocabulary fall back to byte-level encoding — the model sees raw bytes and learns byte-level patterns. This means the tokenizer can handle any Unicode character, even scripts that were not in the training data, without crashing or producing garbage.

The locale gate (Stage 2) detects the script family before classification. CJK characters → ja-JP, Cyrillic → ru, Arabic → ar. The locale detection biases the classifier's expectations: a Japanese-formatted address triggers Japanese ordering expectations, a Cyrillic address triggers Russian postcode patterns. The classifier does not need to handle every script's ordering simultaneously.

The phrase grouper (Stage 2.7) uses structural cues (punctuation, whitespace, capitalization) that are script-agnostic. An apostrophe in Bo'ness does not confuse the grouper — it proposes Bo'ness as a single span because the apostrophe is word-internal punctuation, not a token boundary. An exclamation mark in Westward Ho! is similarly treated as part of the token, not a separator.

The classifier (Stage 3) learns from the corpus: tokens appearing in address contexts with certain structural features (position, surrounding tokens, phrase grouper kind hypothesis) are classified based on learned distributions, not dictionary matches. A Cyrillic street name is classified as street because the training data contains Cyrillic street names in street positions — not because a Cyrillic dictionary exists.

What Mailwoman still can't do

  • Mixed-script addresses. An address like Moscow, ул. Тверская 12 — English city, Cyrillic street — requires the model to handle two scripts in one input. The tokenizer can handle both scripts, but the model's per-token embeddings may not have strong representations for tokens in scripts that are sparse in the training corpus.
  • Character encoding recovery. A parcel label where Cyrillic was mangled by the wrong character encoding (latin1 instead of UTF-8) produces garbage text. A Russian postal worker might reverse the mapping and deliver the parcel. A parser cannot — this is a pre-parsing input-quality problem.
  • Addresses that have changed since the corpus was built. The model learns from the training data's address formats. If an administrative region was renamed after the corpus freeze, the model will not recognize the new name — it will classify it based on structural features (position, surrounding tokens) rather than semantic knowledge. This is acceptable because the model's job is structural parsing, not gazetteer lookup.

References