Normalize to match
The simplest geocoder doesn't parse addresses at all. It normalizes them — strips punctuation, lowercases, expands abbreviations — and matches the result against a known database of addresses. The "parser" is a fuzzy string matcher. The "resolver" is a hash table lookup.
The approach
You have a database of known addresses — your customers, your delivery points, your properties. Each entry has a canonical form: 123 Main St, Springfield, IL 62701. When a user types 123 main street springfield illinois 62701, you don't parse it. You normalize both strings to a common form and compare.
The normalization pipeline:
- Lowercase everything.
Springfieldandspringfieldare the same. - Strip punctuation.
St.andStare the same.62701-1234and62701are the same (if you only care about the 5-digit ZIP). - Expand abbreviations.
St→Street,IL→Illinois,Ave→Avenue. This makes123 Main Stmatch123 Main Street. - Remove noise tokens. Apartment numbers, "Attn:" lines, floor numbers — if your database doesn't have them, strip them from the input.
- Fuzzy match. After normalization, compute edit distance or Jaccard similarity between the input and each candidate. Return the best match above a threshold.
This is not parsing. It is not understanding address structure. It is string normalization plus similarity matching. It works surprisingly well when your universe of possible addresses is bounded.
When it works
- You have a known address database. A logistics company with 10,000 delivery points. A utility company with 500,000 service addresses. A retailer with a customer address book. The universe is finite and you control it.
- Your input is messy but recognizable. Customers typing their own addresses make spelling errors, use abbreviations, omit ZIP+4, add apartment numbers. Normalization absorbs these variations.
- You don't need component-level output. You don't need to know which token is the street and which is the city. You just need to know "this input matches address ID #4572."
- Your addresses are in one country. US addresses have a small set of standard abbreviations (USPS Pub 28 defines them all). Expanding
St→StreetandIL→Illinoiscovers the common cases. International addresses have no such standard abbreviation table. - You need to ship today. Normalization is a hundred lines of code. No training data, no model, no gazetteer. Ship in an afternoon.
What you lose
- Any address not in your database. A new customer, a new delivery point, a one-time destination — the normalizer can only match against known entries. If the address is not in the database, the normalizer returns nothing.
- Ambiguity between similar addresses.
123 Main St, Springfield, ILand123 Main St, Springfield, MAnormalize to nearly identical strings. If both are in your database, the fuzzy matcher picks the higher-similarity score — which may be the wrong one. - International addresses. French
rue de la Républiqueabbreviates nothing like USRepublic St. UK postcodes (SW1A 1AA) don't normalize like US ZIP codes. The abbreviation table is per-country and grows without bound. - New construction. A building built last month is not in your database. A customer who moved last week is at an address you don't have. The normalizer returns nothing for addresses that didn't exist when the database was built.
- Structural errors.
123 Main St, Springfieldand123 Springfield St, Mainnormalize to similar strings if "Main" and "Springfield" both appear in both strings. The fuzzy matcher doesn't know that the street and city are different fields — it just sees word overlap. - No confidence signal. The fuzzy matcher returns a similarity score, not a confidence. A 0.85 similarity might mean "this is the right address with a typo" or "this is a different address in the same city." The downstream system cannot distinguish.
Where Mailwoman fits
Normalize-to-match and Mailwoman's parser are complementary, not competing. A system that normalizes against a known database can use Mailwoman to ingest new addresses into that database:
- A new customer signs up. Their address is not in the database.
- Mailwoman parses
123 Main St, Springfield, IL 62701into{house_number: 123, street: Main St, locality: Springfield, region: IL, postcode: 62701}. - The parsed components are normalized (
St→Street,IL→Illinois) and stored as a canonical form. - Future inputs that normalize to the same canonical form match the existing entry.
The parser handles the cold-start problem — adding new addresses to the database. The normalizer handles the hot path — matching subsequent inputs against known entries. This is the architecture behind most address verification services (USPS AMS, SmartyStreets, Melissa Data): a parsing step to normalize the input, a matching step against a known database, and a confidence score for the match quality.
See also
- Postcode-only geocoding — the simplest geographic approach
- Regex-anchored fields — when you care about a few specific components
- The database fallacy — why no database contains all addresses
- How humans break addresses — the failure modes normalization absorbs