Skip to main content

How it used to work โ€” Mailwoman v1

Mailwoman v1 (the pre-2026 version, still living on as the rule classifiers inside v2) parses an address in four steps. This article walks through each one with a concrete example.

The input we will use:

West 26th Street, New York, NYC, 10010

Step 1 โ€” Tokenizationโ€‹

The input string is split into tokens โ€” single words or punctuation marks. Tokenization is more than string.split(" ") because Mailwoman keeps track of where each token came from in the original string.

Each token carries metadata: its character position in the original string, what kind of separator came before it (comma, space, tab, newline), and what other tokens it is grouped with. The grouping idea โ€” Mailwoman calls these sections โ€” is important: tokens that are separated by commas usually belong to different address components, so the solver treats them differently.

Tokenization is also the place where pre-processing happens: lowercase normalization, abbreviation expansion (St. โ†’ Street), and accent handling for non-English input.

Deep dive: concepts/tokenization.md.

Step 2 โ€” Rule classifiers voteโ€‹

Each token is shown to every rule classifier in parallel. A rule classifier is a small piece of code that tries to label a token as one address component type.

Examples of rule classifiers in Mailwoman v1:

  • house_number โ€” does this token look like a number (with optional letter suffix like 123A)?
  • postcode โ€” does this token match a country-specific postcode pattern? (10010 matches the US 5-digit pattern; 75008 matches the FR 5-digit pattern.)
  • street_prefix โ€” is this token a known direction (North, West, SE) or a known street-type prefix (Avenue, Rue, Boulevard)?
  • whos_on_first โ€” is this token (or this short phrase) found in the Who's On First gazetteer as the name of a country, region, locality, or neighbourhood?

Each classifier produces zero or more classifications for each token. A classification is a triple:

{ component: "postcode", confidence: 0.95, source: "rule:postcode" }

A single token can collect multiple classifications. For our example, New might collect:

  • { component: "locality", confidence: 0.4, source: "rule:whos_on_first" } (because "New" alone matches a few WOF entries)
  • { component: "street_prefix", confidence: 0.1, source: "rule:street_prefix" } (because "New" sometimes precedes a street name)

This is intentional: rule classifiers are allowed to be uncertain and contradictory. The next step resolves the contradiction.

Deep dive: concepts/rule-based-classifiers.md.

Step 3 โ€” The solver picks a winning combinationโ€‹

The solver looks at all the classifications produced for all the tokens and tries to find a self-consistent interpretation. "Self-consistent" means:

  • Each address component appears at most once (you can have one locality, not two).
  • Components do not overlap on the same tokens.
  • The combination obeys soft preferences (a US-style postcode after a region is more likely than a postcode at the start, for example).

Mailwoman v1's solver is an ExclusiveCartesianSolver with filters and augmenters. In plain terms: it generates every plausible combination, filters out the ones that violate the hard rules, scores the rest, and returns them ranked.

For our example, a top-ranked output looks like:

[
{ "component": "street_prefix", "value": "West", "confidence": 0.85 },
{ "component": "house_number", "value": "26th", "confidence": 0.6 },
{ "component": "street", "value": "Street", "confidence": 0.4 },
{ "component": "locality", "value": "New York", "confidence": 0.9 },
{ "component": "locality", "value": "NYC", "confidence": 0.7 },
{ "component": "postcode", "value": "10010", "confidence": 1.0 }
]

Notice the two locality candidates. The solver returns multiple ranked solutions; the consumer (the CLI, the API) picks the top one or shows all of them.

Step 4 โ€” Resolveโ€‹

Parsing answered "what kind of thing is each part of this string?". Resolving answers "where is the resulting place?". The parsed components are looked up in a gazetteer โ€” Who's On First, in Mailwoman's case โ€” and the gazetteer returns coordinates, a stable place ID, and optionally a bounding box.

The resolver is a separate concern from the parser. Read about it in concepts/resolver-and-wof.md.

What this approach is good atโ€‹

  • Determinism. A rule classifier produces the same answer on the same input every time. No retraining, no randomness.
  • Explainability. When the parser is wrong, you can read the rule and see why.
  • Fast iteration on a single bug. "The postcode pattern misses Canadian H1A-X9X" is a one-line code change.

What this approach is bad atโ€‹

  • The long tail. Every new address shape needs a new rule or a tweak to an existing one. The list of rules grows forever.
  • Words that look like multiple things. "Buffalo" is a US locality, a venue name (Buffalo Wild Wings), and an animal. Rules can declare all three; only data can rank them.
  • Multi-word components. "Saint Petersburg" is one locality, not two. A rule that recognises common multi-word names is brittle.
  • Languages other than English. Every locale needs its own rule set, hand-written, by someone who reads that language well.

These weaknesses are why Mailwoman v2 brought in the neural classifier. Continue with how-it-works-now.md.