A walkthrough — NY-NY Steakhouse, Houston, TX
The knowledge ladder explains why the v0.5.0 pipeline grew two new information layers (Stage 2.7 phrase grouper, expanded Stage 5 reconcile). This article walks through what they actually do on one concrete input, end-to-end.
Attention and bidirectional context
The neural classifier's encoder is a 6-layer transformer with bidirectional
BIO labels
BIO labelling is the trick that lets a token classifier (which decides one token at a time) emit spans (groups of consecutive tokens that mean one thing together). It is the standard approach for sequence labelling tasks in NLP — Named Entity Recognition, part-of-speech tagging, address parsing.
Confidence calibration
Every span the parser emits carries a number when the model says it's 94% sure, is it actually right 94% of the time?
CRF decoder
A Conditional Random Field (CRF) is a structured-prediction layer that sits on top of a per-token classifier and finds the best whole-sequence assignment of labels, subject to learnable transition rules between consecutive labels.
From Pelias to Mailwoman
A short lineage for readers who already know Pelias.
FST gazetteer prior
The FST (finite-state transducer) gazetteer prior is the structure that lets the neural classifier benefit from everything Who's On First already knows. The neural model knows grammar; the gazetteer knows places. The FST is the bridge — pre-computed at build time so the classifier can consult it at inference time without paying gazetteer-lookup costs per token.
FST priors as shallow fusion
The neural classifier doesn't know what Boston is. It knows that the
How it will work
This article describes where Mailwoman is heading. The work is tracked in GitHub issues and the plan/ directory. Status is current as of May 2026.
How it works now
Mailwoman v2 (current as of May 2026) runs addresses through a six-stage pipeline. Each stage adds one kind of knowledge the stages below it cannot easily derive. Rule classifiers and a neural classifier coexist. A policy registry decides whose vote wins for each address component.
How the model reasons
What kind of computation does mailwoman's neural parser do? When it looks at
Neural classification
The neural classifier is a small transformer encoder that reads an address as a sequence of tokens and emits one BIO label for each token. This article explains what a transformer encoder is, what makes Mailwoman's specific model the shape it is, and what the model is and is not capable of.
ONNX runtime
ONNX (Open Neural Network Exchange) is a standardized format for serializing trained neural networks. ONNX Runtime is the family of inference engines that consume those files. Mailwoman uses ONNX so the same trained model can run in Node.js, browsers, mobile devices, or anywhere else with an ONNX Runtime build.
The knowledge ladder
The staged pipeline is a contract for decomposition by what each layer knows. Every stage is the rightful home of a particular kind of information; pushing work to the wrong stage produces fragile systems that try to learn things from data that they could have looked up, or look up things that they could have learned. This article catalogues the layers, what each one knows, and the two layers we don't ship yet but should.
The tokenization tautology
Traditional address parsers split the input into tokens, classify each token independently, then try to reassemble the pieces into a coherent parse. This sequence contains a structural circularity: you cannot group tokens correctly without knowing their types, and you cannot type them correctly without knowing their groups. The traditional architecture resolves this with heuristics, exceptions, and solver post-processing. The exception pile grows without bound.
Training pipeline
The training pipeline turns raw address data sources into a model file that ships on npm. This article walks through each stage end-to-end. The Corpus construction article digs into the first three stages in more detail.
Viterbi and BIO validity
After the encoder produces per-token emission logits and the FST priors
Why a neural parser?
The first five articles in this track established the problem: