Glossary
Every technical term used in this documentation, defined once. Skim it before reading the concepts/ track, or come back when a word is unfamiliar.
Aโ
Alignment โ the step in the corpus pipeline that takes a (raw, components) pair from an adapter and produces a (raw, tokens, BIO labels) row by finding each component's text inside the raw string and labelling the matching tokens. See concepts/corpus-construction.md.
Attention โ the core mechanism inside a transformer encoder. Each token's representation is updated by looking at every other token, with learned weights deciding how much each one matters. Read more in concepts/neural-classification.md.
Adversarial corpus โ eval entries chosen specifically to break the parser. Mailwoman's adversarial set covers cases like "Buffalo Buffalo" (a venue named after a city), "St. Petersburg" (multi-word locality), and prefix-honorific names.
Bโ
BIO labels โ a labelling scheme where each token gets one of B-X (beginning of an X span), I-X (inside an X span), or O (outside any span). The 21-label vocabulary in the current model is O + 10 components ร {B-, I-}. See concepts/bio-labels.md.
bf16 / bfloat16 โ a 16-bit floating point format used during training. Half the memory of fp32 with similar numeric range. Required to fit the training batch on the lab's GPU.
Cโ
CRF โ Conditional Random Field โ a structured prediction layer that sits on top of per-token model outputs. The CRF rejects label sequences that violate structural rules (e.g. O followed by I-locality is invalid because an I- must follow a matching B- or I-). See concepts/crf-decoder.md.
Calibration โ how well the model's reported confidence matches its actual accuracy. A well-calibrated model that says "90% confident" should be right 90% of the time. The eval reports include calibration buckets.
Canonicalization โ mapping a span to its canonical value: "St" and "Street" both become the USPS suffix ST; "USA" and "United States" both become ISO US. The operation that handles synonymy. Canonical values come from @mailwoman/codex, not a hand-grown alias list. See concepts/synonymy-and-homonymy.md.
Cartographer โ Mailwoman's mapping utilities. Composes MapLibre style specifications + builds vector source records for the demo's protomaps basemap.
Checkpoint โ a saved snapshot of the model weights + optimizer state during training. Mailwoman saves a checkpoint every 100 steps so training can resume after the GPU's periodic firmware hang.
Classification proposal โ the shared shape that every classifier (rule or neural) writes. The solver consumes proposals without knowing which kind of classifier produced them.
Component / ComponentTag โ one type of address part: country, region, locality, postcode, street, house_number, venue, etc. The full union is in core/types/component.ts.
Corpus โ the curated dataset used for training. The current corpus is corpus-v0.3.0 with 677 million aligned rows.
Dโ
Decoder โ in a transformer encoder-decoder model, the part that produces output sequences. Mailwoman's classifier is encoder-only (no decoder); the "CRF decoder" is a different thing โ a structured-prediction layer that picks the best label sequence from the encoder's outputs.
Eโ
Encoder โ the part of a transformer that turns input tokens into contextualized vector representations. Mailwoman's encoder has 6 layers, 4 attention heads, 256 hidden dimensions. See concepts/neural-classification.md.
Eval / Evaluation โ running the model against a held-out golden dataset and computing per-component F1, exact-match, calibration, etc. The latest report is the Tier 2 step-001800 eval (filename preserves the historical "stage2" naming).
Fโ
F1 score โ the harmonic mean of precision and recall, a single number between 0 and 1 summarizing how good a classifier is at a class. F1 = 2ยทPยทR / (P + R).
Fine label โ in Mailwoman's vocabulary, the three Tier 2 labels added in v3.0.0: venue, street, house_number. As opposed to coarse labels (country, region, locality, postcode).
fp32 / fp16 โ 32-bit and 16-bit floating point formats. Mailwoman trains in bf16 (a 16-bit variant) and exports the ONNX model in int8 for size.
Gโ
Gazetteer โ a database of named places (countries, regions, cities, neighbourhoods) with their canonical names, coordinates, and IDs. Mailwoman uses Who's On First as its gazetteer.
Gazetteer anchor โ an input-layer feature channel that attaches a per-token candidate-tag set from the gazetteer/codex (e.g. "this surface is a known country-and-region") so the model conditions on lexicon membership without being overruled by it. The knowledge lives outside the weights โ extend the gazetteer, no retrain. See the knowledge ladder, rung 3.2.
Golden set โ a hand-labelled evaluation dataset. The current golden set is v0.1.2 with 4,535 entries (US + FR + adversarial).
gradient clipping โ a training trick: if the gradient norm exceeds a threshold, scale it down. Stops a single bad batch from blowing up the model weights.
Hโ
Homonymy / homograph โ one surface, many referents: "Georgia" is a US state in Atlanta, Georgia and a country in Tbilisi, Georgia. Handled by disambiguation, split across two stages: the parser resolves the tag from in-string context; the resolver late-binds the referent with geographic context (concordance). See concepts/synonymy-and-homonymy.md.
Iโ
Iteration log โ the running record of each model release. Lives in plan/phases/PHASE_2_training.md and is the canonical "what shipped when".
int8 quantization โ converting model weights from 32-bit float to 8-bit integer. Shrinks the model file ~4x with usually minor accuracy loss.
Lโ
Label smoothing โ a training regularization technique. Instead of training to put 1.0 probability on the correct label, train to put 1 - ฮต on the correct label and ฮต / (N-1) on each wrong label. Improves calibration. Mailwoman v3.0.0 disabled it (label_smoothing = 0) for stability reasons.
libpostal โ an open-source C address parser used by Pelias. Replaced by Mailwoman v1.
locale โ the combination of language + country that an address comes from. en-US and fr-FR are the locales Mailwoman ships weights for.
Mโ
macro F1 โ the unweighted average of per-class F1 scores. Treats every class equally. Mailwoman's primary eval metric.
Mermaid โ a markdown-friendly diagram syntax. Used in these docs for flowcharts.
Nโ
NAD โ National Address Database โ a US Department of Transportation dataset of structured address points. Added to corpus-v0.3.0 as usgov-nad, contributing 57.9 million rows.
Neural classifier โ Mailwoman's transformer-based classifier (the model). Ships as @mailwoman/neural + per-locale weight packages.
NLL โ Negative Log Likelihood โ the standard form of a loss function in probability-based models. loss = -log P(correct_label | input). The CRF uses NLL of the entire sequence.
Oโ
ONNX โ Open Neural Network Exchange โ a portable model file format. Mailwoman exports the trained PyTorch model to ONNX so it can be loaded by onnxruntime-node (server) and onnxruntime-web (browser) with the same file. See concepts/onnx-runtime.md.
Orphan-I โ a label sequence bug where I-X appears without a matching preceding B-X (e.g. O, I-locality). Structurally invalid in BIO. The CRF prevents this. See concepts/bio-labels.md.
Pโ
Pelias โ an open-source geocoder, Mailwoman's spiritual predecessor. See From Pelias to Mailwoman.
Phase โ a milestone in the implementation plan (plan/README.md). Phases 0โ3 shipped (Foundation โ Corpus โ Training โ Integration), Phases 4โ6 are forward-looking. Distinct from Stage (runtime pipeline) and Tier (model vocabulary).
Per-token argmax โ picking the highest-probability label at each position independently. Fast and simple but can produce structurally invalid sequences. The opposite of Viterbi.
Postcode โ the country-specific postal code (US ZIP, FR code postal, etc.). Mailwoman handles postcode parsing entirely by rule classifier โ it is a regex problem, not an ML problem.
Policy registry โ the per-component table that decides which classifier (rule or neural) has authority for each address component. The Ship-of-Theseus dial.
Rโ
Record linkage โ deciding whether two rows from two datasets describe the same real-world entity. The reason a parse must be a lossless typed decomposition: two records match when their canonicalized span-sets agree, regardless of surface. A geocoder that silently drops unmatched spans cannot do this. See concepts/synonymy-and-homonymy.md.
Rule classifier โ a hand-written piece of code that labels tokens by pattern matching. The Mailwoman v1 approach. See concepts/rule-based-classifiers.md.
Resolver โ the post-parse step that takes labelled components and looks them up in a gazetteer to return coordinates. See concepts/resolver-and-wof.md.
Sโ
SentencePiece โ Google's subword tokenizer library. Mailwoman uses it to split input text into a fixed 16,000-piece vocabulary that handles both English and French (and falls back to byte-level for everything else). See concepts/tokenization.md.
Shallow fusion โ blending an external knowledge source (a gazetteer, an FST prior) into a model's decision as a signal, rather than retraining the model or overriding its output. Mailwoman applies it at two layers: the FST prior biases emissions at decode time; the gazetteer anchor feeds membership features at the input layer. RAG-shaped, but the retrieval lands as feature vectors, not prompt text.
Ship of Theseus โ the migration pattern Mailwoman uses: replace rule classifiers with neural one component at a time, only when metrics justify. Name from the philosophical thought experiment.
Soft prior โ outside knowledge fed to the model or resolver as overridable evidence (a feature, a score term) rather than a hard filter. A focus-country hint becomes an anchor feature; a focus-point becomes a ranking term. The opposite of a positional mask, which constrains what the system may consider. See concepts/synonymy-and-homonymy.md.
Synonymy โ many surfaces, one referent: "123 N Main St" and "123 North Main Street" are the same place. Handled by canonicalization, never by guessing which spelling is right. The dual of homonymy. See concepts/synonymy-and-homonymy.md.
Shard โ a partial output file of the corpus build, written in Parquet format. The training pipeline streams shards row-by-row.
SQLite-wasm โ an open-source build of the SQLite library compiled to WebAssembly. Mailwoman uses it to run the WOF resolver inside the browser.
Span โ a contiguous range of characters in the input string. The basic unit of address parsing.
Stage (runtime) โ one of the six dataflow stages in the runtime pipeline (the-staged-pipeline): Normalize, Locale gate, Kind classify, Phrase group, Token classify, Sequence correct, Reconcile, Resolve. Distinct from Tier (vocabulary expansion across model versions).
Tier 1 / Tier 2 / Tier 3 โ internal versioning of which label classes the model emits. Tier 1 is the coarse components (country, region, locality, postcode). Tier 2 adds venue, street, house_number. Tier 3 (future) would add attention, po_box, and POI venue subtyping. Historically called "Stage 1/2/3" before the runtime-pipeline naming made that ambiguous (renamed 2026-05-22); some shipped artifacts (npm v3.0.0 model card, stage2-step-001800-eval.md) preserve the old name.
Tโ
Token โ one word or subword in the tokenized input. For the neural classifier, tokens come from SentencePiece (subword units). For the rule classifiers, tokens are whitespace-and-punctuation-separated words.
Tokenizer โ the component that splits an input string into tokens. The neural classifier ships its own (SentencePiece, learned from the corpus); the rule classifiers use Mailwoman's hand-written word tokenizer.
Thread โ a parallel workstream within a release. v0.5.0 was organized as six threads (AโF): tokenizer retrain, kryptonite catalogue, classifier code path, reconcile, phrase grouper, verdict-smoke discipline. Threads compose; they are not sequential milestones like Phases.
Transformer โ a neural network architecture invented in 2017 ("Attention Is All You Need"). The basis of every modern NLP model from BERT to GPT-4. Mailwoman uses a small, encoder-only transformer. See concepts/neural-classification.md.
TIGER โ US Census topographically-integrated geographic database. Used as a corpus source for street-segment data.
Vโ
Viterbi โ an algorithm that finds the highest-probability label sequence under a CRF's transition constraints. The CRF's "decode" operation. Replaces the per-token argmax with something structurally valid.
Wโ
WOF โ Who's On First โ an open gazetteer of named places, maintained by SFO Museum. Mailwoman uses a SQLite distribution as its primary gazetteer. See concepts/resolver-and-wof.md.
Warmup โ the early phase of training where the learning rate ramps up from 0 to its peak value. Mailwoman uses 1,000 steps of linear warmup, then cosine decay.
Weights package โ the npm package that ships the trained model file: @mailwoman/neural-weights-en-us, @mailwoman/neural-weights-fr-fr. Versioned independently from the runtime.