Closed-vocab fields: model-first, with a soft gazetteer anchor
How Mailwoman handles closed-vocabulary address fields β country, po_box,
cedex β and why we resolve them inside the model (with a gazetteer fed in as a
soft feature) rather than overriding the model with a deterministic post-parse lookup.
A first pass (2026-06-09 night shift) concluded these fields should be tagged by a
deterministic post-parse overlay (ClosedVocabTagger, issue #464). That conclusion was
reversed after review: it was reached from an unfair comparison and it contradicts
the project's model-first ethos. This doc is the corrected design record. Pressure-tested
across a two-turn DeepSeek consult; see the rationale below.
The decisionβ
- The model is the authority for every field, including closed-vocab ones. We do not override its output with a lookup table.
- The gazetteer enters as a soft feature, never a substitution β the same principle
as the postcode anchor.
codex's matchers (matchCountry,matchPOBox, the cedex pattern) flip role from override to feature-source. - Starved closed-vocab fields are covered by training, the same negative-space
recipe that fixed
unitβ see negative space.
Why not a deterministic overlay β the reasoningβ
The over-firing was a data artifact, not a model limitation. The night-9 country
shard was almost entirely trailing-country rows, so the model learned one cheap
shortcut β "trailing token β country" β and tagged cities and regions as countries (49
F1, 23% precision). That is a skewed training distribution, not a transformer+CRF ceiling.
Context-sensitive token classifiers (BERT+CRF, LUKE, BLINK) routinely resolve gazetteer
homographs from local context; "Atlanta, Georgia" β region vs "Tbilisi, Georgia" β
country is the canonical job of a context-sensitive tagger.
The "deterministic = 100%" result was a mirage. It was measured on a curated eval with zero homographs, against the crippled (badly-trained) model. The homograph case β the one that matters β is exactly where a flat lookup cannot win without a hand-coded guard, and that guard is the growing-exception-list smell the project rejects (no load-bearing trivia). The overlay buys perfect precision only on the cases that were never hard.
It contradicts a principle we already committed to. The postcode-anchor work established lexicon signals as a soft channel, never a substitution. A post-parse override is precisely the substitution we ruled out.
The honest framing: a wrong country is a wrong continent downstream, so the
instinct to want a hard guarantee is sound β but the place to enforce it is the
resolver, using calibrated confidence (we ship isotonic conf=): if the model is
low-confidence and the gazetteer has an unambiguous match, the resolver may prefer the
lookup as a candidate. That is a downstream re-rank at the resolution layer, not a
parser substitution β the same safety, at the right layer.
The revised taxonomy (three tiers, all model-first)β
The night-9 "open-vocab β train / closed-vocab β deterministic" split was too coarse. The real partition is by what the field needs to learn:
| Tier | Fields | Treatment |
|---|---|---|
| Open-vocab, starved | unit, street_prefix, street_suffix | coverage shard + retrain (proven: unit 0β92%, sharpened neighbors) |
| Closed-vocab, no homographs, just starved | po_box, cedex | also just a coverage shard β they were starved, not unlearnable. "PO Box"/"CEDEX" carry no ambiguity, so no anchor is needed; train them like unit |
| Closed-vocab, large set + homographs + long tail | country | balanced coverage shard plus a soft gazetteer anchor feature (long-tail recall floor + homograph flag). The only tier that needs the feature |
The deterministic override appears in none of them.
The architecture: a soft gazetteer anchorβ
For country, the matcher produces per-token features fed into the transformer's token
representation (same machinery as the postcode anchor), and the model decides every tag:
- Binary gazetteer-match flag β "this token/sequence is a known country surface form." Its main job is long-tail recall: with ~250 countries at realistic prevalence, a small country appears a handful of times in training; the flag gives the model a prior it can't get from the subword embeddings of 3 examples. (It cannot help truly OOV / misspelled forms β there the model is on its own, which is correct.)
- Known-homograph flag β "this surface form is also a US state / region name." This
is the single highest-value bit: it tells the model "context is non-negotiable here,
don't fall back on a shortcut." Produced statically by intersecting the country and
region gazetteers. Expected to improve
regionprecision, not dent it. - Optional entity-type embedding β a small learned vector encoding the signal type (is-country / country+region-homograph / cedex / po_box). Near-zero cost; lets the model learn distinct behaviors per signal.
The matcher (codex) is reused, not discarded β it stops overriding and starts informing.
The data + training recipeβ
Built to deny the model the cheap shortcut and give it a fair shot:
- Position variety β leading / middle / trailing / absent country. If every row ends with a country, the CRF learns a transition prior that kills recall on the (common) absent-country address.
- Explicit homograph pairs disambiguated within the same address window β Atlanta vs Tbilisi Georgia; Paris TX vs Paris FR; San JosΓ© CR vs CA.
- Hard negatives β state/province codes that are also country codes (CA, GA, IN, MAβ¦); person-name countries (Jordan, Chad, Dominica).
- Realistic prevalence with loss-level compensation. Don't inflate country everywhere (that re-creates the over-fire). Target ~12β18% of rows carrying a country, then compensate the imbalance with a class-weighted CRF loss (~3β5Γ on B/I-country) plus light transition regularization, rather than raw oversampling. Validate on a production-prevalence (~5%) held-out set so the transition prior stays honest.
Eval-first β because the last measurement misled usβ
Before any retrain, build a hard, non-gameable homograph eval (~500β1000 rows) and re-measure the current model to get the true baseline (the 27% figure was a separate US-order artifact; the real number under homograph stress is unknown). It must contain:
- β₯50 homograph pairs (same ambiguous token, opposite label, distinguished only by real neighbor tokens β no regex-exploitable tell).
- β₯30 state-code-as-country negatives (CA/GA/IN/MAβ¦ as state and as country).
- β₯20 person-name-as-country negatives.
- β₯20 non-trailing positions and β₯30% absent-country rows (to measure O-emission / hallucination).
- a handful of long-tail countries (Eswatini, Timor-Leste, SΓ£o TomΓ©) for recall.
Gate (per-tag no-regression): country up, region/locality flat. Run DeepSeek's
A/B/C control β current model, soft-feature-on-old-corpus, soft-feature-on-balanced-corpus
β so we learn whether the feature or the data does the work and don't pay for the
feature if balanced data alone suffices.
How the field handles this (prior art)β
Two paradigms divide the space:
- Parse-time structured tagging β produce an authoritative field-tagged record (Mailwoman, Pelias, libpostal, deepparse). The over-firing failure mode lives here.
- Index-time disambiguation β don't commit a parse; let retrieval decide (Airmail).
| System | Approach to country / closed-vocab | Lexicon used as |
|---|---|---|
| libpostal | country/po_box are learned labels, same footing as everything | input feature + training-data generator (end-to-end; inherits the over-fire risk) |
| deepparse | learned labels, configurable tag set | β (no overlay) |
| Pelias Parser | dictionary classifiers + cartesian solver | fully deterministic (no ML to inform) |
| Pelias Placeholder | deterministic WoF hierarchy resolution downstream of the parse | a resolution layer, not span re-tagging |
| Nominatim | countrycodes is a hard filter; special phrases | deterministic overlays on search |
| Google libaddressinput | per-ISO-country deterministic metadata | validation/format engine, no free-text parse |
| Airmail | parserless by default; small BERT to aid malformed queries; country baked into a flat content field at index time, S2 cells for hierarchy | OSM key=value as a separate deterministic index-time tag channel |
Where we sit. The field is bimodal: libpostal/deepparse learn everything (and inherit the over-fire risk); Pelias/Nominatim/libaddressinput treat country deterministically but were never ML parsers. Our position β keep the model authoritative for every field, but feed it the gazetteer as a soft feature β is the model-first synthesis: the lexicon informs rather than overrides. The closest kindred is Airmail's instinct that closed categorical vocab deserves a deterministic channel (its OSM tags) alongside the fuzzy path β but Airmail applies it at index time and goes the opposite way on label granularity (it merges labels and tolerates confusion; we add labels and sharpen).
A useful motivating example for the homograph guard the model must learn: Pelias's legacy
addressit parsed CA into a region='CA' subquery and never a country='Canada'
alternative (pelias/api #1287).
Citationsβ
- McCallum & Li, "Early results for NER with CRFs, Feature Induction and Web-Enhanced Lexicons", CoNLL-2003 β aclanthology.org/W03-0430 (lexicon-as-feature gives modest, dataset-dependent gains β the feature path, applied here as a soft anchor).
- "Gazetteer-Enhanced Attentive Neural Networks", EMNLP-2019 β aclanthology.org/D19-1646 (gazetteers valued for boundary precision on closed-class spans).
- libpostal β Inside Libpostal, openvenues/libpostal.
- Pelias β pelias/parser, pelias/placeholder, pelias/api #1287.
- Nominatim β Search Manual.
- Google β libaddressinput.
- Airmail β github.com/ellenhp/airmail, Address parsing: state of the art, and beyond.