Skip to main content

Closed-vocab fields: model-first, with a soft gazetteer anchor

How Mailwoman handles closed-vocabulary address fields β€” country, po_box, cedex β€” and why we resolve them inside the model (with a gazetteer fed in as a soft feature) rather than overriding the model with a deterministic post-parse lookup.

Supersedes the night-9 overlay conclusion

A first pass (2026-06-09 night shift) concluded these fields should be tagged by a deterministic post-parse overlay (ClosedVocabTagger, issue #464). That conclusion was reversed after review: it was reached from an unfair comparison and it contradicts the project's model-first ethos. This doc is the corrected design record. Pressure-tested across a two-turn DeepSeek consult; see the rationale below.

The decision​

  1. The model is the authority for every field, including closed-vocab ones. We do not override its output with a lookup table.
  2. The gazetteer enters as a soft feature, never a substitution β€” the same principle as the postcode anchor. codex's matchers (matchCountry, matchPOBox, the cedex pattern) flip role from override to feature-source.
  3. Starved closed-vocab fields are covered by training, the same negative-space recipe that fixed unit β€” see negative space.

Why not a deterministic overlay β€” the reasoning​

The over-firing was a data artifact, not a model limitation. The night-9 country shard was almost entirely trailing-country rows, so the model learned one cheap shortcut β€” "trailing token β‡’ country" β€” and tagged cities and regions as countries (49 F1, 23% precision). That is a skewed training distribution, not a transformer+CRF ceiling. Context-sensitive token classifiers (BERT+CRF, LUKE, BLINK) routinely resolve gazetteer homographs from local context; "Atlanta, Georgia" β†’ region vs "Tbilisi, Georgia" β†’ country is the canonical job of a context-sensitive tagger.

The "deterministic = 100%" result was a mirage. It was measured on a curated eval with zero homographs, against the crippled (badly-trained) model. The homograph case β€” the one that matters β€” is exactly where a flat lookup cannot win without a hand-coded guard, and that guard is the growing-exception-list smell the project rejects (no load-bearing trivia). The overlay buys perfect precision only on the cases that were never hard.

It contradicts a principle we already committed to. The postcode-anchor work established lexicon signals as a soft channel, never a substitution. A post-parse override is precisely the substitution we ruled out.

The honest framing: a wrong country is a wrong continent downstream, so the instinct to want a hard guarantee is sound β€” but the place to enforce it is the resolver, using calibrated confidence (we ship isotonic conf=): if the model is low-confidence and the gazetteer has an unambiguous match, the resolver may prefer the lookup as a candidate. That is a downstream re-rank at the resolution layer, not a parser substitution β€” the same safety, at the right layer.

The revised taxonomy (three tiers, all model-first)​

The night-9 "open-vocab β†’ train / closed-vocab β†’ deterministic" split was too coarse. The real partition is by what the field needs to learn:

TierFieldsTreatment
Open-vocab, starvedunit, street_prefix, street_suffixcoverage shard + retrain (proven: unit 0β†’92%, sharpened neighbors)
Closed-vocab, no homographs, just starvedpo_box, cedexalso just a coverage shard β€” they were starved, not unlearnable. "PO Box"/"CEDEX" carry no ambiguity, so no anchor is needed; train them like unit
Closed-vocab, large set + homographs + long tailcountrybalanced coverage shard plus a soft gazetteer anchor feature (long-tail recall floor + homograph flag). The only tier that needs the feature

The deterministic override appears in none of them.

The architecture: a soft gazetteer anchor​

For country, the matcher produces per-token features fed into the transformer's token representation (same machinery as the postcode anchor), and the model decides every tag:

  • Binary gazetteer-match flag β€” "this token/sequence is a known country surface form." Its main job is long-tail recall: with ~250 countries at realistic prevalence, a small country appears a handful of times in training; the flag gives the model a prior it can't get from the subword embeddings of 3 examples. (It cannot help truly OOV / misspelled forms β€” there the model is on its own, which is correct.)
  • Known-homograph flag β€” "this surface form is also a US state / region name." This is the single highest-value bit: it tells the model "context is non-negotiable here, don't fall back on a shortcut." Produced statically by intersecting the country and region gazetteers. Expected to improve region precision, not dent it.
  • Optional entity-type embedding β€” a small learned vector encoding the signal type (is-country / country+region-homograph / cedex / po_box). Near-zero cost; lets the model learn distinct behaviors per signal.

The matcher (codex) is reused, not discarded β€” it stops overriding and starts informing.

The data + training recipe​

Built to deny the model the cheap shortcut and give it a fair shot:

  • Position variety β€” leading / middle / trailing / absent country. If every row ends with a country, the CRF learns a transition prior that kills recall on the (common) absent-country address.
  • Explicit homograph pairs disambiguated within the same address window β€” Atlanta vs Tbilisi Georgia; Paris TX vs Paris FR; San JosΓ© CR vs CA.
  • Hard negatives β€” state/province codes that are also country codes (CA, GA, IN, MA…); person-name countries (Jordan, Chad, Dominica).
  • Realistic prevalence with loss-level compensation. Don't inflate country everywhere (that re-creates the over-fire). Target ~12–18% of rows carrying a country, then compensate the imbalance with a class-weighted CRF loss (~3–5Γ— on B/I-country) plus light transition regularization, rather than raw oversampling. Validate on a production-prevalence (~5%) held-out set so the transition prior stays honest.

Eval-first β€” because the last measurement misled us​

Before any retrain, build a hard, non-gameable homograph eval (~500–1000 rows) and re-measure the current model to get the true baseline (the 27% figure was a separate US-order artifact; the real number under homograph stress is unknown). It must contain:

  • β‰₯50 homograph pairs (same ambiguous token, opposite label, distinguished only by real neighbor tokens β€” no regex-exploitable tell).
  • β‰₯30 state-code-as-country negatives (CA/GA/IN/MA… as state and as country).
  • β‰₯20 person-name-as-country negatives.
  • β‰₯20 non-trailing positions and β‰₯30% absent-country rows (to measure O-emission / hallucination).
  • a handful of long-tail countries (Eswatini, Timor-Leste, SΓ£o TomΓ©) for recall.

Gate (per-tag no-regression): country up, region/locality flat. Run DeepSeek's A/B/C control β€” current model, soft-feature-on-old-corpus, soft-feature-on-balanced-corpus β€” so we learn whether the feature or the data does the work and don't pay for the feature if balanced data alone suffices.

How the field handles this (prior art)​

Two paradigms divide the space:

  • Parse-time structured tagging β€” produce an authoritative field-tagged record (Mailwoman, Pelias, libpostal, deepparse). The over-firing failure mode lives here.
  • Index-time disambiguation β€” don't commit a parse; let retrieval decide (Airmail).
SystemApproach to country / closed-vocabLexicon used as
libpostalcountry/po_box are learned labels, same footing as everythinginput feature + training-data generator (end-to-end; inherits the over-fire risk)
deepparselearned labels, configurable tag setβ€” (no overlay)
Pelias Parserdictionary classifiers + cartesian solverfully deterministic (no ML to inform)
Pelias Placeholderdeterministic WoF hierarchy resolution downstream of the parsea resolution layer, not span re-tagging
Nominatimcountrycodes is a hard filter; special phrasesdeterministic overlays on search
Google libaddressinputper-ISO-country deterministic metadatavalidation/format engine, no free-text parse
Airmailparserless by default; small BERT to aid malformed queries; country baked into a flat content field at index time, S2 cells for hierarchyOSM key=value as a separate deterministic index-time tag channel

Where we sit. The field is bimodal: libpostal/deepparse learn everything (and inherit the over-fire risk); Pelias/Nominatim/libaddressinput treat country deterministically but were never ML parsers. Our position β€” keep the model authoritative for every field, but feed it the gazetteer as a soft feature β€” is the model-first synthesis: the lexicon informs rather than overrides. The closest kindred is Airmail's instinct that closed categorical vocab deserves a deterministic channel (its OSM tags) alongside the fuzzy path β€” but Airmail applies it at index time and goes the opposite way on label granularity (it merges labels and tolerates confusion; we add labels and sharpen).

A useful motivating example for the homograph guard the model must learn: Pelias's legacy addressit parsed CA into a region='CA' subquery and never a country='Canada' alternative (pelias/api #1287).

Citations​