FST Gazetteer LM
Goal: Pre-compute a finite-state transducer from the WOF SQLite gazetteer that maps token sequences โ (placetype, wof_id, parent_chain, importance) entries. Use it as an emission prior in the neural Viterbi decoder, as a CLI introspection tool, and as the autocomplete backend.
Principle: Pay down the combinatorial cross-product of "all valid place-name paths through the WOF hierarchy" at build time. At query time, walking the FST is O(depth), not O(gazetteer_size).
Shipped metrics (US admin)โ
| Metric | Value |
|---|---|
| FST states | 114,214 |
| Name insertions | 163,271 |
| Binary size | 5.57 MB |
| Load time | ~10 ms |
| Build time (from unified SQLite) | 2.7 s |
| Build time (unified SQLite from 293K GeoJSON) | 43 s |
| Wikipedia importance matches | 47,348 places |
| Population fallback | 108,111 places |
Architecture overviewโ
The FST data structureโ
FstState {
id: u32 // dense index into states[]
edges: Vec<FstEdge> // 2-5 entries avg
places: Vec<PlaceEntry> // accepting states only, 1-5 entries avg
}
FstEdge {
token: string // interned, lowercase, NFKC-normalized
target: u32 // target state id
}
PlaceEntry {
wof_id: u32 // WOF numeric ID
placetype: u8 // PlacetypeId enum
parent_chain: Vec<u32> // [country_id, region_id, county_id?], leaf-to-root
population: u32
lat: f32, lon: f32
}
Tokenization is whitespace-split with punctuation stripping. The FST edge label is the normalized token. Accepting states carry ALL valid interpretations โ "New York" ends at a state with entries for both NYC (locality) and NY state (region). The FST never picks; the neural model + reconciler do.
Build pipelineโ
WOF SQLite (spr + names tables)
โ
โผ
fst-builder.ts โ resolver-wof-sqlite/fst-builder.ts
โ
โโ Normalize names (NFKC, lowercase, strip punctuation)
โโ Tokenize each name + alt_names into token sequences
โโ Insert into incremental trie (shared prefixes)
โโ At each terminal: attach PlaceEntry
โโ Determinize + Minimize (Hopcroft)
โ
โผ
fst.bin โ compact binary (~8-12 MB for US)
โ
โผ
FstMatcher class (fast prefix walk)
Integration with the neural pipelineโ
The FST provides per-token emission biases, composing additively with the existing QueryShape soft prior:
let emissions = logits
if (opts?.queryShape) {
emissions = addEmissionMatrix(emissions, buildEmissionPriors(...))
}
if (opts?.fst) {
emissions = addEmissionMatrix(emissions, buildFstEmissionPriors(...))
}
// โ Viterbi decode over biased emissions
The FST prior gets STRONGER as the prefix grows longer โ a "wedge that tightens with information." A bare "S" spreads bias across many places; "San Fran" concentrates it on San Francisco.
The WFST analogy to speech recognitionโ
| Speech recognition | Address parsing |
|---|---|
| Acoustic model (DNN) | Neural classifier (transformer) |
| Pronunciation lexicon (L) | FST (token sequences โ place entries) |
| Language model (G) | Address grammar (valid component sequences) |
| Shallow fusion at decode time | Additive emission biases at Viterbi time |
The neural model handles non-gazetteer components (streets, venues, house numbers, typos). The FST handles gazetteer components (countries, regions, localities). They compose via shallow fusion โ same as modern ASR systems.
CLI proof-of-conceptโ
mailwoman fst buildโ
mailwoman fst build --db /path/to/wof-admin-us.db --output fst-en-US.bin --countries US
mailwoman fst queryโ
mailwoman fst query 'New York' --fst fst-en-US.bin --show-continuations
Example output:
"new york" (complete match)
โโโ Accepts: 2 interpretations
โ โโโ locality New York City pop 8,804,000 wof:85977539
โ โ chain: US โ New York (state) โ New York (city)
โ โโโ region New York State pop 20,200,000 wof:85688543
โ chain: US โ New York (state)
โโโ Valid continuations:
โโโ "ny" โ region (NY state, disambiguated)
โโโ "10001" โ postalcode
โโโ [end] โ valid terminal
Negative evidence example โ 'Buffalo Health Clinic, Buffalo':
"buffalo" โ 14 places (locality Buffalo, NY is top)
"health" โ NOT IN FST (no gazetteer match)
โ Strong signal: this span is NOT an admin component
โ Fall back to neural model for typing (likely venue)
Locale strategyโ
Per-locale FSTs (recommended for v1). One FST per locale: fst-en-US.bin, fst-fr-FR.bin. Filter by country during build. Names in all languages are included (a user typing "Munich" in an en-US context won't match โ and that's correct, Munich isn't in the US).
| Locale | Places | FST size |
|---|---|---|
| en-US | ~30K | ~8 MB |
| fr-FR | ~36K (communes) | ~10 MB |
| ja-JP | ~1,800 | ~5 MB |
| en-GB | ~20K | ~6 MB |
Why not combined+filterโ
A single FST with per-edge locale bitsets is space-efficient but query-slower. For v1, per-locale FSTs match the existing per-locale weights model. Revisit if multi-locale deployment becomes necessary.
Size estimatesโ
Admin only (Phase 1): ~8-12 MB for US. Smaller than the current 35 MB WOF slim DB but encodes more structural information (parent chains, population, valid continuations).
With streets (Phase 2+): ~200-500 MB for full US streets (3-5M unique street names). Too large for browser โ shard per metro area (~5 MB each) for browser deployment. Server-side loads the full set.
Implementation phasesโ
Phase 1: FST builder + CLI โ shipped (#170)โ
resolver-wof-sqlite/fst-builder.ts,fst-matcher.ts,fst-types.tsresolver-wof-sqlite/fst-serialize.ts(binary format, VERSION 2)scripts/fst-query.ts(interactive CLI)- 24 integration tests against WOF US admin data
Phase 2: Neural emission prior โ shipped (#170, #173)โ
neural/fst-prior.tsโbuildFstEmissionPriors()with Wikipedia importance weightingneural/classifier.tsโ FST threaded throughParseOpts.fstcore/pipeline/runtime-pipeline.tsโ FST threaded throughRuntimePipelineStages- Wikipedia importance ETL:
scripts/build-importance.ts - Region-aware locality bias guard:
neural/query-shape-prior.ts(#174)
Phase 3: Autocomplete prototype โ partially shipped (#170)โ
resolver-wof-sqlite/fst-autocomplete.tsโ prefix walk + BFS expansion- CLI not yet wired (standalone script only)
Phase 4: Browser deployment โ shippedโ
resolver-wof-sqlite/fst-deserialize-web.tsโ browser-compatible deserializer using DataView + TextDecoder (no Node Buffer dependency)/demopage loadsfst-en-US.bin(~9 MB) and passes it toclassifier.parse()asopts.fst- FST module import, binary fetch, and deserialization happen in parallel with ONNX model load
- Graceful degradation: if the FST fails to load, the demo runs without it
Key design decisionsโ
-
FST is an emission PRIOR, not a replacement. The neural model remains the authority for non-gazetteer components. The FST handles what the model can't: knowing which place names exist and their hierarchies.
-
Wikipedia importance-weighted bias when ambiguous. Each place carries a [0,1] importance score derived from Wikipedia link count (Nominatim methodology). "New York" biases both locality (0.95) and region (0.85) proportionally. Washington DC locality (0.815) correctly outranks Washington state (0.764) despite lower population. Formula:
importance ร biasScale ร maxBias(linear, capped at 3.0 logits). -
Negative suppression on non-place labels. When the FST matches a place name, B-street, I-street, B-house_number, I-house_number, and B-venue receive -1.5 logit suppression. This narrows the gap between place-tag and non-place-tag logits without overriding the model.
-
Negative evidence is free. When a token doesn't extend any FST path, that's a strong signal it's NOT an admin component โ the neural model handles it alone. This is how "Buffalo Health Clinic" gets correctly NOT-biased toward locality.
-
The FST eliminates reconciler beam search for admin components. Only concordant paths exist in the FST. Invalid admin combinations were pruned at build time.
-
Autocomplete is a prefix walk, not an index scan. O(depth ร branching) vs Elasticsearch's O(matches). The FST IS the autocomplete index.
Relationship to existing architectureโ
The FST is the "pay-it-forward" implementation of the mail-carrier philosophy: each carrier's knowledge is pre-compiled into a structure that narrows the search space with each token consumed. The deliberative assembly (phrase grouper + neural model + reconciler) still adjudicates โ the FST just gives them much better priors to work with.
Where the QueryShape soft prior says "this 5-digit token is probably a postcode" (structural pattern), the FST prior says "this token sequence matches 'New York' which is either locality WOF:85977539 or region WOF:85688543" (factual knowledge from the gazetteer). Both are additive biases; neither overrides the neural model.
See alsoโ
QUERY_SHAPE.mdโ the existing structural-prior system this extendsDEMO_PRESET_DIAGNOSIS.mdโ the locality/region confusion this addressesTRAINING_RECIPE_LEVERS.mdโ training-side fixes (complementary)docs/articles/understanding/exotic-poi/โ venue detection that benefits from FST negative evidence