Skip to main content

FST Gazetteer LM

Shipped

Phases 1-2 shipped in v0.5.2 (#170, #173). Wikipedia importance integration in #173. Unified SQLite builder in #176. Phase 3 (autocomplete) partially shipped. Phase 4 (browser) shipped โ€” the /demo page loads a 9 MB FST binary and passes it as an emission prior to the neural classifier.

Goal: Pre-compute a finite-state transducer from the WOF SQLite gazetteer that maps token sequences โ†’ (placetype, wof_id, parent_chain, importance) entries. Use it as an emission prior in the neural Viterbi decoder, as a CLI introspection tool, and as the autocomplete backend.

Principle: Pay down the combinatorial cross-product of "all valid place-name paths through the WOF hierarchy" at build time. At query time, walking the FST is O(depth), not O(gazetteer_size).

Shipped metrics (US admin)โ€‹

MetricValue
FST states114,214
Name insertions163,271
Binary size5.57 MB
Load time~10 ms
Build time (from unified SQLite)2.7 s
Build time (unified SQLite from 293K GeoJSON)43 s
Wikipedia importance matches47,348 places
Population fallback108,111 places

Architecture overviewโ€‹

The FST data structureโ€‹

FstState {
id: u32 // dense index into states[]
edges: Vec<FstEdge> // 2-5 entries avg
places: Vec<PlaceEntry> // accepting states only, 1-5 entries avg
}

FstEdge {
token: string // interned, lowercase, NFKC-normalized
target: u32 // target state id
}

PlaceEntry {
wof_id: u32 // WOF numeric ID
placetype: u8 // PlacetypeId enum
parent_chain: Vec<u32> // [country_id, region_id, county_id?], leaf-to-root
population: u32
lat: f32, lon: f32
}

Tokenization is whitespace-split with punctuation stripping. The FST edge label is the normalized token. Accepting states carry ALL valid interpretations โ€” "New York" ends at a state with entries for both NYC (locality) and NY state (region). The FST never picks; the neural model + reconciler do.

Build pipelineโ€‹

WOF SQLite (spr + names tables)
โ”‚
โ–ผ
fst-builder.ts โ† resolver-wof-sqlite/fst-builder.ts
โ”‚
โ”œโ”€ Normalize names (NFKC, lowercase, strip punctuation)
โ”œโ”€ Tokenize each name + alt_names into token sequences
โ”œโ”€ Insert into incremental trie (shared prefixes)
โ”œโ”€ At each terminal: attach PlaceEntry
โ”œโ”€ Determinize + Minimize (Hopcroft)
โ”‚
โ–ผ
fst.bin โ† compact binary (~8-12 MB for US)
โ”‚
โ–ผ
FstMatcher class (fast prefix walk)

Integration with the neural pipelineโ€‹

The FST provides per-token emission biases, composing additively with the existing QueryShape soft prior:

let emissions = logits
if (opts?.queryShape) {
emissions = addEmissionMatrix(emissions, buildEmissionPriors(...))
}
if (opts?.fst) {
emissions = addEmissionMatrix(emissions, buildFstEmissionPriors(...))
}
// โ†’ Viterbi decode over biased emissions

The FST prior gets STRONGER as the prefix grows longer โ€” a "wedge that tightens with information." A bare "S" spreads bias across many places; "San Fran" concentrates it on San Francisco.

The WFST analogy to speech recognitionโ€‹

Speech recognitionAddress parsing
Acoustic model (DNN)Neural classifier (transformer)
Pronunciation lexicon (L)FST (token sequences โ†’ place entries)
Language model (G)Address grammar (valid component sequences)
Shallow fusion at decode timeAdditive emission biases at Viterbi time

The neural model handles non-gazetteer components (streets, venues, house numbers, typos). The FST handles gazetteer components (countries, regions, localities). They compose via shallow fusion โ€” same as modern ASR systems.


CLI proof-of-conceptโ€‹

mailwoman fst buildโ€‹

mailwoman fst build --db /path/to/wof-admin-us.db --output fst-en-US.bin --countries US

mailwoman fst queryโ€‹

mailwoman fst query 'New York' --fst fst-en-US.bin --show-continuations

Example output:

"new york" (complete match)
โ”œโ”€โ”€ Accepts: 2 interpretations
โ”‚ โ”œโ”€โ”€ locality New York City pop 8,804,000 wof:85977539
โ”‚ โ”‚ chain: US โ† New York (state) โ† New York (city)
โ”‚ โ””โ”€โ”€ region New York State pop 20,200,000 wof:85688543
โ”‚ chain: US โ† New York (state)
โ””โ”€โ”€ Valid continuations:
โ”œโ”€โ”€ "ny" โ†’ region (NY state, disambiguated)
โ”œโ”€โ”€ "10001" โ†’ postalcode
โ””โ”€โ”€ [end] โ†’ valid terminal

Negative evidence example โ€” 'Buffalo Health Clinic, Buffalo':

"buffalo" โ†’ 14 places (locality Buffalo, NY is top)
"health" โ†’ NOT IN FST (no gazetteer match)
โ†’ Strong signal: this span is NOT an admin component
โ†’ Fall back to neural model for typing (likely venue)

Locale strategyโ€‹

Per-locale FSTs (recommended for v1). One FST per locale: fst-en-US.bin, fst-fr-FR.bin. Filter by country during build. Names in all languages are included (a user typing "Munich" in an en-US context won't match โ€” and that's correct, Munich isn't in the US).

LocalePlacesFST size
en-US~30K~8 MB
fr-FR~36K (communes)~10 MB
ja-JP~1,800~5 MB
en-GB~20K~6 MB

Why not combined+filterโ€‹

A single FST with per-edge locale bitsets is space-efficient but query-slower. For v1, per-locale FSTs match the existing per-locale weights model. Revisit if multi-locale deployment becomes necessary.


Size estimatesโ€‹

Admin only (Phase 1): ~8-12 MB for US. Smaller than the current 35 MB WOF slim DB but encodes more structural information (parent chains, population, valid continuations).

With streets (Phase 2+): ~200-500 MB for full US streets (3-5M unique street names). Too large for browser โ€” shard per metro area (~5 MB each) for browser deployment. Server-side loads the full set.


Implementation phasesโ€‹

Phase 1: FST builder + CLI โ€” shipped (#170)โ€‹

  • resolver-wof-sqlite/fst-builder.ts, fst-matcher.ts, fst-types.ts
  • resolver-wof-sqlite/fst-serialize.ts (binary format, VERSION 2)
  • scripts/fst-query.ts (interactive CLI)
  • 24 integration tests against WOF US admin data

Phase 2: Neural emission prior โ€” shipped (#170, #173)โ€‹

  • neural/fst-prior.ts โ€” buildFstEmissionPriors() with Wikipedia importance weighting
  • neural/classifier.ts โ€” FST threaded through ParseOpts.fst
  • core/pipeline/runtime-pipeline.ts โ€” FST threaded through RuntimePipelineStages
  • Wikipedia importance ETL: scripts/build-importance.ts
  • Region-aware locality bias guard: neural/query-shape-prior.ts (#174)

Phase 3: Autocomplete prototype โ€” partially shipped (#170)โ€‹

  • resolver-wof-sqlite/fst-autocomplete.ts โ€” prefix walk + BFS expansion
  • CLI not yet wired (standalone script only)

Phase 4: Browser deployment โ€” shippedโ€‹

  • resolver-wof-sqlite/fst-deserialize-web.ts โ€” browser-compatible deserializer using DataView + TextDecoder (no Node Buffer dependency)
  • /demo page loads fst-en-US.bin (~9 MB) and passes it to classifier.parse() as opts.fst
  • FST module import, binary fetch, and deserialization happen in parallel with ONNX model load
  • Graceful degradation: if the FST fails to load, the demo runs without it

Key design decisionsโ€‹

  1. FST is an emission PRIOR, not a replacement. The neural model remains the authority for non-gazetteer components. The FST handles what the model can't: knowing which place names exist and their hierarchies.

  2. Wikipedia importance-weighted bias when ambiguous. Each place carries a [0,1] importance score derived from Wikipedia link count (Nominatim methodology). "New York" biases both locality (0.95) and region (0.85) proportionally. Washington DC locality (0.815) correctly outranks Washington state (0.764) despite lower population. Formula: importance ร— biasScale ร— maxBias (linear, capped at 3.0 logits).

  3. Negative suppression on non-place labels. When the FST matches a place name, B-street, I-street, B-house_number, I-house_number, and B-venue receive -1.5 logit suppression. This narrows the gap between place-tag and non-place-tag logits without overriding the model.

  4. Negative evidence is free. When a token doesn't extend any FST path, that's a strong signal it's NOT an admin component โ€” the neural model handles it alone. This is how "Buffalo Health Clinic" gets correctly NOT-biased toward locality.

  5. The FST eliminates reconciler beam search for admin components. Only concordant paths exist in the FST. Invalid admin combinations were pruned at build time.

  6. Autocomplete is a prefix walk, not an index scan. O(depth ร— branching) vs Elasticsearch's O(matches). The FST IS the autocomplete index.


Relationship to existing architectureโ€‹

The FST is the "pay-it-forward" implementation of the mail-carrier philosophy: each carrier's knowledge is pre-compiled into a structure that narrows the search space with each token consumed. The deliberative assembly (phrase grouper + neural model + reconciler) still adjudicates โ€” the FST just gives them much better priors to work with.

Where the QueryShape soft prior says "this 5-digit token is probably a postcode" (structural pattern), the FST prior says "this token sequence matches 'New York' which is either locality WOF:85977539 or region WOF:85688543" (factual knowledge from the gazetteer). Both are additive biases; neither overrides the neural model.

See alsoโ€‹

  • QUERY_SHAPE.md โ€” the existing structural-prior system this extends
  • DEMO_PRESET_DIAGNOSIS.md โ€” the locality/region confusion this addresses
  • TRAINING_RECIPE_LEVERS.md โ€” training-side fixes (complementary)
  • docs/articles/understanding/exotic-poi/ โ€” venue detection that benefits from FST negative evidence