Skip to main content

Layer 1 (street-morphology FST) eval β€” 2026-05-28

Applied the newly-landed street-morphology FST as a decoder-only fix on the v0.6.1 weights and re-ran the 4561-entry golden set. Goal: measure whether the morphology prior alone suppresses v0.6.1's 1066 dependent_locality hallucinations without retraining.

TL;DR: It does not. The mechanism is structurally correct (dep_loc hallucinations drop monotonically as the penalty is strengthened) but the model's overconfidence on synth-street-induced predictions is too high for any practical decoder-time bias to flip. The right deployment of Layer 1 is alongside a v0.6.2 retrain that adds O-tagged street slots to the negative-example corpus.

Setup​

  • Model: model-v061-step-100000-int8.onnx (the experimental v0.6.1 weights with synth-street shard, weight 2.0, that produced the original 1066 dep_loc regression)
  • Tokenizer: v0.6.0-a0 (multi-script; matches the model's training tokenizer)
  • Admin FST: fst-en-us.bin (the production admin FST)
  • Morphology FST: built in-process from the 60 libpostal street_types.txt dictionaries (core/data/libpostal/dictionaries/{locale}/street_types.txt)
  • Golden set: data/eval/golden/v0.1.2/ β€” 4561 entries (US + FR + adversarial)
  • Script: scripts/eval-morphology-fst.ts (new, takes explicit weight + FST paths)

Results​

Configurationdep_loc hall.street_suffix hall.street_prefix hall.locality recallstreet recallexact match
v0.6.1 neural-only (on record)10661983131.1%27.5%18.8%
v0.6.1 + admin FST only10632043231.2%27.5%18.8%
+ morphology, defaults1050332 ❌4031.2%26.2%18.2% ❌
+ morphology, lengthβ‰₯3 filter10582383231.2%27.1%β€”
+ morphology, low bias + βˆ’6.0 pen10442133231.3%27.5%β€”

Defaults: maxAffixBias=3.0, maxNeighbourStreetBias=2.0, dependentLocalityPenalty=2.0. Tuned: maxAffixBias=1.0, maxNeighbourStreetBias=1.0, dependentLocalityPenalty=6.0.

Findings​

Admin FST does nothing for street-side errors​

v0.6.1 neural-only vs v0.6.1 + admin FST only is essentially a no-op (1066 vs 1063 dep_loc hallucinations). Confirms the structural diagnosis from the WOF hierarchy gap doc: the admin FST has no street placetypes to match, so it can provide no negative-evidence anchor for street tokens. Synth-street exploited exactly this vacuum.

Layer 1 mechanism is sound but its magnitude is bounded by the model's confidence​

The dep_loc hallucination count drops monotonically as the dependentLocalityPenalty is strengthened (defaults 2.0 β†’ 6.0 β†’ ...). The direction is correct. But the absolute reduction is small: even at βˆ’6.0 penalty (3Γ— the default), only 19/1063 hallucinations get suppressed. To suppress more, you'd need to push the penalty higher β€” at which point it starts corrupting legitimate decisions on other tokens.

The model's confidence on its synth-street-induced dep_loc predictions is genuinely high. Per the v0.6.1 calibration probe design: when the hallucinations are high-confidence, retraining is required, not just a decoder-time threshold or bias.

Default morphology bias is over-aggressive and causes collateral street_suffix damage​

At default magnitudes (maxAffixBias=3.0), the morphology FST inflates street_suffix hallucinations from 204 β†’ 332 β€” a +63% increase. Investigation: the libpostal street_types.txt dictionaries contain many 1-2 character abbreviations (a, b, av, bd, br, ...) that collide with US state abbreviations (OR, CA, ND, NY) and short tokens. A minimum-length-3 surface-form filter mitigates this (av no longer matches, but avenue, rue, blvd still do β€” see resolver-wof-sqlite/street-morphology-fst-builder.ts's minVariantLength option).

Even with the filter, default bias magnitudes still produce 238 hallucinations vs the 204 baseline β€” a smaller but real regression. Lowering maxAffixBias to 1.0 brings collateral damage back to 213, basically baseline-equivalent.

Layer 1 is a real deliverable on top of a v0.6.2 retrain​

Architecturally, Layer 1 IS the dual-FST integration plumbing that future layers (Layer 1.5 candidacy, Layer 2 street identity, Layer 4 brand FST) flow through unchanged. The infrastructure work tonight wasn't speculative β€” ParseOpts.fstStreetMorphology, buildStreetMorphologyEmissionPriors, the PlacetypeId extension, the two PLACETYPE_ORDER synchronization points, the libpostal dictionary walker β€” all of it is correct and tested. It just needs a backbone model that wasn't trained to be wrong about dep_loc.

What this means for v0.6.2​

Per DeepSeek's turn 2 recipe and the street-supplement architecture doc:

  1. Retrain with synth-street weight 0.5 (down from 2.0) β€” reduces the gradient pressure that pushed the model into overconfident dep_loc predictions.
  2. Explicit O-tags on street slots in non-street corpus rows β€” the negative-example counter-distribution that synth-street is currently missing.
  3. Layer 1 prior at inference β€” the morphology FST as an additive anchor; meaningful on top of a corrected backbone, insufficient on its own.

The morphology FST infrastructure landed in this shift is exactly the inference plumbing v0.6.2 will use.

Reproducing​

# Baseline: v0.6.1 + admin FST only
node --experimental-strip-types scripts/eval-morphology-fst.ts \
--model /mnt/playpen/mailwoman-data/models/quantized/model-v061-step-100000-int8.onnx \
--tokenizer /mnt/playpen/mailwoman-data/models/tokenizer/v0.6.0-a0/tokenizer.model \
--model-card neural-weights-en-us/model-card.json \
--admin-fst /mnt/playpen/mailwoman-data/wof/fst-per-locale/fst-en-us.bin \
--golden data/eval/golden/v0.1.2 \
--no-morphology

# With morphology FST (defaults)
node --experimental-strip-types scripts/eval-morphology-fst.ts \
--model /mnt/playpen/mailwoman-data/models/quantized/model-v061-step-100000-int8.onnx \
--tokenizer /mnt/playpen/mailwoman-data/models/tokenizer/v0.6.0-a0/tokenizer.model \
--model-card neural-weights-en-us/model-card.json \
--admin-fst /mnt/playpen/mailwoman-data/wof/fst-per-locale/fst-en-us.bin \
--golden data/eval/golden/v0.1.2

# Tuned (low affix bias, strong dep_loc penalty)
node --experimental-strip-types scripts/eval-morphology-fst.ts \
--model /mnt/playpen/mailwoman-data/models/quantized/model-v061-step-100000-int8.onnx \
--tokenizer /mnt/playpen/mailwoman-data/models/tokenizer/v0.6.0-a0/tokenizer.model \
--model-card neural-weights-en-us/model-card.json \
--admin-fst /mnt/playpen/mailwoman-data/wof/fst-per-locale/fst-en-us.bin \
--golden data/eval/golden/v0.1.2 \
--max-affix-bias 1.0 --max-neighbour-street-bias 1.0 --dep-locality-penalty 6.0

See also​