Layer 1 (street-morphology FST) eval β 2026-05-28
Applied the newly-landed street-morphology FST
as a decoder-only fix on the v0.6.1 weights and re-ran the 4561-entry golden set. Goal: measure
whether the morphology prior alone suppresses v0.6.1's 1066 dependent_locality hallucinations
without retraining.
TL;DR: It does not. The mechanism is structurally correct (dep_loc hallucinations drop monotonically as the penalty is strengthened) but the model's overconfidence on synth-street-induced predictions is too high for any practical decoder-time bias to flip. The right deployment of Layer 1 is alongside a v0.6.2 retrain that adds O-tagged street slots to the negative-example corpus.
Setupβ
- Model:
model-v061-step-100000-int8.onnx(the experimental v0.6.1 weights with synth-street shard, weight 2.0, that produced the original 1066 dep_loc regression) - Tokenizer:
v0.6.0-a0(multi-script; matches the model's training tokenizer) - Admin FST:
fst-en-us.bin(the production admin FST) - Morphology FST: built in-process from the 60 libpostal
street_types.txtdictionaries (core/data/libpostal/dictionaries/{locale}/street_types.txt) - Golden set:
data/eval/golden/v0.1.2/β 4561 entries (US + FR + adversarial) - Script:
scripts/eval-morphology-fst.ts(new, takes explicit weight + FST paths)
Resultsβ
| Configuration | dep_loc hall. | street_suffix hall. | street_prefix hall. | locality recall | street recall | exact match |
|---|---|---|---|---|---|---|
| v0.6.1 neural-only (on record) | 1066 | 198 | 31 | 31.1% | 27.5% | 18.8% |
| v0.6.1 + admin FST only | 1063 | 204 | 32 | 31.2% | 27.5% | 18.8% |
| + morphology, defaults | 1050 | 332 β | 40 | 31.2% | 26.2% | 18.2% β |
| + morphology, lengthβ₯3 filter | 1058 | 238 | 32 | 31.2% | 27.1% | β |
| + morphology, low bias + β6.0 pen | 1044 | 213 | 32 | 31.3% | 27.5% | β |
Defaults: maxAffixBias=3.0, maxNeighbourStreetBias=2.0, dependentLocalityPenalty=2.0.
Tuned: maxAffixBias=1.0, maxNeighbourStreetBias=1.0, dependentLocalityPenalty=6.0.
Findingsβ
Admin FST does nothing for street-side errorsβ
v0.6.1 neural-only vs v0.6.1 + admin FST only is essentially a no-op (1066 vs 1063 dep_loc
hallucinations). Confirms the structural diagnosis from the
WOF hierarchy gap doc: the admin FST has no street
placetypes to match, so it can provide no negative-evidence anchor for street tokens. Synth-street
exploited exactly this vacuum.
Layer 1 mechanism is sound but its magnitude is bounded by the model's confidenceβ
The dep_loc hallucination count drops monotonically as the dependentLocalityPenalty is
strengthened (defaults 2.0 β 6.0 β ...). The direction is correct. But the absolute reduction is
small: even at β6.0 penalty (3Γ the default), only 19/1063 hallucinations get suppressed. To
suppress more, you'd need to push the penalty higher β at which point it starts corrupting
legitimate decisions on other tokens.
The model's confidence on its synth-street-induced dep_loc predictions is genuinely high. Per the v0.6.1 calibration probe design: when the hallucinations are high-confidence, retraining is required, not just a decoder-time threshold or bias.
Default morphology bias is over-aggressive and causes collateral street_suffix damageβ
At default magnitudes (maxAffixBias=3.0), the morphology FST inflates street_suffix
hallucinations from 204 β 332 β a +63% increase. Investigation: the libpostal
street_types.txt dictionaries contain many 1-2 character abbreviations (a, b, av, bd,
br, ...) that collide with US state abbreviations (OR, CA, ND, NY) and short tokens.
A minimum-length-3 surface-form filter mitigates this (av no longer matches, but avenue,
rue, blvd still do β see
resolver-wof-sqlite/street-morphology-fst-builder.ts's minVariantLength option).
Even with the filter, default bias magnitudes still produce 238 hallucinations vs the 204
baseline β a smaller but real regression. Lowering maxAffixBias to 1.0 brings collateral
damage back to 213, basically baseline-equivalent.
Layer 1 is a real deliverable on top of a v0.6.2 retrainβ
Architecturally, Layer 1 IS the dual-FST integration plumbing that future layers (Layer 1.5
candidacy, Layer 2 street identity, Layer 4 brand FST) flow through unchanged. The
infrastructure work tonight wasn't speculative β ParseOpts.fstStreetMorphology,
buildStreetMorphologyEmissionPriors, the PlacetypeId extension, the two PLACETYPE_ORDER
synchronization points, the libpostal dictionary walker β all of it is correct and tested. It
just needs a backbone model that wasn't trained to be wrong about dep_loc.
What this means for v0.6.2β
Per DeepSeek's turn 2 recipe and the street-supplement architecture doc:
- Retrain with synth-street weight 0.5 (down from 2.0) β reduces the gradient pressure that pushed the model into overconfident dep_loc predictions.
- Explicit O-tags on street slots in non-street corpus rows β the negative-example counter-distribution that synth-street is currently missing.
- Layer 1 prior at inference β the morphology FST as an additive anchor; meaningful on top of a corrected backbone, insufficient on its own.
The morphology FST infrastructure landed in this shift is exactly the inference plumbing v0.6.2 will use.
Reproducingβ
# Baseline: v0.6.1 + admin FST only
node --experimental-strip-types scripts/eval-morphology-fst.ts \
--model /mnt/playpen/mailwoman-data/models/quantized/model-v061-step-100000-int8.onnx \
--tokenizer /mnt/playpen/mailwoman-data/models/tokenizer/v0.6.0-a0/tokenizer.model \
--model-card neural-weights-en-us/model-card.json \
--admin-fst /mnt/playpen/mailwoman-data/wof/fst-per-locale/fst-en-us.bin \
--golden data/eval/golden/v0.1.2 \
--no-morphology
# With morphology FST (defaults)
node --experimental-strip-types scripts/eval-morphology-fst.ts \
--model /mnt/playpen/mailwoman-data/models/quantized/model-v061-step-100000-int8.onnx \
--tokenizer /mnt/playpen/mailwoman-data/models/tokenizer/v0.6.0-a0/tokenizer.model \
--model-card neural-weights-en-us/model-card.json \
--admin-fst /mnt/playpen/mailwoman-data/wof/fst-per-locale/fst-en-us.bin \
--golden data/eval/golden/v0.1.2
# Tuned (low affix bias, strong dep_loc penalty)
node --experimental-strip-types scripts/eval-morphology-fst.ts \
--model /mnt/playpen/mailwoman-data/models/quantized/model-v061-step-100000-int8.onnx \
--tokenizer /mnt/playpen/mailwoman-data/models/tokenizer/v0.6.0-a0/tokenizer.model \
--model-card neural-weights-en-us/model-card.json \
--admin-fst /mnt/playpen/mailwoman-data/wof/fst-per-locale/fst-en-us.bin \
--golden data/eval/golden/v0.1.2 \
--max-affix-bias 1.0 --max-neighbour-street-bias 1.0 --dep-locality-penalty 6.0
See alsoβ
- Street-supplement architecture β the design this eval validates against
- WOF hierarchy gap β the structural cause
- v0.6.1 error analysis β the original regression
- 2026-05-28 night-2 postmortem β the postmortem that drove the design consult