v0.5.3 Diagnostic Training Review โ 2026-05-27
Verdict: REVISED โ v0.5.3 is the best model yet.โ
Initial analysis compared val_macro_f1 (0.579 vs 0.638) and concluded v0.5.3 regressed. This was wrong. The F1 numbers are not directly comparable โ different tokenizers, different val sets, different label distributions. When tested on demo presets, v0.5.3 achieves 6/6 correct including locality=Washington/region=DC and locality=New York/region=NY โ the stubborn failures that persisted across all prior models. The tree structure uses proper containment nesting (house_number inside street inside locality inside region). DeepSeek's recommendation to revert was based on the same misleading F1 comparison.
v0.5.3 peaked at val_macro_f1 = 0.579 (step 28K). v0.5.1 reached 0.638. Every recipe change since v0.5.1 โ wof-admin downweight, cosine LR, label smoothing โ has regressed. The model is not capacity-limited; the recipe is wrong.
Training curveโ
| Step | val_loss | val_macro_f1 | Note |
|---|---|---|---|
| 2,000 | 0.789 | 0.519 | Warmup phase |
| 4,000 | 0.616 | 0.543 | Rapid improvement |
| 8,000 | 0.528 | 0.558 | Slowing |
| 10,000 | 0.541 | 0.567 | val_loss uptick |
| 18,000 | 0.430 | 0.574 | Second phase improvement |
| 26,000 | 0.427 | 0.578 | val_loss minimum |
| 28,000 | 0.428 | 0.579 | Peak F1 |
| 34,000 | 0.435 | 0.579 | Plateau |
| 50,000 | 0.438 | 0.578 | Converged, LR at 0 |
The model plateaued at step 18K and gained only +0.005 F1 over the next 32K steps. Cosine LR decayed to zero by step 50K, leaving no learning signal in the final third of training.
What went wrong: recipe comparisonโ
| Parameter | v0.5.1 (0.638) | v0.5.3 (0.579) | Impact |
|---|---|---|---|
| wof-admin source weight | 2.0 | 0.3 | Primary cause. Starved the model of place-name examples. The "bare-name frequency dominance" fix removed the dominant signal entirely instead of rebalancing it. |
| label_smoothing | 0.0 | 0.05 | Contributes to lower peak F1. Smoothing prevents the model from being confident on easy cases, which drags down macro F1 on high-accuracy tags. |
| LR schedule | constant | cosine | Cosine decay to zero killed late-stage learning. v0.5.1's constant LR allowed continued improvement through step 95K. |
| max_steps | 100K | 50K | Undertrained by 2x, but the plateau at 18K suggests more steps wouldn't help with this recipe. |
| class_weights dep_loc/subregion | 0.3 | 1.0 | Intended fix for tag avoidance. Unclear if it helped โ model still under-predicts these tags. |
| transliteration source weights | unspecified (default 1.0) | 0.5 | Intended to reduce non-Latin dominance. May have helped multilingual balance but didn't compensate for the wof-admin loss. |
The wof-admin downweight: diagnosisโ
The v0.5.2 DEMO_PRESET_DIAGNOSIS.md identified "WOF bare-name frequency dominance" as the root cause of locality/region confusion. The fix was to downweight wof-admin from 2.0 to 0.3 โ a 6.7x reduction in the model's exposure to place-name labeling.
This was the wrong fix. The wof-admin source provides 10M rows of country/region/dependent_locality labels โ the only source that teaches the model "Washington" = region, "New York" = locality. Downweighting it starved the model of this signal. The model responded by collapsing to fewer tag types (primarily region and O), which is why v0.5.2 barely emits locality/street/house_number.
The locality/region confusion was better addressed at inference time:
- QueryShape locality prior (+2.0 bias when region abbreviation detected) โ #164
- Region-aware guard (skip bias when text matches region name) โ #174
- FST Wikipedia importance (DC 0.815 > WA 0.764) โ #173
These inference-time fixes compose correctly without degrading the model's ability to emit diverse tags.
Model size is not the bottleneckโ
29M parameters (h384, 6 layers, 6 heads) reached 0.638 on the v0.5.1 recipe. The model overfits past step 65K on this corpus โ a larger model (h512, ~68M) would overfit faster, not break through the ceiling. The bottleneck is:
- Training recipe โ the source weights and LR schedule, not architecture
- Data diversity โ the model sees 1M rows/epoch but the source-weight sampler may starve it of street/venue/house_number tags from structured-address sources (BAN, TIGER, NAD)
- Eval granularity โ val_macro_f1 averages across all tags equally. A model that perfects region (F1=0.99) but ignores venue (F1=0.0) can still report 0.6+ macro F1. Per-tag F1 breakdown is needed to diagnose which tags are being sacrificed.
Recommended next stepsโ
1. Revert to v0.5.1 recipe, retrainโ
source_weights:
wof-admin: 2.0 # restore from 0.3
wof-postalcode: 2.0 # restore from 1.0
ban: 3.0
tiger: 4.0
usgov-nad: 1.0
label_smoothing: 0.0 # restore from 0.05
lr_schedule: constant # restore from cosine
max_steps: 100000 # restore from 50000
Expected: val_macro_f1 โฅ 0.638 (matching v0.5.1). The v0.5.0-a1 tokenizer (48K vocab) may improve or regress vs v0.5.1's effective tokenizer โ this is the variable to watch.
2. Instrument per-tag F1 in training loopโ
The val eval computes macro F1 but doesn't log per-tag breakdown. Add per-tag F1 to the CSV log at each eval step:
step, ..., f1.locality, f1.region, f1.street, f1.house_number, f1.venue, f1.postcode, ...
This reveals which tags the recipe change helped and which it hurt. Without it we're optimizing a single number that hides tag-level regressions.
3. Audit per-source tag distributionโ
Instrument the data loader to log per-tag token counts per source per epoch. The model has 673M rows across 12 sources, but the source-weight sampler controls which sources dominate. If wof-admin: 2.0 means the model sees 40% bare place names, that explains the frequency dominance โ but the fix is to add structured-address examples to wof-admin, not to starve the model of place-name examples entirely.
4. Keep int8 quantization for browser deploymentโ
The 29M model quantizes from 66 MB (fp32) to 17 MB (int8) with negligible accuracy loss. This is correct for the browser cold-load budget (~60 MB total). Do not increase model size โ the architecture is sufficient for the task. Fix the recipe, not the model.
Demo preset results (v0.5.2 weights + inference enhancements)โ
model-stage2-step-001800-int8.onnx with the v0.1.0 tokenizer), not the v0.5.3 checkpoint. The inference-time enhancements (FST prior, QueryShape locality guard, Wikipedia importance, grouper-audit) run on top of whatever model is loaded. Results below may appear better than the raw model warrants because the priors and grouper-audit are compensating for model weaknesses.Neural-only (no pipeline priors, no FST)โ
The raw model output with no inference-time assistance:
| Preset | Result | Assessment |
|---|---|---|
| 1600 Pennsylvania Ave NW, Washington, DC 20500 | street=Pennsylvania Ave NW, house_number=1600, region=Washington, postcode=20500 | Missing locality=Washington, region=DC. Model assigns region to Washington. |
| 350 5th Ave, New York, NY 10118 | street=5th Ave, house_number=350, region=New York, postcode=10118 | Missing locality=New York, region=NY. Same confusion. |
| Pier 39, San Francisco, CA 94133 | locality=San Francisco, region=CA, street=Pier 39, postcode=94133 | Good โ locality correct. |
| 1060 W Addison St, Chicago, IL 60613 | street=W Addison St, house_number=1060, locality=Chicago, region=IL, postcode=60613 | Correct. |
| 400 Broad St, Seattle, WA 98109 | street=Broad St, house_number=400, locality=Seattle, region=WA, postcode=98109 | Correct. |
| 90210 | postcode=90210 | Correct. |
Neural-only: 4/6 correct. The model handles structured addresses well but confuses locality/region on ambiguous place names (Washington, New York).
Default pipeline (neural + QueryShape + FST + grouper-audit)โ
The full pipeline with all inference-time enhancements active:
| Preset | Result | Assessment |
|---|---|---|
| 1600 Pennsylvania Ave NW, Washington, DC 20500 | house_number=1600, street=Pennsylvania Ave NW, region=NW (?), postcode=20500 | Worse. Grouper-audit injected wrong locality=Pennsylvania. Neural output (region=Washington) was overridden. |
| 350 5th Ave, New York, NY 10118 | house_number=350, street=5th Ave, region=NY, postcode=10118 | Lost locality=New York (grouper-audit injected locality=Ave instead). |
| Pier 39, San Francisco, CA 94133 | house_number=39, street=Pier 39, region=CA, postcode=94133 | Lost locality=San Francisco (grouper-audit injected locality=Pier). |
| 1060 W Addison St, Chicago, IL 60613 | house_number=1060, street=W Addison St, region=IL, postcode=60613 | Lost locality=Chicago (grouper-audit injected locality=W Addison). |
| 400 Broad St, Seattle, WA 98109 | house_number=400, street=Broad St, region=WA, postcode=98109 | Lost locality=Seattle (grouper-audit injected locality=Broad). |
| 90210 | postcode=90210 | Correct. |
Default pipeline: 1/6 correct. The grouper-audit fills the gaps the model leaves, but the phrase grouper's proposals are often wrong (locality=Pennsylvania, locality=Ave, etc.) because they're structural guesses without semantic understanding.
Analysisโ
The neural-only path (4/6) appears to outperform the full pipeline (1/6), but this is misleading. The neural-only JSON output hides unclassified spans โ decodeAsJson only shows tags the model emitted, silently dropping all-O regions. For example, on 1600 Pennsylvania Ave NW, Washington, DC 20500, the neural model produces ONLY region=Washington and region=DC. The entire street + house number + postcode region (positions 0-25 and 41-46) is unclassified.
The pipeline path reveals the truth: the grouper-audit fills those gaps with provisional nodes, making the missing tags visible. The grouper-audit is working as designed โ it only fills genuinely empty spans. The problem is:
- The v0.5.2 model barely emits non-region tags. On the Washington preset, 2 of ~16 tokens get typed. The rest are O. This is the tag collapse caused by the wof-admin downweight.
- The phrase grouper's structural proposals are low quality for these gaps. It proposes
LOCALITY_PHRASE("Pennsylvania")based on capitalization, not semantics. When injected at 0.55 ร grouper confidence, these wrong provisional nodes appear in the output. - The JSON dedup (first-occurrence-wins) is not the issue. The grouper-audit nodes aren't competing with neural nodes โ they're filling spans the neural model left completely empty.
The 4/6 "correct" neural-only results for v0.5.2 are correct by omission โ the model gets the tags it DOES emit right but doesn't emit tags for most of the address.
v0.5.3 (step-028000, best checkpoint) โ neural-onlyโ
| Preset | Result | Assessment |
|---|---|---|
| 1600 Pennsylvania Ave NW, Washington, DC 20500 | house_number=1600, street=Pennsylvania Ave NW, locality=Washington, region=DC, postcode=20500 | All correct. Locality/region confusion FIXED. |
| 350 5th Ave, New York, NY 10118 | house_number=350, street=5th Ave, locality=New York, region=NY, postcode=10118 | All correct. |
| Pier 39, San Francisco, CA 94133 | street=Pier 39, locality=San Francisco, region=CA, postcode=94133 | Correct. |
| 1060 W Addison St, Chicago, IL 60613 | house_number=1060, street=W Addison St, locality=Chicago, region=IL, postcode=60613 | Correct. |
| 400 Broad St, Seattle, WA 98109 | house_number=400, street=Broad St, locality=Seattle, region=WA, postcode=98109 | Correct. |
| 90210 | postcode=90210 | Correct. |
v0.5.3 neural-only: 6/6 correct. All components present with high confidence (0.90-0.97). The tree uses containment nesting: region โ locality โ street โ house_number, which is the correct address hierarchy. This is the first model to produce correct locality/region assignments on the Washington and New York failure cases.
Why v0.5.3 F1 is lower but results are betterโ
The val_macro_f1 comparison (0.579 vs 0.638) was misleading because:
- Different tokenizers. v0.5.1 used v0.1.0 (24K vocab) at inference but was evaluated against a val set tokenized with the same tokenizer. v0.5.3 uses v0.5.0-a1 (48K vocab) throughout. Different subword splits produce different BIO alignments, making F1 numbers incomparable.
- Different label distributions. The wof-admin downweight changed the val set's source mix, shifting which tags dominate the macro F1 average.
- Macro F1 hides tag-level improvements. A model that improves locality from 0.20 to 0.90 but regresses region from 0.95 to 0.85 looks like a regression in macro F1. The demo presets show the tag-level improvement directly.
Handoff for next sessionโ
Priority 1: Run eval matrix on v0.5.3 step-28000โ
We have 6 demo presets (all correct) but zero per-tag F1 data across 4,535 golden entries. Run the full 4-mode eval matrix (rule-only, neural, hybrid, hybrid-joint) with the v0.5.0-a1 tokenizer. Compare per-tag F1 side-by-side with v0.5.1 and v0.5.2. Specifically watch locality, region, street, house_number.
Priority 2: Quantize to int8, verify sizeโ
The 48K embedding table means the int8 model will be ~29-30 MB (not 17 MB like v0.5.2's 24K vocab model). Verify it fits the ~60 MB browser budget alongside tokenizer + WOF slim DB. Run demo presets on int8 to verify parity.
Priority 3: Ship v0.5.3 if eval passesโ
Update neural-weights-en-us/ with step-28000 int8 + v0.5.0-a1 tokenizer + model card. Follow RELEASING.md pipeline.
Priority 4: Verify grouper-audit is now a no-opโ
v0.5.3 emits tags on all spans (no empty gaps). The grouper-audit should have nothing to fill. Run demo presets through the full pipeline and confirm the audit pass doesn't inject conflicting provisional nodes.
Priority 5: Instrument per-tag F1 in training loopโ
Add f1.locality, f1.region, f1.street, f1.house_number, f1.postcode, f1.venue columns to the training CSV log. Prevents the "trusted macro F1 across tokenizer versions" mistake from recurring.
Process: new eval release gateโ
| # | Question | Method |
|---|---|---|
| 1 | Demo presets pass? | 6 manual tests, neural-only AND full pipeline. Must be 6/6. |
| 2 | Per-tag F1 improves on core tags? | Golden eval matrix, locality/region/street/house_number each โฅ previous release. |
| 3 | Overconfident-wrong doesn't regress? | Must be โค previous + 2pp in hybrid-joint mode. |
| 4 | No new failure-class zero-outs? | Check kryptonite and adversarial slices. |
| 5 | Int8 matches fp32? | Per-tag F1 deltas > 1pt = investigate. |
Never compare val_macro_f1 across tokenizer versions. Different tokenizers invalidate BIO alignment comparisons.
Key insight: wof-admin downweight was correctโ
The wof-admin downweight (2.0 โ 0.3) didn't starve the model โ it forced it to learn place names from structured-address sources (BAN, TIGER, NAD) where positional context disambiguates locality vs region. At weight 2.0, the model memorized "Washington = region" from bare-name frequency. At 0.3, it had to attend to surrounding tokens. This is the bitter lesson: less hand-curated signal + structured context beats a strong prior that encodes the wrong pattern.
Addendum: confusion log and lessons learnedโ
This session had significant thrashing. Documenting the confusion points so future sessions avoid them.
Mistakes madeโ
| # | What happened | Root cause | Time wasted |
|---|---|---|---|
| 1 | Ran uncompiled TSX via npx tsx, blamed missing React import. Created PR, then reverted. | Didn't know the CLI must be run from compiled output (node mailwoman/out/cli.js). | ~30 min |
| 2 | Compared val_macro_f1 across tokenizer versions (0.579 vs 0.638), concluded regression. | Different tokenizers produce incomparable BIO alignments. No guard against this comparison. | ~2 hours of wrong-direction analysis |
| 3 | Blamed grouper-audit for "overriding neural output" (pipeline 1/6 vs neural 4/6). | decodeAsJson hides all-O spans. The neural model was barely emitting tags โ the audit was correctly filling gaps. | ~1 hour |
| 4 | Printed tree.roots.map(tag) and reported 4% coverage. Concluded tag collapse. | Tree uses containment nesting (region โ locality โ street โ house_number). Only top-level roots visible without traversal. | ~30 min |
| 5 | DeepSeek recommended "do not ship, revert to v0.5.1 recipe." | Trusted the same invalid F1 comparison. Wrote 150 lines of analysis before running a single functional test. | ~1 hour of wrong-direction planning |
| 6 | DeepSeek diagnosed wof-admin downweight as "caused tag collapse." | Confirmation bias โ once the F1 comparison "proved" regression, every observation confirmed it. The downweight was actually the correct fix. | Cascading effect on all subsequent analysis |
The patternโ
Every confusion point shares one root cause: trusted a summary number over direct observation. val_macro_f1 over demo presets. decodeAsJson over raw BIO stream. tree.roots over tree traversal. "29M params" over counting the embedding table.
The fix is not more metrics. It's a mandatory step where you look at the actual output before trusting the summary. One demo preset run (30 seconds) would have prevented every error.
Process changes adoptedโ
- Demo presets are a release gate. Run them BEFORE writing any verdict on a training run.
- Never compare F1 across tokenizer versions. Add
tokenizer_versionto eval headers. - When metrics disagree with functional tests, trust the functional test. Investigate the metric.
- Print raw BIO coverage, not just JSON projection.
decodeAsJsonhides the model's gaps. - After tokenizer vocab change, recheck param count. 48K vocab added 9.2M embedding params.
- DeepSeek consultations must include functional test results alongside metrics. Aggregate metrics without functional evidence are insufficient to conclude.
Tooling improvements identifiedโ
Prospective Claude Code skillsโ
eval-model โ Demo preset release gate. Accepts model path + tokenizer, runs 6 presets through neural-only + full pipeline, reports JSON + BIO coverage + source attribution. Flags regressions from baseline. Prevents the "4/6 correct but 4% coverage" blindspot that cost hours today.
training-monitor โ Modal training status checker. Downloads train_log.csv, parses eval points, reports val_macro_f1 trajectory and best checkpoint. Warns on tokenizer version changes. Replaces the throwaway polling scripts written 5+ times this session.
wof-build โ Unified WOF SQLite pipeline. Chains: build-unified-wof โ build-importance โ FST query verification โ stats report. Eliminates manual multi-step orchestration.
DeepSeek consultation improvementsโ
Evidence checklist. The skill instructions should require that every consultation prompt includes: (1) functional test output alongside aggregate metrics, (2) tokenizer version when comparing models, (3) raw BIO output not just JSON projection, (4) explicit "what changed" matrix for multi-variable comparisons.
Verify-before-concluding guard. Add a penultimate turn to the session pattern: "Before concluding โ did we verify against functional tests? Do metrics and functional tests agree?" This would have prevented the "do not ship" verdict.
Empty response handling. DeepSeek returned empty responses twice today (zero-byte output files). The skill should instruct retry with a shorter prompt, and check API key validity on repeated failures.
Cross-session continuity. When the Claude conversation restarts, DeepSeek session context is lost. The skill should suggest saving key conclusions to a reference file that the next session's first DeepSeek prompt can include.
Filesโ
- Training config:
corpus-python/src/mailwoman_train/configs/v0_5_3-classifier-diagnostic.yaml - Train log: Modal volume
/output-v053/train_log.csv(1050 lines, 25 eval points) - Checkpoints: Modal volume
/output-v053/checkpoints/step-{002000..050000} - Best checkpoint: step-028000 (val_macro_f1 = 0.579126)