v0.5.3 Diagnostic Training Review — 2026-05-27

Verdict: REVISED — v0.5.3 is the best model yet.

Correction

Initial analysis compared val_macro_f1 (0.579 vs 0.638) and concluded v0.5.3 regressed. This was wrong. The F1 numbers are not directly comparable — different tokenizers, different val sets, different label distributions. When tested on demo presets, v0.5.3 achieves 6/6 correct including locality=Washington/region=DC and locality=New York/region=NY — the stubborn failures that persisted across all prior models. The tree structure uses proper containment nesting (house_number inside street inside locality inside region). DeepSeek's recommendation to revert was based on the same misleading F1 comparison.

v0.5.3 peaked at val_macro_f1 = 0.579 (step 28K). v0.5.1 reached 0.638. Every recipe change since v0.5.1 — wof-admin downweight, cosine LR, label smoothing — has regressed. The model is not capacity-limited; the recipe is wrong.

Training curve

Step	val_loss	val_macro_f1	Note
2,000	0.789	0.519	Warmup phase
4,000	0.616	0.543	Rapid improvement
8,000	0.528	0.558	Slowing
10,000	0.541	0.567	val_loss uptick
18,000	0.430	0.574	Second phase improvement
26,000	0.427	0.578	val_loss minimum
28,000	0.428	0.579	Peak F1
34,000	0.435	0.579	Plateau
50,000	0.438	0.578	Converged, LR at 0

The model plateaued at step 18K and gained only +0.005 F1 over the next 32K steps. Cosine LR decayed to zero by step 50K, leaving no learning signal in the final third of training.

What went wrong: recipe comparison

Parameter	v0.5.1 (0.638)	v0.5.3 (0.579)	Impact
wof-admin source weight	2.0	0.3	Primary cause. Starved the model of place-name examples. The "bare-name frequency dominance" fix removed the dominant signal entirely instead of rebalancing it.
label_smoothing	0.0	0.05	Contributes to lower peak F1. Smoothing prevents the model from being confident on easy cases, which drags down macro F1 on high-accuracy tags.
LR schedule	constant	cosine	Cosine decay to zero killed late-stage learning. v0.5.1's constant LR allowed continued improvement through step 95K.
max_steps	100K	50K	Undertrained by 2x, but the plateau at 18K suggests more steps wouldn't help with this recipe.
class_weights dep_loc/subregion	0.3	1.0	Intended fix for tag avoidance. Unclear if it helped — model still under-predicts these tags.
transliteration source weights	unspecified (default 1.0)	0.5	Intended to reduce non-Latin dominance. May have helped multilingual balance but didn't compensate for the wof-admin loss.

The wof-admin downweight: diagnosis

The v0.5.2 DEMO_PRESET_DIAGNOSIS.md identified "WOF bare-name frequency dominance" as the root cause of locality/region confusion. The fix was to downweight wof-admin from 2.0 to 0.3 — a 6.7x reduction in the model's exposure to place-name labeling.

This was the wrong fix. The wof-admin source provides 10M rows of country/region/dependent_locality labels — the only source that teaches the model "Washington" = region, "New York" = locality. Downweighting it starved the model of this signal. The model responded by collapsing to fewer tag types (primarily region and O), which is why v0.5.2 barely emits locality/street/house_number.

The locality/region confusion was better addressed at inference time:

QueryShape locality prior (+2.0 bias when region abbreviation detected) — #164
Region-aware guard (skip bias when text matches region name) — #174
FST Wikipedia importance (DC 0.815 > WA 0.764) — #173

These inference-time fixes compose correctly without degrading the model's ability to emit diverse tags.

Model size is not the bottleneck

29M parameters (h384, 6 layers, 6 heads) reached 0.638 on the v0.5.1 recipe. The model overfits past step 65K on this corpus — a larger model (h512, ~68M) would overfit faster, not break through the ceiling. The bottleneck is:

Training recipe — the source weights and LR schedule, not architecture
Data diversity — the model sees 1M rows/epoch but the source-weight sampler may starve it of street/venue/house_number tags from structured-address sources (BAN, TIGER, NAD)
Eval granularity — val_macro_f1 averages across all tags equally. A model that perfects region (F1=0.99) but ignores venue (F1=0.0) can still report 0.6+ macro F1. Per-tag F1 breakdown is needed to diagnose which tags are being sacrificed.

Recommended next steps

1. Revert to v0.5.1 recipe, retrain

source_weights:
  wof-admin: 2.0 # restore from 0.3
  wof-postalcode: 2.0 # restore from 1.0
  ban: 3.0
  tiger: 4.0
  usgov-nad: 1.0
label_smoothing: 0.0 # restore from 0.05
lr_schedule: constant # restore from cosine
max_steps: 100000 # restore from 50000

Expected: val_macro_f1 ≥ 0.638 (matching v0.5.1). The v0.5.0-a1 tokenizer (48K vocab) may improve or regress vs v0.5.1's effective tokenizer — this is the variable to watch.

2. Instrument per-tag F1 in training loop

The val eval computes macro F1 but doesn't log per-tag breakdown. Add per-tag F1 to the CSV log at each eval step:

step, ..., f1.locality, f1.region, f1.street, f1.house_number, f1.venue, f1.postcode, ...

This reveals which tags the recipe change helped and which it hurt. Without it we're optimizing a single number that hides tag-level regressions.

3. Audit per-source tag distribution

Instrument the data loader to log per-tag token counts per source per epoch. The model has 673M rows across 12 sources, but the source-weight sampler controls which sources dominate. If wof-admin: 2.0 means the model sees 40% bare place names, that explains the frequency dominance — but the fix is to add structured-address examples to wof-admin, not to starve the model of place-name examples entirely.

4. Keep int8 quantization for browser deployment

The 29M model quantizes from 66 MB (fp32) to 17 MB (int8) with negligible accuracy loss. This is correct for the browser cold-load budget (~60 MB total). Do not increase model size — the architecture is sufficient for the task. Fix the recipe, not the model.

Demo preset results (v0.5.2 weights + inference enhancements)

These results use the currently deployed v0.5.2 model weights (which are actually v0.5.1's model-stage2-step-001800-int8.onnx with the v0.1.0 tokenizer), not the v0.5.3 checkpoint. The inference-time enhancements (FST prior, QueryShape locality guard, Wikipedia importance, grouper-audit) run on top of whatever model is loaded. Results below may appear better than the raw model warrants because the priors and grouper-audit are compensating for model weaknesses.

Neural-only (no pipeline priors, no FST)

The raw model output with no inference-time assistance:

Preset	Result	Assessment
1600 Pennsylvania Ave NW, Washington, DC 20500	street=Pennsylvania Ave NW, house_number=1600, region=Washington, postcode=20500	Missing locality=Washington, region=DC. Model assigns region to Washington.
350 5th Ave, New York, NY 10118	street=5th Ave, house_number=350, region=New York, postcode=10118	Missing locality=New York, region=NY. Same confusion.
Pier 39, San Francisco, CA 94133	locality=San Francisco, region=CA, street=Pier 39, postcode=94133	Good — locality correct.
1060 W Addison St, Chicago, IL 60613	street=W Addison St, house_number=1060, locality=Chicago, region=IL, postcode=60613	Correct.
400 Broad St, Seattle, WA 98109	street=Broad St, house_number=400, locality=Seattle, region=WA, postcode=98109	Correct.
90210	postcode=90210	Correct.

Neural-only: 4/6 correct. The model handles structured addresses well but confuses locality/region on ambiguous place names (Washington, New York).

Default pipeline (neural + QueryShape + FST + grouper-audit)

The full pipeline with all inference-time enhancements active:

Preset	Result	Assessment
1600 Pennsylvania Ave NW, Washington, DC 20500	house_number=1600, street=Pennsylvania Ave NW, region=NW (?), postcode=20500	Worse. Grouper-audit injected wrong locality=Pennsylvania. Neural output (region=Washington) was overridden.
350 5th Ave, New York, NY 10118	house_number=350, street=5th Ave, region=NY, postcode=10118	Lost locality=New York (grouper-audit injected locality=Ave instead).
Pier 39, San Francisco, CA 94133	house_number=39, street=Pier 39, region=CA, postcode=94133	Lost locality=San Francisco (grouper-audit injected locality=Pier).
1060 W Addison St, Chicago, IL 60613	house_number=1060, street=W Addison St, region=IL, postcode=60613	Lost locality=Chicago (grouper-audit injected locality=W Addison).
400 Broad St, Seattle, WA 98109	house_number=400, street=Broad St, region=WA, postcode=98109	Lost locality=Seattle (grouper-audit injected locality=Broad).
90210	postcode=90210	Correct.

Default pipeline: 1/6 correct. The grouper-audit fills the gaps the model leaves, but the phrase grouper's proposals are often wrong (locality=Pennsylvania, locality=Ave, etc.) because they're structural guesses without semantic understanding.

Analysis

The neural-only path (4/6) appears to outperform the full pipeline (1/6), but this is misleading. The neural-only JSON output hides unclassified spans — decodeAsJson only shows tags the model emitted, silently dropping all-O regions. For example, on 1600 Pennsylvania Ave NW, Washington, DC 20500, the neural model produces ONLY region=Washington and region=DC. The entire street + house number + postcode region (positions 0-25 and 41-46) is unclassified.

The pipeline path reveals the truth: the grouper-audit fills those gaps with provisional nodes, making the missing tags visible. The grouper-audit is working as designed — it only fills genuinely empty spans. The problem is:

The v0.5.2 model barely emits non-region tags. On the Washington preset, 2 of ~16 tokens get typed. The rest are O. This is the tag collapse caused by the wof-admin downweight.
The phrase grouper's structural proposals are low quality for these gaps. It proposes LOCALITY_PHRASE("Pennsylvania") based on capitalization, not semantics. When injected at 0.55 × grouper confidence, these wrong provisional nodes appear in the output.
The JSON dedup (first-occurrence-wins) is not the issue. The grouper-audit nodes aren't competing with neural nodes — they're filling spans the neural model left completely empty.

The 4/6 "correct" neural-only results for v0.5.2 are correct by omission — the model gets the tags it DOES emit right but doesn't emit tags for most of the address.

v0.5.3 (step-028000, best checkpoint) — neural-only

Preset	Result	Assessment
1600 Pennsylvania Ave NW, Washington, DC 20500	house_number=1600, street=Pennsylvania Ave NW, locality=Washington, region=DC, postcode=20500	All correct. Locality/region confusion FIXED.
350 5th Ave, New York, NY 10118	house_number=350, street=5th Ave, locality=New York, region=NY, postcode=10118	All correct.
Pier 39, San Francisco, CA 94133	street=Pier 39, locality=San Francisco, region=CA, postcode=94133	Correct.
1060 W Addison St, Chicago, IL 60613	house_number=1060, street=W Addison St, locality=Chicago, region=IL, postcode=60613	Correct.
400 Broad St, Seattle, WA 98109	house_number=400, street=Broad St, locality=Seattle, region=WA, postcode=98109	Correct.
90210	postcode=90210	Correct.

v0.5.3 neural-only: 6/6 correct. All components present with high confidence (0.90-0.97). The tree uses containment nesting: region → locality → street → house_number, which is the correct address hierarchy. This is the first model to produce correct locality/region assignments on the Washington and New York failure cases.

Why v0.5.3 F1 is lower but results are better

The val_macro_f1 comparison (0.579 vs 0.638) was misleading because:

Different tokenizers. v0.5.1 used v0.1.0 (24K vocab) at inference but was evaluated against a val set tokenized with the same tokenizer. v0.5.3 uses v0.5.0-a1 (48K vocab) throughout. Different subword splits produce different BIO alignments, making F1 numbers incomparable.
Different label distributions. The wof-admin downweight changed the val set's source mix, shifting which tags dominate the macro F1 average.
Macro F1 hides tag-level improvements. A model that improves locality from 0.20 to 0.90 but regresses region from 0.95 to 0.85 looks like a regression in macro F1. The demo presets show the tag-level improvement directly.

Handoff for next session

Priority 1: Run eval matrix on v0.5.3 step-28000

We have 6 demo presets (all correct) but zero per-tag F1 data across 4,535 golden entries. Run the full 4-mode eval matrix (rule-only, neural, hybrid, hybrid-joint) with the v0.5.0-a1 tokenizer. Compare per-tag F1 side-by-side with v0.5.1 and v0.5.2. Specifically watch locality, region, street, house_number.

Priority 2: Quantize to int8, verify size

The 48K embedding table means the int8 model will be ~29-30 MB (not 17 MB like v0.5.2's 24K vocab model). Verify it fits the ~60 MB browser budget alongside tokenizer + WOF slim DB. Run demo presets on int8 to verify parity.

Priority 3: Ship v0.5.3 if eval passes

Update neural-weights-en-us/ with step-28000 int8 + v0.5.0-a1 tokenizer + model card. Follow RELEASING.md pipeline.

Priority 4: Verify grouper-audit is now a no-op

v0.5.3 emits tags on all spans (no empty gaps). The grouper-audit should have nothing to fill. Run demo presets through the full pipeline and confirm the audit pass doesn't inject conflicting provisional nodes.

Priority 5: Instrument per-tag F1 in training loop

Add f1.locality, f1.region, f1.street, f1.house_number, f1.postcode, f1.venue columns to the training CSV log. Prevents the "trusted macro F1 across tokenizer versions" mistake from recurring.

Process: new eval release gate

#	Question	Method
1	Demo presets pass?	6 manual tests, neural-only AND full pipeline. Must be 6/6.
2	Per-tag F1 improves on core tags?	Golden eval matrix, locality/region/street/house_number each ≥ previous release.
3	Overconfident-wrong doesn't regress?	Must be ≤ previous + 2pp in hybrid-joint mode.
4	No new failure-class zero-outs?	Check kryptonite and adversarial slices.
5	Int8 matches fp32?	Per-tag F1 deltas > 1pt = investigate.

Never compare val_macro_f1 across tokenizer versions. Different tokenizers invalidate BIO alignment comparisons.

Key insight: wof-admin downweight was correct

The wof-admin downweight (2.0 → 0.3) didn't starve the model — it forced it to learn place names from structured-address sources (BAN, TIGER, NAD) where positional context disambiguates locality vs region. At weight 2.0, the model memorized "Washington = region" from bare-name frequency. At 0.3, it had to attend to surrounding tokens. This is the bitter lesson: less hand-curated signal + structured context beats a strong prior that encodes the wrong pattern.

Addendum: confusion log and lessons learned

This session had significant thrashing. Documenting the confusion points so future sessions avoid them.

Mistakes made

#	What happened	Root cause	Time wasted
1	Ran uncompiled TSX via `npx tsx`, blamed missing React import. Created PR, then reverted.	Didn't know the CLI must be run from compiled output (`node mailwoman/out/cli.js`).	~30 min
2	Compared val_macro_f1 across tokenizer versions (0.579 vs 0.638), concluded regression.	Different tokenizers produce incomparable BIO alignments. No guard against this comparison.	~2 hours of wrong-direction analysis
3	Blamed grouper-audit for "overriding neural output" (pipeline 1/6 vs neural 4/6).	`decodeAsJson` hides all-O spans. The neural model was barely emitting tags — the audit was correctly filling gaps.	~1 hour
4	Printed `tree.roots.map(tag)` and reported 4% coverage. Concluded tag collapse.	Tree uses containment nesting (`region → locality → street → house_number`). Only top-level roots visible without traversal.	~30 min
5	DeepSeek recommended "do not ship, revert to v0.5.1 recipe."	Trusted the same invalid F1 comparison. Wrote 150 lines of analysis before running a single functional test.	~1 hour of wrong-direction planning
6	DeepSeek diagnosed wof-admin downweight as "caused tag collapse."	Confirmation bias — once the F1 comparison "proved" regression, every observation confirmed it. The downweight was actually the correct fix.	Cascading effect on all subsequent analysis

The pattern

Every confusion point shares one root cause: trusted a summary number over direct observation. val_macro_f1 over demo presets. decodeAsJson over raw BIO stream. tree.roots over tree traversal. "29M params" over counting the embedding table.

The fix is not more metrics. It's a mandatory step where you look at the actual output before trusting the summary. One demo preset run (30 seconds) would have prevented every error.

Process changes adopted

Demo presets are a release gate. Run them BEFORE writing any verdict on a training run.
Never compare F1 across tokenizer versions. Add tokenizer_version to eval headers.
When metrics disagree with functional tests, trust the functional test. Investigate the metric.
Print raw BIO coverage, not just JSON projection. decodeAsJson hides the model's gaps.
After tokenizer vocab change, recheck param count. 48K vocab added 9.2M embedding params.
DeepSeek consultations must include functional test results alongside metrics. Aggregate metrics without functional evidence are insufficient to conclude.

Tooling improvements identified

Prospective Claude Code skills

eval-model — Demo preset release gate. Accepts model path + tokenizer, runs 6 presets through neural-only + full pipeline, reports JSON + BIO coverage + source attribution. Flags regressions from baseline. Prevents the "4/6 correct but 4% coverage" blindspot that cost hours today.

training-monitor — Modal training status checker. Downloads train_log.csv, parses eval points, reports val_macro_f1 trajectory and best checkpoint. Warns on tokenizer version changes. Replaces the throwaway polling scripts written 5+ times this session.

wof-build — Unified WOF SQLite pipeline. Chains: build-unified-wof → build-importance → FST query verification → stats report. Eliminates manual multi-step orchestration.

DeepSeek consultation improvements

Evidence checklist. The skill instructions should require that every consultation prompt includes: (1) functional test output alongside aggregate metrics, (2) tokenizer version when comparing models, (3) raw BIO output not just JSON projection, (4) explicit "what changed" matrix for multi-variable comparisons.

Verify-before-concluding guard. Add a penultimate turn to the session pattern: "Before concluding — did we verify against functional tests? Do metrics and functional tests agree?" This would have prevented the "do not ship" verdict.

Empty response handling. DeepSeek returned empty responses twice today (zero-byte output files). The skill should instruct retry with a shorter prompt, and check API key validity on repeated failures.

Cross-session continuity. When the Claude conversation restarts, DeepSeek session context is lost. The skill should suggest saving key conclusions to a reference file that the next session's first DeepSeek prompt can include.

Files

Training config: corpus-python/src/mailwoman_train/configs/v0_5_3-classifier-diagnostic.yaml
Train log: Modal volume /output-v053/train_log.csv (1050 lines, 25 eval points)
Checkpoints: Modal volume /output-v053/checkpoints/step-{002000..050000}
Best checkpoint: step-028000 (val_macro_f1 = 0.579126)

Verdict: REVISED — v0.5.3 is the best model yet.​

Training curve​

What went wrong: recipe comparison​

The wof-admin downweight: diagnosis​

Model size is not the bottleneck​

Recommended next steps​

1. Revert to v0.5.1 recipe, retrain​

2. Instrument per-tag F1 in training loop​

3. Audit per-source tag distribution​

4. Keep int8 quantization for browser deployment​

Demo preset results (v0.5.2 weights + inference enhancements)​

Neural-only (no pipeline priors, no FST)​

Default pipeline (neural + QueryShape + FST + grouper-audit)​

Analysis​

v0.5.3 (step-028000, best checkpoint) — neural-only​

Why v0.5.3 F1 is lower but results are better​

Handoff for next session​

Priority 1: Run eval matrix on v0.5.3 step-28000​

Priority 2: Quantize to int8, verify size​

Priority 3: Ship v0.5.3 if eval passes​

Priority 4: Verify grouper-audit is now a no-op​

Priority 5: Instrument per-tag F1 in training loop​

Process: new eval release gate​

Key insight: wof-admin downweight was correct​

Addendum: confusion log and lessons learned​

Mistakes made​

The pattern​

Process changes adopted​

Tooling improvements identified​

Prospective Claude Code skills​

DeepSeek consultation improvements​

Files​