v0.4.0 ablation campaign
Iteration window: 2026-05-23, ~9 hours wall-clock, single Radeon 780M iGPU. Outcome: shipped a smaller subset of the planned recipe than intended; identified one process meta-bug; deferred two of three training-side improvements to v0.4.1.
This is the technical postmortem. For the public-facing narrative, see the corresponding blog post. For the line-by-line iteration log entry, see PHASE_2_training.md's v0.4.0 entry.
Setupβ
- Model: 256-dim, 6 layers, 4 heads, intermediate 1024, max_pos 128 β 9M params total. Unchanged from v0.3.0.
- Corpus:
corpus-v0.3.0(677M aligned rows). Unchanged from v0.3.0. - Tokenizer: v0.1.0 sentencepiece. Unchanged from v0.3.0.
- Label vocab: 21 BIO classes. Unchanged from v0.3.0.
- Hardware: gfx1103 Radeon 780M iGPU, batch=32, grad_accum=4 β effective 128, ~4.5 steps/sec sustained.
The only intended differences from v0.3.0 were:
- Β§1
model.crf_normalization: per_token(was implicitper_sequence) +crf_loss_weight: 1.0(was 0.05). - Β§3
model.class_weights: {O: 1.0, B-country: 2.0, ..., B-venue: 0.5, ...}(was uniform). - Β§4 Source weights rebalanced:
usgov-nad: 2.0 β 1.0;wof-admin: 1.0 β 2.0;wof-postalcode: 1.0 β 2.0.
Β§2 (longer training to step 5000+ floor) and Β§5 (JS-side Viterbi + model-card label loading) were process improvements, not loss-surface changes. Β§5 had shipped before training. Β§6 was "reuse existing corpus" β non-action.
Run-by-runβ
Run 1 β full recipe, lr=5e-4β
configs/v0_4_0.yaml. lr=5e-4 had been the v0.2.0 baseline before v0.3.0's dual-loss instability forced it down to 1.5e-4. The Β§1 hypothesis was that per-token CRF normalization would make crf_loss_weight=1.0 safe at 5e-4 again.
Diverged step 750. Training loss had dropped monotonically 6.5 β 0.3 over steps 0-700, then spiked 0.3 β 3.3 over steps 700-1000. Val macro-F1 collapsed 0.36 β 0.11. Killed via train_with_resume.sh interrupt; checkpoint at step-500 (best so far) preserved at /data/models/checkpoints/v0_4_0/step-000500.
Run 2 β full recipe, lr=3e-4β
configs/v0_4_0-lr3e4.yaml. Operator brief said "if divergence recurs by step 2000, bisect down."
Diverged step 1000. Same fingerprint. Step-750 checkpoint at macro_f1 0.37 preserved.
Run 3 β full recipe, lr=1.5e-4 (v0.3.0-stable LR)β
configs/v0_4_0-lr1.5e4.yaml. By this point the bisect pattern was visible: 5e-4 β step 750, 3e-4 β step 1000. A pure-LR explanation predicted lr=1.5e-4 β step 1875 (linear) or ~step 2200 (proportional). It diverged at step 2000.
A factor-3.3Γ LR drop bought a factor-2.7Γ step delay. Sub-linear β confirming that LR controls when the divergence appears but isn't the root cause. The destabilizer is in the recipe, not the LR knob. Step-1250 checkpoint at macro_f1 0.39 preserved.
Runs 4 & 5 β ablations at lr=5e-4β
configs/v0_4_0-ablate-crf.yaml (drop Β§1, keep Β§3+Β§4) and configs/v0_4_0-ablate-cw.yaml (drop Β§3, keep Β§1+Β§4). Per the issue's prescribed ablation matrix.
Both diverged at step 1000, identically. At lr=5e-4 every single-knob revert behaved like the full recipe.
Conclusion: lr=5e-4 is structurally unreachable for this codebase's dual-loss landscape regardless of which Β§1/Β§3 knob is active. Pivot to the stable LR.
Runs 6-8 β stable-LR verdict smokes at lr=1.5e-4β
Three orthogonal cells, max_steps=3000 cosine schedule:
| Run | Recipe | Verdict | Peak macro_f1 | Drift (last 5 evals) |
|---|---|---|---|---|
| 6 | source-only (Β§4 only) | PASS | 0.4190 | 0.005 |
| 7 | cw-only (Β§3 + Β§4, v0.3.0 CRF base) | PASS | 0.4279 | 0.0036 |
| 8 | crf-only (Β§1 + Β§4, no class_weights) | FAIL | n/a (train_loss=1.24 at step 3000) | n/a |
cw-only looked like the clear winner β higher peak, lower drift, both under the 0.10 collapse threshold.
Run 9 β promote cw-only to full 50Kβ
configs/v0_4_0-final.yaml, identical to cw-only but with max_steps=50000 and a fresh output_dir.
Diverged at step 2250. Same fingerprint as the full recipe at lr=1.5e-4. The smoke had been a false-positive.
Run 10 β fall back to source-only at full 50Kβ
configs/v0_4_0-stableLR-source-only.yaml with max_steps=50000.
Survived. Peak macro_f1 0.42 at step ~2200; gradual cosine-decay decline through the rest. Step-2200 checkpoint became the shipped artifact: v0_4_0-stableLR-source-only/step-002200.
The meta-bugβ
cw-only's smoke (3000 steps, cosine schedule) had its LR back near zero by step 2750. The "pass" criterion ("macrof1 stable across the last three evals past step 2000") was measuring stability _under a decayed-to-near-zero LR. The full 50K run kept the LR near its 1.5e-4 peak for thousands of steps. That sustained-peak exposure was where the destabilization happened in every other run too.
The smoke wasn't testing the same loss landscape as the full run. The cosine schedule's tail had been doing the heavy lifting of "stability" all along.
Process fix for future smokes:
- Constant LR for the verdict window, OR
- max_steps large enough that the cosine tail doesn't dominate. With warmup=1000 + cosine=N, the LR is > 60% of peak roughly through step
warmup + N/3. Picking max_steps so the verdict window sits in that range (e.g. 10000 keeps LR > 60% peak through step ~4300).
Math sanity-checkβ
Before deferring Β§1 and Β§3 we audited model.py and crf.py for implementation bugs that could explain the destabilization:
crf.py:155-164β theper_tokenreduction isnll.sum() / total_tokens.clamp(min=1). Mathematically what the docstring claims. No NaN-prone path; the.clamp(min=1)guards the empty-batch edge case.model.py:270-300β class_weights enter viann.functional.cross_entropy(..., weight=class_weights). The PyTorch-standard path. No silent broadcast mismatch (class_weights.shape == (num_labels,)is asserted in the model constructor).- Dual-loss fusion:
loss = ce_loss + self.crf_loss_weight * crf_loss. Equal-weight summing atcrf_loss_weight=1.0withper_tokenreduction was the Β§1 hypothesis. The hypothesis was that per-token normalization brings CRF NLL to CE magnitude. Empirically, when Β§1 is active, training destabilizes β so either per-token CRF NLL is NOT actually comparable to per-token CE on this corpus, or the magnitude comparison is right but some other dynamic dominates.
No implementation bug found. The destabilization is a real recipe interaction.
Post-hoc regression diagnosticβ
After shipping, we ran a categorized per-tag FP/FN analysis on the shipped checkpoint against golden v0.1.2 (4535 entries). Tool: corpus-python/scripts/diagnose_regression.py (in tree from this iteration).
Postcode FN (1217 total)β
| Category | Count | Share | Pattern |
|---|---|---|---|
| empty_pred | 789 | 65% | Paris 75008 β no postcode in output |
| non_latin | 213 | 18% | γγΌ, 47110 γ΅γ³γ β byte-fallback noise |
| num_confused | 136 | 11% | 47110 SLL, 22 Rue Jasmin β predicts 22 |
| bio_slip | 73 | 6% | LE TRΓPORT, 76470 β ", 7647" |
| other | 6 | 0.5% | β |
Empty-pred dominates. Β§4's NAD downweight (2.0 β 1.0) removed the dominant source of "postcode comes first" patterns (NAD's 57M structured 911-grade rows + many FR mid-position patterns from auxiliary sources). The model now treats mid-position numeric tokens as house_number by default.
Country FN (194 total)β
178 of 194 (92%) are adversarial transliteration entries β gold has English country names but raw input contains mixed CJK/Cyrillic/Armenian script. Examples:
Ψ¨Ψ§Ψ± ΩΩΩ ΩΨ§ΫΩΩ
ΫΩΪ―, Wyoming, United States of America β pred: "yoming, United Sta"
γ΅γΌγ’γγͺγΉ, WY, United States of America β pred: ", WY, United State"
France, LozΓ¨re, ΥΤ΅ΥΥ-Τ±ΤΌΤ²Τ±Υ-ΥΥ
ΥΥΥ-ΤΌΤ»ΥΤ±ΥΥ
ΥΤΌ β pred: "" (empty)
This is v0.3.0's documented non-Latin-byte-fallback failure mode. The v0.4.0 weights did not regress this slice; the eval suite is counting v0.3.0's known failure modes against v0.4.0. After excluding adversarials, country FN drops 194 β ~16.
The country β0.07 F1 regression is mostly a golden-set adversarial-weighting artifact, not a real recipe regression.
The decoder sidecarβ
The bio_slip slice (6% of postcode FN) is a decoder bug, not a model bug. The model's BIO tag attribution is correct; the emitted span includes leading/trailing punctuation tokens. Fixed in core/decoder/build-tree.ts (commit c72ab4c on main):
function trimBoundary(raw: string, start: number, end: number): { start: number; end: number } {
let s = start,
e = end
const isWordChar = (i: number): boolean => /[\p{L}\p{N}]/u.test(raw[i] ?? "")
while (s < e && !isWordChar(s)) s++
while (e > s && !isWordChar(e - 1)) e--
return { start: s, end: e }
}
Both node.value and node.start/node.end trim in sync so consumers slicing raw[start:end] get the same string as node.value. Spans that trim to empty (all-punctuation, pathological model output) are dropped. Word-internal punctuation (hyphens in Sainte-Livrade-sur-Lot, accents in MontrΓ©al) is preserved. 6 new boundary-trim tests pass; full repo suite (1592 tests under singleFork + 30s timeout) green.
What deferred to v0.4.1β
Β§1 (per-token CRF norm) and Β§3 (class weights) both deferred. Both individually broke training in at least one tested configuration. The leading hypothesis at this iteration's boundary is that a high-variance adapter slice in corpus-v0.3.0 produces gradient spikes the per-token-normalized CRF can't dampen β a corpus-audit + gradient-norm probe pass per source-id would surface a candidate.
v0.4.1 thread proposals (drafted in PR i116 body):
- Thread A β source-weight tweak (NAD 1.0 β 1.5 partial restore) + synthesis pass over component-order permutations + case-norm eval. Targets the 65% empty_pred slice. 1-2 days.
- Thread B β corpus-side gradient-norm probe per source-id. Find the high-variance adapter. 3-5 days.
- Thread C β verdict-smoke framework redesign (constant-LR) + milder class_weights ratio. 1-2 days.
Process improvements landed during the iterationβ
Six items from the TODO.md parallel-work list shipped to main during the GPU-bound training windows (host-claude worked them in parallel):
| Commit | What |
|---|---|
ceb2c1f | @mailwoman/locale-gate workspace β Stage 2 of the runtime pipeline |
58eee3b | mailwoman parse --candidates <N> β Springfield-class disambiguation surface |
5566cd2 | corpus-audit tool β shard distribution Γ source_weights diagnostic |
150c7db | runtime-pipeline test hardening (AbortSignal, timing-budget, non-graceful failure) |
d94261b | mailwoman parse --benchmark <N> β per-stage p50/p95/p99 |
c1f82ac | docs/articles/concepts/staged-pipeline-contract.md β "how to plug in a custom stage" |
Plus the decoder sidecar (c72ab4c) and the PHASE_2 iteration-log updates (6ddc170 / d499288 / 09d0d9f / 404cceb).
Numbers referenceβ
Final eval, shipped checkpoint vs v0.3.0, golden v0.1.2:
| Tag | v0.4.0 | v0.3.0 | Ξ | With adversarials excluded |
|---|---|---|---|---|
| country | 0.21 | 0.28 | β0.07 | ~β0.01 (nearly flat) |
| region | 0.19 | 0.18 | +0.01 | similar |
| locality | 0.27 | 0.27 | flat | similar |
| postcode | 0.69 | 0.76 | β0.07 | similar (the regression is real, NAD downweight) |
| venue | 0.39 | 0.39 | flat | similar |
| street | 0.30 | 0.27 | +0.03 | similar |
| house_number | 0.79 | 0.78 | +0.01 | similar (issue #57 floor held) |
Macro F1 raw: 0.357 vs 0.293. Token confidence: 0.806 vs 0.857. Full-parse exact match: 0.082 vs 0.107.
Issue #116's "β₯2 of 4 axes improved" metric: only fine F1 cleanly improved (street +0.03, house_number +0.01). Coarse F1 is mixed once adversarial denominators are excluded. Calibration flat. Training stability negative (the central finding of the campaign).
See alsoβ
- Blog post β public-facing writeup
PHASE_2_training.mdv0.4.0 entry β canonical iteration log- Issue #116 β original work plan
corpus-python/scripts/diagnose_regression.pyβ categorized FP/FN bucketer (in v0.4.0 branch)PR i116β 10 commits with the campaign artifacts