v0.8.1 — The MLM decisive round
The question
Does masked-language-model pre-training of the encoder improve the supervised parser at the task ceiling? The 40k round (lr 5e-5 cosine) showed pretraining's secondary gains (calibration +2.0pp, harness +4.8pp) but BOTH arms under-trained the product metric (resolver 48–50% vs shipped 96.1%) — a broken recipe, not an init verdict. This round re-runs the A/B at the v0.7.2-proven recipe so the comparison happens where the model can actually reach ~96% resolver.
What changed (single variable)
Both arms are identical except the encoder initialization:
| knob | both arms |
|---|---|
| learning rate | 1.5e-4 constant (v0.7.2-proven) |
| steps / warmup | 100,000 / 1,000 |
| label smoothing | 0.1 (both arms — keeps calibration fair) |
| CRF | off (crf_loss_weight=0.0) |
| precision / seed | bf16 / 42 |
| corpus / tokenizer | v0.4.0 / v0.6.0-a0 → A/B F1 comparison is valid |
| arm | encoder init |
|---|---|
| A pretrained | init_from output-v080-mlm-pretrain/checkpoints/step-020000 (MLM) |
| B scratch | fresh random init |
Tokenizer + corpus are held constant, so the "never compare F1 across tokenizer versions" rule is not in play — this is a clean single-variable comparison.
Held-out metrics
Three metrics, auto-parsed from each runner's own JSON/MD sidecar (eval figures are never hand-typed — see #211/#212):
- resolver locality Acc@1 — the PRODUCT metric, on real OpenAddresses points (non-circular). Primary decision metric.
- harness v0-neural pass-rate — the Pelias-lineage regression gate (secondary).
- calibration — fraction of WRONG predictions emitted at confidence ≥ 0.9 (lower is better; the original pathology was ~81% at ≥0.9).
Kill-point rule (DeepSeek turn-4 consult, 2026-05-31)
- pretrained does not ≥ scratch resolver (within ±0.5pp) and harness+calibration gains disappear/reverse → DROP pretraining (v0.7.2 stays default).
- pretrained matches/slightly beats scratch resolver (even +0.3pp) and retains the calibration/harness gains → SHIP with pretraining.
- pretrained beats scratch resolver by ≥ 1pp → SCALE pretrain 20k → 100k and re-run.
Results
Generated by scripts/eval-v081-decisive.sh (resolver re-run at 10k OA); figures are
auto-parsed from each runner's sidecar, never hand-typed.
| metric | A pretrained | B scratch | Δ (A−B) | v0.7.2 ref |
|---|---|---|---|---|
| resolver locality Acc@1 (10k OA) | 97.3% (9730/10000) | 97.5% (9750/10000) | −0.20 (0.9σ — tie) | 96.1% |
| harness pass-rate | 19.04% (79/415) | 20.72% (86/415) | −1.69 | 19.5% |
| wrong-pred confidence p90 (cal.) | 0.949 | 0.949 | 0.00 | — |
Reference column = shipped v0.7.2, cited from 2026-05-30-v0.7.2-eval.md (resolver locality Acc@1 96.1%, harness 19.5%).
Resolver was re-run at the full 10,000-row OA sample (the auto-runner's 3k first pass showed −0.60pp, within noise; 10k tightens it to −0.20pp / 0.9σ — a statistical tie). Both arms use the post-#222 gazetteer-alias matching, and both clear shipped v0.7.2 (96.1%) and v0/Pelias (95.8%) on real points.
Calibration note: the metric here is the 90th-percentile confidence of WRONG predictions (lower = better calibrated), identical for both arms (0.949). It is NOT the "fraction of wrong preds at ≥0.9" the auto-runner's column label implied — that label conflated a percentile and has been corrected. Either way the init-attributable calibration delta is zero.
Reading
At the task ceiling, MLM pre-training of the encoder buys nothing:
- Product metric (resolver): a tie (−0.20pp, 0.9σ). Pre-training is neutral, not harmful.
- Harness: scratch is ahead by 7 assertions (−1.69pp) — small, but the wrong direction for pre-training.
- Calibration: flat (identical wrong-pred confidence distribution).
The decisive point is the disappearance of the 40k round's gains. That under-trained round (lr 5e-5, 40k) had shown pre-training +4.8pp harness and a calibration edge — which is what kept the track alive. Re-run at the v0.7.2-proven recipe where the model actually reaches the ceiling (Arm A's step-28k eval was already v0.7.2-class: locality 0.84 / region 0.75 / street 0.94 / house_number 0.99 / postcode 0.999), those gains vanished. They were an under-training artifact: a from-scratch encoder simply hadn't caught up yet at 40k, so the pre-trained init looked better. Given enough steps, scratch catches up and the gap closes.
Decision
DROP pre-training. v0.7.2 stays the default; the MLM track is closed as a clean negative result. This is exactly DeepSeek's pre-registered kill-point (turn-4): the SHIP branch required pre-training to retain the calibration/harness gains, and at ceiling it does not. The product metric being a tie (not a loss) means the honest conclusion is "no benefit," not "actively worse" — but a +1 training stage (20k MLM pre-train) and the attendant pipeline complexity are not worth a statistical tie plus a slightly-worse harness.
What we keep from the track: the MLM pre-training code (pretrain.py, masking.py,
forward_mlm) lands cleanly and is reusable if a future, larger model or a different objective
(e.g. ELECTRA-style, or pre-training on a much larger unlabeled corpus) revisits it — the
negative result is specific to this 29M-param encoder, 20k-step pre-train, en-us corpus.
The forward prize remains resolver depth (see the resolver failure analysis + backlog),
where the product-metric gains have been real.