v0.8.1 — The MLM decisive round

The question

Does masked-language-model pre-training of the encoder improve the supervised parser at the task ceiling? The 40k round (lr 5e-5 cosine) showed pretraining's secondary gains (calibration +2.0pp, harness +4.8pp) but BOTH arms under-trained the product metric (resolver 48–50% vs shipped 96.1%) — a broken recipe, not an init verdict. This round re-runs the A/B at the v0.7.2-proven recipe so the comparison happens where the model can actually reach ~96% resolver.

What changed (single variable)

Both arms are identical except the encoder initialization:

knob	both arms
learning rate	1.5e-4 constant (v0.7.2-proven)
steps / warmup	100,000 / 1,000
label smoothing	0.1 (both arms — keeps calibration fair)
CRF	off (`crf_loss_weight=0.0`)
precision / seed	bf16 / 42
corpus / tokenizer	v0.4.0 / v0.6.0-a0 → A/B F1 comparison is valid

arm	encoder init
A pretrained	`init_from` output-v080-mlm-pretrain/checkpoints/step-020000 (MLM)
B scratch	fresh random init

Tokenizer + corpus are held constant, so the "never compare F1 across tokenizer versions" rule is not in play — this is a clean single-variable comparison.

Held-out metrics

Three metrics, auto-parsed from each runner's own JSON/MD sidecar (eval figures are never hand-typed — see #211/#212):

resolver locality Acc@1 — the PRODUCT metric, on real OpenAddresses points (non-circular). Primary decision metric.
harness v0-neural pass-rate — the Pelias-lineage regression gate (secondary).
calibration — fraction of WRONG predictions emitted at confidence ≥ 0.9 (lower is better; the original pathology was ~81% at ≥0.9).

Kill-point rule (DeepSeek turn-4 consult, 2026-05-31)

pretrained does not ≥ scratch resolver (within ±0.5pp) and harness+calibration gains disappear/reverse → DROP pretraining (v0.7.2 stays default).
pretrained matches/slightly beats scratch resolver (even +0.3pp) and retains the calibration/harness gains → SHIP with pretraining.
pretrained beats scratch resolver by ≥ 1pp → SCALE pretrain 20k → 100k and re-run.

Results

Generated by scripts/eval-v081-decisive.sh (resolver re-run at 10k OA); figures are auto-parsed from each runner's sidecar, never hand-typed.

metric	A pretrained	B scratch	Δ (A−B)	v0.7.2 ref
resolver locality Acc@1 (10k OA)	97.3% (9730/10000)	97.5% (9750/10000)	−0.20 (0.9σ — tie)	96.1%
harness pass-rate	19.04% (79/415)	20.72% (86/415)	−1.69	19.5%
wrong-pred confidence p90 (cal.)	0.949	0.949	0.00	—

Reference column = shipped v0.7.2, cited from 2026-05-30-v0.7.2-eval.md (resolver locality Acc@1 96.1%, harness 19.5%).

Resolver was re-run at the full 10,000-row OA sample (the auto-runner's 3k first pass showed −0.60pp, within noise; 10k tightens it to −0.20pp / 0.9σ — a statistical tie). Both arms use the post-#222 gazetteer-alias matching, and both clear shipped v0.7.2 (96.1%) and v0/Pelias (95.8%) on real points.

Calibration note: the metric here is the 90th-percentile confidence of WRONG predictions (lower = better calibrated), identical for both arms (0.949). It is NOT the "fraction of wrong preds at ≥0.9" the auto-runner's column label implied — that label conflated a percentile and has been corrected. Either way the init-attributable calibration delta is zero.

Reading

At the task ceiling, MLM pre-training of the encoder buys nothing:

Product metric (resolver): a tie (−0.20pp, 0.9σ). Pre-training is neutral, not harmful.
Harness: scratch is ahead by 7 assertions (−1.69pp) — small, but the wrong direction for pre-training.
Calibration: flat (identical wrong-pred confidence distribution).

The decisive point is the disappearance of the 40k round's gains. That under-trained round (lr 5e-5, 40k) had shown pre-training +4.8pp harness and a calibration edge — which is what kept the track alive. Re-run at the v0.7.2-proven recipe where the model actually reaches the ceiling (Arm A's step-28k eval was already v0.7.2-class: locality 0.84 / region 0.75 / street 0.94 / house_number 0.99 / postcode 0.999), those gains vanished. They were an under-training artifact: a from-scratch encoder simply hadn't caught up yet at 40k, so the pre-trained init looked better. Given enough steps, scratch catches up and the gap closes.

Decision

DROP pre-training. v0.7.2 stays the default; the MLM track is closed as a clean negative result. This is exactly DeepSeek's pre-registered kill-point (turn-4): the SHIP branch required pre-training to retain the calibration/harness gains, and at ceiling it does not. The product metric being a tie (not a loss) means the honest conclusion is "no benefit," not "actively worse" — but a +1 training stage (20k MLM pre-train) and the attendant pipeline complexity are not worth a statistical tie plus a slightly-worse harness.

What we keep from the track: the MLM pre-training code (pretrain.py, masking.py, forward_mlm) lands cleanly and is reusable if a future, larger model or a different objective (e.g. ELECTRA-style, or pre-training on a much larger unlabeled corpus) revisits it — the negative result is specific to this 29M-param encoder, 20k-step pre-train, en-us corpus. The forward prize remains resolver depth (see the resolver failure analysis + backlog), where the product-metric gains have been real.

The question​

What changed (single variable)​

Held-out metrics​

Kill-point rule (DeepSeek turn-4 consult, 2026-05-31)​

Results​

Reading​

Decision​