v0.7.2 eval — intersection-bare retrain (2026-05-30)
Verdict: EXPERIMENTAL. Best neural model to date and strictly dominates v0.6.0, but does not clear the ≥25% harness PROMOTE bar. Do not promote to default on the strict gate; ship as experimental. The path to ≥25% is coverage and the resolver, not another recipe tweak.
What v0.7.2 was
Single variable vs v0.7.1: the synth-intersection shard regenerated ~60% bare
(no , City, ST tail) + an @ connector, weight held at 0.2. v0.7.1 had proven
the model learns intersections (assigned-prob 0.0001 → 0.99 after the shard)
but only in the format it trained on, so it fumbled the harness's bare X & Y
cases (4/65). v0.7.2 tests whether format diversity in the shard fixes that.
Base recipe, tokenizer (v0.6.0-a0), and everything else unchanged → F1 valid.
Primary metric — harness pass rate
| Model | Harness (mailwoman/test, +postcode-repair) |
|---|---|
| v0.6.0 | 14.4% |
| v0.7.0 calibration (ls=0.1) | 13.8% |
| v0.7.2 | 19.5% (81/415) |
| — gate bands | ≤14% sidegrade · 15–24% experimental · ≥25% promote |
+5.1pp over v0.6.0 — the largest single-release harness gain of the v0.6/0.7 cycle. v0 (rules) is at 93.7% on this suite, as expected (it's a Pelias/addressit port; see the capability map — neural is the robustness layer, not a better v0).
Intersection cluster — the retrain's target
intersection.test.ts: 4/65 → 11/65 (17%). The bare-format + @ fix worked
directionally (+7 assertions) but did not reach "most-of-65." Format diversity
helped; it wasn't sufficient. Bare-intersection coverage is still thin relative
to the 65 hand-curated adversarial cases.
Per-tag gate vs v0.6.0 — PASS
Golden exact-match 21.1% → 22.0%. No tag regressed >2pp. Highlights:
| Tag | Recall Δ | Hallucination Δ |
|---|---|---|
| street | +2.9pp (27.7→30.5) | +29 |
| postcode | +0.9pp | −25 |
| locality | +0.5pp | −121 (much cleaner) |
| venue | +0.9pp | +13 |
| country | −0.8pp (within tolerance) | −29 |
The street gain + locality-hallucination drop are the bare-format shard generalizing beyond intersections (street segmentation improved a little).
Capability map on v0.7.2 (all three arenas)
| Arena | n | v0 | neural | crossover |
|---|---|---|---|---|
| libpostal (clean/canonical) | 69 | 29% | 16% | rules win |
| corpus-perturbation (noisy/degraded) | 398 | 39% | 61% | neural wins |
| postal-standards (edge formats) | 38 | 26% | 8% | rules ahead; both weak |
The crossover holds on v0.7.2: clean → rules, noisy → neural. The third arena
(new this cycle) adds the edge dimension — by edge_class: canonical 36/7,
intl-format 43/29, secondary-unit 29/0, and military APO/FPO, PO-box,
rural-route, directional all 0/0 (coverage gaps shared by both parsers). A
neural-only win surfaced: Dun Laoghaire, Co. Dublin (en-IE).
unit-repair (parser backlog B1) measured here
Opt-in --unit-repair post-decode pass. v0.7.2 showed unit-drop has two modes:
- Mislabel-as-locality (
Flat 2 14 Smith St→locality) — fixed by allowing ADD over locality. Postal unit-tag recall 0/8 → 2/8, 0 locality regressions. - Absorbed-into-street (
STE 12insideMAIN ST NW STE 12) — not regex-repairable; a coverage gap (units underrepresented), same as intersections were.
Net: the repair patches the locality half safely; the street-absorption half needs a coverage shard (backlog B3'). Stays opt-in.
Recommendation
- Don't promote v0.7.2 to default on the strict gate (19.5% < 25%). But it
strictly dominates the current default (v0.6.0): +5.1pp harness, +7
intersections, per-tag gate clean. Shipping it as experimental (model card,
not
defaultVersion) is the right call — it's the best artifact we have and is A/B-able. The promote-to-default decision is the operator's. - The 25% bar is not a recipe-tuning problem. Two cycles (v0.6.x, v0.7.x) of corpus tweaks have moved harness 14→19.5%. The remaining gap is (a) coverage of the long tail — bare intersections, street-less last-lines, mid-street units, non-US locales (the third arena's 0% edge classes) — and (b) the resolver (Direction B), which routes per-input on quality (rules for clean, neural for noisy) instead of grading neural on rules' home turf. Both are bigger than a shard weight.
- Next parser-backlog item: B2 country-repair (v0.7.2 mislabels trailing
AUSTRALIA→ locality; country −0.8pp) — a closed-lexicon post-decode pass, same low-risk mould as postcode/unit repair.
Artifacts (volatile, /tmp/v072-eval/ + /tmp/external-arenas-v072/): harness.md,
gate-report.md, eval-morphology.json, the three arena results, postal-unitrepair-v2.json.
Model checkpoint is durable on the Modal volume: output-v072-intersection/checkpoints/step-100000.