v0.7.2 eval — intersection-bare retrain (2026-05-30)

Verdict: EXPERIMENTAL. Best neural model to date and strictly dominates v0.6.0, but does not clear the ≥25% harness PROMOTE bar. Do not promote to default on the strict gate; ship as experimental. The path to ≥25% is coverage and the resolver, not another recipe tweak.

What v0.7.2 was

Single variable vs v0.7.1: the synth-intersection shard regenerated ~60% bare (no , City, ST tail) + an @ connector, weight held at 0.2. v0.7.1 had proven the model learns intersections (assigned-prob 0.0001 → 0.99 after the shard) but only in the format it trained on, so it fumbled the harness's bare X & Y cases (4/65). v0.7.2 tests whether format diversity in the shard fixes that. Base recipe, tokenizer (v0.6.0-a0), and everything else unchanged → F1 valid.

Primary metric — harness pass rate

Model	Harness (mailwoman/test, +postcode-repair)
v0.6.0	14.4%
v0.7.0 calibration (ls=0.1)	13.8%
v0.7.2	19.5% (81/415)
— gate bands	≤14% sidegrade · 15–24% experimental · ≥25% promote

+5.1pp over v0.6.0 — the largest single-release harness gain of the v0.6/0.7 cycle. v0 (rules) is at 93.7% on this suite, as expected (it's a Pelias/addressit port; see the capability map — neural is the robustness layer, not a better v0).

Intersection cluster — the retrain's target

intersection.test.ts: 4/65 → 11/65 (17%). The bare-format + @ fix worked directionally (+7 assertions) but did not reach "most-of-65." Format diversity helped; it wasn't sufficient. Bare-intersection coverage is still thin relative to the 65 hand-curated adversarial cases.

Per-tag gate vs v0.6.0 — PASS

Golden exact-match 21.1% → 22.0%. No tag regressed >2pp. Highlights:

Tag	Recall Δ	Hallucination Δ
street	+2.9pp (27.7→30.5)	+29
postcode	+0.9pp	−25
locality	+0.5pp	−121 (much cleaner)
venue	+0.9pp	+13
country	−0.8pp (within tolerance)	−29

The street gain + locality-hallucination drop are the bare-format shard generalizing beyond intersections (street segmentation improved a little).

Capability map on v0.7.2 (all three arenas)

Arena	n	v0	neural	crossover
libpostal (clean/canonical)	69	29%	16%	rules win
corpus-perturbation (noisy/degraded)	398	39%	61%	neural wins
postal-standards (edge formats)	38	26%	8%	rules ahead; both weak

The crossover holds on v0.7.2: clean → rules, noisy → neural. The third arena (new this cycle) adds the edge dimension — by edge_class: canonical 36/7, intl-format 43/29, secondary-unit 29/0, and military APO/FPO, PO-box, rural-route, directional all 0/0 (coverage gaps shared by both parsers). A neural-only win surfaced: Dun Laoghaire, Co. Dublin (en-IE).

unit-repair (parser backlog B1) measured here

Opt-in --unit-repair post-decode pass. v0.7.2 showed unit-drop has two modes:

Mislabel-as-locality (Flat 2 14 Smith St → locality) — fixed by allowing ADD over locality. Postal unit-tag recall 0/8 → 2/8, 0 locality regressions.
Absorbed-into-street (STE 12 inside MAIN ST NW STE 12) — not regex-repairable; a coverage gap (units underrepresented), same as intersections were.

Net: the repair patches the locality half safely; the street-absorption half needs a coverage shard (backlog B3'). Stays opt-in.

Recommendation

Don't promote v0.7.2 to default on the strict gate (19.5% < 25%). But it strictly dominates the current default (v0.6.0): +5.1pp harness, +7 intersections, per-tag gate clean. Shipping it as experimental (model card, not defaultVersion) is the right call — it's the best artifact we have and is A/B-able. The promote-to-default decision is the operator's.
The 25% bar is not a recipe-tuning problem. Two cycles (v0.6.x, v0.7.x) of corpus tweaks have moved harness 14→19.5%. The remaining gap is (a) coverage of the long tail — bare intersections, street-less last-lines, mid-street units, non-US locales (the third arena's 0% edge classes) — and (b) the resolver (Direction B), which routes per-input on quality (rules for clean, neural for noisy) instead of grading neural on rules' home turf. Both are bigger than a shard weight.
Next parser-backlog item: B2 country-repair (v0.7.2 mislabels trailing AUSTRALIA → locality; country −0.8pp) — a closed-lexicon post-decode pass, same low-risk mould as postcode/unit repair.

Artifacts (volatile, /tmp/v072-eval/ + /tmp/external-arenas-v072/): harness.md, gate-report.md, eval-morphology.json, the three arena results, postal-unitrepair-v2.json. Model checkpoint is durable on the Modal volume: output-v072-intersection/checkpoints/step-100000.

What v0.7.2 was​

Primary metric — harness pass rate​

Intersection cluster — the retrain's target​

Per-tag gate vs v0.6.0 — PASS​

Capability map on v0.7.2 (all three arenas)​

unit-repair (parser backlog B1) measured here​

Recommendation​