Skip to main content

Parity Scorecard — 2026-06-09 (baseline: v4.1.0)

Question this answers: "How close is the neural parser to v0 (Pelias/addressit rules) capability parity, per component, and where do we still bleed?" One authoritative table per model version, so "are we at ~90%?" stops being a scatter of one-off evals.

Read it with two lenses — they disagree on purpose:

  1. Arena head-to-head (whole-address-strict): the 3 unbiased capability arenas (scripts/eval/external-arenas.sh, --symmetric-match --postcode-repair). A row counts only if the WHOLE parse matches. This is the honest "does the system produce a usable parse" lens — but it understates per-tag wins (a unit-perfect parse scores 0 if any other tag slips). Example below: postal-arena secondary-unit reads 0% here while per-tag unit is 92%.
  2. Per-tag F1: per-locale-f1.ts on golden dev (US/FR) + the curated real-OOD evals. The granular tag-health lens — this is what the parity campaign moves.

Self-emitted (scripts/eval/external-arenas.sh + per-locale-f1.ts); do not hand-edit numbers.


Lens 1 — capability arenas (v4.1.0 int8 vs v0)

arenanv0neuralbothneural-onlyv0-onlyboth-fail
libpostal (clean/canonical)6929%22%14%7%14%64%
perturb (noisy/degraded)39839%60%32%28%8%33%
postal (edge formats)3826%11%5%5%21%68%

Routing truth (unchanged since #15): rules win on clean/canonical, neural wins decisively on noisy/degraded (+21pp), both are weak on edge formats (PO-box/military/rural-route). The resolver should route by input shape.

Postal-arena edge classes where BOTH are 0% (the parity frontier): po-box (4), military-apofpo (3), rural-route (1), directional (2). secondary-unit reads 0% whole-match here despite 92% per-tag (lens caveat).


Lens 2 — per-tag F1 (golden dev, v4.1.0 int8, anchor-on)

TagUSFRstatus
postcode98.399.4✅ healthy
house_number96.291.2✅ healthy
venue90.20.0✅ US / ❌ FR (#330)
street78.560.1◐ (street-eats-affix boundary)
region78.427.8◐ US / ❌ FR (#330)
locality60.169.7
country35.246.5❌ starved → deterministic tagger, not retrain (#464) — see below
street_prefix0.00.0❌ starved → affix retrain in flight (v0.9.8)
street_suffix0.0❌ starved → affix retrain in flight
street_prefix_particle0.0❌ starved (FR)
unit6.3¹0.0✅ FIXED (92.3 real-OOD; golden has ~no unit rows)
po_box~18²✅ deterministic tagger = 100% real-OOD (#464)
intersection_a/b0.0❌ starved (experimental, regressed before)
dependent_locality0.00.0(intentionally down-weighted — WOF-schema artifact)

¹ golden dev carries ~no unit rows; the real-OOD eval is the truth (92.3%). ² from the #15 assessment.

Real-OOD evals (the trustworthy lens for the campaign tags)

evaltagv4.1.0v0.9.8-affixnote
unit-real-designators (34)unit92.393.8retained ✓
street-affix-real (32)street_prefix0.078.0P100/R64
street-affix-realstreet_suffix0.066.7P100/R50
street-affix-realstreet (name)0.0³50.0unfolded
country-real (34)country (model v1)49.0over-fires: golden precision 23%
country-real (34)country (deterministic)100.0matchCountry on trailing segment, P=R=100
po-box-real (25)po_box (model, #15)~18starved
po-box-real (25)po_box (deterministic)100.0matchPOBox per segment; 7 negatives, 0 FP
cedex-real (15)cedex (deterministic, FR)100.0CEDEX regex in-segment; 5 negatives, 0 FP

³ v4.1.0 lumps "Wacker Dr" into one street span; the affix retrain teaches the split (7/10 → split, perfect precision).

METHODOLOGY GOTCHA (load-bearing for the campaign): per-locale-f1.ts's foldToComponents JOINS street_prefix+street+street_suffix into one street string — so it cannot measure the affix split and reports 0% even when the model splits perfectly. Use scripts/eval/score-affix.ts (unfolded decodeAsJson) for street_prefix/street_suffix. The folded street is still the right no-regression metric (the fold recomposes → golden street holds/rises).


Parity verdict (v4.1.0)

Common tags (postcode/house_number/street/locality/region/venue-US) are at usable parity. The gap is a small set of starved long-tail tagsunit is now FIXED (the first campaign win), street_prefix/street_suffix are in flight (v0.9.8), country and po_box have a measured deterministic path (P=R=F1=100, #464 — not a retrain), and intersection/FR-venue/cedex remain. Not yet at 90% macro parity; the campaign is the path. Each lever is compounding (covering a tag sharpens its neighbors — unit lifted US street +3pp), and the lever-shape taxonomy below now routes each remaining tag to the right tool (retrain vs deterministic matcher).

Campaign status

Levertag(s)status
unitunit✅ shipped v4.1.0 (0→92%)
affixstreet_prefix/suffixGATED v0.9.8: 0→78/67 (P=100), US net-positive, negative-space street +2.1. FR postcode −3.9 (US-shard dilution) trips the >2pp gate → promote DEFERRED to operator (#462; DeepSeek recommends promote-with-annotation). int8 ship-ready.
countrycountryRESOLVED — deterministic, not retrain. The synth shard makes the model over-fire (v1 country-real 49 F1, golden precision 23% — it learns "trailing token = country"). A deterministic matchCountry tagger on the trailing comma-segment scores P=R=F1=100 on country-real. Lever moves to a post-parse ProposalClassifier (#464); model path kept as exploration record (#463).
po_boxpo_boxRESOLVED — deterministic (probe measured this shift): matchPOBox per comma-segment = P=R=F1=100 on po-box-real (n=25, 7 negatives, 0 FP). Same closed-vocab shape as country → joins #464 as a ProposalClassifier, not a retrain.
cedex (FR)cedexRESOLVED — deterministic (probe this shift): CEDEX regex in-segment = P=R=F1=100 on cedex-real (n=15, 5 negatives, 0 FP). Different locale + match shape than country/po_box, same taxonomy → joins #464.
intersectionintersection_a/b⏳ (gated — regressed before)
FRvenue/region⏳ (#330; cedex now resolved deterministically above)
consolidation⏳ v1.0 (bake winners + full regression gate) — needs > 20k steps: see dilution note

CUMULATIVE-DILUTION LESSON (load-bearing for consolidation): stacking synth shards into one 20k-step run dilutes each tag vs its solo run — the country run (base+unit+affix+country) dropped affix street_suffix 67→59 and street_prefix 78→75 while still under-serving country. Each added shard spreads a fixed step budget thinner. Prove each lever solo, then consolidate with a larger step budget (the v2-vs-v3 unit lesson, re-confirmed). Don't read a cumulative run's per-tag number as that lever's ceiling.

Lever-shape taxonomy (why not everything is a retrain)

Correction (reviewed later same day): the "closed-vocab → deterministic post-parse override" conclusion below was reversed. The over-firing was a training-distribution artifact (a trailing-country-only shard), and the "deterministic = 100%" was measured on a no-homograph eval — an unfair comparison. The corrected decision is model-first with a soft gazetteer anchor (the lexicon informs, never overrides), and the matchers become feature-sources, not a ProposalClassifier override. The deterministic-match P=R=F1=100 numbers below stand as matcher-accuracy evidence (good feature-sources), not as a justification to override the model. See Closed-vocab fields: model-first.

The campaign has surfaced two tag shapes that want different tools:

  • Distributional / open-vocab tags (street_prefix, street_suffix, unit, locality) — the model has to learn the boundary from context. Synth shard + retrain is the right tool; these are where negative-space training compounds.
  • Closed-vocab / fixed-position tags (country ✓, po_box ✓, cedex ✓) — a finite surface-form set in a predictable slot. A deterministic matcher gives perfect precision with zero training; a retrained model only dilutes it (over-fires). Use a ProposalClassifier, keep it default-off + byte-stable. All three now measured at P=R=F1=100 via this route, across 2 locales (US/FR) and 2 match shapes (whole-segment + in-segment regex) — build one ClosedVocabTagger parameterized by (matcher, tag, slot) rather than per-tag classifiers (#464).

Baseline captured 2026-06-09 ~04:50 UTC during the night shift; finalized ~12:25 UTC with the country + po_box + cedex deterministic results. Affix "after" = the v0.9.8 gate run; country = the v0.9.9 v1 cumulative run + the deterministic probe; po_box + cedex = deterministic probes only (no retrain — taxonomy applied up front).