OpenAddresses real-point resolver eval — the non-circular accuracy track (2026-05-30)

Direction-C resolver-depth. The first non-circular end-to-end accuracy number for the resolver — real US addresses with real government coordinates, resolved against a gazetteer they don't come from — and, because v0 is a TypeScript port of the Pelias parser, the neural-vs-Pelias-parser head-to-head at the same time (the parser axis of "beat Pelias", with no Docker Pelias stack needed).

Why this is the honest scoreboard

The WOF-bootstrap eval (the +8.5pp exact-match-tiering result) renders WOF places back into address strings and resolves WOF→WOF. It's a legitimate ranking test, but it's circular by construction — the ground truth is the same gazetteer the resolver consults, so it can't measure whether we resolve real-world addresses to the right place on the map.

OpenAddresses is independent: each row is a real US address with a real lat/lon harvested from authoritative government address points, and the resolver consults the WOF gazetteer — a different source. So:

Admin-match (did we resolve to the expected locality/region, by canonical gazetteer name vs OA's ground truth) measures resolver correctness independent of WOF id conventions.
Coordinate error (great-circle from the resolved admin centroid to OA's real point) is a genuine map-accuracy signal — un-gameable, since OA's point was never in the gazetteer.

The set: data/eval/external/openaddresses-us-sample.jsonl (10,000 rows, 8 states, stratified dense-urban → rural so no single state dominates).

Two-tier metric

Per the DeepSeek resolver consult, a sub-10km coordinate bar is impossible for admin-centroid resolution — a city centroid is legitimately tens of km from its edge addresses. So the metric is split:

Admin-match Acc (the headline) — locality-match and region-match rates, granularity-independent.
Coord error p50/p90 — reported as the admin-centroid tier. The street-level tier (TIGER) will own a sub-km coordinate bar in a later phase.

Head-to-head: neural vs the Pelias parser (v0.7.2 model, 10,000 rows)

mailwoman's v0 rule parser is a TypeScript port of the Pelias parser, so running both parsers through the same resolver makes this a direct neural-vs-Pelias-parser comparison on real, non-circular addresses — no Docker Pelias stack required. The table below is emitted verbatim by the eval runner (--out-md); eval figures are never hand-typed (see the integrity note).

parser	locality-match	region-match	resolved	coord p50 km	coord p90 km	p99 km
neural	96.1%	100.0%	100.0%	2.4	10.6	25.0
v0 (Pelias)	94.4%	99.5%	99.8%	2.4	10.6	25.0

Neural beats the Pelias parser on real US addresses — +1.7pp locality, +0.5pp region, and a higher resolve rate — and wins in every state (per-state below). Both share identical coordinate error because they feed the same resolver and, when both resolve to the right admin, land on the same centroid; the difference is purely which addresses each parser resolves correctly at all.

Neural per-state (locality-match)

state	n	neural loc	v0 loc	neural reg	v0 reg
CA	1429	99.9%	99.7%	100.0%	99.9%
DC	1429	99.5%	99.2%	99.9%	99.2%
IA	1429	94.3%	86.4%	99.8%	99.0%
IL	1429	98.7%	97.6%	100.0%	99.7%
MT	1428	96.7%	95.3%	100.0%	99.4%
SD	1428	96.8%	96.8%	100.0%	99.7%
VT	1428	87.1%	85.7%	100.0%	99.5%

Headline: neural locality-match 96.1%, region-match 100.0% on 10,000 real US addresses, resolved 100.0%; coord p50 2.4km / p90 10.6km / p99 25.0km (admin-centroid tier — median is centroid-to-address distance, not a geocoding miss). Neural's largest margin over the Pelias parser is IA +7.9pp (suburban/ rural midwest); the weakest state for both is VT (rural-northeast, sparse gazetteer coverage), where neural still leads 87.1% vs 85.7%.

Eval-integrity note

This doc's tables are produced by scripts/eval/oa-resolver-eval.ts --out-md and pasted verbatim. The runner also writes --out-json; the two are computed from the same aggregates so they cannot disagree. (Earlier in this work an OA table was hand-typed and shipped wrong numbers — the self-reporting --out-md flag exists to make that class of error impossible.)

What it measures vs. doesn't

It does confirm the resolver maps real addresses to the right city/state at scale, independent of the gazetteer's own id scheme — the credibility check the WOF-bootstrap number couldn't give.
It does not measure street/house precision (the resolver is admin-level; coord error reflects centroid-to-point distance, not a geocoding miss).
Region-match required a name↔abbrev map: resolved regions carry the gazetteer's canonical full name ("California", "District of Columbia") while OA carries the USPS abbreviation ("CA", "DC"). An early cut scored region-match at 30% purely from that mismatch — a matcher bug, not a resolver one; fixed in the runner.

Resolver change that landed with this

core/resolver/resolve.ts now stamps metadata.resolver_name (the resolved place's canonical gazetteer name) alongside resolver_score. Without it the eval could only compare against the parser's own text span, not the place the resolver actually chose — so it couldn't tell a right-name/wrong-place resolution from a correct one. The name is also generally useful to consumers (display the canonical name, not the raw input span).

Reproduce

node --experimental-strip-types scripts/eval/oa-resolver-eval.ts \
  --eval data/eval/external/openaddresses-us-sample.jsonl \
  --model <v0.7.2.onnx> --tokenizer <v0.6.0-a0> --model-card <card> \
  --wof admin-global-priority.db,postalcode-us.db \
  --out-json /tmp/oa-full.json

--limit N for a quick subset. Per-state breakdown is in the runner's output.

Why this is the honest scoreboard​

Two-tier metric​

Head-to-head: neural vs the Pelias parser (v0.7.2 model, 10,000 rows)​

Neural per-state (locality-match)​

Eval-integrity note​

What it measures vs. doesn't​

Resolver change that landed with this​

Reproduce​