Skip to main content

OpenAddresses real-point resolver eval — the non-circular accuracy track (2026-05-30)

Direction-C resolver-depth. The first non-circular end-to-end accuracy number for the resolver — real US addresses with real government coordinates, resolved against a gazetteer they don't come from — and, because v0 is a TypeScript port of the Pelias parser, the neural-vs-Pelias-parser head-to-head at the same time (the parser axis of "beat Pelias", with no Docker Pelias stack needed).

Why this is the honest scoreboard

The WOF-bootstrap eval (the +8.5pp exact-match-tiering result) renders WOF places back into address strings and resolves WOF→WOF. It's a legitimate ranking test, but it's circular by construction — the ground truth is the same gazetteer the resolver consults, so it can't measure whether we resolve real-world addresses to the right place on the map.

OpenAddresses is independent: each row is a real US address with a real lat/lon harvested from authoritative government address points, and the resolver consults the WOF gazetteer — a different source. So:

  • Admin-match (did we resolve to the expected locality/region, by canonical gazetteer name vs OA's ground truth) measures resolver correctness independent of WOF id conventions.
  • Coordinate error (great-circle from the resolved admin centroid to OA's real point) is a genuine map-accuracy signal — un-gameable, since OA's point was never in the gazetteer.

The set: data/eval/external/openaddresses-us-sample.jsonl (10,000 rows, 8 states, stratified dense-urban → rural so no single state dominates).

Two-tier metric

Per the DeepSeek resolver consult, a sub-10km coordinate bar is impossible for admin-centroid resolution — a city centroid is legitimately tens of km from its edge addresses. So the metric is split:

  1. Admin-match Acc (the headline) — locality-match and region-match rates, granularity-independent.
  2. Coord error p50/p90 — reported as the admin-centroid tier. The street-level tier (TIGER) will own a sub-km coordinate bar in a later phase.

Head-to-head: neural vs the Pelias parser (v0.7.2 model, 10,000 rows)

mailwoman's v0 rule parser is a TypeScript port of the Pelias parser, so running both parsers through the same resolver makes this a direct neural-vs-Pelias-parser comparison on real, non-circular addresses — no Docker Pelias stack required. The table below is emitted verbatim by the eval runner (--out-md); eval figures are never hand-typed (see the integrity note).

parserlocality-matchregion-matchresolvedcoord p50 kmcoord p90 kmp99 km
neural96.1%100.0%100.0%2.410.625.0
v0 (Pelias)94.4%99.5%99.8%2.410.625.0

Neural beats the Pelias parser on real US addresses — +1.7pp locality, +0.5pp region, and a higher resolve rate — and wins in every state (per-state below). Both share identical coordinate error because they feed the same resolver and, when both resolve to the right admin, land on the same centroid; the difference is purely which addresses each parser resolves correctly at all.

Neural per-state (locality-match)

statenneural locv0 locneural regv0 reg
CA142999.9%99.7%100.0%99.9%
DC142999.5%99.2%99.9%99.2%
IA142994.3%86.4%99.8%99.0%
IL142998.7%97.6%100.0%99.7%
MT142896.7%95.3%100.0%99.4%
SD142896.8%96.8%100.0%99.7%
VT142887.1%85.7%100.0%99.5%

Headline: neural locality-match 96.1%, region-match 100.0% on 10,000 real US addresses, resolved 100.0%; coord p50 2.4km / p90 10.6km / p99 25.0km (admin-centroid tier — median is centroid-to-address distance, not a geocoding miss). Neural's largest margin over the Pelias parser is IA +7.9pp (suburban/ rural midwest); the weakest state for both is VT (rural-northeast, sparse gazetteer coverage), where neural still leads 87.1% vs 85.7%.

Eval-integrity note

This doc's tables are produced by scripts/eval/oa-resolver-eval.ts --out-md and pasted verbatim. The runner also writes --out-json; the two are computed from the same aggregates so they cannot disagree. (Earlier in this work an OA table was hand-typed and shipped wrong numbers — the self-reporting --out-md flag exists to make that class of error impossible.)

What it measures vs. doesn't

  • It does confirm the resolver maps real addresses to the right city/state at scale, independent of the gazetteer's own id scheme — the credibility check the WOF-bootstrap number couldn't give.
  • It does not measure street/house precision (the resolver is admin-level; coord error reflects centroid-to-point distance, not a geocoding miss).
  • Region-match required a name↔abbrev map: resolved regions carry the gazetteer's canonical full name ("California", "District of Columbia") while OA carries the USPS abbreviation ("CA", "DC"). An early cut scored region-match at 30% purely from that mismatch — a matcher bug, not a resolver one; fixed in the runner.

Resolver change that landed with this

core/resolver/resolve.ts now stamps metadata.resolver_name (the resolved place's canonical gazetteer name) alongside resolver_score. Without it the eval could only compare against the parser's own text span, not the place the resolver actually chose — so it couldn't tell a right-name/wrong-place resolution from a correct one. The name is also generally useful to consumers (display the canonical name, not the raw input span).

Reproduce

node --experimental-strip-types scripts/eval/oa-resolver-eval.ts \
--eval data/eval/external/openaddresses-us-sample.jsonl \
--model <v0.7.2.onnx> --tokenizer <v0.6.0-a0> --model-card <card> \
--wof admin-global-priority.db,postalcode-us.db \
--out-json /tmp/oa-full.json

--limit N for a quick subset. Per-state breakdown is in the runner's output.