v0.7–v0.8: the neural parser vs the rules baseline
When: 2026-05-29 → 2026-05-31.
This retrospective asks three questions. Is the neural parser ready to replace v0 (our
TypeScript port of the Pelias rule engine)? If so, why does our own headline test suite say
it's nowhere close? And when those two answers disagree, which one should we believe?
Well… it depends on how we're keeping score. On our in-house suite the rules parser wins in a landslide, 93.7% to 20.7%. On the scoreboard that matches the product — real addresses, resolved end to end — the neural parser comes out ahead, 97.3% to 95.8%. Both numbers are real; they're just counting different things.
This is the deep-dive companion to the blog post (Our parser fails 80% of our own tests). It pins down the numbers, the bias, and the forward plan, with enough detail to be useful in three months.
All neural numbers below are v0.7.2-class (the shipped default; its v0.8.1 retrain at the
same recipe is statistically identical). Tokenizer v0.6.0-a0 throughout, so cross-run
comparisons are valid. v0 is the rule parser, scored through the same machinery.
Scoreboard 1 — the harness says neural is 4.5× worse
The regression gate is scripts/harness-v0-neural.ts: 415 tag-level assertions ported
from the Pelias / addressit test suites. Current standings:
| parser | pass | rate |
|---|---|---|
| v0 (rule-based) | 389 | 93.7% |
| neural | 86 | 20.7% |
| both pass | 76 | 18.3% |
| v0 only | 313 | 75.4% |
| neural only | 10 | 2.4% |
| both fail | 16 | 3.9% |
On paper the rules parser wins in a walk. Then look at the per-file breakdown: v0 passes
100% of every functional file (usa, intersection, functional, fra, nld, nzd…). Naturally —
v0 is our port of Pelias, and these assertions are Pelias / addressit's own expected
outputs, so it passes by construction. The suite cannot, even in principle, show v0
being wrong on these cases; it encodes Pelias's tagging conventions as ground truth.
So "neural scores 20.7%" measures one thing: how often neural disagrees with Pelias's exact tagging conventions — 75% of the time, on different tokenization of multi-word streets, different venue/locality boundaries, and the cases addressit happened to encode. It says nothing about whether neural is wrong. The harness is a faithful regression gate (it catches when a retrain breaks a convention we previously matched), but not a parsing-quality verdict. Treating it as one is the harness-lineage trap.
Decomposing the 20% — three unbiased arenas
To see where neural actually earns its keep, we score both parsers on three arenas drawn
from outside the Pelias lineage (scripts/eval/external-arenas.sh, all --symmetric-match):
| arena | n | v0 | neural | both | neural-only | v0-only | both-fail |
|---|---|---|---|---|---|---|---|
| libpostal (clean / canonical) | 69 | 29% | 16% | 9% | 7% | 20% | 64% |
| perturb (noisy / degraded) | 398 | 39% | 61% | 32% | 29% | 7% | 32% |
| postal (edge formats) | 38 | 26% | 8% | 5% | 3% | 21% | 71% |
Three different stories:
-
Clean, canonical input → rules win (29% vs 16%). libpostal-style strings are exactly what hand-tuned rules are built for. This also explains the harness: it is all canonical, Pelias-convention input, so it is the arena where neural is weakest, sampled 415 times.
-
Noisy, degraded, reordered input → neural wins, decisively (61% vs 39%) — and this is the largest arena (398 cases), generated by perturbing golden addresses (dropped commas, abbreviations, reordering, casing). This is the closest proxy we have to real geocoder traffic, and it is the entire reason the neural model exists: it generalizes where rules brittle-fail.
-
Edge formats → both are bad, v0 marginally ahead (26% vs 8%). The
edge_classbreakdown of the postal arena:edge_class n v0 neural canonical 14 36% 7% intl-format 7 43% 29% secondary-unit 7 29% 0% directional 2 0% 0% military-apofpo 3 0% 0% po-box 4 0% 0% rural-route 1 0% 0% PO boxes, military APO/FPO, and rural routes are 0% for both — neither parser was built or trained for them.
Scoreboard 2 — the resolver, on real addresses, end to end
The product is a geocoder, not a tag-matcher. The honest, non-circular metric is the OpenAddresses real-point eval: every row is a real US address with a real government lat/lon, and we score whether each parser, run through the same WOF resolver, lands on the right locality. On 10,000 rows:
| parser | locality match | region match | resolved |
|---|---|---|---|
| neural | 97.3% | 99.9% | 100.0% |
| v0 (Pelias) | 95.8% | 99.5% | 99.8% |
On the metric that matches the product, neural beats the rules parser (+1.5pp locality, on real addresses). The north-star claim — "a parser superior to Pelias's regex/rules system" — is demonstrated here, not merely asserted.
(Coordinate accuracy is still admin-centroid tier: p50 2.4 km, p90 10.6 km. That is the geometry gap, not a parsing gap — addressed in the roadmap below.)
Where neural is genuinely worse at parsing — and why
Stripping out the harness bias, the real neural deficits are:
- Clean canonical forms (libpostal 29 vs 16). Rules encode the exact pattern; neural has no inherent edge there, and its documented overconfidence (≈81% of wrong predictions emitted at ≥0.9 confidence) plus reliance on comma/format cues can mis-segment an otherwise tidy string.
- Edge / postal formats — PO box, military, rural route (0% both); secondary-unit 0%. Why: under-represented in the synthetic training corpus, so never learned.
- Non-US locales. On the harness the neural model scores 24% (FR), 23% (NL), 0% (NZ) against v0's 100%. Why: it is an en-us model; it was not trained for NZ/NL/FR conventions.
- Concrete parse bugs surfaced in the
resolver failure analysis:
multi-word locality truncation (
Belle Fourche→Belle,Fort Pierre→Fort), DC directional quadrants (…Street Ne→ tagsNeas a locality), and literal<Null>tokens in dirty input thatv0shrugs off.
Every one of these is corpus-addressable — they are gaps in what the model was shown, not architectural limits.
Where neural wins — and why we run it
Noisy real-world input (perturb 61 vs 39) and the product metric (97.3 vs 95.8 on real points). Real traffic is abbreviated, reordered, comma-light, misspelled — exactly where rules brittle-fail and a learned model generalizes. The 10 harness "neural-only" wins (up from ~2 historically) are the leading edge of the same effect.
The MLM detour — a clean negative result
We spent part of this window testing whether masked-language-model pre-training of the encoder would lift the model (attacking the overconfidence + format-reliance pathologies). A 40k-step fine-tune from a pre-trained encoder looked like a win (+4.8pp harness, a calibration edge). But running the decisive A/B at the full v0.7.2 recipe — both arms to the task ceiling, single variable = encoder init — the gains vanished:
| metric | pretrained | scratch | Δ |
|---|---|---|---|
| resolver locality (10k) | 97.3% | 97.5% | −0.20 (0.9σ, tie) |
| harness | 19.04% | 20.72% | −1.69 |
| wrong-pred conf p90 | 0.949 | 0.949 | 0.00 |
The 40k "gains" were an under-training artifact: a pre-trained init is simply ahead before a scratch encoder catches up; given 100k steps it does. Per a pre-registered kill-point, we dropped pre-training. v0.7.2 stays the default. The code is kept for a possible future larger model — the negative result is specific to this 29M-param encoder, 20k-step pre-train, en-us corpus.
What we'd do differently / lessons
- Don't let a Pelias-derived suite judge a non-Pelias parser. A test suite ported from the system you're trying to beat scores that system at 100% by construction. Use it as a regression gate; judge quality on independent arenas + the product metric.
- Pick the metric that matches the product. Per-tag agreement is not geocoding accuracy. The resolver-on-real-points number is the one that tracks the north star, and it tells the opposite story from the harness.
- Pre-register kill-points. The MLM track stayed alive on a 40k mirage; the pre-registered "ship only if the gains hold at ceiling" rule is what killed it cleanly instead of inviting another speculative round.
Forward plan (DeepSeek turn-5, ranked)
- Street-level geometry (TIGER/Line + house-number interpolation). We win on
locality but coordinates are admin-centroid only (p50 2.4 km). This is the km →
tens-of-meters jump that makes "production geocoder" literally true. Gate: p50 ≤ 200 m,
p90 ≤ 1 km on a 1,000-point ground-truth set. ⚠ Requires geometry we don't yet have
(our
tiger.dbis street-name + ZIP only) — a new data pipeline, not a wiring job. - Resolver ranking depth — fix the same-state wrong-town bug (
Saint Albans, VT→St. Johnsbury) and harden the population/importance tiebreak. Cheap; lifts locality ~97.3 → ~98%. Good first/parallel win. - Parse-robustness corpus pass — multi-word localities, directional quadrants, noisy tokens, the edge formats at 0%. Feeds the next train, closes the genuine neural deficits above.
- Multi-locale — extend beyond en-us once the US geocoder is street-level complete.