v0-vs-Neural harness — 2026-05-28
The neural parser has never been measured against the legacy rule-based pipeline's hand-tuned
acceptance criteria. The 4561-entry golden set at data/eval/golden/v0.1.2/ measures the
neural on what it was trained for; the 376 assertions in mailwoman/test/*.test.ts measure
the rule-based parser on what it was hand-tuned for. This eval bridges the two.
Per DeepSeek consult turn 6: without this harness, v0.6.2's corpus augmentation is "try something and hope." With it, augmentation becomes "fix the specific assertions that fail."
Setup
- Harness:
scripts/harness-v0-neural.tsextracts everyassert(input, ...expected)call frommailwoman/test/*.test.tsvia TypeScript AST (376 assertions across 30 files spanning 20+ locales), runs each input through BOTHcreateAddressParser()(v0 rule-based) andNeuralAddressClassifier(v0.6.0 + admin FST + morphology FST), and reports per-file / per-locale pass rates plus a JSON sidecar for downstream clustering. - Model: v0.6.0 default (
model-v060-step-100000-int8.onnx) - Admin FST:
fst-en-us.bin - Morphology FST: built in-process from libpostal
street_types.txtdictionaries (1,707 canonicals / 3,763 variants after length-3 filter) - Comparison semantics:
- v0: vitest
toEqualstrict deep-equality, position-by-position against ranked solutions (matches the existing test semantics — pass only if every expected solution deep-equalssolutions[i].classifications). - Neural: flatten
AddressTreetoMap<ComponentTag, string>, fold neural-only tags (street_prefix+street+street_suffix→street;intersection_a+intersection_b→streetas two values), then check if any expected solution matches the resulting record (substring containment in either direction allowed).
- v0: vitest
Result
| Parser | Pass | Rate |
|---|---|---|
| v0 (rule-based) | 376 | 100.0% |
| Neural | 54 | 14.4% |
| Category | Count | Rate |
|---|---|---|
| Both pass | 54 | 14.4% |
| v0 only | 322 | 85.6% |
| Neural only | 0 | 0.0% |
| Both fail | 0 | 0.0% |
Zero neural-only wins. Every assertion the neural passes is also passed by v0. The rule-based pipeline strictly dominates the neural pipeline on its own test suite.
Per-file breakdown
| File | Total | Neural % | Comment |
|---|---|---|---|
| address.usa.test.ts | 73 | 23% | Heaviest training distribution; still 56/73 missing |
| intersection.test.ts | 65 | 0% | Total failure mode; Main St & 5th Ave-style inputs |
| functional.test.ts | 34 | 3% | Cross-cutting cases |
| address.fra.test.ts | 33 | 24% | Some French in training |
| address.nld.test.ts | 22 | 9% | NL compact addresses largely unhandled |
| address.nzd.test.ts | 22 | 0% | NZ format |
| addressit.usa.test.ts | 21 | 81% | Autocomplete-style — best performer |
| address.deu.test.ts | 17 | 0% | German format entirely missed |
| place.fra.test.ts | 13 | 0% | French place lookups |
| addressit.aus.test.ts | 11 | 64% | Australian autocomplete |
| address.aus.test.ts | 9 | 0% | Australian unit notation (Unit 12/345) |
| address.nor.test.ts | 9 | 0% | Norwegian |
| address.prt.test.ts | 8 | 0% | Portuguese |
| address.pol.test.ts | 6 | 0% | Polish |
| venue.usa.test.ts | 6 | 0% | Venue-only inputs |
| address.rom.test.ts | 5 | 20% | Romanian |
| address.swe.test.ts | 4 | 0% | Swedish |
| compound_street.test.ts | 4 | 0% | Compound street tests |
| address.cze.test.ts | 3 | 0% | Czech |
| address.gbr.test.ts | 3 | 0% | UK format |
| (other locales) | 1-2 each | 0% | bra, esp, hrv, ind, svk, transit |
| libpostal.test.ts | 1 | 100% | Single passing fixture |
Failure clusters
The 322 v0-only-passes split into four structural categories:
1. Tokenization issues with non-ASCII
Neural tokenizer + BIO decoder garbles multi-byte characters. Examples:
Korunní 810, Praha→ neural:{locality: ["Korunn"], region: ["ha"]}—níis split mid-characterRua Raul Leite Magalhães, 65, Tapiraí - SP, 18180-000, Brazil→ neural:{street: ["es"], region: ["zil"], venue: ["Rua Raul Leite Magalh"]}— Portuguese diacritics destroy span boundaries
This is a tokenizer/encoder issue, not a schema issue. The model can't recover when SentencePiece pieces don't reassemble to the original characters.
2. Schema gap: unit_designator
v0 has both unit and unit_designator — for Unit 12/345 Main St, expected is
{unit_designator: ["Unit"], unit: ["12"], house_number: ["345"], street: ["Main St"]}. The
neural schema only has unit. Every Australian unit-notation test fails on this gap alone:
Apartment 12/345 Main St→ expectedunit_designator: ["Apartment"]but neural has no tag for itU 12 345 Main St→ same problem withULot 12/345 Illawarra Road...→ 9 of 22 NLD/AUS tests fail on this single missing tag
This is solvable by adding unit_designator to the Stage 3 schema (and the training corpus's
adapters), but it's a schema change requiring retraining.
3. House-number / street boundary
The neural parser keeps reading street and house_number as one span. German examples:
Am Nordkanal 11, 47877 Willich→ expectedstreet: "Am Nordkanal"+house_number: "11"; neural:street: "Am Nordkanal 11"Am Falkpl. 5, 10437 Berlin→ expectedstreet: "Am Falkpl."; neural:street: "Am Falkpl"(period dropped) — close but not equal under strict comparison
The model never saw enough German addresses to learn the prepend-house-number pattern. v0's HouseNumberClassifier hard-codes the heuristic; the neural has to learn it from data.
4. Intersections (0/65)
Every intersection test fails. Main St & 5th Ave → expected
{street: ["Main St", "5th Ave"]} (two street values); neural produces single-street
output or routes one side into a non-street tag. This is the failure mode the falsehoods
doc flagged. v0 has a dedicated IntersectionClassifier + CompositeIntersectionClassifier;
the neural has intersection_a/b BIO tags but the training data appears to undersample
intersections relative to v0's hand-tuned coverage.
What this means for v0.6.2
This is the "honest assessment" the postmortem called for. The implications:
- Locale coverage is the dominant gap. US, FR, and a slice of AUS are in distribution; everything else is structurally untaught. v0.6.2's corpus augmentation has to expand beyond US-only synth-street.
- Schema additions are needed before retraining helps Aus/UK unit notation. The
unit_designatorgap can't be papered over by training-data tweaks. - Intersections need targeted corpus work. v0.6.0's synth-street shard taught street decomposition but didn't include intersection patterns at all.
- Tokenizer is suspect for non-ASCII. Czech / Portuguese / German diacritics are destroying span boundaries. Needs investigation independent of the corpus path.
- Strict equality might be the wrong bar for the neural parser. v0's tests assert exact
structural equality including punctuation. The neural pipeline currently doesn't preserve
token-level punctuation. A looser equivalence (
Am Falkpl.≈Am Falkpl) would change the picture for cluster #3 specifically; whether that looser bar is acceptable is a product decision.
Falsehoods extension
The harness also accepts a --falsehoods <dir> argument for JSONL files of structured rows.
data/eval/falsehoods/streets.jsonl lands the catalog from
falsehoods-streets.md as 22 explicit
test rows — Piccadilly, rue de Rivoli, Plein 1944, Avenue Road, Gondel 2695, the
A1-as-composite-street, dependent streets, Japanese block addressing, rural routes, Mannheim
grid. Each row carries falsehood (the category) and expected_failure (whether this is a
known gap rather than a regression vector).
With the falsehoods rows added the picture is:
| Parser | Pass | Rate |
|---|---|---|
| v0 (rule-based) | 385 / 398 | 96.7% |
| Neural | 56 / 398 | 14.1% |
| Category | Count | Rate |
|---|---|---|
| Both pass | 54 | 13.6% |
| v0 only | 331 | 83.2% |
| Neural only | 2 | 0.5% |
| Both fail | 11 | 2.8% |
Two neural-only wins appear once the falsehoods are in play — cases where the neural parser succeeds and the rule-based pipeline fails:
8 Seven Gardens Burgh, WOODBRIDGE, IP13 6SU— number-in-street-name. v0's HouseNumberClassifier grabs the leading8then mis-tags the rest; neural reads it as a single street span correctly.R 5, 6-13, D-68161 Mannheim— grid address with no street. v0's classifiers force a street-shaped reading that doesn't fit; neural correctly produces just locality + postcode.
The pattern: neural is sometimes BETTER on formats it has truly never seen, because v0's hand-tuned rules actively misclassify them. The "ambiguous-but-common" cases (rue de Rivoli, Broadway, place de la Concorde, Avenue Road) is where v0's rules dominate — neural lacks the training distribution.
V0 also fails 13 of 22 falsehoods — Piccadilly, Hauptstraße 5, Plein 1944, Gondel 2695, 6 Elm Avenue, Japanese block addressing, and rural routes are unsolved by both parsers. The falsehoods catalog is the right capability-boundary regression suite for any future release.
Reproducing
node --experimental-strip-types scripts/harness-v0-neural.ts \
--tests mailwoman/test \
--model /mnt/playpen/mailwoman-data/models/quantized/model-v060-step-100000-int8.onnx \
--tokenizer /mnt/playpen/mailwoman-data/models/tokenizer/v0.6.0-a0/tokenizer.model \
--model-card neural-weights-en-us/model-card.json \
--admin-fst /mnt/playpen/mailwoman-data/wof/fst-per-locale/fst-en-us.bin \
--out-json /tmp/harness-full.json \
> /tmp/harness-full-report.md
Total runtime: ~5 seconds for 376 assertions.
See also
- Street-supplement architecture — the design context
- Layer 1 morphology FST eval — preceding eval that established the decoder-only fix is insufficient
- Falsehoods about street names — the edge cases the harness's falsehoods row source captures
- 2026-05-28 night-2 postmortem — postmortem that triggered this assessment