No-street counter-example corpus shard β 2026-05-28
The corpus-side fix for v0.6.1's dependent_locality regression. Per the
2026-05-28 night-2 postmortem and the
Layer 1 morphology FST eval: the
morphology FST as a decoder-only fix is insufficient because the model is too
overconfident on its synth-street-induced dep_loc predictions. The recipe per
DeepSeek's turn 2 consult is to add explicit counter-distribution training data:
addresses where the model should NOT emit street labels at all.
This eval doc captures the new synthesizer + pipeline glue that produces that counter-distribution shard.
What landsβ
corpus/src/synthesize-no-street.tsβ six-template synthesizer:venue-adversarial(35% weight) β venues whose names contain street-typing tokens (Wall Street Industries,5th Avenue Theatre,Park Avenue Dental,Highway 61 Diner, etc.). The load-bearing slice: these are exactly the rows that v0.6.1's "decompose mode" mis-tags. Explicit B-venue / I-venue labels on tokens likeStreet/Avenue/Highwayteach the model to suppress street-side predictions when the surrounding context is venue-shaped.venue-plain(25%) β venues without street-typing tokens.locality-region-postcode(20%) β"Boston, MA 02101".locality-region(12%) β"Boston, MA".postcode-only(6%) β"02101".country-only(2%) β"United States".
corpus/src/synthesize-no-street.test.tsβ 8 unit tests covering each template, including a 500-iteration "never emits any street-side tag" contract check and a 1000-iteration template-distribution sanity check.scripts/build-no-street-shard.mjsβ mirrorsbuild-po-box-shard.mjs: reads{locality, region, postcode, country}tuples from JSONL stdin and emits alignedLabeledRowJSONL ready for the parquet sharding step.corpus/src/index.tsβ exports the new synthesizer.
Verified end-to-endβ
Smoke test with 5 base tuples (US, FR, DE) and --variants 4:
read 5 tuples, emitted 20 rows, skipped 0
template distribution:
venue-plain: 8 (40.0%)
locality-region: 5 (25.0%)
venue-adversarial: 4 (20.0%)
postcode-only: 2 (10.0%)
country-only: 1 (5.0%)
Critical check β 0 street-side BIO labels across all 20 synthesized rows. The contract holds.
Sample of the load-bearing adversarial output:
raw: 7th Street Bistro, Boston, MA 02101
tokens: ['7th', 'Street', 'Bistro', 'Boston', 'MA', '02101']
labels: ['B-venue', 'I-venue', 'I-venue', 'B-locality', 'B-region', 'B-postcode']
raw: Memorial Drive Medical Center, MΓΌnchen, Bayern 80331
tokens: ['Memorial', 'Drive', 'Medical', 'Center', 'MΓΌnchen', 'Bayern', '80331']
labels: ['B-venue', 'I-venue', 'I-venue', 'I-venue', 'B-locality', 'B-region', 'B-postcode']
raw: South Park Children's Center, MΓΌnchen, Bayern 80331
tokens: ['South', 'Park', "Children's", 'Center', 'MΓΌnchen', 'Bayern', '80331']
labels: ['B-venue', 'I-venue', 'I-venue', 'I-venue', 'B-locality', 'B-region', 'B-postcode']
The tokens Street, Drive, Park carry explicit B-venue / I-venue labels
β the model sees direct evidence that these tokens are NOT street components in
this surrounding context.
v0.6.2 recipe (proposed)β
With this shard landed, the full v0.6.2 retrain recipe per DeepSeek turn 2 is runnable:
- Reduce synth-street weight 2.0 β 0.5
- Add synth-no-street shard at weight 1.0, 50Kβ100K rows, US-primary with FR + DE + GB locales sprinkled in proportion to the base tuple availability.
- Enable Layer 1 morphology FST at inference (infrastructure already landed). Provides additive anchoring on top of the corpus-side fix.
- 2D pre-publish eval gate before promoting:
(recall drop >2pp AND baseline >10%)OR(hallucination spike >100 AND rate >20% of golden occurrences)
- Re-run v0-vs-neural harness post-train and target the failure clusters surfaced by the harness eval.
What this doesn't fixβ
- Non-US locale gap. The harness showed neural at 0% on most non-US/non-FR locales. This shard helps with US/FR/DE/GB venue-vs-street confusion but does not fix the locale-distribution gap. Per-locale corpus expansion is a v0.7+ topic.
- Unit designator schema gap. The Australian unit-notation tests
(
Unit 12/345,Apt 12) fail because the Stage 3 schema has nounit_designatortag. This is a schema change requiring a separate retrain, not a corpus tweak. - Tokenization on non-ASCII. Czech, Portuguese diacritics are destroying span boundaries in the harness output. Independent of the corpus path.
See alsoβ
- v0-vs-neural harness eval β establishes the failure clusters this shard addresses
- Layer 1 morphology FST eval β proves the decoder-only fix is insufficient and the corpus side is required
- 2026-05-28 night-2 postmortem β original postmortem driving the v0.6.2 recipe
- Street-supplement architecture β the layered design context