v0.9.4 dual-injection + the PIP pivot — the German "collapse" was two different problems

Verdict: v0.9.4 not promoted, and we stop retraining German. Three retrains (v0.9.2 both-order, v0.9.3 region-tail, v0.9.4 dual-injection) chased a single international-order locality-match number that turns out to conflate two unrelated problems — one of which is a measurement artifact, the other of which no anchor change can touch. A non-gameable metric (PIP-containment) pulled them apart. Tracking: #327.

v0.9.4 result

The dual-injection retrain (model.inject_first_token=true, same corpus, one variable vs v0.9.3) gave the 2×2 (DE locality-MATCH by name):

	anchor OFF	anchor ON
US order	47.3%	43.7%
native DE	48.3%	83.5%

US 96.4%, FR 84.9% (no-regression OK). International anchor-ON locality is 43.7%, below v0.9.3's 44.7% and far under the 70% promote bar. Adding the position-0 country cue moved region-match (38→58%) and resolved-rate (72→88%) but not locality-match. Across all three retrains the anchor consistently lowers international locality-match (off ~47 > on ~44) while coordinate p50 holds ~6 km. That pattern — the geography stays right while the string-match metric drops — is the tell.

PIP-containment splits the problem in half

So we measured the thing that can't be gamed: is the gold OpenAddresses point inside the polygon of the locality we resolved? (scripts/eval/pip-containment.py, point-in-polygon over the WOF GeoJSON.)

intl DE (v0.9.4, 3000 rows)   name-match   PIP-containment   delta
OVERALL                          43.7%          56.1%        +12.4pp
Berlin   (n=1500)                36.3%          36.3%         -0.1pp
Sachsen  (n=1500)                51.1%          75.9%        +24.8pp

Two completely different stories under one average.

Saxony is a name-match artifact. The model and resolver place Saxon addresses correctly 76% of the time, but the name-match metric only credits 51% — a 24.8-point gap. The reason is visible in every miss: OpenAddresses gold carries a regional suffix that WOF's canonical name drops.

gold "Plauen Vogtl"   resolved "Plauen"       (point inside Plauen ✓)
gold "Chemnitz Sachs" resolved "Chemnitz"     (point inside Chemnitz ✓)
gold "Marienberg Erzgeb" resolved "Marienberg" (point inside Marienberg ✓)
gold "Treuen Vogtl"   resolved "Treuen"       (point inside Treuen ✓)

These are not errors. The resolver found the right place; the metric demands an exact string the gazetteer doesn't use. Retraining cannot fix a measurement bug.

Berlin is a genuine failure — but not the one we were retraining for. Berlin's PIP equals its name-match (36.3%), so there's no hidden artifact. Looking closer: of 1,500 Berlin rows, 955 resolve to nothing at all, and the 545 that resolve all resolve correctly to Berlin. Berlin's problem is not a wrong pick or a too-small polygon — it's that the model drops the locality span entirely in the city-state layout …, Berlin, Berlin 14199, where locality and region are the same word. One "Berlin" gets labeled region, the other is lost, and there's nothing left for the resolver to place. This is a real segmentation bug, but it's specific to city-states (Berlin/Hamburg/Bremen) and has nothing to do with the postcode anchor or word order in general — which is exactly why three anchor/order retrains never touched it.

Native German is essentially solved

The other half of the story is native order — the layout real feeds and users actually use. By PIP it is 96.2% (anchor on), against a name-match that said 83.5%, a 12.7pp undersell. Berlin native is 99.2%, so the city-state bug is international-order ONLY. The full 2×2 (scripts/eval/de-pip-eval.sh):

order	anchor	name-match	PIP-containment
native	on	83.5%	96.2%
native	off	48.3%	61.0%
intl	on	43.7%	56.1%
intl	off	47.3%	59.6%

Two things fall out. The anchor does real work on native order (PIP 61 → 96), and it slightly hurts international PIP (59.6 off → 56.1 on) — the anchor was never the international lever. And native German, measured honestly, beats Pelias comfortably and is done. The genuine residual is one thing: Berlin in international order. Everything else is the name-match artifact.

Why three retrains "failed"

They optimized an aggregate that was half artifact, half a bug the lever couldn't reach. The anchor was never the problem (coord p50 ~6 km throughout); the metric was conflating a gazetteer-name mismatch in Saxony with a city-state parse failure in Berlin. With those separated, neither calls for another anchor retrain.

What's next (no A100)

DeepSeek signed off this pivot across four turns. The two real fixes, both off the GPU:

Resolver name-match: alias the regional suffixes (Vogtl→Vogtland, Sachs→Sachsen, Erzgeb→ Erzgebirge, OL→Oberlausitz, …) so Plauen Vogtl credits Plauen. This recovers the 24.8pp Saxon gap in the eval metric without any model change. Filed as a follow-up.
Berlin city-state segmentation — the model needs to learn the City, City Postcode layout where locality == region. A focused data-augmentation pass (city-state addresses in both orders) is the fix, but it's a gated retrain for a future shift, not tonight. Filed as a follow-up.

v0.9.5 (the [COUNTRY_DE] start-token that was queued if PIP fell below 90%) is cancelled: the country signal helps neither half — Saxony is already correct, and Berlin's country is never in doubt. The corpus and anchor levers are spent; the German story moves to the resolver and a narrow city-state data fix.

v0.9.4 result​

PIP-containment splits the problem in half​

Native German is essentially solved​

Why three retrains "failed"​

What's next (no A100)​

v0.9.4 result

PIP-containment splits the problem in half

Native German is essentially solved

Why three retrains "failed"

What's next (no A100)