v0.9.2 both-order retrain — the anchor and the word order fight over German

Date: 2026-06-06 · Run: v0.9.2-de-bothorder (A100, 20k, from-scratch, single variable vs the v0.9.1 anchor-on pilot) · Question: the German "collapse" was largely an eval-order artifact — the model trained native German order, our OA eval renders US/international order. Does training the German shard in both orders close the international-order gap?

The honest answer: partly, and not where it counts yet — and the 2×2 says something more interesting than a single number would have.

The 2×2 (DE locality-match, 3,000 real OpenAddresses points, through the resolver)

The one variable vs the pilot is the German shard's rendering (native-only → ~60/40 native/international, --intl-fraction 0.4). Everything else is held: anchor on, self-conditioning on, seed 42.

DE locality	anchor OFF	anchor ON		pilot anchor OFF	pilot anchor ON
native	48.2%	82.1%		48.4%	83.8%
international (US-order)	48.4%	44.5%		35.9%	45.5%

Two real movements, and they pull in opposite directions:

Both-order training improved the model's intrinsic international parsing. Anchor-off, international order went 35.9% → 48.4% (+12.5pp). The model genuinely learned something about the US/feed layout it had never seen — without the anchor, it now reads both orders at ~48%, where before it could only read the native one.
But with the anchor fed — the production config — international is unchanged (45.5% → 44.5%), because the anchor now hurts on international order. Anchor-off international (48.4%) beats anchor-on (44.5%). The anchor still helps native enormously (+34pp), so it's the anchor that creates the whole native/international asymmetry: without it the model is order-agnostic at ~48%; with it, native rockets to 82% and international slips to 44%.

The harm concentrates exactly where you'd predict — Berlin, the colliding 10xxx postcode. Berlin international goes 40.5% (anchor off) → 34.4% (anchor on): the ambiguous {DE, US} posterior, injected at the trailing postcode, drags the already-read city toward the US reading. In native order the postcode sits before the city and primes it correctly; in international order it trails, and the anchor fires too late, on the wrong side of the locality.

The residual gap has a clear, cheap cause

Why is international still ~48% when native is 82%? The model mangles a tail it was never trained on. Our eval renders real US-order German with the region in the tail — 27 Straußstraße, Berlin, Berlin 12623, 1 Talstraße, Mülsen, Sachsen 08132 — but the international synth dropped the region (…, Berlin, 12623). Parsing the eval's rows through v0.9.2 shows exactly that wound:

Straußstraße, Berlin, Berlin 12623  →  street="straße, Berlin"   (locality DROPPED into street)
Talstraße, Mülsen, Sachsen 08132     →  locality="Mü" + "sen, Sachsen"  (split + region absorbed)
Goetheallee, Dresden, Sachsen 01309 →  locality="Dresden" + locality="Sachsen"  (region mis-tagged)

The model never saw a German City, Region Postcode tail, so it doesn't know how to segment one — the region gets swallowed into the locality or the locality collapses. That is the v0.9.3 fix: render the international order with the region in the tail, matching real US/feed renderings.

No collapse, no regression

Per-tag val F1 (anchor-aware, both-order val): locality 0.820, postcode 1.00, street 0.950, house_number 0.998, region 0.703, macro 0.781. Locality is intact — none of the v0.8.0/PR3 collapse signature. Training was clean (no NaN), cross_pollution 0.00% throughout, locale_acc 0.965.
Functional check — all 6 demo presets parse identically to the v0.6.0 default (the both-order German change does not touch US).
No-regression — US 97.5% (pilot 96.1%), FR 84.9% (pilot 85.1%). Both held; the both-order German shard didn't cost the incumbents a thing.

Verdict — not shipped; a research result that earns its next experiment

This is the multi-locale research line, not the production default (v0.6.0 stays the shipped en-US model), so there's no promotion at stake — the gate's job was to tell us whether both-order training works. It does, intrinsically (+12.5pp anchor-off), but the production-anchor config is flat because the anchor and the word order now fight. v0.9.2 is not shipped. What it bought us is two precise, cheap next steps:

Render the region in the international tail (the synth currently drops it) — the direct fix for the City, Region Postcode collapse the diagnostic caught.
Make the anchor word-order-aware, or down-weight it when the postcode trails the locality — it helps when the postcode leads (native) and hurts when it trails (international).

v0.9.3 follow-up: we rendered the region tail. The corpus wasn't the ceiling.

We took the first fix — render the City, Region Postcode tail the synth had been dropping — and retrained (v0.9.3, single variable, DeepSeek-signed). The 2×2 came back almost identical to v0.9.2:

v0.9.3 DE locality	anchor OFF	anchor ON
native	48.3	83.6
international	47.1	44.7

US 97.2 / FR 84.9 held; no collapse. The region tail did exactly what it was designed to — international region-match went 0% → ~40% (Berlin 45.6, Sachsen 30.9), and Berlin's international locality nudged 34.4 → 38.7 — but locality stayed flat. The model now segments the tail correctly; it just doesn't move the number we care about, because the number we care about was never capped by the tail. The anchor-on international gap is 83.6 − 44.7 = 38.9pp, four times the 10pp the pre-registered gate allowed.

So the corpus lever — both-order rendering, then the region tail — is exhausted for the anchor-on international gap. Three independent lines now point at the same place: anchor-off is order-agnostic at ~48%, forcing the posterior to DE=1.0 changed nothing, and rendering the region tail changed nothing. The ceiling is the anchor's structure — an additive vector injected at the postcode token, which lands on the wrong side of the locality when the postcode trails it.

Next is architectural, and it's the operator's call. The designed fix (v0.9.4) is dual-injection: inject the same anchor at the postcode token and the first token, an order-independent global signal the locality can attend back to. It's a training-code change, not model surgery — the model already adds the anchor per-token, so it's a matter of placing one at position 0. But three iterations in without a promotion, the honest question DeepSeek raised is whether a trained-in always-on anchor is the right design at all: it's a +35pp native win and a −4pp international loss, and an order-conditioned anchor is the alternative.

v0.9.4 dual-injection: the anchor's asymmetry doesn't yield to architecture either

We took the fork. DeepSeek's signed-off fix — pool the anchor and inject it also at position 0, an order-independent cue the locality can attend back to regardless of where the postcode sits (it explicitly preferred this over fragile position-gating). It's a clean change: no new parameters, the c=0 identity holds, de-risked locally (8/8 anchor tests, ONNX export equivalent to the baseline). And it did nothing to the international gap:

DE locality	anchor OFF	anchor ON
native	48.3	83.5
international	47.3	43.7

US 96.4 / FR 84.9 held; native beats Pelias. The anchor still hurts international (off 47.3 > on 43.7) — the position-0 cue changed nothing. Three retrains now tell one story:

retrain	native-on	international-on
v0.9.2 both-order corpus	82.1	44.5
v0.9.3 region tail	83.6	44.7
v0.9.4 dual-injection	83.5	43.7

The anchor-on international number is immovable at ~44%. We threw corpus order coverage, the region tail, and an architectural change at it; none of them moved it. The anchor's native-help (+35pp) / international-hurt (−4pp) asymmetry is fundamental — not a corpus gap, not the region tail, not where the anchor sits in the sequence. And it's a real geographic miss, not a metric mirage: PIP-containment is 96.3% on native German, 57% on international.

So the corpus and the architecture are both exhausted, and the honest conclusion is the one DeepSeek raised at the very start: a trained-in, always-on anchor may simply be the wrong shape for order robustness. The next move is a strategic choice, not another tuning pass — an order-conditioned anchor, or accepting the asymmetry (native German is excellent and that's what production would ship), or a different international approach entirely. It's also cross-cutting: the same anchor would carry the same asymmetry into every locale, so it gates the multi-locale retrain program. v0.6.0 stays the production default throughout.

The decision: accept the asymmetry, and the anchor learns its direction per locale

With the operator away and the data in, we put the fork to DeepSeek, which held delegated authority for the shift, across a third consult turn. Its call was (b): accept the asymmetry, and burn no more GPU. Three independent retrains had failed to move the anchor-on international number; the native gain (+35pp) is large, robust, and generalizes (US 96.4, FR 84.9 hold), while the international penalty (−4pp) is small and stable. Anchor-off international already lands near 48% locality-match, competitive with the non-neural baselines, so a model can serve international-order German through the c=0 identity path and lose nothing the anchor was giving it. This is the research line; v0.6.0 ships unchanged regardless.

The turn earned its keep on the why, and it reframes the whole multi-locale question. The anchor's directionality is learned from each locale's dominant training order, not baked into the architecture. US addresses always train with the postcode trailing the city, so the US anchor learns to look left for the locality — which is exactly why it never hurt US. German's native order puts the postcode before the city, so the German anchor learns to look right; feeding it both orders asked a single additive vector to point two ways at once, and it settled on an average that serves the dominant native order and misfires on the trailing one. So the asymmetry is locale-specific learning showing through, and the multi-locale plan is constrained rather than dead: give each locale its own anchor direction. The highest-value next thread is a country-conditioned anchor vector — the DE anchor specializes on postcode-before-city, the US anchor on postcode-after-city, no shared vector forced to compromise. That is a future iteration with its own gate, not a one-run patch, so it waits.

One cheap operational lever falls out of (b) in the meantime, no GPU and no retrain: route an input to the c=0 (anchor-off) path when a lightweight order check — the postcode's position relative to the locality in the raw string — says the layout is international, and keep the anchor on when the postcode leads. The research-line model would then read native German at 83.5% and international German at ~48%, taking the better of the two columns on each address. Filed against #327 alongside the country-conditioned-anchor thread.

Artifacts: corpus v0.4.2-de-bothorder + v0.4.3-de-regiontail; configs v0.9.2-de-bothorder.yaml + v0.9.3-de-regiontail.yaml + v0.9.4-de-dualanchor.yaml; the 2×2 harness scripts/eval/de-order-eval.sh; follow-up issue #327. Sibling: the order-artifact correction in 2026-06-06-anchor-pilot.md.

The 2×2 (DE locality-match, 3,000 real OpenAddresses points, through the resolver)​

The residual gap has a clear, cheap cause​

No collapse, no regression​

Verdict — not shipped; a research result that earns its next experiment​

v0.9.3 follow-up: we rendered the region tail. The corpus wasn't the ceiling.​

v0.9.4 dual-injection: the anchor's asymmetry doesn't yield to architecture either​

The decision: accept the asymmetry, and the anchor learns its direction per locale​