Night 6 postmortem — 2026-06-05 (CJK arena: Japan lands)

Window: 04:18 → 14:00 UTC. Modal budget: $15 — came in at ~$11.55, under the cap (operator-confirmed). My overnight "may have exceeded" flag was a false alarm: the slow v2 continue-train wasted wall-clock and some GPU-hours, but the dollars stayed contained because I caught and stopped it. The step-rate lesson below still stands. Status: final.

What shipped

#292 — Japan coarse resolution (PR #303, merged). The first CJK locale, and it validated the whole arena architecture. WOF has no municipality polygons in CJK (point geometry, confirmed JP/KR/TW), so the European point-in-polygon build is inapplicable (~25% JP). Pivoted to an authoritative name-match build (KENALL romanized municipality + GeoNames point → cross-placetype WOF match) feeding the _same postcode_area_resolution strategy with zero new resolver code. Build 94.9%, end-to-end resolver 98.5% (KEN_ALL gold) / 93.9% (independent GeoNames cross-check), all above the 85% bar. EU (DE/FR/GB/NL) byte-identical after the merge.
CJK arena eval report (PR #304, merged) — docs/articles/evals/2026-06-05-cjk-arena.md, plus the Direction-E design-doc correction (it had assumed PIP was uniform).
#293 — Korea coarse resolution (PR #312, merged; EXPERIMENTAL, not promoted). The second CJK locale, and the real test of "less special": Korea's data inverts Japan's, so the build inverts. GeoNames postal KR already carries postcode → (place_name, province, lat, lon) (Hangul names), and WOF's names table holds Hangul (kor + und) — so KR is point-primary (nearest WOF locality by coordinate, Hangul name as a confirmation signal), feeding the same postcode_area_resolution strategy with zero resolver code changed. Province + coordinate land at 100% (p50 0.96 km); the name tier is 26.3% (vs JP's 94.9%) — a WOF-data ceiling (ri-granularity offset + no 구 urban-district localities), not an architecture limit. Caught and fixed a homonym bug (global name-match landed 500 km away → proximity-constrained). Honest writeup: docs/articles/evals/2026-06-05-kr-point-primary.md. TW (#294) confirmed blocked one rung lower — GeoNames postal TW is a hard 404 upstream (no postcode→point input at all), so even point-primary can't run; admin-tw.db is unbuilt but building it is premature without the postal source (Chunghwa Post, a deliberate acquisition).
CJK provenance in the build manifest (PR #307) — pinned the JP WOF repo commits + KEN_ALL fetch chain + GeoNames points source (reproducibility / build-from-source discipline).
Two v0.8.0 training experiments (configs #306 ls=0.05, #308 bare-street + the bareProb corpus feature) — run, evaluated, neither promoted (verdict below). The durable wins: the harness failure analysis (143 targetable vs 175 blocked) and the reusable bareProb synthesizer.
Issues: filed #305 (the exact-tier/conflict-flag design question); logged KR (#293) + TW (#294) data blockers; groomed #14 (Japan milestone).

Training verdict (both runs — NEITHER promoted; v0.7.2 stays default)

	per-tag gate	harness	verdict
ls=0.05	postcode +4.2pp, but street −4.1 / venue −4.5 / house_number −2.4 (3 tags >2pp)	+1.6pp (20.2%)	not promoted — calibration trades postcode for street/venue; fails the gate, not significant
bare-street	street −2.8pp (golden full-address streets shifted)	+0.9pp (19.5%); usa +7pp (22→29%) but functional 2/34→1/34 — target missed; net not significant	not promoted — redistributes (US harness up, golden street down), fails the gate
bare-street v2 (continue, stopped)	— (+4k steps only)	harness 20.7%; functional STILL 1/34 (`main pl`/`10th ave` still → `locality`)	not promoted — the no-house-number fix did not move the cluster

Post-handoff update (re-engaged on the auto-classifier's nudge): the bare-street v2 retry FAILED, and a budget caution. v2 corrected the shard (47% pure-bare streets vs v1's ~9% — the diagnosed no-house-number gap, #310) and continue-trained the v1 checkpoint. At +4k steps the functional cluster did not budge — the model's bare-input→locality prior (100k steps deep) didn't yield. The bare-format-shard approach has now failed twice.

Then I tested the decode-time lever too (free, no GPU): the street-morphology FST. It also failed — and revealed why the whole cluster is stuck. (1) The FST biases adjacent tokens away from dependent_locality, but the functional failures are plain locality — a target mismatch. (2) I added an opt-in locality bias and it still didn't flip them (0/6, even at penalty 4.0) — because the FST doesn't even fire on bare abbreviated suffixes (ave/st/pl in 10th Ave). So five distinct levers now fail the functional cluster: ls=0.05, bare-street v1, bare-street v2, the morphology FST (dep_locality target), and the morphology FST (locality target). The model labels bare 10th ave as locality with immovable confidence, and the FST suffix-matcher doesn't engage on bare 2-token inputs. Conclusion: the functional/bare-input gap is a deep model+FST-matching property, not a data or decode-bias problem — it needs the FST suffix-matcher extended to bare inputs AND/OR a different model treatment of bare spans. (The localityPenalty extension was reverted — unproven. The diagnostic is scripts/diag-functional-morphology.ts, kept local.) ⚠️ The v2 continue-train ran pathologically slowly (~0.3 steps/s vs ~15 fresh; +4k steps over a long wall-clock window) before I caught it and stopped it. (Budget update: the shift came in at ~$11.55, under the $15 cap — the slow run wasted GPU-hours but not dollars, because I stopped it.) Root cause unknown (throttled instance / resume-IO); don't continue-train again without watching the real step-rate (the rate field is broken on resume). My miss: I trusted the rate field and didn't check the step-rate against the wall clock sooner.

The honest answer to "can we deliver a v0.8.0 harness breakthrough tonight?": no, with the safe levers. Calibration trades tags; the bare-street shard helped US contexts but missed its functional target and regressed golden street. The real harness gap is the 175/318 untrained-locale failures (the multi-locale PARSER problem), which is the unsolved German-end-of-string-collapse direction — not something to brute-force autonomously. The two runs are clean, informative data points (the ls=0.05 fork is now answered; the bare-street weight 0.2 over-shifts), but v0.7.2 remains the right default. The shift's real model win is the JP RESOLVER, not a new parser.

Experiments + baselines

ls=0.05 (ap-VMb3...) and bare-street (ap-hzrf...) — both 100k steps on A100, detached, concurrent, exported + evaluated. Verdict above; neither promoted.

v0.7.2 baselines (the current default, kept as default):
- Harness (the operator's "v0 test-suite coverage"): neural 18.6% pass (77/415) vs v0 93.7%; both-pass 17.1%, v0-only 76.6%, neural-only 1.4%. Per-file: usa 22%, intersection 17%, functional 6%, fra 33%, nld 9%.
- Per-tag golden (the pre-publish gate): exact-match 31.4%; locality 39.1%, region 64.9%, postcode 79.3%, street 47.8%, house_number 80.5%.
- Differentiators: the 22 falsehoods in the harness (/tmp/v072-harness.json).
bare-street (#308) — the harness lever, from the analysis below. Val macro_f1 0.81. Result: usa +7pp but functional target missed, golden street −2.8pp → not promoted (verdict above).

ls=0.05 calibration — DONE, NOT PROMOTED (the staged fork is answered). Per-tag vs v0.7.2: postcode +4.2pp (calibration's win) but street −4.1, venue −4.5, house_number −2.4 (three tags regress >2pp → fails the pre-publish gate). Exact-match flat (31.2 vs 31.4). Harness +1.6pp (20.2% vs 18.6%) — real but not "significant," and the per-tag regressions disqualify it anyway. Verdict: ls=0.05 trades postcode for street/venue, net not a clean win; calibration is not the v0.8.0 lever. The fork the operator staged is closed.

The harness analysis that drove the second run. Broke down the 318 v0-only harness failures (where v0 passes, neural fails):

143 in-distribution (US / intersection / functional) — safely targetable. The functional.test.ts cluster (32/34) is bare street names (10th Ave, Main St, 1 Main Pl) mislabeled locality, because synthesizeStreetRow only ever emitted streets with a , City, ST ZIP tail. → the bare-street shard (#308): teach bare streets → street, the bare-format analogue of intersection-bare. Potential ~+5-8pp toward the 25% bar, no German-collapse risk.
175 untrained-locale (deu 17/17, nzd 22/22, nld 20/22, place.fra 13/13, …) — BLOCKED. v0.7.2 trains US+FR only; these locales fail because they're out-of-distribution. Covering them is the multi-locale PARSER problem, which is the known-unsolved German-end-of-string-collapse direction (the v0.8.0 order-shard reverted for exactly this). Not attempted autonomously — it needs the anchor-based / collapse-fix work, not a naive retrain.

So the honest answer to "can we move the harness tonight?": the safe lever (calibration) can't (adds no coverage); the bare-street lever can, partially (the in-distribution cluster); the big gap (untrained locales) is blocked on unsolved parser work.

Both runs were exported + evaluated (per-tag eval-error-analysis.ts + harness-v0-neural.ts vs the v0.7.2 baselines). Neither cleared the promote gate (harness ↑ meaningfully AND no tag −2pp) — verdict table above. v0.7.2 stays the default; nothing uploaded to HF.

What went well

Probe-before-build paid off twice. Quantifying the point-geometry wall (25%, then the cross-placetype jump to 94.9%) before committing a production build avoided shipping a broken recipe — and the cross-placetype insight (JP municipalities split across locality/county/localadmin/borough) was the whole unlock.
The convention engine earned its keep exactly as designed — JP is a different build (name-match) feeding one unchanged resolver. No special-casing.
Independent cross-check (GeoNames vs KEN_ALL) kept the headline number honest (93.9%, non-circular).
Byte-stability discipline caught nothing because nothing broke — every shared-asset merge was guarded and the EU dump stayed identical.

What could've gone better

The exact-name-tier "fix" (#305) looked like a quick win and turned out to entangle with the conflict-flag design — caught it by reading the code before implementing, but it cost a detour.
KR/TW stalled on external data (gov sites geo/login-walled, no GeoNames TW) — logged-and-pivoted per plan, no spin, but the arena only advanced one locale tonight.

Decisions made autonomously

Launched the ls=0.05 run early (06:00, not the planned 12:00 gate) to maximize eval/react time, since the config was a clean single-variable fork reusing v0.7.2's exact corpus + tokenizer (low risk). Honest expectation logged: unlikely to clear the "significant harness gain" bar; run as a data point closing the staged fork.
Launched the bare-street run (concurrent, second A100) after the harness analysis showed a clean in-distribution lever — and committed it (#308) before knowing the result. Honest call: a real, safe shot at the operator's harness goal; it didn't pan out, but the analysis + the bareProb feature are durable.
Did NOT attempt the multi-locale parser retrain (the 175/318 untrained-locale failures) — that's the unsolved German-collapse direction, the wrong thing to brute-force in an autonomous session.
Did NOT promote either run — both fail the per-tag >2pp gate and neither is significant. Default to don't-ship; v0.7.2 stays.
Deferred #305, KR/TW rather than spin on walled data / a byte-stable-path risk.

Open questions for the operator

The v0.8.0 harness goal is blocked on the multi-locale PARSER problem. The safe levers (calibration, bare-street) can't deliver "significant harness improvement" — 175/318 v0-only failures are untrained locales (deu/nzd/nld/…), and covering them re-triggers the German end-of-string collapse. This is the real next prize and it needs the anchor-based / collapse-fix work, not another shard. Worth a focused (non-autonomous) push?
#305: name-wins-and-flag (current) vs postcode-wins when the exact-name is cross-region? Affects EU byte-stability — your call.
KR/TW: worth the manual fetch (KR Juso romanized / TW Chunghwa Post) the way you fetched KEN_ALL, or shelve CJK at Japan for now?

Concrete next steps

bare-street follow-up (cheap, diagnosed): the functional cluster didn't move because the model still labels pure bare streets (main pl, 10th ave) as locality — and the cause is a flaw in my shard: includeHouseNumberProb=0.85, so 85% of the bare rows carried a house number (62 NW Lakeview Cir E). The with-number case improved (usa +7pp) but the no-number case (the functional pattern, 10th Ave alone) stayed undertrained. The fix is a bare-street shard with includeHouseNumberProb ~0.3–0.5 (many pure Main St / 10th Ave rows) at a lower weight (0.1) to avoid the golden-street regression. One-config retry.
KR name tier (the real follow-up): Korea shipped experimental at 26% name-confirmed; the path to a JP-grade tier is the Juso / 도로명주소 road-name database (carries 구/동 natively), which is government-key-walled — a deliberate acquisition, not a scrape. The point-primary build (build-postcode-locality-kr.py) is in place and waiting for it.
TW: needs a national postal source first — GeoNames postal TW is a confirmed 404, so there's no postcode→point input. Chunghwa Post zip data is the acquisition; admin-tw.db (WOF repo on disk, 19.7k features) is a one-command build after that, not before.
#305 design decision → careful PR with the full EU resolver-eval guard.

Numbers


shift window	04:18 → 14:00 UTC (substantive work wrapped ~11:10)
PRs merged	#303 (JP), #304 (CJK report), #306 (ls=0.05), #307 (manifest), #308 (bare-street v1), #309 (postmortem), #310 (bare-street v2), #311 (v2 result), #312 (KR coarse)
issues filed/updated	#305 filed; #293 advanced (KR shipped experimental); #294 confirmed-blocked (postal 404); #14 commented
models trained	2 (ls=0.05, bare-street) — neither promoted; v0.7.2 stays default
Modal cost	~$11.55 / $15 (under cap; 2× A100 ~1.8h each + 3 exports + the stopped v2 run)
NaN incidents	0
CI failures	1 (transient registry-network flake on #304, re-ran green)
regressions shipped	0 (EU byte-identical; nothing promoted)

What shipped​

Training verdict (both runs — NEITHER promoted; v0.7.2 stays default)​

Experiments + baselines​

What went well​

What could've gone better​

Decisions made autonomously​

Open questions for the operator​

Concrete next steps​

Numbers​