Night 6 postmortem — 2026-06-05 (CJK arena: Japan lands)
Window: 04:18 → 14:00 UTC. Modal budget: $15 — came in at ~$11.55, under the cap (operator-confirmed). My overnight "may have exceeded" flag was a false alarm: the slow v2 continue-train wasted wall-clock and some GPU-hours, but the dollars stayed contained because I caught and stopped it. The step-rate lesson below still stands. Status: final.
What shipped
- #292 — Japan coarse resolution (PR #303, merged). The first CJK locale, and it validated the whole arena architecture. WOF has no municipality polygons in CJK (point geometry, confirmed JP/KR/TW), so the European point-in-polygon build is inapplicable (~25% JP). Pivoted to an authoritative name-match build (KENALL romanized municipality + GeoNames point → cross-placetype WOF match) feeding the _same
postcode_area_resolutionstrategy with zero new resolver code. Build 94.9%, end-to-end resolver 98.5% (KEN_ALL gold) / 93.9% (independent GeoNames cross-check), all above the 85% bar. EU (DE/FR/GB/NL) byte-identical after the merge. - CJK arena eval report (PR #304, merged) —
docs/articles/evals/2026-06-05-cjk-arena.md, plus the Direction-E design-doc correction (it had assumed PIP was uniform). - #293 — Korea coarse resolution (PR #312, merged; EXPERIMENTAL, not promoted). The second CJK locale, and the real test of "less special": Korea's data inverts Japan's, so the build inverts. GeoNames postal KR already carries
postcode → (place_name, province, lat, lon)(Hangul names), and WOF'snamestable holds Hangul (kor+und) — so KR is point-primary (nearest WOF locality by coordinate, Hangul name as a confirmation signal), feeding the samepostcode_area_resolutionstrategy with zero resolver code changed. Province + coordinate land at 100% (p50 0.96 km); the name tier is 26.3% (vs JP's 94.9%) — a WOF-data ceiling (ri-granularity offset + no 구 urban-district localities), not an architecture limit. Caught and fixed a homonym bug (global name-match landed 500 km away → proximity-constrained). Honest writeup:docs/articles/evals/2026-06-05-kr-point-primary.md. TW (#294) confirmed blocked one rung lower — GeoNames postal TW is a hard 404 upstream (no postcode→point input at all), so even point-primary can't run;admin-tw.dbis unbuilt but building it is premature without the postal source (Chunghwa Post, a deliberate acquisition). - CJK provenance in the build manifest (PR #307) — pinned the JP WOF repo commits + KEN_ALL fetch chain + GeoNames points source (reproducibility / build-from-source discipline).
- Two v0.8.0 training experiments (configs #306 ls=0.05, #308 bare-street + the
bareProbcorpus feature) — run, evaluated, neither promoted (verdict below). The durable wins: the harness failure analysis (143 targetable vs 175 blocked) and the reusablebareProbsynthesizer. - Issues: filed #305 (the exact-tier/conflict-flag design question); logged KR (#293) + TW (#294) data blockers; groomed #14 (Japan milestone).
Training verdict (both runs — NEITHER promoted; v0.7.2 stays default)
| per-tag gate | harness | verdict | |
|---|---|---|---|
| ls=0.05 | postcode +4.2pp, but street −4.1 / venue −4.5 / house_number −2.4 (3 tags >2pp) | +1.6pp (20.2%) | not promoted — calibration trades postcode for street/venue; fails the gate, not significant |
| bare-street | street −2.8pp (golden full-address streets shifted) | +0.9pp (19.5%); usa +7pp (22→29%) but functional 2/34→1/34 — target missed; net not significant | not promoted — redistributes (US harness up, golden street down), fails the gate |
| bare-street v2 (continue, stopped) | — (+4k steps only) | harness 20.7%; functional STILL 1/34 (main pl/10th ave still → locality) | not promoted — the no-house-number fix did not move the cluster |
Post-handoff update (re-engaged on the auto-classifier's nudge): the bare-street v2 retry FAILED, and a budget caution. v2 corrected the shard (47% pure-bare streets vs v1's ~9% — the diagnosed no-house-number gap, #310) and continue-trained the v1 checkpoint. At +4k steps the functional cluster did not budge — the model's bare-input→locality prior (100k steps deep) didn't yield. The bare-format-shard approach has now failed twice.
Then I tested the decode-time lever too (free, no GPU): the street-morphology FST. It also failed — and revealed why the whole cluster is stuck. (1) The FST biases adjacent tokens away from dependent_locality, but the functional failures are plain locality — a target mismatch. (2) I added an opt-in locality bias and it still didn't flip them (0/6, even at penalty 4.0) — because the FST doesn't even fire on bare abbreviated suffixes (ave/st/pl in 10th Ave). So five distinct levers now fail the functional cluster: ls=0.05, bare-street v1, bare-street v2, the morphology FST (dep_locality target), and the morphology FST (locality target). The model labels bare 10th ave as locality with immovable confidence, and the FST suffix-matcher doesn't engage on bare 2-token inputs. Conclusion: the functional/bare-input gap is a deep model+FST-matching property, not a data or decode-bias problem — it needs the FST suffix-matcher extended to bare inputs AND/OR a different model treatment of bare spans. (The localityPenalty extension was reverted — unproven. The diagnostic is scripts/diag-functional-morphology.ts, kept local.) ⚠️ The v2 continue-train ran pathologically slowly (~0.3 steps/s vs ~15 fresh; +4k steps over a long wall-clock window) before I caught it and stopped it. (Budget update: the shift came in at ~$11.55, under the $15 cap — the slow run wasted GPU-hours but not dollars, because I stopped it.) Root cause unknown (throttled instance / resume-IO); don't continue-train again without watching the real step-rate (the rate field is broken on resume). My miss: I trusted the rate field and didn't check the step-rate against the wall clock sooner.
The honest answer to "can we deliver a v0.8.0 harness breakthrough tonight?": no, with the safe levers. Calibration trades tags; the bare-street shard helped US contexts but missed its functional target and regressed golden street. The real harness gap is the 175/318 untrained-locale failures (the multi-locale PARSER problem), which is the unsolved German-end-of-string-collapse direction — not something to brute-force autonomously. The two runs are clean, informative data points (the ls=0.05 fork is now answered; the bare-street weight 0.2 over-shifts), but v0.7.2 remains the right default. The shift's real model win is the JP RESOLVER, not a new parser.
Experiments + baselines
-
ls=0.05 (
ap-VMb3...) and bare-street (ap-hzrf...) — both 100k steps on A100, detached, concurrent, exported + evaluated. Verdict above; neither promoted.v0.7.2 baselines (the current default, kept as default):
- Harness (the operator's "v0 test-suite coverage"): neural 18.6% pass (77/415) vs v0 93.7%; both-pass 17.1%, v0-only 76.6%, neural-only 1.4%. Per-file: usa 22%, intersection 17%, functional 6%, fra 33%, nld 9%.
- Per-tag golden (the pre-publish gate): exact-match 31.4%; locality 39.1%, region 64.9%, postcode 79.3%, street 47.8%, house_number 80.5%.
- Differentiators: the 22 falsehoods in the harness (
/tmp/v072-harness.json).
-
bare-street (#308) — the harness lever, from the analysis below. Val macro_f1 0.81. Result: usa +7pp but functional target missed, golden street −2.8pp → not promoted (verdict above).
ls=0.05 calibration — DONE, NOT PROMOTED (the staged fork is answered). Per-tag vs v0.7.2: postcode +4.2pp (calibration's win) but street −4.1, venue −4.5, house_number −2.4 (three tags regress >2pp → fails the pre-publish gate). Exact-match flat (31.2 vs 31.4). Harness +1.6pp (20.2% vs 18.6%) — real but not "significant," and the per-tag regressions disqualify it anyway. Verdict: ls=0.05 trades postcode for street/venue, net not a clean win; calibration is not the v0.8.0 lever. The fork the operator staged is closed.
The harness analysis that drove the second run. Broke down the 318 v0-only harness failures (where v0 passes, neural fails):
-
143 in-distribution (US / intersection / functional) — safely targetable. The
functional.test.tscluster (32/34) is bare street names (10th Ave,Main St,1 Main Pl) mislabeledlocality, becausesynthesizeStreetRowonly ever emitted streets with a, City, ST ZIPtail. → the bare-street shard (#308): teach bare streets → street, the bare-format analogue of intersection-bare. Potential ~+5-8pp toward the 25% bar, no German-collapse risk. -
175 untrained-locale (deu 17/17, nzd 22/22, nld 20/22, place.fra 13/13, …) — BLOCKED. v0.7.2 trains US+FR only; these locales fail because they're out-of-distribution. Covering them is the multi-locale PARSER problem, which is the known-unsolved German-end-of-string-collapse direction (the v0.8.0 order-shard reverted for exactly this). Not attempted autonomously — it needs the anchor-based / collapse-fix work, not a naive retrain.
So the honest answer to "can we move the harness tonight?": the safe lever (calibration) can't (adds no coverage); the bare-street lever can, partially (the in-distribution cluster); the big gap (untrained locales) is blocked on unsolved parser work.
Both runs were exported + evaluated (per-tag eval-error-analysis.ts + harness-v0-neural.ts vs the v0.7.2 baselines). Neither cleared the promote gate (harness ↑ meaningfully AND no tag −2pp) — verdict table above. v0.7.2 stays the default; nothing uploaded to HF.
What went well
- Probe-before-build paid off twice. Quantifying the point-geometry wall (25%, then the cross-placetype jump to 94.9%) before committing a production build avoided shipping a broken recipe — and the cross-placetype insight (JP municipalities split across
locality/county/localadmin/borough) was the whole unlock. - The convention engine earned its keep exactly as designed — JP is a different build (name-match) feeding one unchanged resolver. No special-casing.
- Independent cross-check (GeoNames vs KEN_ALL) kept the headline number honest (93.9%, non-circular).
- Byte-stability discipline caught nothing because nothing broke — every shared-asset merge was guarded and the EU dump stayed identical.
What could've gone better
- The exact-name-tier "fix" (#305) looked like a quick win and turned out to entangle with the conflict-flag design — caught it by reading the code before implementing, but it cost a detour.
- KR/TW stalled on external data (gov sites geo/login-walled, no GeoNames TW) — logged-and-pivoted per plan, no spin, but the arena only advanced one locale tonight.
Decisions made autonomously
- Launched the ls=0.05 run early (06:00, not the planned 12:00 gate) to maximize eval/react time, since the config was a clean single-variable fork reusing v0.7.2's exact corpus + tokenizer (low risk). Honest expectation logged: unlikely to clear the "significant harness gain" bar; run as a data point closing the staged fork.
- Launched the bare-street run (concurrent, second A100) after the harness analysis showed a clean in-distribution lever — and committed it (#308) before knowing the result. Honest call: a real, safe shot at the operator's harness goal; it didn't pan out, but the analysis + the
bareProbfeature are durable. - Did NOT attempt the multi-locale parser retrain (the 175/318 untrained-locale failures) — that's the unsolved German-collapse direction, the wrong thing to brute-force in an autonomous session.
- Did NOT promote either run — both fail the per-tag >2pp gate and neither is significant. Default to don't-ship; v0.7.2 stays.
- Deferred #305, KR/TW rather than spin on walled data / a byte-stable-path risk.
Open questions for the operator
- The v0.8.0 harness goal is blocked on the multi-locale PARSER problem. The safe levers (calibration, bare-street) can't deliver "significant harness improvement" — 175/318 v0-only failures are untrained locales (deu/nzd/nld/…), and covering them re-triggers the German end-of-string collapse. This is the real next prize and it needs the anchor-based / collapse-fix work, not another shard. Worth a focused (non-autonomous) push?
- #305: name-wins-and-flag (current) vs postcode-wins when the exact-name is cross-region? Affects EU byte-stability — your call.
- KR/TW: worth the manual fetch (KR Juso romanized / TW Chunghwa Post) the way you fetched KEN_ALL, or shelve CJK at Japan for now?
Concrete next steps
- bare-street follow-up (cheap, diagnosed): the functional cluster didn't move because the model still labels pure bare streets (
main pl,10th ave) aslocality— and the cause is a flaw in my shard:includeHouseNumberProb=0.85, so 85% of the bare rows carried a house number (62 NW Lakeview Cir E). The with-number case improved (usa +7pp) but the no-number case (the functional pattern,10th Avealone) stayed undertrained. The fix is a bare-street shard withincludeHouseNumberProb~0.3–0.5 (many pureMain St/10th Averows) at a lower weight (0.1) to avoid the golden-street regression. One-config retry. - KR name tier (the real follow-up): Korea shipped experimental at 26% name-confirmed; the path to a JP-grade tier is the Juso / 도로명주소 road-name database (carries 구/동 natively), which is government-key-walled — a deliberate acquisition, not a scrape. The point-primary build (
build-postcode-locality-kr.py) is in place and waiting for it. - TW: needs a national postal source first — GeoNames postal TW is a confirmed 404, so there's no postcode→point input. Chunghwa Post zip data is the acquisition;
admin-tw.db(WOF repo on disk, 19.7k features) is a one-command build after that, not before. - #305 design decision → careful PR with the full EU resolver-eval guard.
Numbers
| shift window | 04:18 → 14:00 UTC (substantive work wrapped ~11:10) |
| PRs merged | #303 (JP), #304 (CJK report), #306 (ls=0.05), #307 (manifest), #308 (bare-street v1), #309 (postmortem), #310 (bare-street v2), #311 (v2 result), #312 (KR coarse) |
| issues filed/updated | #305 filed; #293 advanced (KR shipped experimental); #294 confirmed-blocked (postal 404); #14 commented |
| models trained | 2 (ls=0.05, bare-street) — neither promoted; v0.7.2 stays default |
| Modal cost | ~$11.55 / $15 (under cap; 2× A100 ~1.8h each + 3 exports + the stopped v2 run) |
| NaN incidents | 0 |
| CI failures | 1 (transient registry-network flake on #304, re-ran green) |
| regressions shipped | 0 (EU byte-identical; nothing promoted) |