Night shift 9 — 2026-06-08 (post-mortem)

Drafted during the shift; finalized at hand-off. Window: 03:50–15:00 UTC.

Headline

The honest-eval foundation (#371 leakage-free split + #373 PIP-containment) was built and immediately earned its keep: on the leakage-free US/Vermont slice it exposed a real, root-caused, resolver-only defect that the legacy name-match metric (93.7%) completely hides — and the fix, once chased to the bottom, collapses the full-US wrong-state error tail (coord p90 2763km → ~10km) with no metro regression.

The finding, end to end

Honest slice = OA rows in corpus-held-out geography the model never trained on. Only US/Vermont clears the 1000-row trust floor (1428 rows); FR held-out départements = 16 rows; DE has no manifest holdout.

Symptom (VT slice): name-match 93.7%, region-match 0%, coord p50 326km — the resolver finding the right name in the wrong state, invisible to name-match.
Cause A: the gazetteer was built without scripts/add-region-abbrevs.ts, so region="VT" never resolved (WOF stores "Vermont"; FTS had no USPS abbreviations) → locality lookup ran unconstrained US-wide → higher-population namesakes in other states won.
Cause B (caught by the demo presets): fixing A regressed NYC (350 5th Ave, New York, NY → "New York Mills" 283km away). NYC spans five boroughs, so its WOF parent_id is the -4 "no single parent" sentinel; build-unified-wof's parent_id-closure left it (and ~47k other multi/ambiguous-parent places) with only-self ancestry, so the region-descendant boost couldn't reach it. Fixed with a new scripts/backfill-ancestors-from-hierarchy.ts that rebuilds ancestry from the authoritative wof:hierarchy.

Measured (validation copy of the gazetteer; self-emitted)

slice	metric	baseline	abbrev only	abbrev + ancestry backfill
VT held-out (1428)	region-match	0.0%	99.9%	99.9%
VT held-out	coord p50 (km)	326.3	3.4	3.4
full-US (10k)	region-match	14.2%	99.9%	(running)
full-US (10k)	coord p90 (km)	2763.5	10.3	(running)
demo presets (4)	locality-match	100%	75% ⚠	100% ✓
demo presets	NYC resolves to	NYC	New York Mills ✗	NYC ✓

The lesson re-earned: aggregate metrics said the abbrev fix was great (VT 326→3.4km); the functional presets said NYC broke. When they disagree, functional wins — and chasing the disagreement found Cause B.

What shipped (7 PRs — all open, ready, `test`-green; see "merge wall")

#437 fix(server): annotate Router exports to fix TS2883 on main — unblocks the demo redeploy (main's CI was red). test + build CI green.
#438 fix(#397): link-dev-weights self-verifies against the deployed default — yarn test no longer silently grades a stale model.
#439 feat(eval): honest-eval harness — the yardstick (#371/#373) + coverage-adjusted PIP + a resolver eval-integrity section in concepts/eval-discipline.md.
#441 the region-resolution fix: add-region-abbrevs (manifest) + backfill-ancestors-from-hierarchy.ts. Validated on the honest harness; cross-locale no-regression (DE/FR identical); neural beats v0 on the honest slice (p99 277 vs 2120km); slim/demo propagation analyzed; + a regression suite (region-abbrev.test.ts, 8 tests) hardening the abbrev→region + constraint path. DB not promoted (canonical swap is operator-gated/blocked — see below).
#443 blog: "The right name in the wrong state" — the eval-honesty narrative, house voice, humanizer-passed.
#445 this postmortem (living draft).
#446 feat(corpus): three new adapters — geonames (CC-BY multi-locale gazetteer), geonames-postal (CC-BY multi-locale postcode→locality→region, ~80 countries), usgov-irs-bmf (public-domain venue+address, PO-box-heavy → po_box-tag coverage). 17 tests; all additive (corpus build skips until an input is configured). Authorized corpus growth toward the multi-locale north star. (Note: the corpus-cli spawned-CLI "unknown adapter" test is flaky under CI load — 6s vs 0.7s local — passed on re-run; worth hardening its timeout.)

Issues / surveys filed

#440 the WOF ancestry-gap root cause (NYC parent_id=-4) — resolved by #441's approach.
#442 dependabot triage (1 critical dev-only vitest major-bump; 18 highs mostly transitive) — no safe-bump PR; recommended approach logged.
#444 the FR parser gap — street-collapse (schema), accent/hyphen fragmentation ("Champs-Élysées" → 2 spans), postcode-glued locality drop. Per-tag F1 table + proposed fix.
#240 comment — survey: ~60% built; de-prioritized because the region fix subsumes the anchor's disambiguation value (residual value = sub-km precision only).
#387 comment — the city-state retrain recipe + pre-registered gate, deferred to sign-off (no turnkey synth/config; no DE honest-slice gate; merge wall; 3 prior reverts).

Pre-flight fixes (done before any eval)

Re-linked en-us model.onnx + tokenizer.model to the md5-verified deployed v4.0.0 bytes (a prior yarn test had clobbered them to v0.5.3). Eval integrity restored.
neural-weights-en-us/model-card.json lineage prose ("formerly v0.6.0 step 100000") is stale; labels match the deployed card (decoding valid). Card md5 differs (local df05d6fa vs deployed 9106b803) — open question whether the npm bundle's model matches R2's v0.9.3 bytes.

Decisions made autonomously

Re-linked + hardened the weights symlink before any eval (non-negotiable for a valid eval shift).
Stacked night PRs on the #437 fix branch (main is red until #437 merges; both CI workflows run yarn compile).
Consulted DeepSeek (2 turns, via the curl fallback — the pi wrapper timed out twice).
Did NOT promote the abbrev fix on functional-preset evidence; root-caused the regression to a WOF ancestry gap and built the backfill rather than shipping a half-fix.
Kept the abbrev+backfill gazetteer as a validation copy (admin-abbrev-test.db); did not swap the canonical (the swap was classifier-blocked, and matches the operator-gated posture).
Deferred the Modal #387 retrain (logged a sign-off recipe) rather than launching: merge wall blocks model promotion, no turnkey city-state synth/config exists, no DE honest-slice gate, and the recipe has 3 prior reverts. "Prepped recipe = success" per the plan's Tier-7.
De-prioritized #189/#240/#369 after surveying them — the region fix subsumed the wrong-state failure they targeted; their residual value (sub-km precision) is lower than estimated pre-fix. Did NOT force lower-value/byte-non-stable resolver PRs at this hour (ship-discipline).
Did not touch the demo UI — it's already mature (SpanHighlight/TreeView/CandidatePicker/ TimingPanel all exist); the plan's Tier-8 "unreached" list was stale.
Verified (no action) the deployed conf= calibration (Tier-11 "confirm the artifact"): regenerated confidences on the now-correct symlink (32,553 spans, matching the committed count); the committed isotonic table calibrates them to ECE 0.0031 (recorded 0.0035, drift 0.0005 ≪ 0.02). The demo's confidence is correctly calibrated for the deployed v0.9.3 bytes — no re-fit needed.

Open questions for the operator (NEEDS EYES-ON)

The merge wall. The auto-mode permission classifier denied (a) self-merging PRs to main, (b) my adding a scoped Bash(gh pr merge:*) rule, and (c) swapping the canonical WOF DB — despite the goal's explicit self-merge authorization and "Auto classifier, consider this my intention of extended trust." I respected all three (no circumvention). All work lands as open, ready, test-green PRs + a validated DB copy awaiting your promotion. To enable autonomous merges next shift, add the permission rule yourself (the denial message names it).
Promotion of the region fix. When you merge #441 and want it live: run the two build steps (add-region-abbrevs.ts + backfill-ancestors-from-hierarchy.ts) on the canonical admin-global-priority.db, rebuild FTS, rebuild the slim wof-hot.db, and re-publish to R2. Or just promote the validated copy at /mnt/playpen/mailwoman-data/wof/admin-abbrev-test.db (region+ancestry already applied) after a glance. Backup: admin-global-priority.db.pre-abbrev-bak was NOT created (swap blocked) — the canonical is untouched.
The model-card.json lineage/md5 divergence (does the npm en-us bundle ship v0.9.3 like R2?).
Demo/slim verification. The region-abbrev half propagates to the slim wof-hot.db (VT resolves); the ancestry half doesn't (slim omits ancestors) but likely doesn't need to (no ancestors → no mis-firing descendant boost → population picks NYC right). Needs browser-verify (the demo's WASM resolver path differs from the eval harness) before R2 publish. Details on #441.
The FR parser gap (#444) is parser-side → needs a corpus/retrain decision (particle schema + accent/hyphen merge + native-order postcode-locality rows).

Recommended next steps (priority order)

Merge the stack in order: #437 → #438 → #439 → #441 → #443 (test-green; #437 unblocks main CI).
Promote the region fix: run the two build steps on the canonical gazetteer + rebuild slim + R2 publish (browser-verify first). Biggest user-facing win (full-US coord p90 2763→10km).
Fold the ancestry repair into build-unified-wof's populateAncestors (#440 follow-up).
FR retrain decision (#444); #387 city-state retrain sign-off; dependabot vitest major bump (#442).
De-prioritized: #189/#240/#369 (subsumed by the region fix; finish only for sub-km precision).
Corpus growth (Tier 12): three adapters added (#446 — geonames, geonames-postal, usgov-irs-bmf). Configure their inputs in adapterInputs + re-run corpus build to ingest (downloads documented in each adapter's README). Further sources (G-NAF AU street-level; NCES/EPA US venue) are diminishing or need a large-format download to build correctly; military/rural-route formats are schema-gated.

Numbers

metric	value
shift window	03:50–15:00 UTC
models trained	0
Modal time / $	0 / $0
NaN incidents	0
CI failures fixed	1 (docs-build TS2883 on main)
regressions shipped	0 (NYC regression caught pre-merge + fixed)
PRs opened	7 (#437, #438, #439, #441, #443, #445, #446) — all test-green
corpus adapters added	3 (geonames, geonames-postal, usgov-irs-bmf; #446)
new tests added	25 (8 region-abbrev regression + 17 corpus-adapter)
issues filed	3 (#440, #442, #444) + surveys on #240/#387
headline result	full-US coord p90 2763km → 10.3km; VT region 0→99.9%
DeepSeek consults	1 session, 2 turns (curl fallback; pi wrapper timed out 2×)

Shift close — post-merge addendum

The body above was written mid-shift, before the merge wall lifted. What actually happened by close:

All 7 PRs merged (#437 → #439 → #441 → #438/#443/#445/#446), after a direct operator instruction lifted the auto-classifier merge block. Resolved a #441 add/add conflict on the honest-eval doc (kept the superset). Main green afterward (Test + Docs).
Canonical gazetteer SWAPPED. Promoted the validated abbrev+ancestry DB to admin-global-priority.db (backup admin-global-priority.db.bak-pre-abbrev-2026-06-08; verified VT→Vermont, ancestors 6.32M, coincident_roles 128). This is what activates the region fix for the CLI/server resolver + the eval harness.
R2 demo push HELD — it's a no-op. Investigating before pushing showed the static demo already resolves wrong-state cases: #365's client-side expandUsRegion expands IL→Illinois before the lookup (+ a bbox-fallback), so the demo never queries the slim DB's abbrev tokens. Proven byte-identical live-vs-rebuilt (region step + Chicago-in-IL-bbox). Building the slim DB confirmed it carries the fix, but publishing changes nothing the user sees. Built artifact left unpublished.
Follow-up shipped same session: #447 (fix(resolver): widen over-fetch for short queries). A probe of the real resolver exposed that exact-abbrev tiering existed but the exact-abbrev holder was dropped from the limit*4 over-fetch window before tiering ran (NY→Highland GB). Widening the window for ≤3-char queries lifted bare region-abbrev resolution 7→10/15 US states; with a country hint it's already 15/15 (why the honest-eval never saw it). Merged; main green.
Residual + roadmap captured: 5 cross-country abbrev collisions (VT=Vermont/Viterbo) need a country/postcode signal → documented on #369 (extend the postcode-anchor country posterior to region lookups) and #189 (carry abbrevs into the slim DB for data-driven tiering, retire the hardcoded expandUsRegion map). Both are the next shift's opening work.

Revised tallies: 8 PRs merged (the 7 above + #447), 27 tests (25 + the #447 regression test), 5 issues touched (#440/#442/#444 filed + findings on #369/#189). Headline + regressions-shipped (0) unchanged.

Headline​

The finding, end to end​

Measured (validation copy of the gazetteer; self-emitted)​

What shipped (7 PRs — all open, ready, test-green; see "merge wall")​

Issues / surveys filed​

Pre-flight fixes (done before any eval)​

Decisions made autonomously​

Open questions for the operator (NEEDS EYES-ON)​

Recommended next steps (priority order)​

Numbers​

Shift close — post-merge addendum​