Direction E โ the Geographic Rule Engine
Date: 2026-06-05
Status: design, signed off (operator + DeepSeek 3-turn consult)
Provenance: Japan cheap-probe (scripts/diag-japan-parse.ts) + DeepSeek consult (.agents/skills/deepseek-consult/session-notes-2026-06-05-japan-resolver-architecture.md)
Sibling: Direction D / epic #239 โ that one is the parser (anchor-based parsing); this one is the resolver (what happens after the parse).
Where this came fromโ
We went looking at Japan to learn what the non-Latin story costs us, and the probe split the problem in two so cleanly that the whole design falls out of it. Run a real Japanese address through the current v0.7.2 universal parser and you see two failures that are nothing alike:
- (A) The model is out-of-distribution. Romaji in Western order parses almost correctly (
1-1-1 Ginza, Chuo-ku, Tokyo 104-0061โ housenumber / street / locality / postcode, all roughly right). Native kanji scatters the labels. This is a coverage problem โ trainable โ and in the meantime the postcode is a clean regex hit in _every sample, so coordinate-first resolution carries Japan exactly the way it carried out-of-distribution Dutch to 94.9%. The coarse win needs no parser change at all. - (B) The schema cannot represent the address. There is no tag for ไธ็ฎ/็ช/ๅท โ chลme/banchi/gล, the district/block/building coordinates that most of Japan uses instead of streets. The model jams them into
house_number/street/postcodebecause those are the only slots that exist. A perfectly trained model still couldn't emit a correct structure. Training can never fix this; it's a vocabulary gap.
That distinction is the spine of everything below. (A) lives in the resolver. (B) is the one thing that has to change in the parser โ and we're going to change it the smallest way that works.
The thesisโ
The machinery after the parse is not one monolithic resolver, and it is not a Matryoshka of nested resolver objects either (orthogonal concerns โ grid morphology vs postal schema vs alias lists โ don't fit a single inheritance tree, and adding a city shouldn't mean a code deploy). It's a flat Geographic Rule Engine:
A
conventiontable keyed by Who's-On-First polygon id, deep-merged up the ancestor chain (Sapporo's row inherits Hokkaidล's inherits Japan's), that dispatches a short ordered sequence of named strategy functions. The data carries the routing and parameters; the strategies are a small registry of code primitives. "Plug in and out as needed" means a row in a table plus a pick from existing primitives โ new code only for a genuinely novel spatial system.
Two things fall out for free:
- The dispatch chicken-and-egg dissolves. To pick Sapporo's rules you must know you're in Sapporo โ but
postcode โ WOF locality โ ancestorsalready hands you the entire admin chain in the coarse pass we run today. Convention selection rides on it at zero cost. - The two audiences are one pipeline at two depths. Our need (postcode โ region/locality, for training gold and coarse geocode) and a user's need (a specific address down to a building) are the same machine stopped at different depths, not two systems. The 2025 Digital Address code is just an early-exit pure-lookup strategy at the front.
This also retires an older open question. Direction D imagined a coarse-placer that decides which locale model to reach for (#244/#245). With the rule engine, one universal parser plus a geography-keyed convention table replaces the model router for every Latin locale and every locale the postcode can anchor. We mostly don't need to reach for a different model.
The one foundational decisionโ
The seductive idea is to throw out the semantic tags and have the parser emit positional levels (L0โฆLn) that the convention names per locale. It's elegant and would serve every future country. It is also the highest-regret move on the board: it detonates four working locales, forces a from-scratch retrain plus a resolver-contract rewrite, with NL's 94.9% to re-earn from zero โ and it breaks on partial queries, where the level index of a bare "Berlin" is undefined until you've already resolved the country. We are not doing that, and Phase 3 below exists only to record the tripwire that would ever reopen it.
The additive cut that buys ~80% of the value with zero regression risk: keep every existing semantic tag, and add exactly one new multi-valued tag โ locator[] โ for sub-locality spatial units that aren't a Western street + number (chลme/banchi/gล, grid coordinates, block numbers). Street-based countries never emit it; block-based countries never emit street; the convention maps locator[] per locale. DE/FR/GB/NL never reference it, so they cannot regress.
Make it less special: Japan, South Korea, Taiwan togetherโ
Building only for Japan would bake Japanese quirks into something we'd call "general." So the design is validated against three CJK block-style systems at once โ Japan (chลme/banchi/gล), South Korea (๋๋ก๋ช
road-name and ์ง๋ฒ lot systems), Taiwan (ๆฎต/ๅทท/ๅผ/่). If the convention table + locator[] + the strategy registry express all three without per-country code beyond data, the abstraction has earned the word "general."
Update (2026-06-05) โ the CJK coarse build is a name-match, not PIP. Phase-1 implementation surfaced a structural finding (confirmed across JP/KR/TW): WOF admin geometry is point-based at the municipality level โ there are no municipality polygons โ so the European point-in-polygon coordinate-first build is inapplicable to CJK (it gives ~25% JP coverage). CJK resolves by authoritative name-match instead: postcode โ national-postal-authority municipality NAME (JP = KENALL) + GeoNames point โ cross-placetype match (
locality+county+localadmin+borough) โ WOF id, written to the _samepostcode_localitytable so the samepostcode_area_resolutionstrategy consumes it. This is the convention engine earning its keep exactly as designed โ a different build for a different data shape, feeding one unchanged resolver. Japan: 94.9% built, 93.9% resolved (independent gold). PIP-containment is replaced by name-agreement as the CJK metric. Full write-up:docs/articles/evals/2026-06-05-cjk-arena.md.
Sequencing (reversible-first)โ
Phase 1 โ resolver-side, additive, reversible, no parser change. This is largely reshaping what we already have โ addressing-conventions.json is a primitive convention table, and coordinate-first is postcode_area_resolution โ into the general form, then adding rows for JP/KR/TW. Delivers coarse locality geocoding off the postcode alone, the same trick that got Dutch-OOD NL to 94.9%. Pure resolver work, the part of the stack that has never bitten us.
Phase 2 โ parser-side, additive, gated. Teach the parser to emit locator[] from a handful of JP/KR/TW examples. Existing tags untouched, model rollback-able. This inherits every retrain fragility we've hit (the German end-of-string collapse, the PR3 negative), so it's gated on per-locale-F1 + resolver utility exactly like every other retrain, and Phase 1 must prove the engine first.
Phase 3 โ deferred, decision-only. Revisit L0โฆLn only if onboarding more systems (India, China, โฆ) proves the additive scheme genuinely insufficient โ by then with real data and a mature engine, as a migration rather than a leap.
The coarse/fine split maps onto a safe/risky split: Phase 1 is cheap and near-certain; building-level precision (Phase 2+) is a separate, genuinely harder bet, and we may live on Phase 1 for a good while.
The datasets, as strategy primitivesโ
Heterogeneous authoritative sources sit behind the uniform strategy interface; the resolver never cares which source answered. Built from source, never a prebuilt dump โ same discipline as every other table we ship.
| Dataset | Primitive | Tier |
|---|---|---|
| KEN_ALL (JP postal authority: 7-digit โ prefecture/municipality/district) | postcode_area_resolution (+ block_area_lookup once block polygons are georeferenced) | coarse (+ fine) |
| JIGYOSYO (JP large-org / PO-box codes) | org_code_lookup | fine |
| Digital Address 2025 (JP 7-char code โ registered building) | digital_code_lookup (early-exit) | fine |
| ๋๋ก๋ช ์ฃผ์ / ์ง๋ฒ (KR) | postcode_area_resolution + named_linear_feature_with_offset | coarse + fine |
| TW postal + ๆฎต/ๅทท/ๅผ/่ | postcode_area_resolution + block_area_lookup | coarse + fine |
| WOF admin polygons (already shipped) | ancestor walk for convention selection | all |
Seed strategy registry: postcode_area_resolution, block_area_lookup, intersection_resolver, named_linear_feature_with_offset, alias_normalisation, transliterate_reorder (pre-parse), digital_code_lookup, org_code_lookup, fallback_fuzzy_name_match.
Pre-registered decision rulesโ
- Phase 1 promotion: a locale ships when postcode-route PIP-containment clears ~85% on a real OA sample, with no regression to DE/FR/GB/NL (the existing coordinate-first numbers are byte-stable).
- Phase 2 gate: the
locator[]fine-tune promotes only on per-locale-F1 + resolver utility, and never if it drops an existing locale >2pp. Default to not promoting; live on Phase 1's coarse win. - The
locator[]shape, the first-ship scope, and the convention-asset format are open forks to settle at Phase 1 kickoff (plain ordered slot vs typed; JP-only branch vs general refactor; WOF-id-keyed read-only sqlite asset likepostcode-locality-intl.db). - Phase 3 tripwire (the only thing that reopens
L0โฆLn): the additivelocator[]scheme fails to express a newly-onboarded system as data and forces per-country parser-output post-processing in โฅ2 locales.