Phase 4 โ Resolver
Goal: add a resolver layer that takes parsed components and resolves them to canonical place identifiers + coordinates, with source provenance threaded through the output. The parser and geocoder share one representation โ the same AddressTree the decoder already produces, decorated in-place.
Status: opened 2026-05-20, supersedes the original sketch. Phases 0โ3 have shipped (@mailwoman/neural@2.1.0 on npm; CLI + per-component policy live). Real-world deployment has not yet generated feedback, but the team has accepted the architectural risk of beginning Phase 4 work now so that the output shape (src attr already reserved on the XML serializer) can land before downstream consumers depend on its absence.
Branch: sub-phase branches off main (feature/phase-4-<slice>). Each sub-phase ships independently.
Depends on: @mailwoman/core@2.x decoder pipeline (PR #58 lineage), @mailwoman/neural@2.x.
Why now, why this shapeโ
Three forcing functions:
- The XML serializer already reserves the
srcattribute (serialize-xml.ts:18../../core/decoder/serialize-xml.ts). The TODO comment is a public commitment; shipping a release that adds the attr is non-breaking only because consumers don't depend on its absence yet. Every release that goes out withoutsrcmakes the eventual flip costlier. - The neural classifier emits proposals with
source+source_idfields that the decoder discards. That's free debugging signal we're throwing away. - Resolver feedback into parsing was the project creator's original vision (per
reference/ARCHITECTURE.md's opening). The resolver is not bolted on โ it shares theAddressTreerepresentation.
Architecture decision: Option B (SQLite FTS5 + WOF SQLite)โ
The original sketch listed three options. This plan picks Option B.
- Option A (tantivy / Airmail) โ rejected for v1. Introduces Rust into the runtime, which contradicts the project's "TypeScript-first" hard constraint (
docs/plan/README.md). Revisit only if Option B's recall floor is unacceptable at planet scale. - Option C (external geocoder API) โ rejected as the default. Network dependency + rate limits + privacy implications all hostile to a library that's meant to be embedded. We will expose a
RemoteResolveradapter for users who prefer Pelias / BAN / Nominatim, but the in-package default is local. - Option B (SQLite FTS5 + WOF SQLite) โ picked. WOF mirrors at data.geocode.earth/wof/dist/sqlite/ (per
project-geocode-earth-voltronnotes) ship as a known-good packaging. Pure Node vianode:sqlite(built-in since Node 22) orbetter-sqlite3(lighter dependency surface). Pros: zero new runtime languages, deterministic, offline-capable, fits the existingweights-*package shape (data packages downloaded on demand). Cons: slower at planet scale than tantivy, simpler ranking โ acceptable for v1 because the parser narrows the search space (locality + region + country are already extracted).
Sub-phase breakdownโ
Phase 4 ships in three slices, each independently mergeable:
| Slice | Goal | Independently useful? |
|---|---|---|
| 4.1 โ Source provenance (this PR) | Thread source + source_id from ClassificationProposal through the decoder; AddressNode gains optional source + sourceId; XML serializer emits src attribute; JSON / tuple projections unchanged. | Yes โ surfaces classifier provenance to debug + downstream filtering. No resolver yet. |
| 4.2 โ WOF SQLite loader package | New package @mailwoman/resolver-wof-sqlite (or fold into @mailwoman/neural? decide in 4.2). Loads a WOF SQLite distribution, exposes an FTS5-backed lookup findPlace({ locality, region, country, locale }) returning candidate WOF places with confidence. | Yes โ usable standalone for "what's the WOF id for Paris, FR?" without going through the full parser. |
| 4.3 โ Resolver integration | Resolver interface + WofSqliteResolver impl + resolveTree(tree, resolver) that walks the AddressTree, queries the resolver per node, decorates with src="wof-admin:<id>", lat, lon, wof_id. CLI --resolve flag. | Closes the loop โ outputs gain real-world identifiers. |
Sub-phases 4.2 and 4.3 will each get their own plan doc (PHASE_4_2_*.md, PHASE_4_3_*.md) written when they begin. This doc is the spine.
Phase 4.1 โ Source provenance (current)โ
Pre-flightโ
- PR #58 (decoder + 3 projections) merged.
-
ClassificationProposal.source+source_iddefined incore/types/.
Tasksโ
-
Decoder types
core/decoder/types.ts: extendAddressNodewithsource?: stringandsourceId?: string. Both optional; the existing decoder paths that emitAddressNodewithout these continue to work.- Update the file header to describe the provenance fields.
-
proposalsToTree
core/decoder/proposals-to-tree.ts: carryp.sourceandp.source_idthrough into each emitted root. Drop the fields when the proposal lacks them (defensive โ the type allows it).
-
buildAddressTree
core/decoder/build-tree.ts: optionalBuildTreeOpts { source?: string; sourceId?: string }param. The neural pipeline's caller stampssource: "neural"+sourceId: <model-card-version>on every emitted span. No per-span variation here โ one model, one source.
-
XML serializer
core/decoder/serialize-xml.ts: emitsrc="<value>"whennode.sourceornode.sourceIdis set. Format:src="<source>:<sourceId>"if both present,src="<source>"if only source. AddincludeSrc?: booleanopt (default true) for callers who want to suppress.- Update the file header: drop "reserved for Phase 4" wording; replace with the actual semantics.
-
JSON + tuple projections โ explicitly unchanged
decodeAsJsonstays libpostal-compat (shape:{ tag: value }). No provenance.decodeAsTuplesstays[tag, value][]. No provenance.- Rationale documented in the file headers.
-
Tests
core/decoder/provenance.test.ts(new): verify thesrcattr through bothproposalsToTreeandbuildAddressTreepaths; verifyincludeSrc: falsesuppresses it; verify JSON/tuple projections are unchanged when provenance is set.- Update existing
serialize.test.tsonly if necessary (existing fixtures don't set provenance, sosrcshould be absent โ that's a feature).
Success criteriaโ
- All existing decoder tests pass unchanged.
- New provenance test passes for both decoder entry points.
- A
decodeAsXmlcall on a proposal-derived tree emits<locality src="rule:whos_on_first" ...>Paris</locality>-style output.
Out of scope for 4.1โ
- Resolver lookup (4.3).
- Lat/lon attrs (4.3).
- WOF SQLite loader (4.2).
decodeAsJsonshape change (deferred indefinitely; libpostal compat is load-bearing).
Phase 4.2 โ WOF SQLite loader (sketch)โ
Standalone package. Loads a WOF SQLite distribution from a path or URL. Exposes:
interface WofPlace {
wof_id: number
name: string
placetype: "country" | "region" | "locality" | "neighbourhood" | "microhood" | ...
lat: number
lon: number
parent_id?: number
country: string // ISO-3166 alpha-2
}
interface PlaceLookup {
findPlace(query: {
text: string
placetype?: WofPlace["placetype"]
country?: string
parentId?: number
}): Promise<Array<{ place: WofPlace; score: number }>>
}
FTS5 over wof.name + wof.name_alts. Score = FTS5 BM25 + boosts for placetype + country match. Distribution-versioning piggy-backs on the existing neural-weights-* pattern: @mailwoman/wof-sqlite-<region> packages, one per geographic shard, pulled on demand.
Decisions deferred to 4.2:
- Sync (
better-sqlite3) vs async (node:sqlite+Worker) โ depends on whatonnxruntime-nodealready does and whether resolver lookups end up in a hot loop. - Whether to fold into
@mailwoman/neuralor split โ split is cleaner but means another package to publish. - Region sharding strategy (US-only first vs full planet vs admin-2 shards).
Phase 4.3 โ Resolver integration (sketch)โ
Resolver interface composes a PlaceLookup (4.2) with the AddressTree:
interface Resolver {
resolveTree(tree: AddressTree, opts?: ResolveOpts): Promise<AddressTree>
}
Walk the tree top-down (country โ region โ locality โ ...), use each resolved parent's wof_id to constrain the child lookup. Decorate matched nodes with source: "resolver", sourceId: "wof-admin:<wof_id>", and new fields on AddressNode: lat?: number; lon?: number; placeId?: string. The XML serializer gains those as additional attributes when present.
When the resolver "wins" attribution, the classifier's original source moves into metadata so debugging tools can still see it. The XML attr shows the winning source.
CLI: mailwoman parse --resolve --format xml toggles the pipeline on. Default off until 4.3 ships.
Decisions deferred until 4.2 / 4.3 beginโ
- Feedback loop (resolver-corrects-parser) โ not in v1. The output is decorated, not rewritten. A future sub-phase 4.4 can add the loop.
- Whether to expose the joint
{tree, resolution}type publicly โ deferred to 4.3. - BAN-specific resolver for FR โ likely a separate
WofSqliteResolverpeer using the BAN data, gated on whether WOF's France coverage is acceptable in the eval set. Defer until 4.3 hits the eval bench.
Reading material to revisit at each sub-phaseโ
ellenhp/airmailโ even though Option A is rejected, the indexer's ranking heuristics are worth borrowing.pelias/placeholderโ closest prior art for Option B; cribbing welcome.- WOF's
placetypetaxonomy โ the canonical hierarchy walking strategy. project-geocode-earth-voltronoperator note โ sanity-check the SQLite schema against the source before trusting it.project-mailwoman-licensingโ WOF is CC-BY 4.0; attribution required in any redistribution. The resolver package's README must carry it.
Changelogโ
- 2026-05-20 โ sketch (the original three-option overview) replaced with this detailed plan. Picked Option B. Defined sub-phases 4.1 / 4.2 / 4.3 and started 4.1.