Mailwoman Neural โ Implementation Plan
The plan that drove the neural address parser work inside Mailwoman: TypeScript-first, deployed as a new classifier that progressively replaces rule-based classifiers as confidence metrics justify (Ship of Theseus).
Status (last edited 2026-05-22): Phases 0 through 3 shipped; Tier 2 label expansion landed in v3.0.0 (historically called "Stage 2"; see PHASE_2_training.md terminology note); Phase 4 Resolver subphases through 4.3.x shipped.
@mailwoman/neuralpublished to npm (v2.x runtime).@mailwoman/neural-weights-{en-us,fr-fr}at v3.0.0 โ Tier 2 vocabulary (21 BIO classes: O + 7 coarse ร {B-,I-} + 3 fine ร {B-,I-} forvenue/street/house_number). Trained oncorpus-v0.3.0(677M aligned rows; adds US DOT NAD as 57.9M structured 911-grade address points). Includes a linear-chain CRF decoder over a frozen BIO transition mask so orphan-I-*sequences (Saint Petersburg โ Petersburg) are structurally impossible. Eval against golden v0.1.2 (4,535 entries): macro F1 0.32,house_numberF1 0.78,venueF1 0.39,streetF1 0.27,postcodeF1 0.76,regionF1 0.18,localityF1 0.27. Coarse F1 regressed vs v0.2.0's small-slice eval; v0.4.0 follow-up targets it (see PHASE_2_training.md iteration log).mailwoman parse --neural --format json|tuple|xmlworks end-to-end against the v3.0.0 weights.- Decoder + tokenizer + ONNX runtime + per-component policy registry all live;
neural/labels.tsknows the 21-label vocabulary. - Phase 4 Resolver: WOF SQLite path shipped through 4.3.x (FTS5 prefix, R*Tree proximity, population-weighted ranking, multi-shard ATTACH). Browser-side path lives in
@mailwoman/neural-web+@mailwoman/resolver-wof-wasm(the demo at https://mailwoman.sister.software/demo runs both client-side).
What's described below as "next" or "future" should be read as the historical plan; the actual state is whatever this directory's most-recently-committed phase file says, plus LOG.md at the repo root for the live cadence.
Read orderโ
- This file
reference/CONTEXT.mdโ background, prior art, design rationalereference/ARCHITECTURE.mdโ system shape, key abstractions, locale strategyreference/SCHEMA.mdโ canonical component tag union (single source of truth)reference/INTERFACES.mdโ TypeScript contracts at every boundaryreference/OPERATIONS.mdโ how to work, how to commit, how to report progressphases/PHASE_0_foundation.mdโ start here once you have read 1โ6
Subsequent phases (phases/PHASE_1_* through phases/PHASE_6_*) are sequential. Do not begin a phase until the previous phase's success criteria are met.
Hard constraintsโ
- TypeScript-first. Inference runtime is
onnxruntime-node. Training is allowed in Python but is not shipped to npm. - Coexistence, not replacement. The neural classifier is additive. Rule classifiers are not deleted until per-component metrics justify retirement.
- Locale scope (v1): US + France. English + French. Japanese is a deliberate Phase 6 stress test of the architecture, not v1.
- Staged components: coarse (country/region/locality/postcode) first, street second, venue third. Each stage ships.
- No new ML in places where rules are already correct. Postcodes are a regex problem. Don't burn a model on them.
- Schema-first. Touching
reference/SCHEMA.mdrequires a written rationale in the commit. Downstream code keys off it.
What good looks likeโ
By end of Phase 3 you will have shipped @mailwoman/neural@0.1.0 to npm. It loads a quantized ONNX model on demand, emits ClassificationProposal objects that flow through Mailwoman's existing solver unchanged, and beats rule-based Mailwoman on the held-out golden set for at least the country and region components.
By end of Phase 6 the same architecture supports Japanese addresses with schema additions but no core refactor. That is the validation that the design is sound.
What failure looks likeโ
- Schema churn after Phase 1 โ means Phase 0 was rushed.
- Tokenizer drift between Python training and TypeScript inference โ silent accuracy loss in prod.
- Training data leakage (same locality in train and eval splits) โ model looks great in eval, fails in the wild.
- Coupling the model to a specific Mailwoman internal type, then needing to refactor to ship the model standalone later.
Any of these means stop and re-read reference/ARCHITECTURE.md.
How to ask for helpโ
You will inevitably hit decisions the plan does not anticipate. When that happens:
- Write the decision and the options to
DECISIONS.md(create it if absent) under a new heading with date. - Pick the option that is most reversible and lowest-blast-radius.
- Continue. Do not block waiting for input from outside the lab unless the decision is genuinely irreversible (npm package name, public API contract, data license commitment).
How to report progressโ
Append to LOG.md (create if absent) after every meaningful unit of work. One line per entry: YYYY-MM-DD HH:MM | <phase> | <what was done> | <next>. Keep it terse โ the radio-console format from the user preferences. This log is the only thing the human will read between check-ins.