Skip to main content

How it will work โ€” the near future

This article describes where Mailwoman is heading. The work is tracked in GitHub issues and the plan/ directory. Status is current as of May 2026.

What shipped: v0.4.0 through v0.5.0 scaffoldingโ€‹

v0.4.0 (shipped May 2026)โ€‹

The v0.4.0 ablation campaign attempted to combine three training improvements โ€” per-token CRF normalization, class-weighted cross-entropy, and source-weight rebalance โ€” into one release. Five of six training runs diverged. The shipped checkpoint (v0_4_0-stableLR-source-only/step-002200) uses only source-weight rebalance layered on v0.3.0's existing recipe.

Key outcomes:

  • Source rebalance works. NAD downweight (2.0 โ†’ 1.0) and WOF promotion (1.0 โ†’ 2.0) trains stably.
  • Fine labels improved slightly. street 0.27 โ†’ 0.30, house_number 0.78 โ†’ 0.79.
  • Postcode regressed. 0.76 โ†’ 0.69, driven by NAD downweight removing "postcode-first" positional patterns from training.
  • Country F1 regression is mostly an eval artifact. 92% of country false-negatives are adversarial transliteration entries the model was never trained for.
  • JS-side Viterbi decoder shipped. CRF with frozen mask now runs in the browser.
  • Verdict-smoke discipline hardened. Constant-LR smokes and full-run effective-batch matching prevent the cosine-LR meta-bug that hid v0.4.0's divergence.

v0.5.0 scaffolding (shipped May 2026)โ€‹

The v0.5.0 "fresh-slate" bundle shipped five of six planned threads to main:

ThreadComponentStatus
A0Tokenizer harness + A0 baseline weightsShipped
A1Tokenizer retrain on corpus-v0.4.0Trained; not yet used in stable classifier
BKryptonite catalogue (4,771 adversarial rows)Shipped
B2Transliteration pairs (~73K US/FR โ†’ CJK/Cyrillic/Hangul/Han/Armenian)Shipped
C-sClassifier code path (top-k inference + phrase-prior conditioning)Scaffolded in main; full train (C-train) pending stability
D-sStage 5 reconcile (joint decoding with concordance scoring)Shipped, opt-in behind feature flag
EPhrase grouper (Stage 2.7, rule-based)Shipped as @mailwoman/phrase-grouper
FVerdict-smoke disciplineShipped (VERDICT_SMOKES.md, --smoke-mode constant)

The A1 tokenizer halved byte-fallback on multi-script addresses (36.7% โ†’ 18.2%). The phrase grouper and joint decoder close the two architectural gaps v0.4.0 exposed. The kryptonite catalogue gives the training pipeline an adversarial validation set.

What is in progress: training stabilityโ€‹

The C-train โ€” the full classifier training run that would produce v0.5.0 weights โ€” has not converged. Four training attempts using the v0.5.0 recipe all diverged with the same fingerprint seen in v0.4.0: loss descends through warmup, bottoms out, then climbs catastrophically under sustained peak learning rate.

The May 2026 diagnostic work identified that the CRF-NLL term dominates the CE term by 8-20ร— in gradient magnitude. This is the opposite of what the v0.4.0 campaign assumed (it hypothesized CRF gradient collapse, not CRF gradient dominance). The current experiment is CE-only training: remove the CRF loss term entirely, train on cross-entropy alone, keep the CRF as an inference-time structural decoder with the frozen mask.

Resolution path:

  1. CE-only smoke (in progress). 2,000-step constant-LR run with crf_loss_weight=0.0. Gate: no loss climb past step 2,000 AND val_macro_f1 โ‰ฅ 0.35.
  2. If CE-only stable โ†’ full 50K C-train. Quality ceiling TBD โ€” the win is stability, not quality. Quality gains come from recipe knobs (class weights, source weights, longer training) that were unsafe under dual-loss.
  3. If CE-only diverges โ†’ bisect tokenizer/corpus. A1 tokenizer + corpus-v0.4.0 vs v0.1 tokenizer + corpus-v0.4.0. Isolate the destabilizer.
  4. Parallel: reconcile integration. Wire joint decode as the default Stage 5 path behind a feature flag. Evaluate against kryptonite catalogue + golden v0.1.2. Gate: +15pp exact-match on kryptonite, โ‰ค1pt macro_F1 regression on golden.

Beyond training: the integration horizonโ€‹

Once stable training is achieved, the next steps:

Joint decode as defaultโ€‹

The reconciler (Stage 5) currently operates behind an opt-in flag. Wiring it as the default path requires:

  • TS-side per-span logit aggregation (softmax over phrase-grouper spans) to feed top-K to the reconciler.
  • Evaluation against the kryptonite catalogue (NY-NY Steakhouse, Paris TX, St. Petersburg FL).
  • A/B comparison of joint-decode vs argmax fallback on golden v0.1.2.

v0.5.1 and beyondโ€‹

MilestoneWhat it adds
Learned phrase grouper1-2M-param span proposer trained on segment boundaries. Replaces the rule-based grouper.
Tier 3 label expansionattention, po_box, richer POI taxonomy. Requires corpus adapter updates.
top-k Resolver APIMulti-candidate resolver output surfaced in CLI + demo. "Springfield" shows IL, MA, MO candidates.
Phase 5 โ€” StudioWeb UI for human correction of parses. Corrections feed into retraining.
Phase 6 โ€” JapanJapanese address validation. Architecture stress test โ€” no streets, block-based addressing.

What stays the sameโ€‹

  • Rule classifiers are not replaced. They stay deterministic, fast, and reliable. The neural classifier earns each component one at a time.
  • The model stays small. The transformer is staying at roughly 9 million parameters. The win comes from better architecture (phrase grouper, joint decode) and better training, not from scaling up.
  • Browser support is a hard constraint. The pipeline must stay under ~60MB cold load. The demo is the canary.
  • Locale parity is tracked but not urgent. The en-us model has had the most attention. fr-fr will catch up, ja-jp is the validation stress test.

What we are not doingโ€‹

  • Multi-language understanding. We are not training a model that reads prose in 50 languages. We are training one that labels tokens in addresses for the locales we ship.
  • Generative output. The neural classifier labels existing tokens. It does not write text.
  • Replacing upstream projects. Mailwoman uses Pelias, libpostal, OSM, WOF, and other open data as sources and inspiration. It is a complementary project, not a competitor.

See alsoโ€‹