Skip to main content

Phase 2 โ€” Model Training

Goal: train a token-classification model on corpus-v0.1.0+, export to ONNX, quantize to int8, evaluate against the golden set. End state is shippable neural-weights-en-us + neural-weights-fr-fr weight packages ready for publishing.

Cadence (revised 2026-05-18): ~3-7 days per iteration, multiple iterations rather than a single 2-week run. See reference/ARCHITECTURE.md ยง "Training cadence vs. plan" for the iteration history + roadmap. The original "2 weeks single run to >95% F1" framing has been superseded.

Branch: ad-hoc per iteration (e.g. feat/v0.2.0-shipping, feat/v0.3.0-stage2), merged to main between iterations.

Depends on: Phase 1 complete (corpus build pipeline running, golden set v0.1.0 in place).

Language: Python in packages/corpus-python/. No TypeScript in this phase.

Terminology note (2026-05-22): This document uses "Tier" for label-vocabulary expansion (Tier 1 = coarse, Tier 2 = +fine, Tier 3 = +POI). Historically called "Stage 1/2/3" โ€” renamed to free the word "Stage" for runtime-pipeline stages (see concepts/the-staged-pipeline). Shipped artifacts (model cards, eval filenames like stage2-step-001800-eval.md, npm v3.0.0 releases) preserve the old "Stage" naming as historical record.

Iteration log (live)โ€‹

  • v0.1.0 (shipped 2026-05-18, PR #42) โ€” first Tier 1 ship. F1 below targets (~0.03 macro) due to positional-heuristic overfit; calibration 0.337 in conf>0.9 bucket. Honest below-target ship with v0.2.0 retrain recipe in session-notes.

  • v0.2.0 (shipped 2026-05-18, PR #53) โ€” source_weights mechanism + relaxed coarse gate. 9ร— macro-F1 (0.037 โ†’ 0.335); calibration tightened 2.6ร— (0.337 โ†’ 0.882 in conf>0.9 bucket). Still below 95% target on country/region/locality but real, measurable, ship-worthy improvement.

  • v0.3.0 โ†’ v3.0.0 (shipped 2026-05-22, PR #115; published as @mailwoman/neural-weights-{en-us,fr-fr}@3.0.0 โ€” major-bumped from npm's 2.0.6 line to signal the 15โ†’21 BIO breaking change). Tier 2 label expansion (venue / street / house_number BIO classes), linear-chain CRF decoder over a frozen BIO transition mask, dual loss (CE + 0.05ยทCRF NLL), corpus-v0.3.0 rebuild (677M aligned rows, adds usgov-nad at 57.9M with full venue+street+house_number coverage). Four hparam iterations converged on lr=1.5e-4 + grad_clip=1.0 + crf_loss_weight=0.05 + label_smoothing=0 (down from v0.2.0's lr=5e-4 CE-only โ€” dual loss is far more LR-sensitive; see DECISIONS.md "v0.3.0 Stage 2 dual loss"). Early-stopped at step 1800 of 50K on val plateau. Eval against golden v0.1.2 (4,535 entries): macro F1 0.32; capability-surface win on the 3 new label classes (house_number F1 0.78 near #57's 0.8 target, venue 0.39, street 0.27 โ€” both below floor; v0.4.0 targets); coarse F1 regressed substantially vs v0.2.0 (region 0.83 โ†’ 0.18, locality 0.65 โ†’ 0.27, postcode 0.86 โ†’ 0.76) โ€” under-trained at step 1800 + expanded label space pulled prior mass off coarse predictions. CRF makes orphan-I-* decode structurally impossible (verified on the demo's "Saint Petersburg" case).

  • JS Viterbi (shipped 2026-05-22) โ€” neural/viterbi.ts lands the Viterbi decoder + BIO structural transition mask. NeuralAddressClassifier defaults to Viterbi decode; orphan-I-* sequences are now structurally impossible at JS runtime (matching the eval-time path). Works on the v3.0.0 weights as-is โ€” no model retraining required because the structural mask is built from the labels list. Learned CRF transitions will compose on top once a future weights release ships them.

  • v0.4.0 (shipped to packaged artifacts 2026-05-23, NOT npm-published โ€” issue #116) โ€” six planned work areas: (1) per-token CRF NLL normalization; (2) longer training (reach step-5000+ before judging); (3) class-weighted CE biased toward coarse classes; (4) source-weight rebalance away from NAD-heavy fine-label mix; (5) JS-side Viterbi decode + model-card label loading (landed 2026-05-22, see entry above); (6) reuse corpus-v0.3.0. Shipped recipe is ยง4 + ยง5 only; ยง1 and ยง3 deferred to v0.4.1 after destabilizing the dual-loss training.

    Ablation campaign (2026-05-23, 5 divergence runs + 3 verdict smokes + 1 full 50K + 1 successful full run):

    1. The full ยง1+ยง3+ยง4 recipe diverged at all three tested LRs (lr=5e-4 step 750, lr=3e-4 step 1000, lr=1.5e-4 step 2000). LR delays divergence proportionally โ€” destabilizer is in the recipe, not LR.
    2. At lr=5e-4, both single-knob ablations failed identically (ablate-ยง1: 0.35 โ†’ 0.11; ablate-ยง3: 0.35 โ†’ 0.14). lr=5e-4 is structurally unreachable for this codebase's dual-loss landscape regardless of which ยง1/ยง3 knob is active.
    3. Switched to lr=1.5e-4 (v0.3.0-stable) verdict-smoke matrix at max_steps=3000:
      • source-only (ยง4): PASS, peak 0.4190 step 2250
      • cw-only (ยง3+ยง4): PASS, peak 0.4279 step 2250
      • crf-only (ยง1+ยง4): FAIL, train_loss=1.24 at step 3000
    4. Verdict-smoke framework had a false-positive on cw-only: the smoke's cosine LR decayed to ~0 by step 2750, masking sustained-peak-LR divergence. When promoted to the full 50K run, cw-only diverged at step 2250 (macro_f1 collapse 0.41 โ†’ 0.29). Process improvement for v0.4.1: verdict smokes should use a constant LR or much longer max_steps. (Landed in v0.5.0 Thread F as --smoke-mode constant|long-tail; framework + decision matrix in VERDICT_SMOKES.md.)
    5. Math sanity-check of model.py + crf.py found no implementation bug โ€” per_token reduction is nll.sum() / total_tokens.clamp(min=1); class_weights enter via cross_entropy(weight=...). Destabilization is a real recipe interaction, not a coding artifact.
    6. Shipped checkpoint: v0_4_0-stableLR-source-only/step-002200 โ€” ยง4 source rebalance + v0.3.0 dual-loss base + lr=1.5e-4. Only recipe that stayed clean AND outperformed cw-only on golden v0.1.2 eval.

    Golden v0.1.2 eval (4535 entries) โ€” mixed result, issue #116 success metric NOT cleanly met:

    tagv0.4.0 shippedv0.3.0ฮ”
    country0.210.28-0.07 regression
    region0.190.18+0.01
    locality0.270.27flat
    postcode0.690.76-0.07 regression
    venue0.390.39flat
    street0.300.27+0.03 improvement
    house_number0.790.78+0.01 (issue #57 floor held)

    Macro F1 raw average: 0.357 vs 0.293. Mean token confidence: 0.806 vs 0.857. Full-parse exact match: 0.082 vs 0.107 (regression โ€” better per-component agreement, worse full-address agreement).

    Issue #116 asked for "clear progress on at least two of (coarse F1, fine F1, calibration, training stability)":

    • coarse F1: NEGATIVE (country/postcode each -0.07)
    • fine F1: SMALL POSITIVE (street +0.03, house_number +0.01)
    • calibration: FLAT
    • training stability: NEGATIVE (recipe destabilization is the campaign's central finding)

    Only one clean improvement axis. ยง1 (per_token CRF) and ยง3 (class_weights) deferred to v0.4.1 after a corpus-side investigation of why the full recipe destabilizes past step 2000.

    Post-hoc regression diagnostic (categorized, 4535 entries, 1217 postcode FNs + 194 country FNs):

    error classpostcode FNcountry FN
    empty_pred (model emits nothing)789 (65%)9 (5%)
    non-Latin transliteration (v0.3.0 pre-existing)213 (18%)178 (92%)
    num_confused (house# picked instead of postcode)136 (11%)n/a
    bio_slip (boundary off ยฑ1 sentencepiece)73 (6%)small
    other6 (0.5%)~5%

    Headline takeaways:

    • The dominant postcode FN is empty_pred (65%) on mid-position / short-form postcodes (Paris 75008, 47110 Sainte-Livrade-sur-Lot). v0.4.0's source rebalance traded structured-address postcode exposure for coarse-only 10118-style forms.
    • Country FN is 92% adversarial transliteration (CJK/Cyrillic raw input โ†’ English country gold). This is a v0.3.0 pre-existing failure mode; after excluding adversarials, country FN drops 194 โ†’ ~16. The country -0.07 F1 regression is mostly a golden-set adversarial-weighting artifact, not a real recipe regression.
    • Decoder span-trim sidecar (commit c72ab4c, no retrain required) addresses the 6% bio_slip slice + an unmeasured share of FP rate. Material impact on the headline numbers is smaller than the initial diagnostic suggested but the trim is still a correct decoder bug fix.

    v0.4.1 implications: source-weight tweak alone partially addresses the 11% num_confused slice. The 65% empty_pred slice requires a different intervention (likely source-weight + synthesis pass over component-order permutations, or an aggressive wof-postalcode bump). Full v0.4.1 scope draft staged at .playpen/control/drafts/v0_4_1-scope.md on the i116 container.

  • v0.5.0 thread C-s โ€” scaffold only (shipped 2026-05-23, see Phase 8 fresh-slate plan ยง Thread C) โ€” v0.4.1 was explicitly skipped in favor of a fresh-slate iteration. Thread C-s lands the model code path for the v0.5.0 architecture changes โ€” no training run as part of this ship; the training kicks off after Threads A (tokenizer retrain) and B (corpus-v0.4.0 with adversarial expansion) complete.

    Architectural changes shipped in the scaffold:

    • Top-k inference (predict_top_k): the encoder now emits the K most-probable tag sequences with calibrated log-probability scores (crf.top_k_decode, list-Viterbi over the structural mask). Default k=5. This is what Stage 5 reconcile (Thread D) consumes โ€” the classifier becomes a candidate generator rather than a single-answer predictor. Argmax path (predict) unchanged for back-compat.
    • Phrase-prior input-layer conditioning: when model.use_phrase_priors=true, the encoder takes a per-token feature tensor (B, S, PHRASE_FEATURE_DIM=10) carrying BIE markers (phrase_start/phrase_mid/phrase_end) + a 7-way one-hot over the PhraseKind taxonomy (NUMERIC, STREET_PHRASE, LOCALITY_PHRASE, REGION_ABBREVIATION, POSTCODE, VENUE_PHRASE, HYPHENATED_COMPOUND โ€” mirrored from Thread E's TS contract). Features are concatenated onto the token+position embedding and projected back to hidden_size with a learned linear. Default false preserves v0.3.0/v0.4.0 behavior bit-identically โ€” enables clean ablation of the phrase-prior contribution.
    • Hidden-size knob kept at v0.3.0/v0.4.0 baseline (256). The plan doc mentions a 256 โ†’ 384 or 512 bump "likely paid for by rented GPU." For the scaffold, hold the bump โ€” validate the new architecture trains cleanly at the current size first; bump becomes an orthogonal v0_5_0-classifier-large.yaml follow-up once the baseline lands clean.
    • Label vocab unchanged from v0.3.0 (21 BIO classes; see mailwoman_train.labels.ACTIVE_BIO_LABELS). POI taxonomy expansion is out of scope per the plan doc.
    • Recipe parked: configs/v0_5_0-classifier-smoke.yaml defines the short verdict-smoke (constant-LR per Thread F's VERDICT_SMOKES.md, --smoke-mode constant). Invocable once Threads A + B + F have all landed. NOT executed as part of this PR.

    Forward-pass smoke (tests/mailwoman_train/test_v0_5_0_forward_pass.py, 14 cases) verifies wiring on stub batches in seconds โ€” no loss.backward, no optimizer step. It is the integration check that decides whether the new architecture is ready for verdict-smoke training, not whether it converges.

Pre-flightโ€‹

  • corpus-v0.1.0 exists at /data/corpus/versioned/corpus-v0.1.0/
  • Golden set v0.1.0 exists at /data/eval/golden/v0.1.0/
  • Python environment set up with PyTorch, Transformers, ONNX, ONNX Runtime, datasets, sentencepiece
  • GPU available (Phase 2 is GPU-bound). If lab has no GPU, document training time and proceed on CPU โ€” the model is small.

Tasksโ€‹

1. Python project setupโ€‹

  • packages/corpus-python/pyproject.toml with deps locked
  • packages/corpus-python/src/mailwoman_train/ package
  • Subcommand CLI: python -m mailwoman_train <command>

2. Data loadingโ€‹

  • data_loader.py โ€” load Parquet shards via datasets.load_dataset('parquet', ...). Lazy, streaming, memory-stable.
  • Stratified sampling: per-country weights to prevent imbalance. Configurable in YAML.
  • Length filtering: drop rows where token count > 128. Address text is short by nature; long rows are usually adapter bugs.
  • Verify tokenizer alignment: load tokenizer-v0.1.0, re-tokenize a sample of raw strings, assert the tokenization matches the stored tokens field. If it doesn't, the corpus is corrupt โ€” stop and investigate.

3. Model architectureโ€‹

  • model.py โ€” small encoder-only transformer
    • Layers: 6
    • Hidden: 256
    • Attention heads: 4
    • FF intermediate: 1024
    • Max position: 128
    • Vocab: from tokenizer v0.1.0
    • Token classification head: linear โ†’ |BIO_LABELS| logits
  • Use HuggingFace transformers library: BertConfig + BertForTokenClassification with custom small config. Or RoBERTa. Pick one and stick with it.
  • From-scratch initialization (not pretrained). Address vocabulary is too small to benefit from English-internet pretraining.

4. Training loopโ€‹

  • train.py
  • Optimizer: AdamW, lr 5e-4, weight decay 0.01
  • LR schedule: linear warmup 1000 steps, then cosine decay
  • Batch size: 256 (lower if OOM)
  • Steps: ~50k for Tier 1 coarse-only. Eval every 2k steps on val set.
  • Mixed precision (fp16/bf16) on GPU
  • Save checkpoint every 5k steps to /data/models/checkpoints/
  • Track: train loss, val loss, val F1 per component, val full-parse exact match
  • Use Weights & Biases or TensorBoard or plain CSV โ€” pick one and stick with it. Don't ship a logging refactor in the middle of training.

5. Tiered training planโ€‹

Tier 1: Coarse-only (this phase)โ€‹

  • Train only on rows where country and at least one of (region, locality, postcode) is present
  • Labels restricted to: country, region, locality, dependent_locality, postcode, subregion, cedex, O
  • Target: > 95% F1 per component on golden set
  • This is the v0.1.0 model.

Tiers 2 (street) and 3 (venue) are explicitly future iterations. Do not attempt to train all tiers in Phase 2.

6. Evaluationโ€‹

  • eval.py โ€” load checkpoint, run inference on golden set, compute metrics
  • Metrics:
    • Per-component F1, precision, recall
    • Full-parse exact match (all components correct)
    • Mean token confidence
    • Calibration: bucket predictions by confidence, check accuracy per bucket
  • Output: a markdown report saved alongside the checkpoint
  • Compare against rule-based Mailwoman on the same golden set. Rule baseline numbers should be cached so this doesn't require running Mailwoman during training.

7. ONNX exportโ€‹

  • export_onnx.py
  • Export with dynamic axes for batch and sequence length
  • Opset 17
  • Verify ONNX inference matches PyTorch inference within 1e-4 on a sample of 1000 inputs
  • Output: /data/models/onnx/model-v0.1.0-en-us.onnx, same for fr-fr

โš  If you trained a single multilingual model (recommended for Tier 1 โ€” coarse is cheap to share), export it twice with the same weights, named per locale. Splitting into per-locale models is a Phase 3 decision based on size and load behavior.

8. Quantizationโ€‹

  • quantize.py โ€” int8 dynamic quantization via onnxruntime.quantization
  • Calibrate on 1000 val-set examples
  • Verify quantized model F1 on golden set drops by less than 0.5% from fp32. If it drops more, investigate (likely a quantization config issue, not a fundamental limit).
  • Output: /data/models/quantized/model-v0.1.0-en-us-int8.onnx

9. Weights package preparationโ€‹

  • packages/neural-weights-en-us/ and packages/neural-weights-fr-fr/
  • Each contains:
    • model.onnx (int8 quantized)
    • tokenizer.model (SentencePiece)
    • model-card.json (ModelCard per reference/INTERFACES.md)
    • package.json with name, version, license
    • README.md describing the model, training corpus, eval scores
  • These packages are data-only. No JS code. They are loaded by @mailwoman/neural at runtime.
  • Verify package size: aim for < 30MB int8 model. Tokenizer is ~1MB. Total package < 40MB.

10. Model cardโ€‹

  • ModelCard filled in honestly
  • Include: training corpus version, training duration, hardware, eval scores per component on golden set + holdout splits, known failure modes (e.g., "underperforms on Hawaiian addresses", "confused by historical Paris arrondissement notation pre-1860")
  • This is a public document. Users will read it before adopting.

Success criteria checklistโ€‹

  • Tier 1 model trained, checkpoint saved
  • ONNX export verified parity with PyTorch
  • Int8 quantized model meets eval threshold
  • neural-weights-en-us@0.1.0 and neural-weights-fr-fr@0.1.0 package directories complete
  • Model cards filled in
  • Eval reports committed to git (numbers, not models)
  • Beats rule-based Mailwoman on golden set for country and region components by at least 2 F1 points. If not, investigate before proceeding โ€” the architecture is fine, the corpus is probably the issue.

When to ship vs train more (original framing โ€” superseded by iteration cadence)โ€‹

If after the first training run, golden F1 is:

  • 95% per coarse component โ†’ ship.

  • 90โ€“95% โ†’ analyze failure modes. Likely fixable with corpus tweaks (more synthesis, deduplication, a missing source). One additional iteration is fine.
  • < 90% โ†’ stop and re-examine. Could be: tokenizer mismatch, label misalignment, schema bug, severely imbalanced training data. Do not "train longer" without diagnosing first.

โš  Resist the urge to add street-level components in this phase to "get more value." Tier 1 ships coarse. Tier 2 is its own iteration. Mixing them blurs the metrics and slows iteration.

Revised (2026-05-18)โ€‹

The above original framing assumed a single training run. After v0.1.0 + v0.2.0, the actual cadence is:

  • Each iteration ships an artifact, target or no target. Below-target ships are OK if they're honest about it (model card + eval ledger entry) AND the Ship-of-Theseus coexistence model means the rule classifiers still run alongside.
  • Per-iteration F1 floor (not target): each iteration must improve over the prior one in at least one of: per-component F1, calibration tightness, or capability surface (new labels supported).
  • Vocabulary tier expansion (Tier 2 โ†’ venue+street+house_number; Tier 3 โ†’ organization/POI venue) happens as iteration deltas within Phase 2, not as separate phases. Same encoder, more head classes. Each lands in its own retrain.
  • The eval ledger is the success record โ€” evals/scores-by-version.json with corpus + golden-set sha-pinning makes every iteration's delta empirically defensible.

When to call this phase doneโ€‹

When the weights packages exist on disk, model cards are accurate, eval scores beat the rule baseline on coarse components, and the only remaining work for shipping is TypeScript integration (Phase 3).