Phase 2 โ Model Training
Goal: train a token-classification model on corpus-v0.1.0+, export to ONNX, quantize to int8, evaluate against the golden set. End state is shippable neural-weights-en-us + neural-weights-fr-fr weight packages ready for publishing.
Cadence (revised 2026-05-18): ~3-7 days per iteration, multiple iterations rather than a single 2-week run. See reference/ARCHITECTURE.md ยง "Training cadence vs. plan" for the iteration history + roadmap. The original "2 weeks single run to >95% F1" framing has been superseded.
Branch: ad-hoc per iteration (e.g. feat/v0.2.0-shipping, feat/v0.3.0-stage2), merged to main between iterations.
Depends on: Phase 1 complete (corpus build pipeline running, golden set v0.1.0 in place).
Language: Python in packages/corpus-python/. No TypeScript in this phase.
Terminology note (2026-05-22): This document uses "Tier" for label-vocabulary expansion (Tier 1 = coarse, Tier 2 = +fine, Tier 3 = +POI). Historically called "Stage 1/2/3" โ renamed to free the word "Stage" for runtime-pipeline stages (see concepts/the-staged-pipeline). Shipped artifacts (model cards, eval filenames like stage2-step-001800-eval.md, npm v3.0.0 releases) preserve the old "Stage" naming as historical record.
Iteration log (live)โ
-
v0.1.0 (shipped 2026-05-18, PR #42) โ first Tier 1 ship. F1 below targets (~0.03 macro) due to positional-heuristic overfit; calibration 0.337 in conf>0.9 bucket. Honest below-target ship with
v0.2.0 retrain recipein session-notes. -
v0.2.0 (shipped 2026-05-18, PR #53) โ
source_weightsmechanism + relaxed coarse gate. 9ร macro-F1 (0.037 โ 0.335); calibration tightened 2.6ร (0.337 โ 0.882 in conf>0.9 bucket). Still below 95% target on country/region/locality but real, measurable, ship-worthy improvement. -
v0.3.0 โ v3.0.0 (shipped 2026-05-22, PR #115; published as
@mailwoman/neural-weights-{en-us,fr-fr}@3.0.0โ major-bumped from npm's 2.0.6 line to signal the 15โ21 BIO breaking change). Tier 2 label expansion (venue/street/house_numberBIO classes), linear-chain CRF decoder over a frozen BIO transition mask, dual loss (CE + 0.05ยทCRF NLL),corpus-v0.3.0rebuild (677M aligned rows, addsusgov-nadat 57.9M with full venue+street+house_number coverage). Four hparam iterations converged on lr=1.5e-4 + grad_clip=1.0 + crf_loss_weight=0.05 + label_smoothing=0 (down from v0.2.0's lr=5e-4 CE-only โ dual loss is far more LR-sensitive; see DECISIONS.md "v0.3.0 Stage 2 dual loss"). Early-stopped at step 1800 of 50K on val plateau. Eval against golden v0.1.2 (4,535 entries): macro F1 0.32; capability-surface win on the 3 new label classes (house_numberF1 0.78 near #57's 0.8 target,venue0.39,street0.27 โ both below floor; v0.4.0 targets); coarse F1 regressed substantially vs v0.2.0 (region 0.83 โ 0.18, locality 0.65 โ 0.27, postcode 0.86 โ 0.76) โ under-trained at step 1800 + expanded label space pulled prior mass off coarse predictions. CRF makes orphan-I-*decode structurally impossible (verified on the demo's "Saint Petersburg" case). -
JS Viterbi (shipped 2026-05-22) โ
neural/viterbi.tslands the Viterbi decoder + BIO structural transition mask.NeuralAddressClassifierdefaults to Viterbi decode; orphan-I-*sequences are now structurally impossible at JS runtime (matching the eval-time path). Works on the v3.0.0 weights as-is โ no model retraining required because the structural mask is built from the labels list. Learned CRF transitions will compose on top once a future weights release ships them. -
v0.4.0 (shipped to packaged artifacts 2026-05-23, NOT npm-published โ issue #116) โ six planned work areas: (1) per-token CRF NLL normalization; (2) longer training (reach step-5000+ before judging); (3) class-weighted CE biased toward coarse classes; (4) source-weight rebalance away from NAD-heavy fine-label mix; (5) JS-side Viterbi decode + model-card label loading (landed 2026-05-22, see entry above); (6) reuse
corpus-v0.3.0. Shipped recipe is ยง4 + ยง5 only; ยง1 and ยง3 deferred to v0.4.1 after destabilizing the dual-loss training.Ablation campaign (2026-05-23, 5 divergence runs + 3 verdict smokes + 1 full 50K + 1 successful full run):
- The full ยง1+ยง3+ยง4 recipe diverged at all three tested LRs (lr=5e-4 step 750, lr=3e-4 step 1000, lr=1.5e-4 step 2000). LR delays divergence proportionally โ destabilizer is in the recipe, not LR.
- At
lr=5e-4, both single-knob ablations failed identically (ablate-ยง1: 0.35 โ 0.11; ablate-ยง3: 0.35 โ 0.14).lr=5e-4is structurally unreachable for this codebase's dual-loss landscape regardless of which ยง1/ยง3 knob is active. - Switched to lr=1.5e-4 (v0.3.0-stable) verdict-smoke matrix at max_steps=3000:
- source-only (ยง4): PASS, peak 0.4190 step 2250
- cw-only (ยง3+ยง4): PASS, peak 0.4279 step 2250
- crf-only (ยง1+ยง4): FAIL, train_loss=1.24 at step 3000
- Verdict-smoke framework had a false-positive on cw-only: the smoke's cosine LR decayed to ~0 by step 2750, masking sustained-peak-LR divergence. When promoted to the full 50K run, cw-only diverged at step 2250 (macro_f1 collapse 0.41 โ 0.29). Process improvement for v0.4.1: verdict smokes should use a constant LR or much longer max_steps. (Landed in v0.5.0 Thread F as
--smoke-mode constant|long-tail; framework + decision matrix inVERDICT_SMOKES.md.) - Math sanity-check of
model.py+crf.pyfound no implementation bug โper_tokenreduction isnll.sum() / total_tokens.clamp(min=1); class_weights enter viacross_entropy(weight=...). Destabilization is a real recipe interaction, not a coding artifact. - Shipped checkpoint:
v0_4_0-stableLR-source-only/step-002200โ ยง4 source rebalance + v0.3.0 dual-loss base + lr=1.5e-4. Only recipe that stayed clean AND outperformed cw-only on golden v0.1.2 eval.
Golden v0.1.2 eval (4535 entries) โ mixed result, issue #116 success metric NOT cleanly met:
tag v0.4.0 shipped v0.3.0 ฮ country 0.21 0.28 -0.07 regression region 0.19 0.18 +0.01 locality 0.27 0.27 flat postcode 0.69 0.76 -0.07 regression venue 0.39 0.39 flat street 0.30 0.27 +0.03 improvement house_number 0.79 0.78 +0.01 (issue #57 floor held) Macro F1 raw average: 0.357 vs 0.293. Mean token confidence: 0.806 vs 0.857. Full-parse exact match: 0.082 vs 0.107 (regression โ better per-component agreement, worse full-address agreement).
Issue #116 asked for "clear progress on at least two of (coarse F1, fine F1, calibration, training stability)":
- coarse F1: NEGATIVE (country/postcode each -0.07)
- fine F1: SMALL POSITIVE (street +0.03, house_number +0.01)
- calibration: FLAT
- training stability: NEGATIVE (recipe destabilization is the campaign's central finding)
Only one clean improvement axis. ยง1 (per_token CRF) and ยง3 (class_weights) deferred to v0.4.1 after a corpus-side investigation of why the full recipe destabilizes past step 2000.
Post-hoc regression diagnostic (categorized, 4535 entries, 1217 postcode FNs + 194 country FNs):
error class postcode FN country FN empty_pred (model emits nothing) 789 (65%) 9 (5%) non-Latin transliteration (v0.3.0 pre-existing) 213 (18%) 178 (92%) num_confused (house# picked instead of postcode) 136 (11%) n/a bio_slip (boundary off ยฑ1 sentencepiece) 73 (6%) small other 6 (0.5%) ~5% Headline takeaways:
- The dominant postcode FN is empty_pred (65%) on mid-position / short-form postcodes (
Paris 75008,47110 Sainte-Livrade-sur-Lot). v0.4.0's source rebalance traded structured-address postcode exposure for coarse-only10118-style forms. - Country FN is 92% adversarial transliteration (CJK/Cyrillic raw input โ English country gold). This is a v0.3.0 pre-existing failure mode; after excluding adversarials, country FN drops 194 โ ~16. The country -0.07 F1 regression is mostly a golden-set adversarial-weighting artifact, not a real recipe regression.
- Decoder span-trim sidecar (commit
c72ab4c, no retrain required) addresses the 6% bio_slip slice + an unmeasured share of FP rate. Material impact on the headline numbers is smaller than the initial diagnostic suggested but the trim is still a correct decoder bug fix.
v0.4.1 implications: source-weight tweak alone partially addresses the 11% num_confused slice. The 65% empty_pred slice requires a different intervention (likely source-weight + synthesis pass over component-order permutations, or an aggressive
wof-postalcodebump). Full v0.4.1 scope draft staged at.playpen/control/drafts/v0_4_1-scope.mdon the i116 container. -
v0.5.0 thread C-s โ scaffold only (shipped 2026-05-23, see Phase 8 fresh-slate plan ยง Thread C) โ v0.4.1 was explicitly skipped in favor of a fresh-slate iteration. Thread C-s lands the model code path for the v0.5.0 architecture changes โ no training run as part of this ship; the training kicks off after Threads A (tokenizer retrain) and B (corpus-v0.4.0 with adversarial expansion) complete.
Architectural changes shipped in the scaffold:
- Top-k inference (
predict_top_k): the encoder now emits the K most-probable tag sequences with calibrated log-probability scores (crf.top_k_decode, list-Viterbi over the structural mask). Default k=5. This is what Stage 5 reconcile (Thread D) consumes โ the classifier becomes a candidate generator rather than a single-answer predictor. Argmax path (predict) unchanged for back-compat. - Phrase-prior input-layer conditioning: when
model.use_phrase_priors=true, the encoder takes a per-token feature tensor(B, S, PHRASE_FEATURE_DIM=10)carrying BIE markers (phrase_start/phrase_mid/phrase_end) + a 7-way one-hot over thePhraseKindtaxonomy (NUMERIC, STREET_PHRASE, LOCALITY_PHRASE, REGION_ABBREVIATION, POSTCODE, VENUE_PHRASE, HYPHENATED_COMPOUND โ mirrored from Thread E's TS contract). Features are concatenated onto the token+position embedding and projected back tohidden_sizewith a learned linear. Defaultfalsepreserves v0.3.0/v0.4.0 behavior bit-identically โ enables clean ablation of the phrase-prior contribution. - Hidden-size knob kept at v0.3.0/v0.4.0 baseline (256). The plan doc mentions a 256 โ 384 or 512 bump "likely paid for by rented GPU." For the scaffold, hold the bump โ validate the new architecture trains cleanly at the current size first; bump becomes an orthogonal
v0_5_0-classifier-large.yamlfollow-up once the baseline lands clean. - Label vocab unchanged from v0.3.0 (21 BIO classes; see
mailwoman_train.labels.ACTIVE_BIO_LABELS). POI taxonomy expansion is out of scope per the plan doc. - Recipe parked:
configs/v0_5_0-classifier-smoke.yamldefines the short verdict-smoke (constant-LR per Thread F's VERDICT_SMOKES.md,--smoke-mode constant). Invocable once Threads A + B + F have all landed. NOT executed as part of this PR.
Forward-pass smoke (
tests/mailwoman_train/test_v0_5_0_forward_pass.py, 14 cases) verifies wiring on stub batches in seconds โ no loss.backward, no optimizer step. It is the integration check that decides whether the new architecture is ready for verdict-smoke training, not whether it converges. - Top-k inference (
Pre-flightโ
-
corpus-v0.1.0exists at/data/corpus/versioned/corpus-v0.1.0/ - Golden set v0.1.0 exists at
/data/eval/golden/v0.1.0/ - Python environment set up with PyTorch, Transformers, ONNX, ONNX Runtime, datasets, sentencepiece
- GPU available (Phase 2 is GPU-bound). If lab has no GPU, document training time and proceed on CPU โ the model is small.
Tasksโ
1. Python project setupโ
-
packages/corpus-python/pyproject.tomlwith deps locked -
packages/corpus-python/src/mailwoman_train/package - Subcommand CLI:
python -m mailwoman_train <command>
2. Data loadingโ
-
data_loader.pyโ load Parquet shards viadatasets.load_dataset('parquet', ...). Lazy, streaming, memory-stable. - Stratified sampling: per-country weights to prevent imbalance. Configurable in YAML.
- Length filtering: drop rows where token count > 128. Address text is short by nature; long rows are usually adapter bugs.
- Verify tokenizer alignment: load
tokenizer-v0.1.0, re-tokenize a sample ofrawstrings, assert the tokenization matches the storedtokensfield. If it doesn't, the corpus is corrupt โ stop and investigate.
3. Model architectureโ
-
model.pyโ small encoder-only transformer- Layers: 6
- Hidden: 256
- Attention heads: 4
- FF intermediate: 1024
- Max position: 128
- Vocab: from tokenizer v0.1.0
- Token classification head: linear โ
|BIO_LABELS|logits
- Use HuggingFace
transformerslibrary:BertConfig+BertForTokenClassificationwith custom small config. OrRoBERTa. Pick one and stick with it. - From-scratch initialization (not pretrained). Address vocabulary is too small to benefit from English-internet pretraining.
4. Training loopโ
-
train.py - Optimizer: AdamW, lr 5e-4, weight decay 0.01
- LR schedule: linear warmup 1000 steps, then cosine decay
- Batch size: 256 (lower if OOM)
- Steps: ~50k for Tier 1 coarse-only. Eval every 2k steps on val set.
- Mixed precision (fp16/bf16) on GPU
- Save checkpoint every 5k steps to
/data/models/checkpoints/ - Track: train loss, val loss, val F1 per component, val full-parse exact match
- Use Weights & Biases or TensorBoard or plain CSV โ pick one and stick with it. Don't ship a logging refactor in the middle of training.
5. Tiered training planโ
Tier 1: Coarse-only (this phase)โ
- Train only on rows where
countryand at least one of (region,locality,postcode) is present - Labels restricted to:
country,region,locality,dependent_locality,postcode,subregion,cedex,O - Target: > 95% F1 per component on golden set
- This is the v0.1.0 model.
Tiers 2 (street) and 3 (venue) are explicitly future iterations. Do not attempt to train all tiers in Phase 2.
6. Evaluationโ
-
eval.pyโ load checkpoint, run inference on golden set, compute metrics - Metrics:
- Per-component F1, precision, recall
- Full-parse exact match (all components correct)
- Mean token confidence
- Calibration: bucket predictions by confidence, check accuracy per bucket
- Output: a markdown report saved alongside the checkpoint
- Compare against rule-based Mailwoman on the same golden set. Rule baseline numbers should be cached so this doesn't require running Mailwoman during training.
7. ONNX exportโ
-
export_onnx.py - Export with dynamic axes for batch and sequence length
- Opset 17
- Verify ONNX inference matches PyTorch inference within 1e-4 on a sample of 1000 inputs
- Output:
/data/models/onnx/model-v0.1.0-en-us.onnx, same for fr-fr
โ If you trained a single multilingual model (recommended for Tier 1 โ coarse is cheap to share), export it twice with the same weights, named per locale. Splitting into per-locale models is a Phase 3 decision based on size and load behavior.
8. Quantizationโ
-
quantize.pyโ int8 dynamic quantization viaonnxruntime.quantization - Calibrate on 1000 val-set examples
- Verify quantized model F1 on golden set drops by less than 0.5% from fp32. If it drops more, investigate (likely a quantization config issue, not a fundamental limit).
- Output:
/data/models/quantized/model-v0.1.0-en-us-int8.onnx
9. Weights package preparationโ
-
packages/neural-weights-en-us/andpackages/neural-weights-fr-fr/ - Each contains:
model.onnx(int8 quantized)tokenizer.model(SentencePiece)model-card.json(ModelCard perreference/INTERFACES.md)package.jsonwith name, version, licenseREADME.mddescribing the model, training corpus, eval scores
- These packages are data-only. No JS code. They are loaded by
@mailwoman/neuralat runtime. - Verify package size: aim for < 30MB int8 model. Tokenizer is ~1MB. Total package < 40MB.
10. Model cardโ
-
ModelCardfilled in honestly - Include: training corpus version, training duration, hardware, eval scores per component on golden set + holdout splits, known failure modes (e.g., "underperforms on Hawaiian addresses", "confused by historical Paris arrondissement notation pre-1860")
- This is a public document. Users will read it before adopting.
Success criteria checklistโ
- Tier 1 model trained, checkpoint saved
- ONNX export verified parity with PyTorch
- Int8 quantized model meets eval threshold
-
neural-weights-en-us@0.1.0andneural-weights-fr-fr@0.1.0package directories complete - Model cards filled in
- Eval reports committed to git (numbers, not models)
- Beats rule-based Mailwoman on golden set for
countryandregioncomponents by at least 2 F1 points. If not, investigate before proceeding โ the architecture is fine, the corpus is probably the issue.
When to ship vs train more (original framing โ superseded by iteration cadence)โ
If after the first training run, golden F1 is:
-
95% per coarse component โ ship.
- 90โ95% โ analyze failure modes. Likely fixable with corpus tweaks (more synthesis, deduplication, a missing source). One additional iteration is fine.
- < 90% โ stop and re-examine. Could be: tokenizer mismatch, label misalignment, schema bug, severely imbalanced training data. Do not "train longer" without diagnosing first.
โ Resist the urge to add street-level components in this phase to "get more value." Tier 1 ships coarse. Tier 2 is its own iteration. Mixing them blurs the metrics and slows iteration.
Revised (2026-05-18)โ
The above original framing assumed a single training run. After v0.1.0 + v0.2.0, the actual cadence is:
- Each iteration ships an artifact, target or no target. Below-target ships are OK if they're honest about it (model card + eval ledger entry) AND the Ship-of-Theseus coexistence model means the rule classifiers still run alongside.
- Per-iteration F1 floor (not target): each iteration must improve over the prior one in at least one of: per-component F1, calibration tightness, or capability surface (new labels supported).
- Vocabulary tier expansion (Tier 2 โ venue+street+house_number; Tier 3 โ organization/POI venue) happens as iteration deltas within Phase 2, not as separate phases. Same encoder, more head classes. Each lands in its own retrain.
- The eval ledger is the success record โ
evals/scores-by-version.jsonwith corpus + golden-set sha-pinning makes every iteration's delta empirically defensible.
When to call this phase doneโ
When the weights packages exist on disk, model cards are accurate, eval scores beat the rule baseline on coarse components, and the only remaining work for shipping is TypeScript integration (Phase 3).