Contributing model work โ the runbook
You want to improve a tag, add a locale, or change anything that touches the model. This page is the procedure. The concepts behind it live elsewhere (What Mailwoman is, Eval discipline, Negative space); this is the part you follow with your hands.
The iron rulesโ
- Real-OOD eval before training. Build (or find) the held-out real-world eval for your tag first, and pre-register the gate โ target numbers, no-regression bounds โ in the training config as a comment before step 1 runs. If you can't define the gate, you aren't ready to train.
- Never compare F1 across tokenizer versions. Same tokenizer or the comparison is void
(
per-locale-f1.ts --tokenizerenforces this; the model card records the version). - Compare fp32-to-fp32. Int8 is a release artifact, not an experiment lens.
- One variable per run. Prove a lever solo, then consolidate. Stacking shards in one 20k run dilutes every tag (measured, twice โ see the cumulative-dilution note in the parity scorecard).
- Recompile before eval. Eval harnesses load compiled
core/outโ a stale build makes your core change a silent no-op.
Which tool for which tag (the lever-shape taxonomy)โ
- Open-vocab / distributional tags (street, affixes, unit, locality): the model must learn the boundary from context โ format-diverse synth shard + retrain is the tool.
- Closed-vocab tags (country, po_box, cedex): a finite surface set in a predictable slot โ gazetteer/codex lexicon as a soft anchor feature โ the matcher informs the model, it never overrides it. We measured the override version and reversed it; read Closed-vocab fields: model-first before proposing a deterministic tagger.
- Format quirks both systems miss (rare edge shapes): consider a query-shape route before a shard โ some formats are structurally different enough that routing beats tagging.
The standard gate setโ
Run for every candidate, regardless of what you changed (this is what
promotion-gate.sh, #479 automates):
| Check | Command | Bar |
|---|---|---|
| Per-locale floors | scripts/eval/per-locale-f1.ts (US/FR, --tokenizer) + per-locale-f1-floors.py | within 1pp, fp32-to-fp32 |
| German native order | scripts/eval/de-order-eval.sh | NO-REGRESSION |
| Unit retention | unit-real designators eval | โฅ 88 |
| Affix split | scripts/eval/score-affix.ts (NOT per-locale-f1 โ its fold can't see the split) | hold gated numbers |
| Demo smoke | eval-model skill / demo-preset-compare.ts | presets stable |
| Honest geography | scripts/eval/honest-eval.sh | region-match + coord p50/p90 + PIP hold |
Plus your tag's pre-registered real-OOD gate. Pass โ append the run to
evals/scores-by-version.json and write the dated report in docs/articles/evals/.
Adding a shardโ
- Generator lives in
corpus/(synthesize-*.ts) orscripts/build-*-shard.mjs; every row carriessynth: {method, base_source_id}ancestry. - Audit format diversity against the real eval's observed forms before generating โ thin diversity teaches lexical pattern-matching, not the boundary.
- Overlay manifests declare lineage against a base corpus version
(
scripts/assemble-*-overlay-manifest.pyis the pattern); never hand-edit a manifest. - Train via
modal run -d scripts/modal/train_remote.py --config <yaml> --resume auto(~1h A100). Config lives incorpus-python/src/mailwoman_train/configs/โ copy the latest, change one variable, write the gate comment.
Reading the scorecardโ
Two lenses that disagree on purpose: arena head-to-head is whole-parse-strict (the honest "usable parse" lens; understates per-tag wins) and per-tag F1 is what the campaign moves. Golden-dev per-tag numbers lie for tags the golden set barely contains (unit reads 6.3 there and 92.3 on real-OOD) โ for campaign tags, the real-OOD column is the truth.
When your lever doesn't gateโ
Tune-and-retry once, then stop and write the postmortem instead of a third run โ the v0.6.x cycle is the cautionary retrospective. A negative result recorded in the ledger is a contribution; a recipe-tweak treadmill is not.