Night Shift 2 โ 2026-05-28
Second night shift focused on v0.6.0: Stage 3 label activation + PO box corpus synthesis + multi-locale variant alias foundation.
Shippedโ
PO box synthesis pipelineโ
Locale-aware PO box generator covering en-US/CA/GB/AU, fr-FR/CA, es-ES/MX/AR.
corpus/src/synthesize-po-box.tsโ per-locale leader templates, number-format noise (10%), PMB variant logiccorpus/src/adapters/synth-po-box/adapter.tsโ JSONL tuples โ CanonicalRowsscripts/extract-tuples.pyโ pulls (locality, region, postcode, country) from WOF SQLite via localityโcountyโregion join + state-prefix-to-ZIP synthesisscripts/build-po-box-shard.mjsโ runs synthesizer + alignRow, writes LabeledRow JSONLscripts/jsonl-to-parquet.pyโ converts to v0.4.0 parquet schema
Produced: 50K labeled rows / 3 MB parquet shard. Sample:
P.O. Box 9, Bancroft, ID 83603
tokens: ['P', 'O', 'Box', '9', 'Bancroft', 'ID', '83603']
labels: ['B-po_box', 'I-po_box', 'I-po_box', 'I-po_box', 'B-locality', 'B-region', 'B-postcode']
Design decisionsโ
Consulted DeepSeek (3 turns) + USPS Pub 28 ยง28C2.040 + DMM 508 ยง4.1.4/ยง4.5.4:
- PMB shares
po_boxtag. USPS treats PMB as a PO Box alias in CASS. Downstream code can distinguish via presence of a street line. - Whole-phrase span ("PO Box 123" not "123"). Matches existing golden eval convention.
- Strategy A (replace street, not augment) โ PO box and street are mutually exclusive per USPS. Synthesizing fake (street + PMB) tuples would teach the model an invalid pattern.
- 10% number-format noise โ commas, dashes, embedded spaces. Real OCR/transcription input is lousy; ship with that as native.
Stage 3 activationโ
# Before
ACTIVE_TAGS = STAGE2_TAGS # 10 tags, 21 BIO labels
# After
ACTIVE_TAGS = STAGE3_TAGS # 16 tags, 33 BIO labels
STAGE3 appends to STAGE2 without reordering โ IDs are preserved. Existing v0.4.0 shards work unchanged. Models trained on STAGE2 IDs decode correctly against STAGE3; the new logit slots just never get argmax'd.
Golden eval expansionโ
PO box coverage in data/eval/golden/v0.1.2/:
- Before: 1 entry (just
"PO Box 123, Burlington, VT 05401") - After: 27 entries
- 20 US variants: PO Box, P.O. Box, P. O. Box, POB, POBOX, Post Office Box, Box, P.O.Box; PMB at CMRA; single-digit through 7-digit
- 6 FR/CA variants: BP, B.P., Boรฎte Postale, Case Postale, CP
Per-tag error analysisโ
scripts/eval-error-analysis.ts now emits a per-tag recall table. v0.5.4 baseline:
- Stage 2 tags (locality, region, postcode, street, house_number, ...) โ high recall as expected
- Stage 3 tags (street_prefix, street_suffix, unit, po_box, intersection_a/b, attention, cedex) โ 0% recall (model doesn't emit them yet). Baseline for v0.6.0 to compare against.
NaN recoveryโ
v0.6.0 training diverged with NaN loss at step 950 (right at end of warmup, peak LR). Root cause per DeepSeek: CRF transition gradient at peak LR amplified by the new 33-label transition table.
Two-knob fix:
learning_rate: 1.5e-4 โ 1.0e-4crf_loss_weight: 0.5 โ 0.1
Skipped: increased warmup (delays blowup), tighter grad clip (masks rather than fixes). After ship, raise CRF weight incrementally.
Variant alias table (#166 follow-up)โ
Foundation shipped earlier in the day (@mailwoman/variant-aliases). 37 entries covering amenity + brand variants in en-AU/GB/CA, fr-FR/CA, ja-JP. Locale-gated lookup with confidence scoring. The runtime integration into the kind classifier remains v0.6.0+ work.
v0.6.0 ship pipelineโ
scripts/ship-v0.6.0.sh โ staged end-to-end: export ONNX from Modal โ quantize int8 โ link as dev โ 9 demo presets (6 canonical + 3 PO box) โ error analysis โ upload to HF.
Open issuesโ
- #189 โ alt_names FTS5 split. SQLite BM25's doc-length normalization can't be fixed with column weights alone. Real fix requires schema migration to separate FTS5 tables. Documented in
docs/articles/concepts/importance-vs-population.md. - #166 โ variant alias runtime integration. Requires new QueryKind values + POI index.
v0.6.0 resultsโ
Training completed at step 100K. CE-only (CRF training disabled after two NaN attempts).
Demo presets: 11/11 pass (6 canonical + 5 Stage 3 variants).
Golden eval per-tag (v0.5.4 โ v0.6.0):
| Tag | v0.5.4 | v0.6.0 | ฮ |
|---|---|---|---|
| postcode | 75.7% | 76.0% | +0.3 |
| house_number | 78.7% | 79.0% | +0.3 |
| region | 65.0% | 65.0% | flat |
| locality | 39.4% | 39.7% | +0.3 |
| street | 28.0% | 27.9% | flat |
| venue | 29.4% | 29.2% | flat |
| po_box | 0.0% | 51.9% | +51.9 |
| street_prefix | 0.0% | 0.0% | flat (corpus rebuild pending) |
| street_suffix | 0.0% | 0.0% | flat (corpus rebuild pending) |
| unit | 0.0% | 0.0% | flat (corpus rebuild pending) |
| intersection_a/b | 0.0% | 0.0% | flat (corpus rebuild pending) |
v0.6.0 ships:
- HF model repo
sister-software/mailwoman-en-usupdated - HF bucket
en-us/v0.6.0/*populated (model + tokenizer + FST + wof-hot.db + model-card) releases.jsonupdated,defaultVersion: v0.6.0neural-weights-en-uspackage bumped to 0.6.0
Deferred to v0.6.1:
- CRF learned transitions (NaN root cause investigation โ bf16 + 33ร33 transition table)
- Street decomposition recall (needs corpus rebuild with updated TIGER/NAD/BAN adapters)
- Stage 3 intersection / unit tags (same corpus rebuild requirement)