Night Shift 2 — 2026-05-28

Second night shift focused on v0.6.0: Stage 3 label activation + PO box corpus synthesis + multi-locale variant alias foundation.

Shipped

PO box synthesis pipeline

Locale-aware PO box generator covering en-US/CA/GB/AU, fr-FR/CA, es-ES/MX/AR.

corpus/src/synthesize-po-box.ts — per-locale leader templates, number-format noise (10%), PMB variant logic
corpus/src/adapters/synth-po-box/adapter.ts — JSONL tuples → CanonicalRows
scripts/extract-tuples.py — pulls (locality, region, postcode, country) from WOF SQLite via locality→county→region join + state-prefix-to-ZIP synthesis
scripts/build-po-box-shard.mjs — runs synthesizer + alignRow, writes LabeledRow JSONL
scripts/jsonl-to-parquet.py — converts to v0.4.0 parquet schema

Produced: 50K labeled rows / 3 MB parquet shard. Sample:

P.O. Box 9, Bancroft, ID 83603
  tokens: ['P', 'O', 'Box', '9', 'Bancroft', 'ID', '83603']
  labels: ['B-po_box', 'I-po_box', 'I-po_box', 'I-po_box', 'B-locality', 'B-region', 'B-postcode']

Design decisions

Consulted DeepSeek (3 turns) + USPS Pub 28 §28C2.040 + DMM 508 §4.1.4/§4.5.4:

PMB shares po_box tag. USPS treats PMB as a PO Box alias in CASS. Downstream code can distinguish via presence of a street line.
Whole-phrase span ("PO Box 123" not "123"). Matches existing golden eval convention.
Strategy A (replace street, not augment) — PO box and street are mutually exclusive per USPS. Synthesizing fake (street + PMB) tuples would teach the model an invalid pattern.
10% number-format noise — commas, dashes, embedded spaces. Real OCR/transcription input is lousy; ship with that as native.

Stage 3 activation

# Before
ACTIVE_TAGS = STAGE2_TAGS  # 10 tags, 21 BIO labels

# After
ACTIVE_TAGS = STAGE3_TAGS  # 16 tags, 33 BIO labels

STAGE3 appends to STAGE2 without reordering — IDs are preserved. Existing v0.4.0 shards work unchanged. Models trained on STAGE2 IDs decode correctly against STAGE3; the new logit slots just never get argmax'd.

Golden eval expansion

PO box coverage in data/eval/golden/v0.1.2/:

Before: 1 entry (just "PO Box 123, Burlington, VT 05401")
After: 27 entries
- 20 US variants: PO Box, P.O. Box, P. O. Box, POB, POBOX, Post Office Box, Box, P.O.Box; PMB at CMRA; single-digit through 7-digit
- 6 FR/CA variants: BP, B.P., Boîte Postale, Case Postale, CP

Per-tag error analysis

scripts/eval-error-analysis.ts now emits a per-tag recall table. v0.5.4 baseline:

Stage 2 tags (locality, region, postcode, street, house_number, ...) — high recall as expected
Stage 3 tags (street_prefix, street_suffix, unit, po_box, intersection_a/b, attention, cedex) — 0% recall (model doesn't emit them yet). Baseline for v0.6.0 to compare against.

NaN recovery

v0.6.0 training diverged with NaN loss at step 950 (right at end of warmup, peak LR). Root cause per DeepSeek: CRF transition gradient at peak LR amplified by the new 33-label transition table.

Two-knob fix:

learning_rate: 1.5e-4 → 1.0e-4
crf_loss_weight: 0.5 → 0.1

Skipped: increased warmup (delays blowup), tighter grad clip (masks rather than fixes). After ship, raise CRF weight incrementally.

Variant alias table (#166 follow-up)

Foundation shipped earlier in the day (@mailwoman/variant-aliases). 37 entries covering amenity + brand variants in en-AU/GB/CA, fr-FR/CA, ja-JP. Locale-gated lookup with confidence scoring. The runtime integration into the kind classifier remains v0.6.0+ work.

v0.6.0 ship pipeline

scripts/ship-v0.6.0.sh — staged end-to-end: export ONNX from Modal → quantize int8 → link as dev → 9 demo presets (6 canonical + 3 PO box) → error analysis → upload to HF.

Open issues

#189 — alt_names FTS5 split. SQLite BM25's doc-length normalization can't be fixed with column weights alone. Real fix requires schema migration to separate FTS5 tables. Documented in docs/articles/concepts/importance-vs-population.md.
#166 — variant alias runtime integration. Requires new QueryKind values + POI index.

v0.6.0 results

Training completed at step 100K. CE-only (CRF training disabled after two NaN attempts).

Demo presets: 11/11 pass (6 canonical + 5 Stage 3 variants).

Golden eval per-tag (v0.5.4 → v0.6.0):

Tag	v0.5.4	v0.6.0	Δ
postcode	75.7%	76.0%	+0.3
house_number	78.7%	79.0%	+0.3
region	65.0%	65.0%	flat
locality	39.4%	39.7%	+0.3
street	28.0%	27.9%	flat
venue	29.4%	29.2%	flat
po_box	0.0%	51.9%	+51.9
street_prefix	0.0%	0.0%	flat (corpus rebuild pending)
street_suffix	0.0%	0.0%	flat (corpus rebuild pending)
unit	0.0%	0.0%	flat (corpus rebuild pending)
intersection_a/b	0.0%	0.0%	flat (corpus rebuild pending)

v0.6.0 ships:

HF model repo sister-software/mailwoman-en-us updated
HF bucket en-us/v0.6.0/* populated (model + tokenizer + FST + wof-hot.db + model-card)
releases.json updated, defaultVersion: v0.6.0
neural-weights-en-us package bumped to 0.6.0

Deferred to v0.6.1:

CRF learned transitions (NaN root cause investigation — bf16 + 33×33 transition table)
Street decomposition recall (needs corpus rebuild with updated TIGER/NAD/BAN adapters)
Stage 3 intersection / unit tags (same corpus rebuild requirement)

Shipped​

PO box synthesis pipeline​

Design decisions​

Stage 3 activation​

Golden eval expansion​

Per-tag error analysis​

NaN recovery​

Variant alias table (#166 follow-up)​

v0.6.0 ship pipeline​

Open issues​

v0.6.0 results​