Skip to main content

Night Shift 2 โ€” 2026-05-28

Second night shift focused on v0.6.0: Stage 3 label activation + PO box corpus synthesis + multi-locale variant alias foundation.

Shippedโ€‹

PO box synthesis pipelineโ€‹

Locale-aware PO box generator covering en-US/CA/GB/AU, fr-FR/CA, es-ES/MX/AR.

  • corpus/src/synthesize-po-box.ts โ€” per-locale leader templates, number-format noise (10%), PMB variant logic
  • corpus/src/adapters/synth-po-box/adapter.ts โ€” JSONL tuples โ†’ CanonicalRows
  • scripts/extract-tuples.py โ€” pulls (locality, region, postcode, country) from WOF SQLite via localityโ†’countyโ†’region join + state-prefix-to-ZIP synthesis
  • scripts/build-po-box-shard.mjs โ€” runs synthesizer + alignRow, writes LabeledRow JSONL
  • scripts/jsonl-to-parquet.py โ€” converts to v0.4.0 parquet schema

Produced: 50K labeled rows / 3 MB parquet shard. Sample:

P.O. Box 9, Bancroft, ID 83603
tokens: ['P', 'O', 'Box', '9', 'Bancroft', 'ID', '83603']
labels: ['B-po_box', 'I-po_box', 'I-po_box', 'I-po_box', 'B-locality', 'B-region', 'B-postcode']

Design decisionsโ€‹

Consulted DeepSeek (3 turns) + USPS Pub 28 ยง28C2.040 + DMM 508 ยง4.1.4/ยง4.5.4:

  1. PMB shares po_box tag. USPS treats PMB as a PO Box alias in CASS. Downstream code can distinguish via presence of a street line.
  2. Whole-phrase span ("PO Box 123" not "123"). Matches existing golden eval convention.
  3. Strategy A (replace street, not augment) โ€” PO box and street are mutually exclusive per USPS. Synthesizing fake (street + PMB) tuples would teach the model an invalid pattern.
  4. 10% number-format noise โ€” commas, dashes, embedded spaces. Real OCR/transcription input is lousy; ship with that as native.

Stage 3 activationโ€‹

# Before
ACTIVE_TAGS = STAGE2_TAGS # 10 tags, 21 BIO labels

# After
ACTIVE_TAGS = STAGE3_TAGS # 16 tags, 33 BIO labels

STAGE3 appends to STAGE2 without reordering โ€” IDs are preserved. Existing v0.4.0 shards work unchanged. Models trained on STAGE2 IDs decode correctly against STAGE3; the new logit slots just never get argmax'd.

Golden eval expansionโ€‹

PO box coverage in data/eval/golden/v0.1.2/:

  • Before: 1 entry (just "PO Box 123, Burlington, VT 05401")
  • After: 27 entries
    • 20 US variants: PO Box, P.O. Box, P. O. Box, POB, POBOX, Post Office Box, Box, P.O.Box; PMB at CMRA; single-digit through 7-digit
    • 6 FR/CA variants: BP, B.P., Boรฎte Postale, Case Postale, CP

Per-tag error analysisโ€‹

scripts/eval-error-analysis.ts now emits a per-tag recall table. v0.5.4 baseline:

  • Stage 2 tags (locality, region, postcode, street, house_number, ...) โ€” high recall as expected
  • Stage 3 tags (street_prefix, street_suffix, unit, po_box, intersection_a/b, attention, cedex) โ€” 0% recall (model doesn't emit them yet). Baseline for v0.6.0 to compare against.

NaN recoveryโ€‹

v0.6.0 training diverged with NaN loss at step 950 (right at end of warmup, peak LR). Root cause per DeepSeek: CRF transition gradient at peak LR amplified by the new 33-label transition table.

Two-knob fix:

  • learning_rate: 1.5e-4 โ†’ 1.0e-4
  • crf_loss_weight: 0.5 โ†’ 0.1

Skipped: increased warmup (delays blowup), tighter grad clip (masks rather than fixes). After ship, raise CRF weight incrementally.

Variant alias table (#166 follow-up)โ€‹

Foundation shipped earlier in the day (@mailwoman/variant-aliases). 37 entries covering amenity + brand variants in en-AU/GB/CA, fr-FR/CA, ja-JP. Locale-gated lookup with confidence scoring. The runtime integration into the kind classifier remains v0.6.0+ work.

v0.6.0 ship pipelineโ€‹

scripts/ship-v0.6.0.sh โ€” staged end-to-end: export ONNX from Modal โ†’ quantize int8 โ†’ link as dev โ†’ 9 demo presets (6 canonical + 3 PO box) โ†’ error analysis โ†’ upload to HF.

Open issuesโ€‹

  • #189 โ€” alt_names FTS5 split. SQLite BM25's doc-length normalization can't be fixed with column weights alone. Real fix requires schema migration to separate FTS5 tables. Documented in docs/articles/concepts/importance-vs-population.md.
  • #166 โ€” variant alias runtime integration. Requires new QueryKind values + POI index.

v0.6.0 resultsโ€‹

Training completed at step 100K. CE-only (CRF training disabled after two NaN attempts).

Demo presets: 11/11 pass (6 canonical + 5 Stage 3 variants).

Golden eval per-tag (v0.5.4 โ†’ v0.6.0):

Tagv0.5.4v0.6.0ฮ”
postcode75.7%76.0%+0.3
house_number78.7%79.0%+0.3
region65.0%65.0%flat
locality39.4%39.7%+0.3
street28.0%27.9%flat
venue29.4%29.2%flat
po_box0.0%51.9%+51.9
street_prefix0.0%0.0%flat (corpus rebuild pending)
street_suffix0.0%0.0%flat (corpus rebuild pending)
unit0.0%0.0%flat (corpus rebuild pending)
intersection_a/b0.0%0.0%flat (corpus rebuild pending)

v0.6.0 ships:

  • HF model repo sister-software/mailwoman-en-us updated
  • HF bucket en-us/v0.6.0/* populated (model + tokenizer + FST + wof-hot.db + model-card)
  • releases.json updated, defaultVersion: v0.6.0
  • neural-weights-en-us package bumped to 0.6.0

Deferred to v0.6.1:

  • CRF learned transitions (NaN root cause investigation โ€” bf16 + 33ร—33 transition table)
  • Street decomposition recall (needs corpus rebuild with updated TIGER/NAD/BAN adapters)
  • Stage 3 intersection / unit tags (same corpus rebuild requirement)