Skip to main content

v0.5.0 โ€” as shipped

A single landing page for anyone catching up on the v0.5.0 fresh-slate work.

The v0.5.0 plan (PHASE_8_v0_5_0_fresh_slate.md) split the work into six threads (A through F). All six landed in main by 2026-05-24. This article catalogues what is now in the tree, how the pieces fit together, and what is still cooking on the training side.

Why v0.5.0 existsโ€‹

The v0.4.0 ablation campaign retrospective ended with three structural ceilings that small recipe tweaks could not clear:

  1. The v0.1.0 tokenizer's byte-fallback rate on non-Latin scripts (the cause of 92% of country FN in the golden eval).
  2. The classifier doing boundary discovery and type classification at the same time via BIO labels (the cause of the bio_slip slice on ", 22220" for 22220).
  3. Stage 5 reconcile only sorting spans โ€” it could not catch joint-incoherent parses like NY-NY Steakhouse, Houston, TX.

Patching each ceiling individually meant three medium iterations. v0.5.0 bundles them into one fresh-slate ship โ€” pay the cost once, clear all three at once.

The "fresh-slate" framing is described in the operator's sharpen-the-axe note โ€” it is the right shape when several improvements share a common cost (a retrain, an interface change) and bundling avoids paying that cost three times.

The six threadsโ€‹

ThreadNameWhat it addsPR
A0Tokenizer harness + A0 weightsSentencePiece training pipeline; A0 trained on corpus-v0.3.0 as a byte-fallback baseline. A1 will retrain on corpus-v0.4.0 once B + B2 are both in.#129
BKryptonite catalogue4,771 adversarial address rows generated via DeepSeek โ€” NY-NY Steakhouse, Paris, Texas, Saint Petersburg, FL, etc. Each annotated with the correct parse.#130
B2Transliteration pairs~73K US/FR addresses transliterated into CJK / Cyrillic / Hangul / Han / Armenian script via DeepSeek. Validated by substring-match aligner (~1.1% reject rate). Included in corpus-v0.4.0.#119
C-sClassifier code pathTop-k inference (returns ranked sequences, not just argmax) + phrase-prior input features that condition on Stage 2.7's proposed spans. No training run in this PR โ€” that lands once A1 + corpus-v0.4.0 are ready.#128
D-sStage 5 reconcilereconcile.ts joint decoder โ€” beam search over (span ร— tag ร— resolver candidate) with concordance scoring via WOF parent-id chains. Catches the kryptonite cases at runtime.#131
EPhrase grouperNew @mailwoman/phrase-grouper workspace โ€” Stage 2.7 proposes coherent input spans (street, postcode, locality, โ€ฆ) before Stage 3 runs. Rule-based v1 only; learned span proposer scoped for v0.5.1 (PHASE_8_E).#126
FVerdict-smoke disciplineNew VERDICT_SMOKES.md + --smoke-mode {constant,long-tail} flag, so the cosine-LR mask that hid v0.4.0's divergence cannot reoccur.#125

How the threads composeโ€‹

The threads were designed to be parallel where possible and stacked where unavoidable. The actual data dependencies look like this:

Solid boxes are merged. The A1 tokenizer was retrained on corpus-v0.4.0 and halved byte-fallback (36.7% โ†’ 18.2%). The C-train (full classifier training run) is the remaining item โ€” CE-only training in progress as of 2026-05-25.

The single biggest schedule reframe was A0 โ€” the plan originally said "A is blocked by B" because the tokenizer trains on the same corpus as the classifier. We split that into:

  • A0 trains on corpus-v0.3.0 immediately to validate the harness and establish a byte-fallback baseline.
  • A1 retrains on corpus-v0.4.0 (= v0.3.0 + B + B2) when the corpus is complete.

That way A0 could ship in parallel with B + E + the scaffolds, instead of waiting for B + B2 to finish.

The other parallel-friendly move was C-s and D-s as scaffolds. The plan had C and D as serial training/integration work that needed real classifier output to validate. We pulled the code portion to t0 (forward-pass tests against stub data, mocked top-k for Stage 5) and deferred only the actual training run. That left the only mandatory serial portion as B โ†’ A1 โ†’ C-train.

What is still cookingโ€‹

ItemStateETA
CE-only full 50K C-trainIn progress. Smoke passed: val_macro_f1=0.444, no divergence past step 2000. Full 50K run targeting first stable v0.5.0 weights.~6-8h from launch.
C-train full classifier runBlocked on A1 + corpus-v0.4.0. Estimated single H100-day on rented compute per the plan.After A1 lands.

Three smaller follow-ups carried over from postmortems:

  • Wire reconcileSpans as the default Stage 5 in runPipeline (currently behind an explicit call). Lands once the classifier emits real top-k. Tracked in D-s's PR body.
  • Learned phrase-grouper variant (1-2M-param span proposer). Scope written in PHASE_8_E_learned_span_proposer.md. v0.5.1 stretch.
  • Reservoir-sampling parallelisation in the tokenizer harness (~5ร— speedup possible per A0's session notes). Out-of-scope until A1 turnaround time becomes a real cost.

If you are catching up on the architecture, read in this order:

  1. The knowledge ladder โ€” the conceptual framing for why the two new layers (Stage 2.7 phrase grouper, expanded Stage 5 reconcile) exist.
  2. The staged pipeline โ€” the runtime composition end-to-end.
  3. STAGES.md โ€” formal per-stage type contracts.
  4. VERDICT_SMOKES.md โ€” the process discipline that kept v0.5.0 from repeating v0.4.0's cosine-LR mask bug.

If you want the line-by-line technical history, the v0.4.0 retrospective (retrospectives/v0-4-0-ablation-campaign.md) is what motivated v0.5.0; this article is the bookend.