Skip to main content

Tokenizer A0 β€” baseline

The v0.5.0 plan (PHASE_8 Β§A) splits the tokenizer retrain into two steps:

  • A0 trains on corpus-v0.3.0 (the existing en-US + fr-FR corpus, no transliterations) using the new SentencePiece harness. Its purpose is to validate the harness and measure a byte-fallback baseline before the new synthetic data lands.
  • A1 retrains on corpus-v0.4.0 (= v0.3.0 + Thread B kryptonite + Thread B2 transliteration). Its purpose is to actually hit the plan's < 5% byte-fallback target on non-Latin scripts.

This article records the A0 numbers so the A1 delta is interpretable.

The headline numbers​

A0 was trained 2026-05-23 on a single host (no rented compute), 1.04 hours wall-clock. SentencePiece configuration:

KnobA0v0.1.0 (predecessor)
model_typeunigramunigram
vocab_size48,00016,000
character_coverage0.99990.9995
byte_fallbacktruetrue
user_defined_symbols2,1100
Training corpuscorpus-v0.3.0, US + FR @ 500K each + 2K mined postcode literals(historical)

Eval fixture: 43 hand-curated lines covering CJK / Cyrillic / Armenian / Greek / Arabic / Hebrew / Devanagari / Thai / Latin-with-diacritics. The fixture lives at data/eval/multi-script/v0.5.0-a0.jsonl.

ScriptA0 byte-fallbackv0.1.0 baselineΞ”
overall36.7%18.0%+18.7 pt
cjk80.0%51.6%+28.4 pt
cyrillic2.2%0.0%+2.2 pt
armenian21.3%0.0%+21.3 pt
greek16.3%0.0%+16.3 pt
arabic5.0%0.0%+5.0 pt
hebrew10.0%0.0%+10.0 pt
devanagari12.8%0.0%+12.8 pt
thai46.7%10.0%+36.7 pt
latin5.8%0.0%+5.8 pt
other34.3%0.0%+34.3 pt

A0 is worse than v0.1.0 on every script. This is not a bug. It is the expected outcome and it is the point of running A0.

Why A0 looks worse β€” the WOF admin-name leakage hypothesis​

The v0.1.0 tokenizer was trained on a corpus that included Who's On First admin-name variants. WOF ships place names in many scripts β€” Tokyo appears as Tokyo and 東京 and Tōkyō and in Cyrillic transliteration, all in the same admin record. Mailwoman v0.1.0's training pipeline pulled these admin records to populate Stage 6 lookups, and the tokenizer training scanned the same data.

The result: v0.1.0's byte-fallback rate on Cyrillic / Armenian / Greek / Arabic / Hebrew / Devanagari was effectively zero, not because the corpus had real coverage of those scripts but because the place-name admin variants accidentally taught the tokenizer enough sub-pieces to avoid byte-fallback on a small evaluation slice.

When we moved to corpus-v0.3.0 (US DOT NAD + WOF postalcode + cleaner US/FR address rows), those incidental admin-name variants stopped being in the tokenizer training set. A0 β€” trained on the same en-US + fr-FR mass without the WOF admin leakage β€” has effectively no non-Latin coverage. Hence the regressions you see in the table.

The v0.1.0 numbers were never an honest measurement of multi-script capability; they were a leakage artefact. A0 is the first honest measurement of "what does our tokenizer do on adversarial transliteration input when we have not trained it on any". The answer is: badly. As expected.

What A1 will measure​

A1 retrains on corpus-v0.4.0 = corpus-v0.3.0 + Thread B (4,771 kryptonite rows) + Thread B2 (~75K transliteration pairs into Cyrillic / Japanese / Hangul / Han / Armenian). The transliteration pairs are the part that matters for byte-fallback β€” they put real non-Latin character sequences into the tokenizer's training set.

The headline metric is the A1 vs A0 delta on each script:

  • Target: A1 byte-fallback < 5% overall on the same fixture, with per-script rates approaching the v0.1.0 numbers (the leakage numbers) as a floor.
  • Stretch: per-script rates well below v0.1.0 on the scripts B2 explicitly covers (cjk, cyrillic, hangul, han, armenian).
  • Anti-goal: regression on Latin. A1 should not lose ground on the en-US + fr-FR mass.

A1 retrain is a single re-invocation of the A0 harness against corpus-v0.4.0 β€” deterministic with --seed 42, takes the same ~1 hour. The output writes to /data/models/tokenizer/v0.5.0-a1/ and the model card carries the same shape (a JSON file with sentencepiece flags, training-line count, training duration, byte-fallback per-script, and a SHA-256 of the .model file).

What A0 is good for, even though it lost​

Three things:

  1. The harness is validated. A0 trained end-to-end on real data. Reservoir sampling, the user-defined-symbols normalisation step (see SP UDS whitespace gotcha β€” or, if you are reading on the published docs site, the version in the mailwoman repo at corpus-python/src/mailwoman_train/tokenizer_train.py), the model card writer, the byte-fallback eval β€” every piece worked. A1 has zero new code to write.
  2. The eval fixture is validated. 43 hand-curated multi-script lines is small but the per-script breakdown is informative enough to tell us where a retrain helped vs hurt. Future tokenizer iterations will use the same fixture and the same harness, so A0 / A1 / A2 numbers will be directly comparable.
  3. The user-defined-symbols list is validated. A0 has 2,110 UDS β€” US state codes, country abbreviations, postcode literal patterns (5-digit ZIP, 5-4 ZIP+4, UK alphanumeric blocks, JP 100-0005). The A0 harness substitutes ASCII space β†’ U+2581 internally so SP's normalisation does not silently disable the UDS (the SP UDS whitespace gotcha). That substitution is now battle-tested.

Model card schema​

A0's model_card.json is the contract for every future tokenizer. It carries:

{
"tokenizer_version": "v0.5.0-a0",
"corpus_version": "v0.3.0",
"vocab_size": 48000,
"training_lines": 1000000,
"training_duration_seconds": 6294.237,
"generated_at": "2026-05-23T19:53:06Z",
"git_commit": "378a55d…",
"model_sha256": "a864fde…",
"sentencepiece_flags": { … },
"sampling": {
"countries": ["US", "FR"],
"per_country": 500000,
"seed": 42
},
"byte_fallback_eval": {
"overall": { "lines": 43, "pieces": 1075, "byte_fallback_pieces": 395, "rate": 0.367 },
"per_script": { "cjk": …, "cyrillic": …, … }
}
}

META.json is a thin shim with the same content, kept for compatibility with @mailwoman/neural-weights-* package loaders that already read it.

See also​