Skip to main content

3 posts tagged with "Infrastructure"

Articles about the training-data pipeline, model training, tokenization, and other non-model components.

View All Tags

Zero byte-fallback: a multi-script tokenizer from WOF-earth

· 3 min read
Teffen Ellis
Sister Software

The v0.5.0-a1 tokenizer had a dirty secret: it was trained exclusively on US and French addresses. When it encountered Chinese, Japanese, Korean, Thai, or Arabic text, it fell back to encoding individual bytes — 50-75% of tokens for CJK scripts. Every byte-fallback token is a lost opportunity for the model to learn meaningful subword patterns.

Today we fixed that.