3 posts tagged with "Infrastructure"

Articles about the training-data pipeline, model training, tokenization, and other non-model components.

Zero byte-fallback: a multi-script tokenizer from WOF-earth

May 28, 2026 · 3 min read

Sister Software

The v0.5.0-a1 tokenizer had a dirty secret: it was trained exclusively on US and French addresses. When it encountered Chinese, Japanese, Korean, Thai, or Arabic text, it fell back to encoding individual bytes — 50-75% of tokens for CJK scripts. Every byte-fallback token is a lost opportunity for the model to learn meaningful subword patterns.

Today we fixed that.

Our model worked in CI but broke on every real device

May 27, 2026 · 4 min read

Teffen Ellis

Sister Software

We shipped a browser-based address parser that runs a 29 MB ONNX model entirely client-side. The Playwright tests showed perfect results. Chrome desktop looked great. Then someone opened it on an iPhone.

Night Shift 2 — from thermal hangs to a shipped model in one session

May 25, 2026 · 7 min read

Teffen Ellis

Sister Software

The second night shift ran from roughly 2am to 2pm UTC on May 25th, 2026. It started with a GPU that wouldn't stop crashing and ended with a trained model, an ONNX export, and a full evaluation report. This is the story of how infrastructure choices turned a hardware problem into a non-issue.