11 posts tagged with "Model training"

Articles about training runs, loss recipes, verdict smokes, and divergence diagnostics.

A lookup table scored 100%. We shipped the model anyway.

June 9, 2026 · 5 min read

Sister Software

This morning we published a post that ended with a tidy rule: some address tags don't want a neural network, they want a lookup table. Country names are a closed list in a known position. Our deterministic matcher scored a perfect 100 on the eval. The retrained model scored a mess. Case closed, we wrote.

By the afternoon we'd reopened the case, and the verdict flipped — hard enough that we've retracted the morning post rather than leave the wrong conclusion lying around for someone to cite. This is the story of how a perfect score nearly talked us out of the entire premise of the project.

Which way does a postcode point?

June 6, 2026 · 10 min read

Teffen Ellis

Sister Software

We left the last postcode story with a promise and a bill. The promise was that the "which country is this" signal has to come from the trained model reading the whole string, because the postcode on its own settles the question less than half the time. The bill was that this is the expensive version of the feature. This is the post where we paid it: we built the country signal into the model, watched it do something genuinely great, and then watched it refuse, in the most instructive way we've hit all month, to do that same great thing in a different word order.

The great thing first, because you've earned it. We took the postcode's gazetteer membership, that [us, de, fr] answer from last time, and instead of handing it to a regex we injected it into the model at the postcode token itself. A small additive nudge on the hidden state, right where the five digits sit, carrying "here is what this code could be." On German addresses written the way Germans actually write them, it was worth thirty-five points of locality accuracy. It beat Pelias. For one evening we were heroes.

Then we looked at the international numbers and the floor gave way. Same model, same anchor, the same German cities, but now written house-number-first with the postcode trailing the city, the way our test feed renders them, and it scored a hair above a coin flip. The hero anchor was, on those rows, slightly worse than no anchor at all.

Three questions sit under the rest of this, so let me put them on the table before we start:

When a parser "collapses" on a test, is the parser wrong, or is the test?
Can you train one model to read an address in any order, or does each order quietly cost you the other?
And the one that took three retrains to answer honestly: what does a learned anchor actually learn, the thing you asked for, or the shape of where you kept putting it?

The model that never saw an intersection

May 29, 2026 · 5 min read

Teffen Ellis

Sister Software

We spent a night trying to make our neural address parser less cocky. We ended it having learned something more useful. The model wasn't cocky — it was uninformed. It had never been shown whole categories of address.

This is the story of chasing the wrong number, and the diagnostics that pointed at the right one.

Zero byte-fallback: a multi-script tokenizer from WOF-earth

May 28, 2026 · 3 min read

Teffen Ellis

Sister Software

The v0.5.0-a1 tokenizer had a dirty secret: it was trained exclusively on US and French addresses. When it encountered Chinese, Japanese, Korean, Thai, or Arabic text, it fell back to encoding individual bytes — 50-75% of tokens for CJK scripts. Every byte-fallback token is a lost opportunity for the model to learn meaningful subword patterns.

Today we fixed that.

Why Japanese addresses break Western parsers

May 28, 2026 · 5 min read

Teffen Ellis

Sister Software

In Tokyo, the address of Tokyo Tower is 〒105-0011 東京都港区芝公園4-2-8.

In English: "4-2-8 Shibakōen, Minato City, Tokyo 105-0011".

The Japanese form runs right-to-left compared to the English form. The prefecture (都道府県) comes first, then the city or ward (市区町村), then a district (丁目) and a block-number-style locator. There's no street name — just a grid.

This is why every rule-based address parser written for Western addresses breaks on Japan.

PO Box Boîte Postale Apartado: Stage 3 ships with 6 new tags

May 28, 2026 · 6 min read

Teffen Ellis

Sister Software

For its first six versions, Mailwoman emitted ten BIO tags. The model could pick street out of a row but not street_prefix, street_suffix, unit, or po_box. Real addresses are messier than that. The golden eval set has known examples — 6220 SE Salmon St, Portland, OR 97215 (Stage 2 collapses prefix+name+suffix), 123 Main St Apt 4B, Springfield, IL 62701 (loses the apartment), PO Box 123, Burlington, VT 05401 (treats it as a malformed street).

v0.6.0 adds six tags: street_prefix, street_suffix, unit, po_box, intersection_a, intersection_b. The model is the same h384/6L/6H transformer. The recipe is the same v0.5.1 settings. The tokenizer is the same v0.6.0-a0 multi-script bundle. The only structural change is the output head: 21 BIO labels → 33.

Night Shift 2 — from thermal hangs to a shipped model in one session

May 25, 2026 · 7 min read

Teffen Ellis

Sister Software

The second night shift ran from roughly 2am to 2pm UTC on May 25th, 2026. It started with a GPU that wouldn't stop crashing and ended with a trained model, an ONNX export, and a full evaluation report. This is the story of how infrastructure choices turned a hardware problem into a non-issue.

Five tries, same failure — narrowing v0.5.0's training problem by elimination

May 24, 2026 · 9 min read

Teffen Ellis

Sister Software

This is a follow-up to yesterday's post about the v0.5.0 C-train failures. Yesterday we ran four attempts and ruled out three suspects. Today we ran a fifth and ruled out a fourth. We're now down to one remaining hypothesis — and the way we got here is a kind of debugging that translates pretty cleanly from software engineering, so this post is pitched at engineers who haven't run a training campaign before.

If you've ever bisected a regression in a piece of software (used git bisect, narrowed a test failure by reverting changes one at a time, taken a known-good build and a known-broken build and asked which of the changes between them caused the breakage), then you already understand the core move. The rest is vocabulary.

Two voices arguing inside a model — a beginner-friendly debugging story

May 24, 2026 · 11 min read

Teffen Ellis

Sister Software

If you found this via search

Mailwoman is an open-source address parser that runs in Node and the browser. It uses a small neural model to label address components ("350" = house number, "NY" = region, etc.). Try the live demo.

This post is a beginner-friendly debugging story — no ML background needed. If you just want the project status, see what ships today.

This is the third post in a series about a training problem we've been chasing. The first two were written for software engineers. This one is for someone who is just starting to learn about AI and machine learning — no jargon assumed, no math beyond high-school algebra. The point is to show you what real ML debugging looks like, using a problem we actually had this week.

If you've been programming for a while but ML feels opaque, this post is for you. The core technique we used — figuring out which of two instructions our model was listening to — turns out to be much more like ordinary debugging than the field usually makes it sound.

Four training runs, zero shipped weights — bisecting v0.5.0's divergence

May 24, 2026 · 11 min read

Teffen Ellis

Sister Software

If you found this via search

Mailwoman is an open-source address parser. This post is a training log entry from May 2026 documenting the v0.5.0 divergence investigation. For current project status, see what ships today.

v0.5.0 was the fresh-slate ship: new tokenizer, expanded corpus, new architecture, new reconcile stage. The plan was to bundle several months of structural improvements into one big iteration and pay the cost once. Most of it landed clean. The classifier didn't.

This post walks through the four training attempts the v0.5.0 C-train made overnight, the bisect that ruled out three plausible explanations, and what we think the remaining culprit is. It's a sister piece to the v0.4.0 retrospective — same shape of failure, different diagnostic ladder.