Skip to main content

Why Japanese addresses break Western parsers

· 5 min read
Teffen Ellis
Sister Software

In Tokyo, the address of Tokyo Tower is 〒105-0011 東京都港区芝公園4-2-8.

In English: "4-2-8 Shibakōen, Minato City, Tokyo 105-0011".

The Japanese form runs right-to-left compared to the English form. The prefecture (都道府県) comes first, then the city or ward (市区町村), then a district (丁目) and a block-number-style locator. There's no street name — just a grid.

This is why every rule-based address parser written for Western addresses breaks on Japan.

PO Box Boîte Postale Apartado: Stage 3 ships with 6 new tags

· 6 min read
Teffen Ellis
Sister Software

For its first six versions, Mailwoman emitted ten BIO tags. The model could pick street out of a row but not street_prefix, street_suffix, unit, or po_box. Real addresses are messier than that. The golden eval set has known examples — 6220 SE Salmon St, Portland, OR 97215 (Stage 2 collapses prefix+name+suffix), 123 Main St Apt 4B, Springfield, IL 62701 (loses the apartment), PO Box 123, Burlington, VT 05401 (treats it as a malformed street).

v0.6.0 adds six tags: street_prefix, street_suffix, unit, po_box, intersection_a, intersection_b. The model is the same h384/6L/6H transformer. The recipe is the same v0.5.1 settings. The tokenizer is the same v0.6.0-a0 multi-script bundle. The only structural change is the output head: 21 BIO labels → 33.

FST gazetteer ships to the browser

· 3 min read
Teffen Ellis
Sister Software

The /demo page now loads a 9 MB FST (finite-state transducer) gazetteer alongside the 29 MB ONNX model. 94,000 US admin places with Wikipedia importance scores feed directly into the neural classifier's Viterbi decoder as emission priors — the same pipeline that runs server-side now runs entirely in the browser.

Five tries, same failure — narrowing v0.5.0's training problem by elimination

· 9 min read
Teffen Ellis
Sister Software

This is a follow-up to yesterday's post about the v0.5.0 C-train failures. Yesterday we ran four attempts and ruled out three suspects. Today we ran a fifth and ruled out a fourth. We're now down to one remaining hypothesis — and the way we got here is a kind of debugging that translates pretty cleanly from software engineering, so this post is pitched at engineers who haven't run a training campaign before.

If you've ever bisected a regression in a piece of software (used git bisect, narrowed a test failure by reverting changes one at a time, taken a known-good build and a known-broken build and asked which of the changes between them caused the breakage), then you already understand the core move. The rest is vocabulary.

Taming Who's On First — making sense of the world's open place data

· 10 min read
Teffen Ellis
Sister Software
If you found this via search

Mailwoman is an open-source address parser + geocoder that uses Who's On First as its gazetteer. This post is a practical reference on WOF's gotchas and the tooling we built to work around them. Try the demo or see what ships today.

Who's On First is the best open gazetteer we have. It's also one of the strangest datasets you'll encounter as a developer. This post is about what makes it hard to use, what makes it worth the effort, and the tooling we built inside Mailwoman to tame it.

If you've ever tried to answer "what city is this address in?" programmatically, using open data without paying a geocoding API, you've probably already run into WOF. And you probably had some questions.

Two voices arguing inside a model — a beginner-friendly debugging story

· 11 min read
Teffen Ellis
Sister Software
If you found this via search

Mailwoman is an open-source address parser that runs in Node and the browser. It uses a small neural model to label address components ("350" = house number, "NY" = region, etc.). Try the live demo.

This post is a beginner-friendly debugging story — no ML background needed. If you just want the project status, see what ships today.

This is the third post in a series about a training problem we've been chasing. The first two were written for software engineers. This one is for someone who is just starting to learn about AI and machine learning — no jargon assumed, no math beyond high-school algebra. The point is to show you what real ML debugging looks like, using a problem we actually had this week.

If you've been programming for a while but ML feels opaque, this post is for you. The core technique we used — figuring out which of two instructions our model was listening to — turns out to be much more like ordinary debugging than the field usually makes it sound.

Four training runs, zero shipped weights — bisecting v0.5.0's divergence

· 11 min read
Teffen Ellis
Sister Software
If you found this via search

Mailwoman is an open-source address parser. This post is a training log entry from May 2026 documenting the v0.5.0 divergence investigation. For current project status, see what ships today.

v0.5.0 was the fresh-slate ship: new tokenizer, expanded corpus, new architecture, new reconcile stage. The plan was to bundle several months of structural improvements into one big iteration and pay the cost once. Most of it landed clean. The classifier didn't.

This post walks through the four training attempts the v0.5.0 C-train made overnight, the bisect that ruled out three plausible explanations, and what we think the remaining culprit is. It's a sister piece to the v0.4.0 retrospective — same shape of failure, different diagnostic ladder.

Five training runs, one shipped checkpoint — what we learned from v0.4.0

· 10 min read
Teffen Ellis
Sister Software
If you found this via search

Mailwoman is an open-source address parser. This post is a historical retrospective on the v0.4.0 training campaign (May 2026). For current project status, see what ships today.

@mailwoman/neural-weights-en-us@v0.4.0 (and the fr-fr sibling) shipped today as packaged artifacts (the npm publish is a separate step we do by hand). It is a mixed-result release: one clear win on fine-grained labels, two regressions on coarse labels that turned out to be mostly artifacts of how we measured. Almost everything we set out to do — combine three orthogonal training improvements into one ship — was empirically falsified by a divergence pattern we hadn't seen before.

This is a writeup of how the campaign went. We're publishing it for two reasons: to be honest about what the headline numbers mean, and because the way the failures stacked up is worth thinking about if you train your own NER-style models.