Mailwoman log

A lookup table scored 100%. We shipped the model anyway.

2026-06-09T00:00:00.000Z

This morning we published a post that ended with a tidy rule: some address tags don't want a neural network, they want a lookup table. Country names are a closed list in a known position. Our deterministic matcher scored a perfect 100 on the eval. The retrained model scored a mess. Case closed, we wrote.

By the afternoon we'd reopened the case, and the verdict flipped — hard enough that we've retracted the morning post rather than leave the wrong conclusion lying around for someone to cite. This is the story of how a perfect score nearly talked us out of the entire premise of the project.

The score was real. The fight was rigged.

Here's what the morning's comparison actually was. In one corner: a flat lookup, matching the trailing chunk of an address against the ISO country list. In the other: a model we had retrained on a synthetic shard where every single row ended in a country. That model learned exactly what we taught it — "the last thing in an address is a country" — and started promoting cities and states to nationhood. Precision: 23%.

And the referee? An eval with no homographs in it. Not one "Georgia." Not one "CA." Fifty-four addresses where the trailing token was never ambiguous, which is to say, fifty-four addresses where a lookup table cannot lose.

A crippled model, an unloseable eval, and a perfect score. We looked at that 100% and wrote down a design principle. You've done this too — a benchmark hands you a clean number, the number agrees with the architecture you were already tempted by, and the question "what exactly did this measure?" quietly leaves the room.

The objection that reopened it

The pushback, when it came, was about the soul of the thing. Mailwoman is a model system. The entire bet is that a human reads "Atlanta, Georgia" and "Tbilisi, Georgia" and resolves them without a rulebook, so a context-reading model should too. A lookup table can't do that. It needs a hand-coded guard for every collision — Georgia, Jordan, Lebanon, CA — and a growing list of exceptions is precisely the disease we left rules-based parsing to escape.

So we did what we should have done in the morning: gave the model a fair fight.

We rebuilt the training shard with the homographs in it, both ways. "Tbilisi, Georgia" labeled as a country, "Atlanta, Georgia 30309" labeled as a state, the same surface form pulling in opposite directions until the only way to win is to read the neighbors. We added addresses with no country at all, so abstaining stays on the menu. Then we built the eval the morning's comparison never had: Paris, Texas against Paris, France; Kingston, New York against Kingston, Jamaica; person-named countries; the works.

The fair fight

The retrained model's first result: country recognition went from literally zero to alive — at 100% precision, with zero over-fires. Not one city promoted to a nation. Not one "Georgia" guessed wrong. Region accuracy improved while the new tag came online. The contrast pairs did exactly what the theory said: the model learned that the label is contextual, because we finally showed it contexts that disagree.

What it missed, it missed honestly: Eswatini. Timor-Leste. Bhutan. Countries the training data mentions a handful of times. That failure mode is recognition, a vocabulary problem, and vocabulary is what gazetteers are for.

Which is where the lookup table re-enters the story — demoted. It doesn't get to be the judge anymore; it gets to be a witness. We feed gazetteer membership into the model as a per-token clue: this word is on the country list; this word is on two lists, so pay attention. The model still rules on every tag. Add Liechtenstein to the gazetteer tomorrow and the clue fires with no retrain, because the knowledge lives outside the weights. The morning's matcher survives intact, doing the one job it was always qualified for: knowing what's on the list. Reading the room was never on its résumé.

The lesson we're keeping

The seductive thing about a deterministic component is that it cannot be wrong in the cases you thought to test. The treacherous thing is the same sentence with the emphasis moved: it cannot be right in the cases you didn't. Our 100% was an artifact of an eval that only contained the easy half of the problem.

When a benchmark tells you the simple thing beats the learned thing, before you celebrate, check who the learned thing was that day, and check what the benchmark left out. Sometimes the simple thing genuinely wins. Ours lost the rematch the moment the hard cases showed up — and the model, given one honest shot at the data, did the thing we built it to do.

Trust the model. Feed it better.

The right name in the wrong state

2026-06-08T00:00:00.000Z

Our resolver scored 93.7% on the metric we'd been quoting for months. On the same addresses, its median answer was 326 kilometers from the truth.

Both numbers are correct. That's the uncomfortable part.

A metric that reads the label and never checks the map

When the resolver turns a parsed address into a place, we used to grade it one way: did the place it picked carry the same name as the gold answer? Gold says the locality is "Sheldon", resolver says "Sheldon", that's a point. It's a reasonable-sounding check, and it is wrong in a way that took us months to see. It can only fail when the name is wrong, and the name is almost never wrong.

There are ten places called "Sheldon" in the United States. "New York" is a city and a state and a village 280 kilometers apart. "Washington" is a town in most states you can name. When you grade by name, every one of those is a tie, and the resolver gets full marks for picking any of them. The metric was answering "is this the right word?" when the only question that matters is "is this the right place on Earth?"

So we built a harness that asks the second question, and pointed it at the one slice of data where it would tell the truth.

Leakage-free, or it's just a memory test

The honest slice matters as much as the honest metric. Our model trains on a corpus that covers the same towns the eval tests, so a random evaluation partly measures memorization: the model recalling a place it has already seen rather than generalizing to one it hasn't. The corpus deliberately holds a few regions out of training entirely. Evaluate only on those held-out places and you're testing the model on geography it has genuinely never met.

In our current data that's Vermont: 1,428 addresses the model trained around, not on. We ran the full pipeline on them and stopped grading by name. We measured region-match, the great-circle distance from the gold point to the resolved one, and PIP-containment (whether the gold coordinate actually falls inside the resolved place's polygon). None of those can be fooled by a matching string.

Here is what the honest slice said, next to the number we'd been quoting:

metric	what we quoted	the honest number
locality name-match	93.7%	93.7%
region-match	—	0.0%
coordinate error (p50)	—	326 km

Region-match: zero. Not low. Zero. The resolver was getting the state right essentially never, and the name-match metric had no way to tell us, because "Sheldon, Vermont" and "Sheldon, Iowa" are the same word.

Following the 326 kilometers down

The model wasn't the problem. Hand it 226 Bridge Rd, North Hero, VT 05474 and it cleanly tags region="VT", locality="North Hero", the street, the number, the postcode. The parse is right. The resolver throws the region away.

It throws it away because it can't read it. Who's On First stores Vermont as "Vermont"; our search index carried no abbreviations, so findPlace("VT") matched nothing. With no resolved region, the resolver had no parent to constrain the locality search, so it searched the whole country — and when ten Sheldons compete with no geographic filter, the one with the largest population wins. Vermont's Sheldon (population 932) loses to Iowa's (population 5,455) every single time. The 326 kilometers was the distance between the right name and the famous one.

The fix already existed in the repo. A build step that pulls state abbreviations from a reference dataset we already ship had simply fallen out of the build manifest, so the gazetteer went out without it. We put it back, rebuilt the index, and re-ran the same slice:

metric	before	after
region-match	0.0%	99.9%
coordinate error (p50)	326 km	3.4 km

Across the full US sample, the long tail told the same story louder: the 90th-percentile error fell from 2,763 kilometers to 10. We carry a flag called --default-country, the one that makes you tell the resolver the answer it's supposed to find, and it exists largely to paper over this exact blindness. The resolver can read the region now.

The number was right; the screwdriver was wrong

This is where it would be tidy to stop. It wasn't tidy.

Before promoting anything we ran the demo presets, the eight addresses we look at by hand, and one of them had gotten worse. 350 5th Ave, New York, NY used to resolve to New York City. Now it resolved to "New York Mills", a village 283 kilometers upstate. The aggregate said the fix was a triumph; the functional check said we'd broken the most famous address in the set. When those two disagree, the functional check is the one telling the truth, and that disagreement is where you go looking.

The clue led somewhere worth knowing. Now that the region resolved, the resolver was boosting places that descend from it, and it works out descent from a precomputed ancestry table. New York City spans five boroughs, so Who's On First gives it the "no single parent" sentinel for a parent id, and our table-builder, which only ever followed parent ids, had recorded NYC's ancestry as just itself. No link to New York state. So the region boost lifted the correctly-filed village over the city, and a village of three thousand beat a city of eight million on a technicality of bookkeeping.

The ancestry was never actually missing. NYC's source record carries the full hierarchy, with New York state in all five of its borough branches, sitting in a field our builder didn't read. So we read it: a repair pass that rebuilds ancestry from the authoritative hierarchy fixed 47,129 places. New York City resolves to New York City again, Vermont stayed at 3.4 kilometers, and the metro regression was gone.

What we're keeping

Two things, and they're the same shape.

The first is about metrics. A measurement that grades by name can be gamed by coincidence and will flatter you right up until a customer geocodes into the wrong state. The coordinate can't be gamed: a point is either inside the right boundary or it isn't. We lead with region-match and distance now, and we report containment honestly, point geometry and all. The yardstick comes before the optimization, because every win you book against a dishonest yardstick is a win you might have to give back.

The second is about trust. The aggregate loved the abbreviation fix. The eight addresses we read with our own eyes caught the regression the aggregate buried, and chasing why those eight disagreed is what surfaced the ancestry bug underneath. Numbers scale and that is exactly their weakness; they average away the one case that would have embarrassed you. Keep reading the addresses by hand. The disagreement is where the bug lives.

We spent three retrains fixing a German bug that didn't exist

2026-06-07T00:00:00.000Z

There is a particular kind of engineering misery where you fix a bug three times and it never gets better, because the bug is in your ruler. This is that story.

Our neural parser handles German two ways. Native order — Hauptstraße 5, 10115 Berlin — is the layout real German feeds and real German people use. International order — 5 Hauptstraße, Berlin, 10115 — is the Americanized layout our evaluation set happens to ship. For months, international-order German "collapsed": locality accuracy sat around 44% while native cleared 80%. We had a story for it. The postcode anchor — a side-channel that feeds the model a country hint derived from the postcode — sits at the trailing postcode, which in international order lands on the far side of the locality from where it's needed. Plausible. So we retrained.

Three swings

The first retrain taught the model both word orders. It moved the model's intrinsic parsing but the production number stayed flat. The second re-added a region tail the synthetic data had dropped. It fixed region tagging — and left locality exactly where it was. The third injected the country hint at the front of the sentence too, so word order couldn't hide it. Locality-match went from 44.7% to 43.7%. Down. Three swings, and the needle would not move.

Across all three, one number sat there glowing and we kept not looking at it: the median coordinate error was about 6 kilometers. Six kilometers is city-centroid accuracy. That is not what a "collapse" looks like. A model that genuinely couldn't parse German addresses would be putting them in the wrong country, not six kilometers from the front door. The geography was fine the whole time while the locality-match score fell. When your accuracy metric drops and your distance-to-truth doesn't, the metric is the thing that's broken.

Measuring the thing that can't be gamed

So we measured it. PIP-containment: forget whether the resolved name string matches the gold string — is the address's real GPS point physically inside the polygon of the place we resolved it to? You cannot game that with a string trick. It either lands in the right place or it doesn't.

The international-order German result split clean down the middle:

                name-match   PIP-containment
Saxony            51.1%          75.9%        (+24.8pp)
Berlin            36.3%          36.3%        ( 0.0pp)

Two completely different stories had been hiding under one average.

Saxony was never broken. The model places Saxon addresses correctly three times in four; the name-match metric only credited half of them. Look at what it was rejecting:

gold "Plauen Vogtl"     resolved "Plauen"        point inside Plauen ✓
gold "Chemnitz Sachs"   resolved "Chemnitz"      point inside Chemnitz ✓
gold "Marienberg Erzgeb" resolved "Marienberg"   point inside Marienberg ✓

OpenAddresses tags these with the regional district — Vogtländischer Kreis, Sachsen, Erzgebirge — and Who's On First's canonical name doesn't carry the suffix. So Plauen Vogtl ≠ Plauen, the string check fails, and the model eats a miss for resolving an address to exactly the right town. Twenty-five points of "collapse" was our ruler refusing to call Plauen Plauen.

Berlin was genuinely broken — just not the way we'd been retraining for. Of 1,500 Berlin addresses, 955 resolved to nothing at all. The model drops the locality entirely in the city-state layout …, Berlin, Berlin 10115, where the city and the state are the same word: one Berlin gets labeled the region, the other vanishes, and the resolver has nothing to place. That's a real bug. It is also specific to Berlin, Hamburg, and Bremen, and it has nothing whatsoever to do with the postcode anchor or word order — which is precisely why three anchor-and-order retrains never laid a finger on it.

What native German was actually doing

And then the part that stung. We ran the same honest metric on native order, the layout that actually matters:

                name-match   PIP-containment
native German     83.5%          96.2%

Ninety-six percent. Native German, measured by where the addresses actually land, was essentially solved and beating the rules-based baseline comfortably — while we'd been reading 83.5% off the name string and quietly wishing it were better. The metric had been low-balling our best locale by thirteen points the whole time.

The bill

Three retrains, an A100 each, to discover that the model was fine and the scoreboard was broken. The honest accounting: one bug was a measurement artifact in the resolver's name comparison (the fix is an alias, not a training run), one was a narrow city-state parsing bug (a small data fix, not a country hint), and the model's German was a good deal better than any of our numbers had admitted. We cancelled the fourth retrain that was already queued.

The thing I keep turning over is that the coordinate error sat at six kilometers across all three runs and we kept retraining anyway, because the metric we'd built our gates around was the one telling us to. A benchmark you can fail while being right is worse than no benchmark, because it doesn't just fail to help — it actively points you at the wrong fix and lets you feel diligent while you chase it. We have a non-gameable metric now. We should have built it first.

The 2×2s, the PIP-containment harness, and the per-state breakdowns are in scripts/eval/de-pip-eval.sh and docs/articles/evals/. Numbers in this post are generated.

Which Berlin? When your metric grades the wrong thing

2026-06-07T00:00:00.000Z

Ask a geocoder for "Berlin" and it has to make a choice. There's the one in Germany, obviously. There's also Berlin, New Hampshire (population nine thousand and change), Berlin, Wisconsin, Berlin, Connecticut, and a dozen more scattered across the United States like the name was on sale. The parser hands you the word Berlin tagged as a locality; something downstream has to decide which dot on the map that is. How would you even know if it picked right?

For a long time our answer was a scorecard that checked the name. Did the resolved place's name equal the expected name? Tick. Move on. It is a completely reasonable thing to measure, and it was lying to us for months.

The gold star for New Hampshire

Here's the failure the name check can't see. Feed it a German address, let the resolver land on Berlin, New Hampshire, and ask the scorecard how it did. The resolved name is "Berlin." The expected name is "Berlin." Tick. Gold star. We just put a Berlin address an ocean away from Berlin and the metric congratulated us for it.

This isn't a contrived edge case. Bare locality names collide constantly across borders, and a name-only check is structurally blind to the collision. Whenever the model dropped a German locality on its American namesake, our headline number stayed perfectly, serenely flat. The bug and the scorecard were made for each other.

We only tripped over it by accident, chasing something else entirely.

The hint that did nothing, loudly

Every address carries a postcode, and a postcode mostly pins down a country. So we built a small extractor that turns the postcode into a guess about which country you're in, and we ran a simulation: feed that country guess into the resolver's ranking, give candidates from the right country a nudge, and see how much the name-match score improves.

It improved by nothing. Zero. Flat line.

Which, briefly, looked like a dead end. The hint was supposed to help and the number said it didn't. Then it clicked: the number couldn't say it helped, because the number grades by name, and fixing a wrong-country pick doesn't change the name. We'd handed our metric exactly the kind of improvement it was built to ignore.

Measure the distance and the floor falls out

So we threw out the name check and graded by distance instead. We have the real government coordinates for every test address, so we can ask the only question that actually matters: how far is the resolver's pick from where the address really is?

The picture inverted immediately. On German addresses, the postcode hint dragged 33 picks back across the Atlantic to where they belonged, erasing about 117,000 kilometers of total error. On American addresses it pulled 333 of them more than 100 km closer to the truth and pushed only 7 the wrong way, a roughly fifty-to-one trade. The hint was quietly worth a continent, and the name scorecard had been sitting there the whole time reporting that absolutely nothing was happening.

A metric you can satisfy without being right will let you be wrong forever, cheerfully, in production. "Berlin" matches "Berlin" no matter which one you meant. The distance to the real point does not care what you call the place; it just measures whether you found it. We switched the yardstick, and we're building the country hint into the resolver for real now that we can finally see what it does.

The same week, the same lesson

This landed the same week we did something that sounds unrelated and turns out to be the identical problem: we calibrated the parser's confidence. Every span comes out stamped with a conf= number, and we'd never checked whether a 0.9 actually meant right-nine-times-in-ten. It didn't, until we fit a correction that made it honest (the calibration writeup has the details, including the weather-forecaster version of the story).

Both are the same realization wearing different hats. A geocoder reports numbers about itself constantly: how confident it is in a tag, how well it scored on a benchmark. Those numbers are worthless decoration until you've checked that they mean what they say. A confidence that isn't calibrated is a vibe with a decimal point. A benchmark you can game is a way to feel good while shipping the wrong Berlin.

So the next time a metric tells you everything is fine, ask it the one thing it isn't measuring. Ours was measuring the spelling. It should have been measuring the distance.

The harness, the per-row deltas, and the reproducible reports live in scripts/eval/anchor-resolver-delta.ts and docs/articles/evals/. Numbers in this post are generated, not hand-typed.

Which way does a postcode point?

2026-06-06T00:00:00.000Z

We left the last postcode story with a promise and a bill. The promise was that the "which country is this" signal has to come from the trained model reading the whole string, because the postcode on its own settles the question less than half the time. The bill was that this is the expensive version of the feature. This is the post where we paid it: we built the country signal into the model, watched it do something genuinely great, and then watched it refuse, in the most instructive way we've hit all month, to do that same great thing in a different word order.

The great thing first, because you've earned it. We took the postcode's gazetteer membership, that [us, de, fr] answer from last time, and instead of handing it to a regex we injected it into the model at the postcode token itself. A small additive nudge on the hidden state, right where the five digits sit, carrying "here is what this code could be." On German addresses written the way Germans actually write them, it was worth thirty-five points of locality accuracy. It beat Pelias. For one evening we were heroes.

Then we looked at the international numbers and the floor gave way. Same model, same anchor, the same German cities, but now written house-number-first with the postcode trailing the city, the way our test feed renders them, and it scored a hair above a coin flip. The hero anchor was, on those rows, slightly worse than no anchor at all.

Three questions sit under the rest of this, so let me put them on the table before we start:

When a parser "collapses" on a test, is the parser wrong, or is the test?
Can you train one model to read an address in any order, or does each order quietly cost you the other?
And the one that took three retrains to answer honestly: what does a learned anchor actually learn, the thing you asked for, or the shape of where you kept putting it?

The collapse that was a rendering bug

Before you can fix a collapse you have to be sure it's real, and ours mostly wasn't. The number that scared us, German international addresses parsing around 45% while native ones sat in the eighties, turned out to be measuring our test harness as much as our model.

Here's the thing we'd quietly done to ourselves. Our German evaluation set is rendered from OpenAddresses in the layout our US-trained tooling defaults to: 27 Straußstraße, Berlin, Berlin 12623. House number first, postcode after the city, region hanging off the tail. No German has ever written an address that way. They write Straußstraße 27, 12623 Berlin, street then number, postcode before the city. The model had trained on the German order and we were grading it on the American one, then reading the low score as a model failure.

So we re-rendered the same cities in their native order and measured again. The "collapsing" model read them at 83.8%, comfortably past Pelias's 78.7. The collapse was, to a first approximation, us holding the test sideways. That's worth saying plainly because it's the cheap half of the lesson: when a model falls over on exactly one slice of your data, suspect the slice before you suspect the model. We've now been burned by eval-order twice, and both times the fix was free.

Only the first approximation was free, though. After we corrected the rendering, a residual gap stayed behind, and it had nothing to do with order artifacts. With the anchor switched on, international-order German still came in a few points below the same model with the anchor switched off. The boost that was worth +35 on native addresses had flipped its sign. No rendering fix was going to explain that one away; the anchor was actively making the harder order worse, and chasing why is where the rest of the story lives.

Three swings at the residual

We did the obvious thing first, and the obvious thing told us something real. If the model had only ever seen German in native order, of course it stumbled on the international one, so we rebuilt the training shard to render both orders, roughly sixty/forty. The model with the anchor off responded exactly as you'd hope: international-order parsing climbed from 35.9% to 48.4%. The capability is learnable. Show the model both layouts and it reads both.

The model with the anchor on didn't move. International stayed stuck around 44%, with the anchor still dragging it below the anchor-off number. So we'd proven the corpus wasn't the ceiling, which is genuinely useful and was not the result we wanted.

Swing two. We noticed the international synth had been dropping the region from the tail while the eval fed it, so the model was being asked to segment a City, Region Postcode ending it had never trained on. Reasonable suspect. We rendered the region back into the tail and retrained. The region-matching did exactly its job, international region accuracy going from zero to about forty percent, and the locality number we actually cared about did not budge. The tail wasn't the ceiling either.

Swing three was the architectural one, and it's the one we'd have bet on. If the anchor lands on the postcode and the postcode trails the city in international order, then by the time the city gets read the anchor is firing on the wrong side of it. Fine: inject the anchor a second time, at the very first token, where every locality can attend back to it no matter where the postcode ended up. A clean change, no new parameters, the zero-confidence case stays a perfect identity. We retrained.

It did nothing. International held at 43.7%, the anchor still underwater.

retrain	native, anchor on	international, anchor on
both-order corpus	82.1	44.5
region in the tail	83.6	44.7
second anchor at token 0	83.5	43.7

Three swings, one number that would not move. At some point a column of results that flat stops being a series of failed fixes and starts being the finding itself.

What the anchor actually learned

Here's where it helps to stop asking "why won't it improve" and start asking what the thing in front of you is actually doing. We'd been describing the anchor as if it carried a meaning, "this postcode could be German," and meanings don't have a handedness. What we actually add to that one position is a vector, and the model spends all of training learning what to do with the nudge. What it learned to do, it turns out, has a direction baked into it.

Think about where the city sits relative to the postcode in each training distribution. In native German, the postcode comes before the city: 12623 Berlin. Every time the anchor fired during training, the locality it was supposed to help was sitting just to its right. So the model learned an anchor that reaches rightward, and on native addresses it reaches right and finds Berlin every time, which is your +35 points. Hand that same model an international-order address and the postcode is now after the city. The anchor reaches right out of long habit, finds the region or the end of the string, and meanwhile the actual city it was meant to rescue is sitting behind it, unhelped and slightly shoved.

The clean confirmation was hiding in the data the whole time, in the one locale that never suffered. American addresses put the postcode after the city, Seattle WA 98101, and the US anchor never hurt anything; US held at 96, 97%. Of course it did. US training is consistently postcode-after-city, so the US anchor learned to reach left, toward the city behind it, and it's right every time because the layout never varies. Same architecture, same injection point, opposite learned direction, because the two countries write their addresses in opposite orders and the anchor simply absorbed whichever one it was fed.

That's the asymmetry, and it's why it's fundamental rather than a tuning problem. A single added vector can encode "reach toward the city." It cannot encode "reach toward the city, which is sometimes to my left and sometimes to my right." Mix both orders into one shard and you're asking one direction to point two ways; it settles on the average and serves the dominant order, which is exactly the flat international number we kept retraining into. To check we weren't chasing a name-matching mirage, we ran a containment metric, does the resolved point land inside the right city's polygon, and the gap held: 96% on native German, 57% on the international order. The miss is geographic and real, not a scoreboard artifact.

Accepting the asymmetry

When you've thrown corpus, tail, and architecture at a number and it hasn't twitched, the honest move is to stop calling it a bug. We brought the whole arc to our second-opinion model, the same one that talked us out of the doomed feature last time, and it made the call we'd been circling: accept the asymmetry, ship the native win.

The case is stronger than "we gave up." The native gain is large, it's stable across every retrain, and it generalizes; US and French held throughout. The international penalty is small and just as stable, and an international-order German address can route around the anchor entirely, since the model reads both orders fine on its own once it's seen them. You lose nothing real by switching the anchor off for the layout it was never going to help. So that's production: anchor on where the postcode leads the city, off where it trails it, and the +35 points kept exactly where they were earned.

The asymmetry doesn't kill the bigger plan either, which was the part worth keeping. If one vector can only ever point one way, then a cleverer single anchor was never going to save us. What we want is an anchor per locale, each one free to learn its own country's direction: the German anchor reaches right, the American one reaches left, and nobody is forced to average. That's a real week of work for another day, but it's a justified one now instead of a hopeful one, which is the same place the last postcode story left us standing.

The lesson, which is older than this anchor

What we'd missed, going in, is that a learned signal doesn't carry the meaning you named it after. It carries the geometry of the data you trained it on. We called the thing a "country anchor" and reasoned about it as if it knew a fact about a postcode, when what it had absorbed was a habit about where cities tend to sit. The name was a label we put on the outside; the direction was the thing inside, and the direction is what shipped.

So when you train a helper signal and it works beautifully on the distribution you built it against, the question to ask before you trust it somewhere new is what it actually learned the shape of, and whether that shape still holds one locale to the left. Ours didn't. The good news is it told us so in three clean retrains, and the better news is that the thing it learned, narrow as it is, is worth thirty-five points right where we'll keep it.

The map runs out before the country does

2026-06-05T00:00:00.000Z

We spent a good month teaching our resolver exactly one trick. Take a postcode, drop its centroid into the city polygon that happens to contain it, read off the city. It's a genuinely good trick. It got the Netherlands to 95% and Germany to 93%, and for a while it felt like the whole problem was going to fall to it. Then we pointed it at Japan, and Japan calmly informed us that it has no city polygons to drop anything into.

What follows is a two-country story about what a geocoder can still do when the map underneath it goes thin, and where it finally can't. Japan we resolved anyway, 94% of the way, by putting the polygon down and asking a different question. Korea handed the same problem back to us turned inside-out: it let us pin the coordinate perfectly, every time, and then stopped us cold at the one thing we were really after, which is the name of the place you've landed in.

Three questions sit under all of it, so let me put them on the table before we start:

What do you do when the gazetteer gives you points where you expected shapes?
Does the move that rescues Japan actually generalize, or did we get lucky once and dress it up as a method?
And the question with no comfortable answer: what happens when the map is simply missing the part of a country you most need to see?

Japan has no city polygons

Quick recap for anyone who missed the last Japan post: a few weeks ago we pulled Japan's address hierarchy out of Who's On First and learned that Japanese addresses run backwards and have no street names at all. This is the sequel, the one where we try to actually resolve them.

The European recipe is point-in-polygon, and it's about as simple as geocoding gets. A postcode comes with a centroid. Who's On First gives you administrative polygons. You ask which locality polygon contains the centroid, and that's your city. Clean, fast, and it carried four European locales without complaint.

It gets Japan to 25%, and it took us an embarrassingly long while to see why, because the failure wears the costume of a tuning problem and is nothing of the sort. We went digging through WOF's Japanese geometry placetype by placetype, and the pattern repeats every time. The prefectures have polygons. The wards and sub-prefectures have polygons. The municipality — the 市区町村 level a postcode actually resolves to — is essentially all points. Not coarse polygons, not bad ones, just points: a latitude and a longitude with nothing to be inside of.

So point-in-polygon has nothing to contain, and no amount of fiddling with a containment test rescues a containment test when there are no containers. We checked Korea and Taiwan while we were down there, and they tell the identical story. The municipality layer across all three countries is dots on a map where Europe gave us regions. This is the shape of the whole problem, and it means the recipe we were so pleased with simply doesn't travel east.

You stop asking the polygon and start asking Japan Post

If you can't ask "which shape am I inside," you ask the postal authority something more direct: "what's the municipality for this postcode?" Then you go find that municipality in WOF by name. Japan Post publishes exactly that mapping in a file called KEN_ALL, and, crucially, a romanized edition whose municipality column reads SAPPORO SHI CHUO KU, in the same alphabet WOF uses for its romanized place names. Two romanized strings you can actually compare. That's the whole pivot.

Getting the file was its own small comedy. Every KEN_ALL download URL we had on record returned a 404. The replacements turned out to be gated behind JavaScript and a Japan-only fetch, so a plain script came home with a polite error page instead of data. And when the file finally arrives, it's CP932 (Shift-JIS) encoded, in the year 2025. We got there, and it carries the one thing WOF's own postcode hierarchy refuses to give up: the municipality, where WOF stops at the prefecture and leaves you a hundred kilometres too coarse.

The matching had one wrinkle worth knowing about. Japanese municipalities don't sit in a single WOF placetype. A regular city lands in locality, a ward in county or localadmin, a Tokyo special ward in borough. Match against just one of those and you cap out around 55%. Search all of them at once and you get 94.3% of postcodes matched to a real municipality, with end-to-end resolution landing between 94 and 98% depending on which gold set you grade against. Comfortably past our 85% bar, and the European locales come out of the change byte-for-byte identical, because the new path only fires for the countries that need it.

The same strategy, a build shaped to the country

Here's the part I want to dwell on, because it's the part that decides whether any of this scales. The Japanese build feeds the exact same resolver strategy the European one does. All we wrote was a Japan build: a different way of filling in the one table the resolver already reads. The resolver itself never changed, never even noticed. Postcode in, locality out, the same code path Amsterdam runs through.

That's the bet the whole "rule engine" design rests on: one strategy, and a per-country table that each country gets to populate however its data allows. Japan populates it by asking its postal authority for names. The question we hadn't answered was whether a genuinely different country, with genuinely different data, could populate that same table without us bolting on a pile of special cases. Which is where Korea comes in, and where the story stops being a victory lap.

Korea, the same trick inverted

Korea's data is the mirror image of Japan's, so the build came out mirrored too.

Japan made us go fetch the names from a postal authority. Korea hands them over for free: the GeoNames postal file for Korea already carries, in one place, the postcode, the place name, the province, and a latitude and longitude. No saga, no Shift-JIS. The snag is that the names are in Hangul (추자면), while WOF's romanized spr.name for Korea is some transliteration that may or may not line up. Matching Hangul against romaji goes nowhere, and that's exactly why Korea sat on our "blocked" list for a while.

It turns out that read was half right and gave up one step early. WOF doesn't only keep the romanized name. Its names table also carries Hangul, 13,120 native entries plus several thousand more filed under "undetermined language" that are Hangul all the same. So a Hangul-to-Hangul join is on the table after all. And because every Korean postcode arrives with a coordinate already attached, we could lead with the coordinate and treat the Hangul name as a second opinion. Korea's build is point-primary: take the postcode's coordinate, find the nearest WOF locality, confirm it by name where a name exists. A different first move from Japan, the same table out the other end, and the thing we were testing for, not one line of new resolver code.

On the parts that build resolves, it is excellent. WOF's Korean locality layer is dense, 21,139 of them, near enough one per village, so the nearest locality to a postcode sits a median of 0.96 km away. The province falls out for free and exact: GeoNames' province name matches WOF's Korean region name 17 times out of 17. Hand us a Korean address and we'll put it in the right province and within a kilometre of the right spot, on 100% of postcodes. For a coarse fix, that's money in the bank.

Where the map runs out

Then you ask for the administrative name, and the floor gives way. The name confirms on 26% of Korean postcodes. Japan was 94. Same method, same care, a third of the hit rate, and the whole gap is a story about what the map happens to hold.

Two things go wrong, and both earn their names because they tell you where to dig. The first is a granularity mismatch. GeoNames names a postcode at the eup/myeon/dong level, 추자면, Chuja-myeon. WOF's locality layer is one rung finer, down at the hamlet, so the nearest point to that postcode is a village called "Mung" sitting inside Chuja-myeon. The coordinate is dead-on and the name belongs to a smaller, different place. Both sources are telling the truth about different rungs of the same ladder, and the ladder doesn't line up.

The second one is worse, and it's the one I'd lose sleep over. The single biggest bucket of misses is 구 (gu), the urban districts. Gangnam-gu. Haeundae-gu. The level that is the address for most of Seoul and Busan. WOF Korea doesn't carry those as named localities at all, so there is nothing on the map to confirm against. The single most address-dense slice of the country is the slice the gazetteer is thinnest on. You can have a method that works and a map that's blank exactly where the people are, and that is the honest ceiling on Korea today. No recipe tweak gets past a name that was never in the dataset.

A bug the verifier caught, and you should want it to

One detour, because it's the kind of mistake that ships quietly if you let it. The first version of the Korean build reported 56% name confirmation, and we were briefly delighted. Then we looked at the distances, and the "confirmed" matches were averaging 71 kilometres from the postcode, a few of them out past 500.

Korean place names repeat. A lot. Dozens of villages share a name up and down the country, and the matcher had been finding a name match anywhere in Korea and then taking the nearest copy, which can still be a province away. The fix is the same proximity leash Japan's build already wore: a name only counts as confirmation if the place it names also sits nearby. That pulled the number down to its honest 26% and the average distance back under five kilometres. One signal in a costume of two is worse than the one signal alone — the inflated 56% would have told us Korea was twice as solved as it is. Make your two signals genuinely agree, or don't get to call it two.

What we keep, and what the map still owes us

So where does that leave the bet? The architecture held. A point-primary build and a name-primary build, two countries whose data shares almost nothing, both poured into the same resolver strategy with no new resolver code between them. The "less special" thing we wanted to prove, that this generalizes past one lucky locale, is proven. What it can't do is conjure place names that Who's On First was never handed.

So Korea ships as honest as it is: rock-solid on province and coordinate, explicit about a 26% name tier, marked experimental and kept out of the default bundle until the rest catches up. The catch-up has an address. Korea's road-name database, Juso, carries the gu and dong names natively. It's locked behind a government API key, so getting it is a deliberate acquisition, and it's next on the list to go fetch. Taiwan is one rung further back: there's no GeoNames postal file for it at all, a flat 404, so there isn't even a coordinate to begin with until we source one.

If there's a portable lesson in two countries' worth of this, it's that a geocoder is only ever as good as the map it stands on, and a map's favourite way to lie is to leave things out. Japan's map was missing shapes, and we could route around that. Korea's map is missing names, right where the cities are, and there's no routing around a blank. So before you tune a model or argue with a matcher, go look at what your reference data actually holds in the exact spot you care about most. The country is all still there. Whether the map admits it is a separate question, and it's usually the one that decides how far you get.

Does a postcode know what country it's in?

2026-06-03T00:00:00.000Z

We set out to fix a small wart in our address parser and came away with a number that told us to put the screwdriver down.

Here is the wart. When our postcode extractor sees a five-digit run and wants to know whether it's a real postcode or just a house number that happens to look like one, it peeks at the words sitting next to it and checks them against every country's street vocabulary we know — American, German, French, all at once. That "all at once" is fine at three countries. At twenty it gets loud, and a German street suffix starts shadowing an English word by sheer coincidence. So we went looking for the clean way to tell the extractor which country's words to bother with.

That question has a much bigger sibling, and chasing the sibling is where the story actually is.

The thing we actually wanted

Our resolver, the part that turns a parsed address into a point on Earth, takes a --default-country flag. You hand it US and it searches the American gazetteer; you hand it DE and it searches the German one. It works, and we hate it, because in production nobody hands you the country. The whole reason you're parsing the address is that you don't know where it is yet. A flag that makes you supply the answer up front is a flag that solves the easy half of the problem and leaves the hard half on the floor.

So here's the dream, and it's a good one. The postcode is the most information-dense token in an address: five or six characters that encode a routing hierarchy, a region, often a neighbourhood. We already extract it before the neural parser runs. What if the postcode just told the resolver which country to search? Delete the flag, let the address speak for itself, and as a bonus we'd have the locale signal the street-vocabulary check was asking for in the first place. One stone, several birds.

You can probably feel the shape of the questions piling up:

Where should that "which country" signal come from: the extractor, the resolver, the model?
Is the street-vocabulary blindness even a real problem, or a tidy-minded itch?
And the load-bearing one: is a postcode actually a strong enough signal to retire the flag?

We brought all three to a second opinion before touching anything.

A second opinion, and a sharper question

When a decision feels heavier than it looks, we run it past a second model (a different architecture, with no stake in our assumptions) and let it push back. This was one of those. Four turns in, it had stopped answering the question and started reframing it — and the reframe is the part worth keeping.

The street-vocabulary blindness, our second opinion argued, is a symptom wearing the costume of a bug. Conditioning that one helper on a locale would scratch the itch and teach us nothing. The actual gap underneath is that there is no single, early, reliable place where "which country is this" gets decided once and shared. We had three half-answers scattered across the codebase: the extractor computing a country posterior from the gazetteer, a rule-based stage guessing locale from the postcode's shape, the model's eventual learned guess. No one agreed which was the source of truth, or how they were supposed to relate. The blind helper was just the loose thread you could see.

That reframe pointed at a clean design, and I'll give you the one idea worth keeping: unify the data, not the modules. Every address system in our reference package already owns its own postcode shape. So the one new thing we built is the inverse of those shapes — a function that takes a postcode and asks every system at once, "is this yours?" A bare 68161 comes back [us, de, fr], because a five-digit shape genuinely belongs to all three. Both the extractor and the rule-based stage read from that one function instead of keeping their own divergent copies. Nobody calls anybody; they share a table. That's the part that scales.

The rest of the design followed from there: a small fused "locale prior" object, and a clean rule that the resolver always takes that prior's shape while the thing producing it can be swapped (a cheap pre-pass today, the trained model later). It's tidy. It's the kind of architecture you sketch on a whiteboard and feel good about.

And then, before building a line of it, we did the thing we should always do and rarely want to: we tried to kill it.

Measure before you build

The whole edifice rests on one assumption: that the postcode is present and unambiguous often enough to carry the country on its own. That's testable today, on real addresses, with no model and no new code beyond a probe. So we wrote the probe: take a thousand-plus real US addresses and a thousand-plus German ones, extract the postcode, resolve it against the gazetteer, and ask how confidently it names a single country.

The postcode is present every time. OpenAddresses is postcode-rich; an anchor fired on 100% of rows. That part of the dream survives.

Here's the part that doesn't.

	US	DE
postcode present	100%	100%
names one country, confidently	27.9%	44.1%

A US postcode pins its own country a little over a quarter of the time. A German one, not quite half. The rest of the time the strongest signal in the address shrugs and offers you a menu.

The reason is the most ordinary thing in the world: a five-digit code is five digits in a lot of places. 75001 is the first arrondissement of Paris. It is also Addison, Texas. The gazetteer, asked in good faith, reports both, and a uniform posterior over {FR, US} is an honest answer to a question the postcode simply cannot settle. Same script, same length, two continents. Multiply that across every numeric-postcode country and the confident cases are the minority.

(One trap worth flagging, since I nearly fell in it: an early version of the probe looked far rosier because of an alphabetical tie-break. When the posterior is a flat {DE, US}, "DE" sorts first and quietly wins, so the German numbers looked almost perfect. They were an artifact of the sort order, not the signal. The honest reading is the confident-single-country rate above, and only that.)

What the number was actually telling us

A weak result is still a clue, so it's worth being precise about what it ruled out and what it confirmed.

It ruled out the bonus. An extractor-only locale prior cannot retire the --default-country flag, because more than half the time it would hand the resolver a coin-flip, and a coin-flip is worse than a default. The clean PR we'd sketched would have failed its own acceptance test. We just hadn't written it yet, which is the entire return on running the probe first.

What it confirmed is the more interesting half, and it's something our own design document had asserted on faith months ago: figuring out the country is most of what parsing an address is. If the single most information-dense token only settles the question a third of the time, then the rest of the answer has to come from everything around it — the city, the street, the order the pieces arrive in. You can't get that from a regex run before the model; you get it from the model itself, reading the whole string at once and conditioning its own decisions on what it infers. The number didn't break the plan. It told us which layer the country actually lives in, and that layer is the expensive one.

What shipped, and what we left alone

So we shipped the piece that survived contact with the evidence. The street-vocabulary check is now gated by the postcode's real gazetteer membership: a US-only ZIP consults the American vocabulary and never asks the German one, because there's nothing German about it. An unrelated language's words can no longer down-weight a code that was never theirs. It scales to twenty countries cleanly, the resolver evals come out byte-identical to before (a precision change you can't see on a clean sample is exactly the change you want), and the shared inverse-shape function is now in place for whatever reads it next.

And we left the flag alone, on purpose, with a number to point at. --default-country stays until the country signal comes from where the evidence says it has to: the trained model, conditioning on the full address. That's a heavier piece of work, and now it's a justified one rather than a hopeful one.

The cheaper lesson is the one I'd actually press on you. We came within one satisfying afternoon of building a clean, well-argued, doomed feature. What stopped us wasn't taste or a code review — it was a few hours of measurement aimed squarely at the assumption everything else rested on. Find the load-bearing assumption in whatever you're about to build, and go try to break it before you write the part that depends on it. The probe that saves you a week looks, going in, exactly like the probe that wastes you an afternoon. Run it anyway.

Our parser fails 80% of our own tests. We shipped it anyway.

2026-05-31T00:00:00.000Z

Our neural address parser passes 20.7% of our test suite. The rule-based parser it's meant to replace passes 93.7%. By that scoreboard, we should delete the neural model and go home.

We shipped the neural model instead. Here's why both numbers are true — and why the one that matters says the opposite.

Two parsers, one bench

Mailwoman carries two address parsers. v0 is a hand-written rule engine — a TypeScript port of the Pelias parser, all regexes and dictionaries and heuristics. The other is a 29M-parameter encoder-only transformer that tags each token (street, locality, postcode, …) and was trained on synthetic and real corpora. The whole bet of the neural model is that it generalizes to messy real-world input where rules brittle-fail.

To check the bet, we run both through the same 415-assertion test suite. The rules parser wins in a landslide: 93.7% to 20.7%.

The catch: the bench was built by the opponent

Look one level down, at the per-file results, and something jumps out: v0 passes 100% of every functional file. Not 99%. Every single one.

That's not skill — it's lineage. Every one of those 415 assertions was ported from the Pelias and addressit test suites, and v0 is our port of Pelias, so the suite is grading a parser against its own author's answer key. It cannot, even in principle, catch v0 being wrong, because v0's output is the definition of correct.

So "neural scores 20.7%" measures one thing: how often neural disagrees with Pelias's exact conventions — where to split a multi-word street, where a venue ends and a locality begins, the dozens of micro-decisions addressit happened to encode. It says nothing about how often neural is wrong. Useful as a regression gate (did a retrain break something we used to match?); useless as a verdict on which parser is better.

Decomposing the 20%

To judge quality fairly we need benches drawn from outside the Pelias lineage. We score both parsers on three:

arena	what it is	n	v0	neural
libpostal	clean, canonical strings	69	29%	16%
perturb	noisy, abbreviated, reordered	398	39%	61%
postal	edge formats (PO box, military…)	38	26%	8%

Three different stories:

Clean input → rules win. Canonical strings are exactly what hand-tuned regexes are for. This is also the entire harness — all canonical, all Pelias-convention — which is why neural looks worst there.
Messy input → neural wins, decisively (61% vs 39%) — and this is the biggest bench by far (398 cases), built by perturbing real addresses: dropped commas, abbreviations, reordering, weird casing. It's the closest proxy we have to what people actually type, and it's the whole reason the neural model exists.
Edge formats → both are bad. PO boxes, military APO/FPO, and rural routes are 0% for both parsers. Neither was built for them.

The scoreboard that matters

A geocoder's job is to put a real address on the map. So the honest test is end-to-end: take 10,000 real US addresses with real government coordinates, run each parser through the same resolver, and ask which one lands on the right city.

parser	locality match (10k real addresses)
neural	97.3%
v0 (Pelias)	95.8%

On the metric that matches the product — real addresses, end to end — the neural parser beats the rules parser. The 20.7% and the 97.3% are measuring two completely different things: agreement with Pelias's answer key, versus getting real addresses right.

The lesson

If you port your test suite from the system you're trying to beat, that system scores 100% by construction, and your challenger will always look broken. The suite is doing its job: faithfully measuring agreement with the incumbent. Just don't mistake that for a measure of quality.

Measure on the distribution you actually serve. For us that's messy, abbreviated, real-world addresses — and there, the learned model is ahead.

The full breakdown is in the v0.7–v0.8 retrospective: every arena, the genuine neural deficits (it does truncate Belle Fourche to Belle), the masked-LM pre-training experiment that turned into a clean negative result, and what's next (street-level geometry, to go from "right city" to "right spot").

The model that never saw an intersection

2026-05-29T00:00:00.000Z

We spent a night trying to make our neural address parser less cocky. We ended it having learned something more useful. The model wasn't cocky — it was uninformed. It had never been shown whole categories of address.

This is the story of chasing the wrong number, and the diagnostics that pointed at the right one.

The hypothesis: it's overconfident

Across the v0.6.x training cycle, one pattern kept surfacing: when the model was wrong, it was confidently wrong. On a held-out test set, 86% of its incorrect predictions were made at ≥0.9 confidence — and most of those at a flat 1.00. A model that hedged appropriately would, we reasoned, stop steamrolling good answers with bad high-confidence ones.

The standard tool for that is label smoothing: instead of training toward a one-hot target (1.0 for the right tag, 0 for the rest), you train toward something softer (0.9 / spread-the-rest). It caps how peaked the model's outputs can get. So we ran a clean, single-variable experiment (the v0.6.0 recipe plus label_smoothing=0.1, nothing else changed) and measured.

It worked, exactly as advertised. Overconfidence-on-wrong dropped 86% → 67%; the mass at 1.00 confidence vanished, capped around 0.95. Postcode recall even ticked up.

And the metric we actually ship on — harness pass rate — didn't move. 14.6% → 13.8%. If anything, slightly down. Two tags (house numbers, streets) regressed.

Following the evidence

A well-calibrated model that's no better at the job is a clue, not a victory. So instead of tuning the smoothing knob again, we asked a blunter question: of everything the harness gets wrong, what kind of wrong is it?

We categorized every failure. The answer reframed the whole project:

55% of the gap was missing labels — the model emitted no tag at all where one belonged. Not a wrong value, not a fuzzy boundary. Silence.
The most-missed tags were street (×197) and house_number (×100).
One cluster stood out: intersections — addresses like Broadway & W 42nd St. They're 17% of our harness, and the model scored 0% on them.

Calibration softens the confidence of labels the model does emit. It is structurally incapable of conjuring a label the model never produces. That's why it left the harness flat: we'd been sharpening the model's aim at targets it wasn't even shooting at.

The probe

We ran a single probe on a canonical intersection. For every token in Broadway & W 42nd St, we read off the probability the model assigned to the intersection_a / intersection_b tags.

The maximum, across every token, was ~0.0001.

Uncertainty doesn't look like that. A model that's merely unsure still puts some probability on the right tag; ~0.0001 means the model has no representation of intersections whatsoever. The labels existed in its output vocabulary; it had simply never learned to use them.

Why? We checked the corpus pipeline. There are synthesizers for streets, no-street venues, PO boxes, house+venue combinations… and nothing that generates intersections. The real-world adapters don't emit them in that form either. The training signal for intersections was, to a very good approximation, zero. The model never saw one — so it never learned one. No loss function, no calibration trick, no bigger model recovers a category that isn't in the data.

A different coverage gap, a different fix

Calibration's one genuine win (a small postcode bump) pointed at a second coverage story, this one about tokenization.

Alphanumeric postcodes (SW1A 1AA, M5V 2T6) get shredded by the subword tokenizer into fragments like ["S","##W","##1","##A", "1","##AA"]. The seven-character shape a regex would trivially recognize is invisible to a model reasoning over disconnected pieces. The result: GB/CA/NL postcodes at 0%.

Here the fix wasn't training at all. A deterministic regex repair runs after the model decodes: detect a postcode-shaped substring, and snap the label span to it. On the postcode harness that single pass fixed 135 cases and regressed zero, taking GB/CA/DE/PT to 100%. Sometimes the right tool is a retrain. Sometimes it's eight lines of pattern-matching and a careful "longest-match-wins" rule so a US ZIP+4 doesn't get mistaken for a Dutch postcode.

What we actually learned

A few lessons we're keeping:

Pick a metric that can't be gamed by the thing you're optimizing. Per-tag F1 looked fine while the product was stuck; harness pass rate (does the whole address come out right?) told the truth.
A confident-wrong model and an ignorant model need opposite fixes. We assumed the former; the data showed the latter. Calibration for one, coverage for the other.
Structural validity is its own signal. A checker that flags incoherent parses — a house number with no street, an orphaned unit — caught a mid-training regression that the headline accuracy number completely hid.
You can't learn what you never see. The most expensive-sounding problem of the night had the cheapest root cause: a missing synthesizer.

So the real fix for intersections is mundane: a couple thousand synthetic X & Y St examples, labeled and dropped into the corpus as a small targeted supplement, plus a retrain that finally gives the model something to learn from. That run is training as we publish this.

We'll report what the model does once it has, for the first time, actually seen an intersection.

Zero byte-fallback: a multi-script tokenizer from WOF-earth

2026-05-28T00:00:00.000Z

The v0.5.0-a1 tokenizer had a dirty secret: it was trained exclusively on US and French addresses. When it encountered Chinese, Japanese, Korean, Thai, or Arabic text, it fell back to encoding individual bytes — 50-75% of tokens for CJK scripts. Every byte-fallback token is a lost opportunity for the model to learn meaningful subword patterns.

Today we fixed that.

The data

Who's On First ships one GitHub repo per country, each containing GeoJSON files for every administrative place. Every place carries localized name variants — "New York" has a name:zho of "纽约", a name:jpn of "ニューヨーク", a name:kor of "뉴욕", and dozens more.

We cloned 7 priority countries (US, FR, JP, CN, KR, DE, GB) — 1.74 million GeoJSON files — and built them into a unified SQLite database using our WAL + Freeze pipeline:

Country	GeoJSON files	Time
CN	680K	-
US	449K	-
FR	231K	-
DE	189K	-
GB	73K	-
JP	63K	-
KR	54K	-
Total	1.74M	3 min

The result: 1.29 million places with 10.2 million name variants in 20+ languages. 768K Chinese names, 184K Japanese, 264K French, 261K German, 285K Arabic.

The tokenizer

We extracted a balanced multi-script training set (2.19M lines) from the global WOF names table, shuffled across script groups:

500K Latin (English, French, German, Spanish, ...)
500K Chinese
468K Cyrillic (Russian, Ukrainian, ...)
285K Arabic
183K Japanese
94K Korean
160K other (Thai, Hindi, Hebrew, Greek, ...)

SentencePiece trained in 28 seconds. Same 48K vocab size as before, same user-defined symbols (US state abbreviations, postcode formats). The difference: the vocab now allocates subword pieces for CJK characters, Hangul syllables, Thai consonant clusters, and Arabic word fragments — instead of wasting slots on Latin-only subwords that the old training data biased toward.

The result

Script	v0.5.0-a1 (old)	v0.6.0-a0 (new)
Chinese	50-75% byte-fallback	0%
Japanese	58-60%	0%
Korean	41%	0%
Thai	30%	0%
Arabic	0%	0%
Latin	0%	0%
Aggregate	36.6%	0.0%

Issue #120 targeted less than 5% byte-fallback. We hit zero.

The tokenizer also produces fewer pieces per input. "北京市朝阳区建国路79号" (Beijing address) went from 19 pieces (63% byte-fallback) to 11 pieces (0% byte-fallback). That means more of the 128-token sequence budget is available for actual content instead of being consumed by byte encoding.

What's training

v0.5.4 is now running on a Modal A100 with the new tokenizer. It uses the v0.5.1 proven recipe (the one that achieved 0.638 F1) but with the multi-script tokenizer. If the model learns CJK address patterns as well as it learns Latin ones, this is the foundation for JP/CN/KR locale support.

The pipeline

The global WOF build pipeline follows the WAL + Freeze design brief:

Enumerate: glob **/data/**/*.geojson across all country repos
Ingest: WAL mode, parallel file reads (asyncParallelIterator), single-thread writer, batched transactions
Freeze: WAL checkpoint, journal_mode=DELETE, create indexes, ANALYZE, VACUUM INTO

The frozen artifact is a clean 1.09 GB SQLite with no sidecars, verified read-only, integrity-checked. It's available for download from the Hugging Face bucket.

Why Japanese addresses break Western parsers

2026-05-28T00:00:00.000Z

In Tokyo, the address of Tokyo Tower is 〒105-0011 東京都港区芝公園4-2-8.

In English: "4-2-8 Shibakōen, Minato City, Tokyo 105-0011".

The Japanese form runs right-to-left compared to the English form. The prefecture (都道府県) comes first, then the city or ward (市区町村), then a district (丁目) and a block-number-style locator. There's no street name — just a grid.

This is why every rule-based address parser written for Western addresses breaks on Japan.

The hierarchy

Who's On First ships Japan's admin hierarchy as one repo with 62,896 GeoJSON files. After pulling it into our unified SQLite, the placetype distribution looks like this:

Placetype (English)	Japanese	Count
country	国	1
region (prefecture)	都道府県	47
county (city)	郡	2,287
locality (ward/town)	市区町村	43,886
neighbourhood (chome)	丁目	7,736

47 prefectures. The whole country. Every chome (city block district) tagged with a name like １丁目 (1-chome), ２丁目 (2-chome).

Reversed ordering

Western address: [house_number] [street] [unit?], [locality], [region] [postcode].

Japanese address: 〒[postcode]? [region][locality][chome][block]-[sub-block]-[house_number].

The order matters for parsers because we use position as a feature. A model trained on "1600 Pennsylvania Avenue NW, Washington, DC 20500" expects digits at the start, region near the end. A Japanese address inverts this entirely. Walking the parent chain in the WOF database confirms the inversion:

neighbourhood   jpn=１丁目      eng=１丁目
locality        jpn=世田谷区     eng=Setagaya
county          jpn=世田谷区     eng=Setagaya
region          jpn=東京        eng=Tokyo
country         jpn=日本        eng=Japan

To synthesize a JP address you concatenate the parent chain top-to-bottom: 東京 + 世田谷区 + １丁目 → 東京世田谷区１丁目.

No street names

Western addresses identify locations by street + number. "1600 Pennsylvania Avenue NW" picks a specific building because Pennsylvania Avenue is a known line and 1600 is a known offset along that line.

Japan uses block addressing instead. Read 4-2-8 in 芝公園 as chome 4, block 2, building 8 within the 芝公園 district. There's no "芝公園 street" for the number to sit on; the grid is the addressing primitive, not the line.

Implications for the parser:

street_prefix and street_suffix don't apply (no street).
house_number becomes a hyphenated triple: 4-2-8.
The "丁目" suffix is a categorical marker, not a street type.

For now we map chome to dependent_locality since it's the closest existing tag. A proper JP locale would introduce block and sub_block tags per the schema in core/types/component.ts (declared but unused until JP ships).

Prefix postcode

Japanese addresses prefix the postcode with 〒, the postal mark. Format: 〒NNN-NNNN. Examples:

〒100-0005 — Tokyo Marunouchi
〒530-0001 — Osaka Umeda
〒810-0001 — Fukuoka Tenjin

A parser needs to read 〒 as a categorical marker: the postal mark that flags the following 7 digits + dash as a postcode. SentencePiece tokenizes 〒 as a separate piece. Our new v0.6.0-a0 multi-script tokenizer handles this cleanly (0% byte-fallback on the 〒 character).

What we shipped today

The wof-admin-jp adapter prototype walks the WOF parent chain for every 丁目 in the Japanese repo and synthesizes a training row. Output:

{
	"raw": "東京港区芝公園",
	"components": {
		"region": "東京",
		"locality": "港区",
		"dependent_locality": "芝公園",
		"country": "JP"
	}
}

6,373 rows from 47 prefectures and 269 localities — that's training data we didn't have yesterday. Top prefectures by row count:

Prefecture	Rows
東京 (Tokyo)	2,251
神奈川 (Kanagawa)	888
大阪 (Osaka)	460
千葉 (Chiba)	380
埼玉 (Saitama)	263

Tokyo dominates because of its density of named neighborhoods — every chome of every ward is tagged. Smaller prefectures have fewer registered neighborhoods.

What's still missing

Real JP addresses include house numbers (4-2-8) which WOF doesn't track. To complete a Stage 3 JP corpus we need a separate source — the MLIT national address database or JapanPost postcode CSVs. Both are public.

Once those land, the JP corpus becomes a 100K+ row source with full Stage 3 + Phase 6 tags (block, sub_block, house_number). v0.6.0 trains on US/FR. v0.7.0 could ship JP if the data pipeline holds.

Schema readiness

The infrastructure is already in place. core/types/component.ts declares JP-specific Phase 6 tags:

// JP-specific (Phase 6 — declared but unused until then)
"prefecture",
"municipality",
"district",
"block",
"sub_block",
"building_number",
"building_name",

The schema, formatting, runtime pipeline, and now the corpus prototype are ready. The blockers are: (1) the missing house-number data source, and (2) training time on a JP-aware recipe.

Where rules fail and learning wins

Every address parser written for Western input fails on Japan in a specific, predictable way: it parses the prefecture as a country, then runs out of tokens. The locality and chome get lumped into a single span. The block-number triple gets parsed as a postcode or dropped entirely.

Mailwoman's transformer architecture is locale-agnostic at the BIO level. The same model can learn region → locality → chome ordering if it sees enough examples. The 6,373 rows we generated today are the first batch.

PO Box Boîte Postale Apartado: Stage 3 ships with 6 new tags

2026-05-28T00:00:00.000Z

For its first six versions, Mailwoman emitted ten BIO tags. The model could pick street out of a row but not street_prefix, street_suffix, unit, or po_box. Real addresses are messier than that. The golden eval set has known examples — 6220 SE Salmon St, Portland, OR 97215 (Stage 2 collapses prefix+name+suffix), 123 Main St Apt 4B, Springfield, IL 62701 (loses the apartment), PO Box 123, Burlington, VT 05401 (treats it as a malformed street).

v0.6.0 adds six tags: street_prefix, street_suffix, unit, po_box, intersection_a, intersection_b. The model is the same h384/6L/6H transformer. The recipe is the same v0.5.1 settings. The tokenizer is the same v0.6.0-a0 multi-script bundle. The only structural change is the output head: 21 BIO labels → 33.

The schema was already there

core/types/component.ts has declared the canonical ComponentTag union since Phase 0, including all six new tags plus seven JP-specific ones (Phase 6). The schema was forward-declared. The runtime pipeline, the formatter, the golden eval, and even the rule classifiers (StreetPrefixClassifier, StreetSuffixClassifier) all knew about these tags. Only one constant was missing: the active training label set.

# corpus-python/src/mailwoman_train/labels.py

# Old:
ACTIVE_TAGS: Final[tuple[str, ...]] = STAGE2_TAGS  # 10 tags

# New:
ACTIVE_TAGS: Final[tuple[str, ...]] = STAGE3_TAGS  # 16 tags

The label IDs are stable: STAGE3 appends to STAGE2 without reordering. Old parquet shards work unchanged — they just don't emit the new tags. Models trained on STAGE2 IDs would still decode correctly against a STAGE3 classifier head; the new logit slots just never get picked.

Where the data comes from

For street decomposition, the data was already there too. Three existing adapters got Stage 3 enhancements:

TIGER (corpus/src/adapters/tiger/) — FULLNAME like "SE Salmon St" gets decomposed via decomposeStreet(), which uses the curated libpostal/en directional + street-type dictionaries (same dictionaries that back the runtime StreetPrefixClassifier).
NAD (corpus/src/adapters/usgov-nad/) — NAD already has structured St_PreDir, St_PreTyp, St_Name, St_PosTyp, St_PosDir fields. The adapter now emits them as separate components instead of joining into one monolithic street. Unit/Building/Floor/Room chain into the new unit tag.
BAN (corpus/src/adapters/ban/) — French street types are leading words: "Rue de Rivoli", "Avenue des Champs-Élysées". decomposeFrStreet() uses libpostal/fr/street_types.txt to pick off the leading type word as street_prefix.

These changes immediately give the model thousands of correctly-labeled Stage 3 examples per adapter without retraining the upstream data.

PO box: the synthesis case

PO boxes are different. No corpus adapter has explicit po_box data — TIGER is street segments, NAD has buildings, BAN is street-level addresses, WOF is the admin hierarchy. We need synthesis.

The good news: PO boxes are highly templated. USPS Pub 28 §28C2.040 and DMM 508 §4.1.4/§4.5.4 specify the allowed forms. Multi-locale extension is similarly bounded:

Locale	Leaders
en-US	PO Box, P.O. Box, POB, Post Office Box, PMB, Box, #
en-CA	PO Box, P.O. Box, POB
en-GB	PO Box, P.O. Box, Post Office Box
en-AU	PO Box, GPO Box, Locked Bag
fr-FR	BP, B.P., Boîte Postale
fr-CA	CP, C.P., Case Postale, BP
es-ES	Apdo., Apartado, Apartado de Correos
es-MX	Apdo., Apartado Postal, AP
es-AR	Casilla, Casilla de Correo, CC

corpus/src/synthesize-po-box.ts ships these templates plus three design decisions from a DeepSeek consultation:

PMB shares the po_box tag. USPS treats PMB as a PO Box alias in CASS processing; downstream code can distinguish via "is a street line also present?" without needing a separate label.
Whole-phrase spans ("PO Box 123" as one po_box span, not "123" alone). Matches the existing golden eval convention.
10% number-format noise (commas, dashes, embedded spaces). Real OCR'd input is lousy with "Box 1,234" and "PMB-200" — the parser ships with that as native input.

The pipeline

WOF SQLite (1.29M places, 7 countries)
  ↓  scripts/extract-tuples.py
50K (locality, region, postcode, country) tuples
  ↓  scripts/build-po-box-shard.mjs
50K LabeledRow JSONL with B-po_box/I-po_box spans
  ↓  scripts/jsonl-to-parquet.py
3 MB Parquet shard → Modal volume
  ↓
v0.6.0 training (source_weight: 1.5)

Sample output:

P.O. Box 9, Bancroft, ID 83603
  tokens: ['P', 'O', 'Box', '9', 'Bancroft', 'ID', '83603']
  labels: ['B-po_box', 'I-po_box', 'I-po_box', 'I-po_box', 'B-locality', 'B-region', 'B-postcode']

Five tokens get po_box (the whole "P.O. Box 9" phrase including the . punctuation). The model learns the span shape, the leader vocabulary, and the locale-to-template mapping all at once.

Golden eval expansion

Test data matters as much as training data. The golden v0.1.2 set had 1 PO box entry — not enough to fail meaningfully, let alone measure progress. We added 26:

20 US variants across all leader forms (PO Box, P.O. Box, P. O. Box, POB, POBOX, Post Office Box, Box, P.O.Box) and number ranges (single-digit to 7-digit)
3 PMB variants ("100 Main St PMB 200", "1234 Wilshire Blvd #500")
6 FR/CA variants (BP, B.P., Boîte Postale, Case Postale, CP)

Results

v0.6.0 trained 100K steps on a Modal A100 (CE-only — crf_loss_weight: 0 after two NaN attempts with CRF training enabled; the 33×33 transition table + bf16 was numerically unstable. Inference-time CRF still active via the structural mask. v0.6.1 will investigate).

Demo presets: 11/11 parse (6 canonical addresses + 5 Stage 3 variants).

Per-tag golden eval (4,561 entries):

Tag	v0.5.4 recall	v0.6.0 recall
postcode	75.7%	76.0%
house_number	78.7%	79.0%
region	65.0%	65.0%
locality	39.4%	39.7%
street	28.0%	27.9%
venue	29.4%	29.2%
po_box	0.0%	51.9%
street_prefix	0.0%	0.0%
street_suffix	0.0%	0.0%
unit	0.0%	0.0%
intersection_a/b	0.0%	0.0%

PO box recognition went from impossible to functional in one training run. Sample:

"PO Box 123, Burlington, VT 05401"
→ { region: "VT", locality: "Burlington",
    po_box: "PO Box 123", postcode: "05401" }

Stage 2 metrics held flat: the new tags extended the schema without displacing the old ones.

What's deferred

The other Stage 3 tags (street_prefix, street_suffix, unit, intersection) stayed at 0% recall because the TIGER/NAD/BAN adapter changes that emit them haven't been baked into a corpus rebuild yet. The training data still has monolithic street spans like "SE Salmon St" instead of decomposed street_prefix: "SE", street: "Salmon", street_suffix: "St". v0.6.1 needs a fresh corpus build to surface those.

CRF learned transitions are also deferred. Two NaN attempts (crf_loss_weight: 0.5 then 0.1) both diverged post-warmup. The hypothesis: bf16 + the doubled transition table (33×33 vs 21×21) is numerically unstable. v0.6.1 will try fp32 precision for the CRF parameters specifically, or a gradient-clipped warmup-only schedule.

What this proves

The pattern works. A new tag in the canonical schema + a focused synthesis source + a one-line corpus config change + 100K training steps = working tag recognition. Total elapsed time tonight: ~6 hours from "no PO box training data exists" to a 28 MB model that hits PO box correctly more than half the time on a hostile eval set.

The same recipe scales to street decomposition, intersection, unit, and the JP-specific Phase 6 tags. The schema is already declared. Each new tag is the same shape of work as PO box was tonight.

FST gazetteer ships to the browser

2026-05-27T00:00:00.000Z

The /demo page now loads a 9 MB FST (finite-state transducer) gazetteer alongside the 29 MB ONNX model. 94,000 US admin places with Wikipedia importance scores feed directly into the neural classifier's Viterbi decoder as emission priors — the same pipeline that runs server-side now runs entirely in the browser.

What changed

The FST binary encodes every US admin place name from Who's On First as a trie: "new york" walks to a state with 7 interpretations (NYC locality, NY state region, New York County, etc.). At query time, the classifier receives additive logit biases proportional to each place's Wikipedia importance — Washington DC (importance 0.815) correctly outranks Washington state (0.764).

The browser integration required a new deserializer (fst-deserialize-web.ts) that uses DataView + TextDecoder instead of Node's Buffer. Same binary format, zero Node dependencies. The FST loads in parallel with the ONNX model — no added latency on the critical path.

The tokenizer incident

While wiring the FST, we discovered the live demo was serving the wrong tokenizer. The v0.5.3 model (48K vocab, 29 MB) was paired with the old v0.1.0 tokenizer (24K vocab, 474 KB). This produced garbage output — every span labeled as locality with sub-0.5 confidence. Nobody noticed because the demo was "working" (it showed results), just badly.

The root cause: docs/static/mailwoman/ was manually managed. Model and tokenizer were copied independently, and the tokenizer copy was missed during the v0.5.3 update.

The fix is a Docusaurus plugin (docs/plugins/demo-assets/) that stages all binary assets from the neural-weights-en-us package at build time. Model card version is the source of truth. The tokenizer/model mismatch can't recur because both come from the same source.

What we fixed along the way

The night shift addressed every recommendation from the v0.5.3 training review:

Per-tag F1 in training CSV. The macro F1 comparison that caused hours of wrong analysis in the v0.5.3 session (0.579 vs 0.638 across different tokenizers) is now impossible — per-tag breakdown logged at every eval step.
Grouper-audit fix. The audit was checking only top-level tree roots for coverage, missing nested children in containment trees. "400 Broad St, Seattle, WA 98109" was getting locality=Broad injected because the audit didn't see street=Broad St nested inside locality=Seattle.
Phrase grouper hardening. "Pennsylvania" was proposed as LOCALITY_PHRASE on "1600 Pennsylvania Ave NW" because any capitalized word matched. Now penalized -0.20 when the word is a US state name in a non-tail position. "Paris, Texas" is preserved (tail position).
CRF transition export pipeline. The Python training side can now export learned CRF transition scores to crf-transitions.json. The TypeScript classifier loads and composes them with the structural BIO mask. Not yet trained (v0.5.4 will be the first model to use this).

Browser verification

Playwright headless test against the live site:

400 Broad St, Seattle, WA 98109

  house_number: "400"    (0.97)
  street:       "Broad St"   (0.98)
  locality:     "Seattle"    (0.98)
  region:       "WA"         (0.98)
  postcode:     "98109"      (0.96)

6/6 demo presets correct, zero grouper-audit nodes. The model works.

Try it

mailwoman.sister.software/demo — type any US address. The neural classifier, FST gazetteer, and WOF locality resolver all run in your browser. No server round-trips after the initial ~75 MB asset load.

Our model worked in CI but broke on every real device

2026-05-27T00:00:00.000Z

We shipped a browser-based address parser that runs a 29 MB ONNX model entirely client-side. The Playwright tests showed perfect results. Chrome desktop looked great. Then someone opened it on an iPhone.

What we saw

Every address component was classified as "locality" with 0.2–0.4 confidence. "400 Broad St, Seattle, WA 98109" became three locality spans with no street, no region, no postcode. The model was producing near-uniform logits — as if it hadn't been trained at all.

Toggling to the WASM backend in our debug UI produced perfect results immediately. Same model bytes, same tokenizer, same input. The GPU path was the problem.

The wrong hypotheses

We burned hours on each of these before finding the real cause:

Stale browser cache. We'd recently updated the model from 25 MB (old tokenizer) to 29 MB (new tokenizer). The old model with the wrong tokenizer produces exactly this symptom — garbage output. We added cache-busting query params, migrated assets to a CDN, and verified file sizes. The files were correct.

Tokenizer mismatch. The v0.5.3 model uses a 48K-vocab tokenizer but an older 24K-vocab tokenizer was briefly deployed. We verified hashes. The tokenizer was correct.

Model version drift. We have four model versions on the CDN. Maybe the wrong one was being loaded. We added a version selector to the demo page and confirmed v0.5.3 was selected. The model was correct.

Browser-specific WASM numerics. Maybe Safari's WASM implementation handles int8 quantization differently. We tested WASM on Safari — it worked perfectly. The problem was WebGPU-specific.

Why Playwright couldn't catch it

Every automated test we ran passed. The reason: headless Chromium does not have a WebGPU adapter. When you request executionProviders: ["webgpu", "wasm"], the runtime silently falls back to WASM. WASM handles int8 correctly, so the test passes.

We had a verify skill that launched a real headless browser, navigated to the live demo, typed an address, and checked the parse output. It ran after every deployment. It passed every time. And it was useless for catching this bug, because it could never exercise the code path that was broken.

The real cause

onnxruntime-web ships two WebGPU execution providers in the same npm package:

onnxruntime-web          → ort.bundle.min.mjs     → JSEP (old, broken)
onnxruntime-web/webgpu   → ort.webgpu.bundle.min.mjs → Native EP (fixed)

The JSEP (JavaScript-based execution provider) has a slice kernel bug that produces incorrect results when reversing a tensor on a specific axis. This corrupts the dequantization of int8 weights. The bug is worse on Safari's Metal backend than Chrome's Dawn/Vulkan backend — Chrome happened to mask it in our case.

The native WebGPU EP handles the same operations correctly on all backends.

The fix

- import * as ort from "onnxruntime-web"
+ import * as ort from "onnxruntime-web/webgpu"

One line. The API is identical. Session creation, tensor I/O, and provider fallback all work the same way. The native bundle is also smaller (113 KB vs 405 KB).

After this change, the model produces correct results on Chrome, Safari macOS, and iOS Safari — all via WebGPU, no WASM fallback needed.

What we should have done differently

The diagnostic path that would have saved hours:

Force WASM. If results become correct, the problem is GPU-side.
Check which execution provider is actually active. We didn't have this instrumentation — we've since added a backend indicator to the demo page.
Check the import path. grep "onnxruntime-web" in your source. If you're importing the bare package, you're on the JSEP.
Test on Safari. If it fails on Safari but works on Chrome, the JSEP is the prime suspect.

The deeper lesson: test infrastructure that lacks GPU access will never catch GPU-specific bugs. Headless browsers are not real browsers when it comes to hardware acceleration. If your product runs on GPUs, you need at least one test that exercises the GPU path on a device that has one.

References

Night Shift 2 — from thermal hangs to a shipped model in one session

2026-05-25T00:00:00.000Z

The second night shift ran from roughly 2am to 2pm UTC on May 25th, 2026. It started with a GPU that wouldn't stop crashing and ended with a trained model, an ONNX export, and a full evaluation report. This is the story of how infrastructure choices turned a hardware problem into a non-issue.

The hardware wall

The lab runs on a small form factor desktop with an AMD Radeon 780M integrated GPU. For short bursts (a 2-minute smoke test, a 10-minute diagnostic probe), it works fine. For sustained multi-hour training at 98% GPU utilization, it overheats. The firmware detects thermal stress and resets the GPU, killing whatever process was running on it.

During this session, the GPU hit 22 resets before we stopped counting. Every 60-90 minutes of training, the hardware would fault. A watchdog script would wait 15 minutes for the chassis to cool, then restart from the last checkpoint. Net progress: about 8,800 training steps out of a target 50,000.

At that rate — 90 minutes of compute, 15 minutes of cooldown, 500 steps lost per restart — the full training run would take roughly 38 hours of wall-clock time. That's fine for a research prototype, but it's not a productive use of a night shift.

Modal is a cloud compute platform where you write a Python function, decorate it with @app.function(gpu="A100"), and it runs in a datacenter with a proper GPU. No SSH, no Docker, no instance management.

The pivot took about an hour:

Upload the corpus to Cloudflare R2 — 30 GB of training data, synced via rclone. This took about 15 minutes (the data was already on a fast local drive; the upload was bandwidth-limited but not painfully so).
Write a Modal wrapper — 20 lines around the existing training script. The wrapper pulls the corpus from R2 into a Modal Volume (a persistent disk), runs the train, writes checkpoints back.
Debug three small issues — the Modal worker needed the R2 credentials passed as secrets (first attempt used empty env vars), the training config wasn't on the volume yet, and the ONNX export needed onnxscript added to the image.
Run the training — 50,000 steps on an NVIDIA A100-SXM4-40GB in 2 hours. No hangs, no resets, no watchdog. Just clean, uninterrupted compute at 6.9 steps per second (vs 0.56 on the local iGPU).

Total cost: about $5, covered entirely by Modal's $30/month free credits for new accounts.

The results

The CE-only model (which drops the problematic CRF loss term that caused nine previous runs to diverge) trained to completion:

val_macro_f1: 0.605 (final), 0.621 (peak at step 35K)
Train loss: 0.068 (final)
Zero divergence across all 50,000 steps
ONNX export: 66 MB (full-precision training artifact; the shipped weights are quantized to ~25 MB for the npm package, and smaller still for the browser demo)

For context: v0.4.0 shipped at macro_f1 = 0.36. This is a 68% relative improvement on the same evaluation set.

The eval matrix

After the model shipped, we ran the full product-level evaluation — four pipeline modes compared on 4,535 hand-curated golden addresses:

Mode	Exact Match	Macro F1	Empty Parse	Overconf Wrong
Rule-only	30.8%	22.0%	6.3%	2.4%
Neural	0.1%	7.3%	0.3%	54.5%
Hybrid	0.1%	7.3%	0.3%	54.5%
Hybrid-joint (reconciler)	6.0%	16.6%	0.0%	0.1%

A few things jump out:

The neural model hallucinates components it shouldn't. On the golden set, it invented a dependent_locality — a sub-city neighborhood — 956 times where none existed. Two explanations look tempting, and both lead nowhere. Calibration? These predictions come out at high confidence; the model commits hard to the wrong answer. Decoding? Viterbi with the structural mask is already running. What's left is training: cross-entropy treats every mislabeling equally, so the model never learned that dependent_locality is rare and should be emitted sparingly. Class-weighted CE — which was blocked in v0.4.0 because it destabilized the dual-loss training — puts a thumb on the scale: mislabeling a rare tag costs more. Now that CE-only training is proven stable, this lever is unlocked.

Hybrid mode shows identical numbers to neural alone. The hybrid mode fuses rule classifications with neural output, but in this iteration the raw neural decoder's overconfidence drowns out the rules, hence the identical numbers. The reconciler (hybrid-joint) is the mode that actually disciplines the merge.

The reconciler fixes the honesty problem. It drops overconfident-wrong from 54.5% to 0.1% by checking whether parsed components form a coherent real-world hierarchy. It also eliminates empty parses entirely (0.0% vs rules' 6.3%): it always produces something, even if conservative.

The rules are a ceiling. The neural model is a ramp. Rule-only at 30.8% exact match is a mature system, hand-tuned over years. Each additional percentage point costs engineering time. The neural model at 6.0% (hybrid-joint) after one stable training run is learning from data, which means each new training run can improve across every component and every locale simultaneously. The 68% improvement from v0.4.0 to v0.5.0 is the trend that matters — and the ramp just proved it can climb.

The infrastructure lesson

The overnight session could have been a write-off. A GPU that crashes every 90 minutes, a 50,000-step training target, and 12 hours of wall-clock to fill. Instead:

Corpus on R2 means any GPU provider can pull it at datacenter speed. Upload once, train anywhere.
Modal's per-second billing means we paid $0 for data upload, $0 for debugging, and ~$5 for the actual GPU compute.
Checkpoints every 500 steps on the Modal Volume means even if a Modal preemption happened (it didn't), we'd lose at most 7 minutes of work.
The same training script ran locally (for smoke tests) and remotely (for the full run) without modification: the config just points at /data/, which is either the local mount or the Modal Volume.

The local iGPU still has a role: smoke tests, gradient probes, quick 50-step experiments. The expensive runs go to the cloud. The separation happened naturally once we accepted that the hardware wall was real and not worth engineering around.

What's next

Now that we have cloud GPU access at $5 per full training run, several decisions we made for the local hardware no longer apply. The v0.5.0 model was trained with constraints that made sense on a thermal-limited iGPU but don't make sense on an A100:

Hidden size 256 — we wanted 384 but fell back when it wouldn't train locally. The A100 has 40 GB of VRAM; 384 or 512 are trivial.
Effective batch 128 via gradient accumulation (batch=16, accumulate 8 steps) — a workaround for limited GPU memory. The A100 can do batch 128 directly, which changes the gradient noise characteristics and potentially the training dynamics.
50,000 steps — sized for "affordable locally." At 6.9 steps/second on the A100, 100K steps costs $10. We might be undertrained.
Phrase-prior conditioning disabled — turned off during debugging and never turned back on. The architectural thesis was built around it.
Class-weighted cross-entropy disabled — the v0.4.0 recipe lever that addresses the 956-FP hallucination problem is now safe to use.

The next iteration removes all of these constraints at once: h384, direct large-batch, phrase priors on, class weights on, longer schedule. Same corpus, same tokenizer, same CE-only stability fix — just the model the architecture was designed to produce. One A100 run, a few hours, covered by free credits.

Where to look

Getting started — 5-minute install + first parse
Project status — what ships today, per package
Eval matrix report — full per-component breakdown
What the eval numbers mean — plain-English interpretation
Modal training wrapper — the 250-line script that runs the whole thing
Dual-loss curvature conflict — why CE-only works when nine dual-loss runs didn't

Five tries, same failure — narrowing v0.5.0's training problem by elimination

2026-05-24T00:00:00.000Z

This is a follow-up to yesterday's post about the v0.5.0 C-train failures. Yesterday we ran four attempts and ruled out three suspects. Today we ran a fifth and ruled out a fourth. We're now down to one remaining hypothesis — and the way we got here is a kind of debugging that translates pretty cleanly from software engineering, so this post is pitched at engineers who haven't run a training campaign before.

If you've ever bisected a regression in a piece of software (used git bisect, narrowed a test failure by reverting changes one at a time, taken a known-good build and a known-broken build and asked which of the changes between them caused the breakage), then you already understand the core move. The rest is vocabulary.

The setup, in software terms

Last week we shipped a "v0.4.0" model. Think of a model as a long-lived process — millions of internal numbers (weights) that we tune by feeding it labelled examples for hours and adjusting based on how wrong each guess is. The output of all that tuning is a single file (~50MB) we copy to production.

v0.4.0 worked. We then changed a handful of things in parallel to ship v0.5.0:

New tokenizer (the thing that splits input strings into model-readable units; we made a bigger, smarter one because the old one fell back to raw bytes on non-Latin scripts).
New corpus (we added synthetic adversarial examples + transliteration pairs to the training data).
New input layer ("phrase priors": pre-computed hints about where each meaningful span starts and ends, fed into the model alongside the raw tokens).
Bigger hidden size (the internal width of the model — more capacity, in principle).
Plus a bunch of new code surrounding it (top-k inference, joint decoding, a new reconcile stage).

Items 5 are pure new code, those are fine. Items 1-4 are the suspects. Any combination of them could be the thing that breaks training. Welcome to a multi-variable regression.

The failure mode

When we trained the model — which is just a long loop, run for ~50,000 iterations, watching a number called "loss" go down — the loss went down beautifully for the first ~1000 iterations and then started going up. Catastrophically. By the time we noticed, the model had unlearned everything useful and was producing garbage.

In software terms: imagine a process that runs fine for the first hour and then enters a kind of cascading state corruption that slowly destroys all its in-memory data, even though no individual operation looks wrong. There's no segfault, no exception. The numbers just slowly drift away from useful and toward useless.

This pattern has a fingerprint:

descent through warmup  →  brief plateau at a low loss  →  sharp climb back to nonsense

Every single run we've done so far has shown this exact fingerprint at slightly different points. The depth of the plateau varies; the moment the climb starts shifts; the climb itself is always there.

Bisecting by elimination

Five attempts now, each varying one knob from the previous:

Run	What changed	Best loss before climb	When the climb started
v1	(all v0.5.0 changes ON)	0.61	step 700
v2	lowered LR (1.5e-4 → 1e-4)	0.51	step 1000
v3	turned off two loss-side knobs (§1, §3)	0.41	step 800
bisect-h256	reverted hidden-size bump	0.31	step 1050
bisect-phrase-off	reverted phrase-prior input layer	0.38	step 1050

The bisect-phrase-off run is the new one (today). The previous post covered v1 through bisect-h256.

What every bisect attempt has in common: the model is provably learning something useful for several hundred iterations (the loss decreases, validation accuracy climbs), and then it falls off a cliff. This means the model isn't fundamentally broken. It can fit the data, it just can't stay fit. Something is pushing it off the cliff.

Each bisect tested a different "is this the cliff?" hypothesis:

v2 tested "is the learning rate too high?" No. Lowering LR delayed the climb but didn't stop it.
v3 tested "are the new loss-side weighting knobs destabilising?" No. v0.4.0's known-stable loss settings still produce the cliff.
bisect-h256 tested "is the bigger model the problem?" No. We reverted hidden size to v0.4.0's value and got the cleanest training so far (best validation macro-F1 we've ever measured), and the cliff still happened.
bisect-phrase-off (today) tested "is the phrase-prior input layer the problem?" No. We turned off the phrase-prior feature concatenation entirely and the cliff is still there, in the same shape, at almost the same step.

Five attempts, five identical fingerprints, four hypotheses eliminated. Exactly one architectural change from the known-stable v0.4.0 setup is still in play: the tokenizer + corpus pair.

What's left, and why it's interesting

The two remaining variables are linked:

A1 tokenizer: a new vocabulary of 48,000 sub-pieces that the model uses to chop input strings into atomic units. It was trained on the v0.4.0 corpus (which includes the new transliteration data) so it knows about CJK / Cyrillic / Hangul / Han / Armenian script. The old tokenizer just gave up on non-Latin scripts and emitted raw bytes.
corpus-v0.4.0: the old corpus plus ~78,000 new rows generated by an LLM — adversarial "trick" addresses and transliteration pairs in non-Latin scripts.

These two are bundled. A1's vocabulary was constructed from v0.4.0's content. So the model is simultaneously seeing new tokens (vocab change) and new data (corpus change) for the first time, and we can't fully separate them without retraining one or the other.

But we have one cheap experiment that gets us most of the way there. Hold the tokenizer constant (keep A1), and just swap the corpus back to v0.3.0 (the old data, no transliteration mass). That tests whether the transliteration data is the destabiliser, while preserving the tokenizer-side win.

This is the next bisect. If it trains cleanly, we'll know the synthetic data — specifically, B2's transliteration mass — is what's breaking training. That'd be a useful answer because we have a couple of obvious follow-ups:

Downweight transliteration in the training mix. The corpus has per-source weights; we just turn down the new stuff. Lossy but cheap.
Investigate why the transliteration data destabilises. The honest hypothesis is that the LLM-generated rows have systematically different gradient signatures than human-validated address data — they might be too structured, or have repetitive patterns the model overfits to and then explodes on. We have tooling (corpus-audit) that can quantify this.
Ship A1 (tokenizer wins) + corpus-v0.3.0 model (proven-stable) for v0.5.0, defer transliteration training to v0.5.1.

If the corpus-revert bisect also fails, we're left with the A1 tokenizer itself as the destabiliser. That's a stranger answer (tokenizer training is mostly orthogonal from classifier training), but not impossible. New vocabulary means a fresh embedding table the model has never seen; unusual sub-piece frequencies could in principle produce unusual gradient norms.

What we'd tell a software engineer reading this

Three things about ML debugging that don't translate cleanly from regular software:

There's no stack trace. Loss is the only signal you get. You don't get to step into the model and see what's wrong. You change one knob, run the experiment for hours, and read the resulting curve like a fortune teller. This is the part that makes ML feel unscientific — but if you're disciplined about it (one knob at a time, write down the result, save the artifacts), it's exactly the same bisect-by-elimination workflow as git bisect.
Iterations are expensive. Each "is the bug here?" check costs hours of GPU time. You can't make 100 tries and look at the distribution. You make 5-10 tries, and each one has to be carefully designed to maximise the information yield. This is why ML researchers obsess over "ablation studies" — they're the equivalent of unit tests, but each one costs $5 of compute.
The "obvious" suspect is often wrong. When v0.5.0 started failing we assumed the bigger model was at fault. (We made it bigger! That's a lot of new parameters to break!) The h256 bisect ruled that out cleanly. Then we assumed it was the new input layer. The phrase-off bisect ruled that out too. The remaining suspect — the tokenizer + corpus — was our least-favourite hypothesis going in, because the tokenizer was the headline win of v0.5.0. But the data has a way of being indifferent to your preferences. You keep elimination-bisecting until you find the answer the data is actually telling you.

Where we go next

The corpus-revert bisect is the next experiment. It's a 25-30 hour training run on the lab's GPU, so we'll start it tonight and check on it tomorrow morning. If it trains clean, we have a clear shippable v0.5.0 (with a v0.5.1 follow-up to figure out the transliteration data destabilisation). If it doesn't, we'll have the cleanest possible signal that the tokenizer change itself is interacting weirdly with classifier training — a much more interesting problem to write about.

Either way the bisect ladder is short now. Five experiments in, one hypothesis left, and a clear next experiment that resolves it. The frustrating part of ML debugging is the long iteration cycle; the satisfying part is that the same systematic elimination always works in the end.

Where to look

v0.5.0 fresh-slate plan
Yesterday's post on the first four attempts
The v0.4.0 retrospective (the original "destabilisation fingerprint" we recognised in v0.5.0)
VERDICT_SMOKES.md — discipline doc for the smoke-test framework we built during v0.4.0 to catch divergences early

If you do ML work and have ideas about what classes of corpus distribution shift could produce a "trains fine for a thousand steps then catastrophically diverges" pattern, the mailbox is open: contact@sister.software.

Taming Who's On First — making sense of the world's open place data

2026-05-24T00:00:00.000Z

If you found this via search

Mailwoman is an open-source address parser + geocoder that uses Who's On First as its gazetteer. This post is a practical reference on WOF's gotchas and the tooling we built to work around them. Try the demo or see what ships today.

Who's On First is the best open gazetteer we have. It's also one of the strangest datasets you'll encounter as a developer. This post is about what makes it hard to use, what makes it worth the effort, and the tooling we built inside Mailwoman to tame it.

If you've ever tried to answer "what city is this address in?" programmatically, using open data without paying a geocoding API, you've probably already run into WOF. And you probably had some questions.

What Who's On First actually is

WOF is a gazetteer — a structured database of places. Not addresses, not roads, not buildings. Places: countries, regions, counties, cities, neighbourhoods, venues. Each one gets a stable numeric ID, a parent-child hierarchy, multilingual name variants, and a polygon geometry.

It was created by Mapzen (RIP, 2018) as the successor to GeoPlanet (Yahoo's gazetteer, also RIP). The data lives on GitHub as approximately 100 repositories under the whosonfirst-data org, totalling several million individual GeoJSON files. Geocode Earth maintains the canonical SQLite distributions at data.geocode.earth.

The key thing WOF gives you that no other open dataset does: a consistent hierarchy with stable IDs. You can take a locality (Houston, id 85922029), follow its parent_id to a region (Texas, id 85688753), follow that to a country (United States, id 85633793), and know the chain is consistent. OpenStreetMap doesn't give you this. GeoNames gives you a partial version. WOF gives you the whole thing, with an opinion on how the world's administrative boundaries nest.

What makes it hard

One file per place

WOF stores each place as a separate .geojson file in a directory tree. A US admin dataset has roughly 120,000 individual files. The French equivalent has about 80,000. Opening, parsing, and indexing 200,000 JSON files is a meaningful engineering problem before you've even asked a question of the data.

This per-file layout made sense for WOF's original use case: git-trackable changes to individual places. You can see who edited Houston last, what changed, and when. But for a geocoder that needs to query "all localities named Houston" across 120K files, it's the wrong access pattern entirely.

The property namespace explosion

A WOF GeoJSON feature's properties object looks like this:

{
	"wof:id": 85830005,
	"wof:name": "Lawrence Corner",
	"wof:placetype": "neighbourhood",
	"wof:parent_id": 1729442683,
	"wof:country": "US",
	"wof:hierarchy": [
		{
			"continent_id": 102191575,
			"country_id": 85633793,
			"county_id": 102085493,
			"localadmin_id": 404477193,
			"locality_id": 1729442683,
			"neighbourhood_id": 85830005,
			"region_id": 85688689
		}
	],
	"name:eng_x_preferred": ["Lawrence Corner"],
	"name:eng_x_variant": ["Lawrence Cor"],
	"src:geom": "quattroshapes",
	"edtf:inception": "uuuu",
	"edtf:cessation": "uuuu",
	"geom:area": 0.000047,
	"geom:bbox": "-74.73,40.08,-74.72,40.09",
	"mz:hierarchy_label": 1
}

There are a few things to notice:

Namespaced keys everywhere. wof:, name:, src:, edtf:, geom:, mz: — each prefix is a different source or concern. The schema is flat (one object, no nesting) with meaning encoded in the key name.
Name variants are language-coded. name:eng_x_preferred is the preferred English name. name:fra_x_preferred would be French. name:zho_x_preferred would be Chinese. The _x_ separator splits language code from name kind (preferred, variant, colloquial, abbr, short).
Some places have dozens of name keys. A major city like Paris has name: entries in 50+ languages. A rural US neighbourhood might have only one.
The hierarchy is pre-computed. Instead of walking parent_id up the tree at query time, WOF bakes the full ancestry chain into each record. Convenient for display; redundant for storage; occasionally stale when a parent is reclassified.

Brooklyn Integers

WOF IDs are issued by a service called Brooklyn Integers, a distributed ID generator that guarantees uniqueness across the dataset. The IDs are not sequential, not geographically meaningful, and not sortable. They're just unique numbers. This is fine for lookup but means you can't reason about "nearby" places by ID proximity.

Supersession chains

Places get deprecated: a neighbourhood is absorbed by a neighbouring one, a county boundary changes, a locality is merged. WOF tracks this via wof:superseded_by arrays. A query that doesn't check supersession may return a place that hasn't existed since 2015.

Parent ID = -1

A parent_id of -1 means "we don't know the parent." A parent_id of 0 means "no parent (this is a continent or Earth itself)." The first French postalcode dataset was ingested with parent_id: -1 for every record, making hierarchy traversal useless until someone manually assigned parents. Some of those records still have -1.

What we built to tame it

Mailwoman needs WOF for two things:

Rule classifiers: "is this token a known locality name?" (Used by the locality/region/country dictionaries in the rule-based classifiers.)
Reconciler concordance scoring: "does this parse's locality/region/country assignment form a valid parent_id chain in the world?" (Used by Stage 5 joint decoding.)

Each use case has a different access pattern, so we built two layers:

Layer 1: normalised placename index (`WOFPlacenameCache`)

For the rule classifiers, all we need is a fast "is this string a placename in any language?" lookup. We don't need coordinates, hierarchy, or geometry — just the normalised string and which languages it's valid in.

WOFPlacenameCache builds this index by streaming GeoJSON files via TextSpliterator (our line-delimited streaming library), extracting name:* properties, normalising them (case folding, accent stripping), and inserting into an in-memory Map keyed by the normalised form. The value is a Set of language codes the name appears in.

The normalisation matters because WOF stores "São Paulo" with the accent, but user input might arrive as "Sao Paulo" or "SAO PAULO". The index needs to match all three.

Layer 2: per-placetype SQLite DB (`PlacetypeDataSource`)

For the reconciler, we need richer queries: "give me all localities named Houston with their parent_id chains" and "walk this locality's parent_id up to region — does it reach Texas?"

PlacetypeDataSource is a SQLite database per (placetype, language) combination. Schema:

CREATE TABLE records (
  id        INTEGER NOT NULL,
  src       TEXT NOT NULL,
  name      TEXT NOT NULL,
  preferred TEXT,
  variant   TEXT,
  colloquial TEXT,
  abbr      TEXT,
  short     TEXT,
  parent_id INTEGER,
  PRIMARY KEY (id, src, name)
);

One row per name variant. "Saint Petersburg" and "St. Petersburg" and "St Petersburg" are three rows for the same id, different name/variant/short columns. The reconciler can query any variant form and get the same parent_id chain, which is what solves the "not found" problem we hit in testing.

The Piscina pipeline (stalled, documented)

Processing 120K GeoJSON files into these DBs is an embarrassingly-parallel problem. Our commands/wof/prepare command uses Piscina (a Node.js worker-thread pool) to dispatch files across all available CPU cores. Each worker:

Reads a GeoJSON file.
Calls pluckPlacetypeSpec to extract the structured fields + all name variants per language.
(Should) upsert into the appropriate PlacetypeDataSource.

Step 3 currently targets Redis (a leftover from an earlier prototype). The migration to SQLite is documented but not yet complete. The design intent was in-memory SQLite per worker (zero disk I/O during the hot path) with a consolidation step at the end — but that never got past the design stage.

`AsyncSpliterator.asMany` — the file that got away

When the data arrives as a single bulk NDJSON dump rather than 120K files, the access pattern changes. Instead of "glob files, dispatch per-file," you want "split one huge file into N byte-range chunks, process each chunk in parallel."

AsyncSpliterator.asMany(source, delimiter, concurrency) was built for this case. Given a file handle and a desired concurrency, it seeks to N roughly-equal byte positions in the file, snaps each position to the nearest delimiter boundary (so no line gets split between workers), and returns N independent async iterators that each process their own byte range.

The analogy: you have a book with a million pages. Instead of having one person read cover-to-cover, you measure the book's thickness, divide it into N roughly-equal stacks, and hand each stack to a different reader. Each reader finds the nearest chapter boundary at their stack's start and end (the delimiter-snap), then processes independently.

We built it, marked it @internal, and haven't exercised it at scale because the per-file path was sufficient for the repos we actually use. But when someone wants to process the full Geocode Earth SQLite distribution (3+ GB of admin data across all countries), this is the right primitive.

What's next

Three things, in priority order:

Finish the SQLite migration. The worker targets Redis; it should target PlacetypeDataSource. The pluckPlacetypeSpec output already matches the schema. The remaining work is plumbing, not design.
Wire PlacetypeDataSource into the reconciler. The concordance scoring currently uses the raw WOF spr SQLite table (from Geocode Earth's distribution). It should use our per-placetype/per-language DBs, which carry the name variants the raw table doesn't expose. This fixes the "Saint Petersburg not found" class of lookup failures.
Benchmark the in-memory-then-consolidate pattern. If 120K individual writes to the same few DB files from N concurrent workers bottlenecks on SQLite's WAL writer (likely), the in-memory-SQLite-per-worker → ATTACH-and-merge pattern is the escape hatch. Whether it's actually needed depends on whether step 1 is fast enough without it.

So why put up with WOF?

Every geocoder needs a gazetteer. The choice is: pay for one (Google, HERE, Mapbox), use an open one (WOF, GeoNames, OSM Nominatim), or build your own from government sources (BAN, NAD, TIGER).

WOF is the best open option for hierarchy and multilingual names. But it's hard to use raw. The per-file layout, the flat namespace, the supersession chains, the parent_id: -1 holes — each one is a trap for a naive consumer.

The tooling we built (WOFPlacenameCache, PlacetypeDataSource, the Piscina pipeline, pluckPlacetypeSpec, AsyncSpliterator.asMany) is our attempt to close the gap between "WOF exists" and "WOF is usable as a geocoder component." It's not complete, but the architecture is sound and the incomplete pieces are documented.

If you're building a geocoder or any location-aware system and you need hierarchy + multilingual names from open data, WOF is probably your starting point. The gotchas above are the things we wish someone had told us when we started.

Where to look

Who's On First on GitHub — the source repos
Geocode Earth WOF distributions — pre-built SQLite files
Spelunker — the official WOF browser/explorer
docs/articles/concepts/whosonfirst-gotchas.md — the stable reference version of this article (data model, gotchas, tooling architecture)
docs/articles/concepts/wof-data-pipeline.md — our internal architecture doc for the ingest pipeline
docs/articles/concepts/resolver-and-wof.md — how the runtime resolver queries WOF
core/resources/whosonfirst/ — the TypeScript tooling source

Two voices arguing inside a model — a beginner-friendly debugging story

2026-05-24T00:00:00.000Z

If you found this via search

Mailwoman is an open-source address parser that runs in Node and the browser. It uses a small neural model to label address components ("350" = house number, "NY" = region, etc.). Try the live demo.

This post is a beginner-friendly debugging story — no ML background needed. If you just want the project status, see what ships today.

This is the third post in a series about a training problem we've been chasing. The first two were written for software engineers. This one is for someone who is just starting to learn about AI and machine learning — no jargon assumed, no math beyond high-school algebra. The point is to show you what real ML debugging looks like, using a problem we actually had this week.

If you've been programming for a while but ML feels opaque, this post is for you. The core technique we used — figuring out which of two instructions our model was listening to — turns out to be much more like ordinary debugging than the field usually makes it sound.

What we're building, in one paragraph

Mailwoman is a piece of software that reads address strings ("350 5th Avenue, New York, NY 10118") and turns them into structured place information ("this is in Manhattan, here are the coordinates, here's the postcode, etc."). It uses a small AI model to do the parsing. "Small" by AI standards: about 9 million numbers inside it. (For comparison: GPT-4 is rumoured to have over a trillion.)

We don't need a giant model because the task is narrow: addresses follow patterns, and we just need to identify which parts of a string are which (350 is a house number, 5th Avenue is a street, etc.).

What "training a model" actually looks like

Forget everything you've seen about AI in movies. Training a model is, mechanically, this:

You have a model with millions of numbers inside it (call them "weights"). At the start they're random.
You have a pile of example data — addresses with the correct answers labelled, like flashcards.
You show the model an address. It guesses what each part is.
You compare the guess to the correct answer. The difference is a number called loss: low loss means a good guess, high loss means a bad guess.
The training algorithm then tweaks the millions of internal numbers to make the loss a little bit smaller next time.
Repeat thousands or millions of times.

The "intelligence" of the model is just an enormous lookup table of patterns, refined slowly by 50,000 rounds of "you said it was a street name, but it was actually a postcode; here, nudge these specific numbers a tiny bit so you'll guess better next time."

If you've ever debugged a function by running it, looking at the wrong output, and tweaking one parameter at a time until the output got right, you've done a single iteration of model training by hand.

The "loss curve" you keep hearing about

People who train models stare at a chart called the loss curve all day. It looks like this:

loss
 ^
 |   X
 |    X
 |     X
 |      XX
 |        XXX
 |           XXXXX
 |                XXXXXXXXXX
 +----------------------------> step
   0      500      1000     1500

Each X is one round of training. Loss starts high (the model is randomly guessing) and decreases as the model learns. A good training run looks exactly like that — descending until it plateaus.

Now here's our actual loss curve from one of nine training runs we did this month:

loss
 ^                                  XXXX
 |                                 X    X
 |                                X      XX
 |                               X         XX
 |    X                         X            XX
 |     X                       X                X
 |      X                    XX                  X
 |       XX                 X                     ...
 |         XXX             X
 |            XXXXX       X
 |                 XXXXXXX
 +----------------------------> step
   0     500     800     1100

The model descends nicely for 500 steps (warmup), settles at a low loss for a bit — and then climbs back up. By the end, it's worse than when it started. We trained it on 50,000 examples and it got worse.

Every training engineer's heart sinks at this curve. It means something is wrong, and the model isn't telling us what. There's no stack trace. There's no exception. There's just a chart that says "I learned, and then I unlearned."

We saw this exact shape in nine different runs. Different learning speeds, different model sizes, different feature combinations. Every time: clean descent, then catastrophic climb.

The clue we'd been missing

To find a bug in a program, you usually narrow it down by ruling out parts of the code one at a time. ML debugging works the same way: you change one thing, retrain, look at the curve. But each "retrain" takes hours and costs real money on a rented GPU. You learn to be careful about which experiments are worth running.

For weeks we'd been ruling out hypotheses:

Maybe the learning rate is too high? (No — lowering it just delayed the climb.)
Maybe the model is too small? (No — we made it bigger and the same thing happened.)
Maybe a new feature we added is destabilising it? (No — we turned it off, same problem.)
Maybe the data has a bug? (Couldn't rule out, expensive to check.)

Then somebody pointed out a thing we hadn't questioned: we were training the model with two different goals at once.

The model has two scoring systems. We'll call them Voice A and Voice B.

Voice A says: "How good are your guesses for each individual word? Did you tag '350' correctly as a house number? Did you tag 'NY' correctly as a region?"
Voice B says: "How sensible is your overall pattern? Is your sequence of tags structurally valid? Does it look like a real address?"

Both voices are useful. A working geocoder needs both per-word accuracy and sensible patterns. We'd been adding their feedback together (with Voice B's contribution scaled down to 5%) and using that combined signal to train the model.

The question we'd never asked: were Voice A and Voice B telling the model to do the same thing?

The five-minute diagnostic

Here's the part that surprised me about ML debugging — the technique we used could be explained to a curious teenager.

When you train a model, every weight inside it gets nudged in a particular direction based on the combined loss. That nudge is called a gradient. If gradients are big, the weight moves a lot per step; if they're small, it moves a little.

The two voices each contribute their own gradient. They get added together (with Voice B at 5%) to produce the final nudge.

So we asked: at the moment just before the model starts unlearning, which voice is doing most of the talking? We took a saved snapshot of the model from that moment, fed it a few example addresses, and measured the size of each voice's contribution to the gradient separately.

We expected the answer to be something like "Voice A is 20× louder than Voice B" — meaning Voice B was contributing almost nothing, which would mean the 5% scaling we'd set was actually appropriate.

What we got instead: Voice B's gradient was 16× LOUDER than Voice A's.

Wait. Voice B was supposed to be scaled to 5%. But the raw gradient was 16× larger than Voice A's. Multiply 5% by 16 and Voice B's effective contribution to the model's training was actually 80% of Voice A's. The hand-tuned scaling knob we'd been treating as "Voice B contributes lightly" was secretly producing "Voice B contributes nearly as much as Voice A."

The cooperative-vs-conflict picture

Here's the framing that made it all click.

Imagine you're a hiker on a foggy hill, trying to walk to the lowest point in the landscape. You can't see far, so you have two GPS devices that each tell you "go downhill, that way."

At the top of the hill (high loss), both GPSes agree: every direction is downhill, so they both point you roughly the same way. You make progress. Loss decreases.
As you descend into a specific valley (loss gets lower), the landscape becomes more detailed. Suddenly the two GPSes start disagreeing: Voice A says the valley floor is to the left; Voice B says it's to the right. They don't see the same valley.

When that happens, your hiking direction is mostly determined by whichever GPS is shouting louder. With Voice B shouting 16× louder than Voice A, you stop following Voice A's instructions and start following Voice B's — even though Voice A was correct about where the valley floor actually was. You climb out of one valley toward a different point that Voice B prefers, and Voice A's loss (the per-word accuracy) gets worse.

That's literally what happened to our model. Above loss 0.41 (high up on the hill), both voices agreed and the model descended cleanly. Below 0.41, they started disagreeing, and Voice B's louder gradient pulled the model away from the basin Voice A had been guiding it toward. The model's per-word accuracy got worse and worse, which we saw as loss climbing back up.

The fix

Once you understand what's happening, the fix is mechanical: silence Voice B during training. Don't let it contribute to the gradient at all.

But we still want Voice B's contribution somewhere, because it really does encode useful structural rules (no orphan tags, no invalid BIO sequences). So we keep Voice B for inference (the moment when we actually use the trained model to parse an address) but not during training.

This is a one-line change in the code: when Voice B's weight is set to 0, don't even compute it. The training loop then runs purely on Voice A's gradient, which has been the well-behaved one all along.

What we tell ourselves we learned

A few things stand out:

"Add two losses together with weights" sounds simple. It can be a disaster. Two loss functions can have wildly different gradient magnitudes even when their loss values look comparable. Multiplicative scaling on the loss does NOT produce balanced contributions to the optimiser. Watching the loss values fooled us for nine training runs.
The five-minute diagnostic was more valuable than the previous month of retraining experiments. Every "what if we change this knob and retrain" experiment cost hours. The gradient-norm probe cost five minutes and gave a sharper answer than any of them. It works because it asks a more fundamental question: not "what's the result," but "what's the model actually listening to?"
ML debugging is more like programming debugging than the field admits. Once you have a vocabulary for what's happening, the techniques are familiar: bisect, isolate, instrument, hypothesise, test. The hard part is finding the right vocabulary for what's actually happening inside the model. Once you have it, the bug is usually findable.
Cheap experiments first. A 5-minute probe should always run before a 25-hour retrain. We didn't think to run the probe earlier because nobody had told us it was a thing. Now we know.

Where to read more

The original failure post — written for engineers, has the loss curves and recipe details.
The bisect-by-elimination post — what we ruled out before the diagnostic.
The technical writeup — for engineers who want the gradient math and the cooperative-vs-conflict framing in detail.

If you're starting out in ML and any of this helped, the mailbox is open: contact@sister.software. We'd genuinely like to hear what gaps the existing intro material still leaves.

Update — it worked

We wrote this post while the CE-only experiment was still running. It passed. The model trained past step 2000 — the point where every prior run had diverged — with no loss climb at all. Final validation accuracy: 0.444, the best number any run in this project has ever produced. The full 50,000-step training run is now in progress.

Without the competing voice, the model settled deeper into its basin than any prior run could before being dragged out.

Four training runs, zero shipped weights — bisecting v0.5.0's divergence

2026-05-24T00:00:00.000Z

If you found this via search

Mailwoman is an open-source address parser. This post is a training log entry from May 2026 documenting the v0.5.0 divergence investigation. For current project status, see what ships today.

v0.5.0 was the fresh-slate ship: new tokenizer, expanded corpus, new architecture, new reconcile stage. The plan was to bundle several months of structural improvements into one big iteration and pay the cost once. Most of it landed clean. The classifier didn't.

This post walks through the four training attempts the v0.5.0 C-train made overnight, the bisect that ruled out three plausible explanations, and what we think the remaining culprit is. It's a sister piece to the v0.4.0 retrospective — same shape of failure, different diagnostic ladder.

What v0.5.0 shipped before the train started

Six threads merged to main before the C-train attempts began:

Thread A1 — sentencepiece tokenizer retrained on corpus-v0.4.0. Overall byte-fallback 36.7% → 18.2% on the multi-script eval; CJK 80% → 45.2%; Armenian / Devanagari to 0%. Halved on the eval fixture and validated against a real adversarial slice.
Thread B + B2 — corpus-v0.4.0 adds 4,771 kryptonite rows (NY-NY Steakhouse, Paris-Texas, Saint Petersburg FL) plus 73,316 transliteration pairs across CJK / Cyrillic / Hangul / Han / Armenian, all DeepSeek-generated and validated through a substring-match aligner that caught ~1.1% reject rate worth of misaligned LLM output.
Thread C-s — classifier code path with top-k inference and a phrase-prior input layer that condition on Stage 2.7's proposed spans. Forward-pass tested on stub data; no full train.
Thread D-s — reconcile.ts joint decoder. Beam search over (span × tag × resolver candidate) with concordance scoring via WOF parent_id chains. The empty-parse trap was caught early and fixed with an inclusion log-bonus.
Thread E — @mailwoman/phrase-grouper workspace. Rule-based span proposer feeding Stage 2.7.
Thread F — verdict-smoke discipline. New --smoke-mode constant flag so the cosine-LR mask that hid v0.4.0's divergence cannot reoccur.

C-train was the experiment that actually used all of it together for the first time.

The recipe we tried first

Going in we had a clean confirmation from the operator: hidden_size=384 (up from v0.4.0's 256), effective batch 128 via batch=16 grad_accum=8, constant LR, starting LR=1.5e-4 (the same lr v0.3.0 had to drop to), top-k inference and phrase-prior conditioning ON (PR #128). The corpus was A1 tokenizer + corpus-v0.4.0. Recipe knobs §1 (per_token CRF normalization) and §3 (class_weights) were carried over from the C-s scaffold: a host-side YAML-drafting decision, not an operator confirmation.

The 50-step constant-LR smoke passed cleanly. val_macro_f1 climbed 0.121 → 0.187 across 50 steps. The recipe looked stable.

Promoted to full. The full run diverged at step 1000.

Four attempts, one fingerprint

The pattern repeated, with each variant getting marginally further before the same shape of failure took over. All four runs use the same constant-LR schedule (mode A per VERDICT_SMOKES.md) and the same effective batch of 128.

v1: h384, §1+§3 ON,  lr=1.5e-4
  step  500: train_loss=0.69 (warmup end, LR plateau)
  step  600: train_loss=0.61 (settled)
  step  700: train_loss=1.49 (climb start)
  step 1000: train_loss=3.29 (killed)

v2: h384, §1+§3 ON,  lr=1e-4
  step  500: train_loss=0.90 (warmup end)
  step  900: train_loss=0.51 (best)
  step 1000: train_loss=0.69 (climb start)
  step 1200: train_loss=1.96 (killed)

v3: h384, §1+§3 OFF, lr=1.5e-4  ← v0.4.0-stable recipe
  step  500: train_loss=0.63 (warmup end)
  step  700: train_loss=0.41 (best ever — better than v1/v2)
  step  800: train_loss=1.21 (climb start)
  step  900: train_loss=1.97 (killed)

h256-bisect: h256, §1+§3 OFF, lr=1.5e-4, eff_batch=128
  step  500: train_loss=0.67 + val_macro_f1=0.311
  step 1000: train_loss=0.31 + val_macro_f1=0.399 (best ever)
  step 1050: train_loss=0.42 (climb start)
  step 1500: train_loss=1.85 + val_macro_f1=0.229 (killed)

The fingerprint is identical to v0.4.0's. Loss descends through warmup, settles for a few hundred steps near the bottom, then climbs back to its starting magnitude over 100-300 steps. val_macro_f1 (where we measured it) does the same: peaks around the time the loss bottoms out, then collapses.

The only thing that shifts between runs is how deep the loss gets before the climb starts. v1 bottomed at 0.61, v2 at 0.51, v3 at 0.41, h256-bisect at 0.31. Each successive variant trained better for longer, and then collapsed in exactly the same way.

What the bisect ruled out

We ran three knob-changes between v1 and h256-bisect, each motivated by a different hypothesis. None of them held.

Learning rate isn't it. v0.4.0's bisect already showed that a factor-2 LR drop only buys a factor-1.3 step delay, ruling out "we picked too high an LR" as a full explanation. v1 → v2 confirmed the same shape: 1.5e-4 → 1e-4 moved the divergence from step 700 to step 1000. Same dynamic, just shifted later. The LR controls when, not whether.

Recipe knobs §1+§3 aren't it. v0.4.0's retrospective concluded the destabilizer was in the recipe, not the LR, and shipped with §1 (per_token CRF) and §3 (class_weights) OFF. v3 reverted those knobs and dropped back to LR=1.5e-4, the canonical v0.4.0-stable recipe. v3 trained better than v1/v2 (bottom of 0.41 vs 0.61/0.51), zero GPU hangs (vs v2's six), and still diverged at step 900.

Hidden size isn't it. h256-bisect reverted the only architectural change still in the recipe: 384 → 256 hidden, 6 → 4 heads, 1536 → 1024 intermediate. With everything else at v0.4.0-shipped settings (effbatch=128, LR=1.5e-4, §1+§3 OFF), this configuration is _identical to v0.4.0's shipped recipe, except for two architectural pieces we haven't touched yet. h256-bisect was the best-performing run of the four (peak val_macro_f1=0.399, train_loss=0.31), and it diverged anyway.

The remaining suspects

Only two architectural changes from the proven-stable v0.4.0 baseline remain:

Phrase-prior input features (PR #128). The classifier's input embedding takes 10 extra per-token features encoding the phrase grouper's proposed spans — is_phrase_start / is_phrase_mid / is_phrase_end plus a one-hot for the proposed PhraseKind. New projection layer; new gradient pathway.
A1 tokenizer. New 48K vocab (up from v0.1.0's 16K), trained on corpus-v0.4.0 including the transliteration adapter. The embedding table is sized to that vocab and the model has never trained against it before.

Both are real architectural changes from v0.4.0. Either could plausibly produce a confident-wrong degenerate minimum that the loss bottoms out at and then escapes from. We can't tell from the curves alone — the shape is the same regardless.

The cheapest next bisect is phrase priors OFF (revert PR #128's input-layer features, keep A1 tokenizer). One YAML knob change, ~15min smoke + a partial train if smoke passes. That isolates whether the destabilizer is the phrase-prior projection or the new tokenizer's interaction.

If phrase-priors-off ALSO diverges, the next bisect is the A1 tokenizer itself: revert to v0.1.0's sentencepiece weights against the same corpus + recipe + h256. At that point we're back to v0.4.0's proven-stable shipping configuration; if even that diverges, the destabilizer is in corpus-v0.4.0's composition (the transliteration shards' distribution might be the issue, not the tokenizer trained on them).

A discipline lesson worth keeping

We caught a real bug in our verdict-smoke discipline along the way. The original 50-step smoke for v1 passed cleanly — train_loss descended, val_macro_f1 climbed, no NaN or spike. We promoted to full. Full diverged at step 1000 in a regime the smoke had never reached.

Two things had to change for the smoke to be a real predictor:

Smoke length matters. 50 steps captures only the warmup descent. The sustained-peak-LR regime where the recipe destabilises starts at step 500 in our schedule. Smokes need to be long enough to spend several hundred steps near peak LR — we ended up at 1500 steps as the floor.
Effective batch must match the full run. v3's smoke ran at batch_size=8 grad_accum=1 (eff_batch=8); the full run was eff_batch=128. The smoke said stable; the full diverged. The recipe's stability is batch-geometry-dependent. A smoke that doesn't reproduce the full-run gradient noise is a smoke that can't detect this class of failure.

The constant-LR-mode discipline that landed in Thread F is still correct: it's what made the v0.5.0 destabilization observable at all instead of hiding under cosine decay. But the smoke configuration needs to mirror the full-run throughput characteristics on top of the LR schedule.

Both lessons will land in VERDICT_SMOKES.md as a follow-up. The current text describes constant-LR mode as the gate but doesn't say "your smoke's eff_batch must match the full run." That's an obvious-in-hindsight footgun.

What didn't get burned

Most of the v0.5.0 fresh-slate work survives this episode intact:

A1 tokenizer's byte-fallback wins are real. Multi-script eval went from 36.7% to 18.2%, with B2's targeted scripts (CJK, Cyrillic, Hangul, Han, Armenian) all hitting or beating their v0.1.0 leakage baselines. That's a tokenizer that's actually fit for non-Latin addresses. It works fine for inference even though the classifier we'd train on top of it diverges.
corpus-v0.4.0 is sound. corpus-audit passes; the substring-match validator caught the LLM's alignment failures; both adapter additions land cleanly via the new MANIFEST.json-driven harness. Whatever destabilises the train, it isn't a corpus integrity problem.
Stage 5 reconcile + phrase grouper + verdict-smoke discipline all shipped and live in main. They run on v0.4.0 weights right now and produce correct output on the kryptonite catalogue. They'll keep working when v0.5.0 weights land.
TRAINING_ENV.md documents the playpen container's ROCm bootstrap recipe so the next training spawn doesn't re-discover the wall. ~15min one-time setup that took us most of an hour to invent the first time.

The honest read

We spent roughly four hours of GPU time on four diverging training runs, learned what isn't the destabilizer, and stopped before we burned a fifth shot at it. v0.5.0's classifier weights aren't shipping today.

That's still a useful outcome. We have a smaller hypothesis space (two architectural pieces left to bisect), better infrastructure than we started with (TRAINING_ENV, MANIFEST harness fix, longer smoke discipline), and a concrete recommendation for v0.5.0.1's first move. The v0.4.0 model continues to ship in production; nothing downstream is blocked.

What we'd tell a future ourselves

Smoke geometry must match the full-run geometry. Constant-LR isn't enough if eff_batch differs. Either match the full-run batch shape or run two smokes — one at the smaller geometry for fast iteration, one at the full geometry as the actual gate.
The "destabilizes a few hundred steps after warmup ends" fingerprint isn't unique to v0.4.0. It's appearing in v0.5.0 too with a different recipe. Whatever it is, it's a deeper issue with the dual-loss landscape under sustained peak LR than either retrospective has so far named.
Plan for divergence retries in the time budget. A single full-train shot is rarely the experiment that ships. v0.4.0 needed five runs; v0.5.0 has needed four so far with at least two more bisects ahead. Realistic v0.X.0 release cadence is probably 8-12 training runs per cycle, not one.
Operator-side and host-claude-side recipe knobs need to be distinguished early. §1+§3 entered the v0.5.0 recipe by being in the C-s scaffold YAML — a host-claude inheritance decision, not an operator confirmation. That cost us v3 in the bisect ladder.

Where to look

docs/articles/plan/v0-5-0-shipped.md — what landed and what didn't in the v0.5.0 bundle
docs/articles/plan/reference/VERDICT_SMOKES.md — the smoke discipline (with the eff_batch lesson pending)
docs/articles/plan/reference/TRAINING_ENV.md — playpen container ROCm bootstrap
Diverged train CSVs at c-train-full-{DIVERGED-lr1.5e4,v2-watchdog-DIVERGED-step1200,v3-DIVERGED-step900,h256-bisect}.csv
v0.4.0 retrospective — the sister piece

Next: phrase-priors-off bisect. If that lands a stable train, we ship v0.5.0 weights without phrase-prior conditioning and pick up the priors as a v0.5.1 follow-up. If it doesn't, we revert the A1 tokenizer and confirm v0.4.0-shipped configuration trains cleanly on corpus-v0.4.0 — which would isolate the destabilizer to corpus distribution effects from B2's transliteration mass.

Five training runs, one shipped checkpoint — what we learned from v0.4.0

2026-05-23T00:00:00.000Z

If you found this via search

Mailwoman is an open-source address parser. This post is a historical retrospective on the v0.4.0 training campaign (May 2026). For current project status, see what ships today.

@mailwoman/neural-weights-en-us@v0.4.0 (and the fr-fr sibling) shipped today as packaged artifacts (the npm publish is a separate step we do by hand). It is a mixed-result release: one clear win on fine-grained labels, two regressions on coarse labels that turned out to be mostly artifacts of how we measured. Almost everything we set out to do — combine three orthogonal training improvements into one ship — was empirically falsified by a divergence pattern we hadn't seen before.

This is a writeup of how the campaign went. We're publishing it for two reasons: to be honest about what the headline numbers mean, and because the way the failures stacked up is worth thinking about if you train your own NER-style models.

What v0.4.0 was supposed to do

v0.3.0 had shipped with a known regression on coarse labels (country, region, locality) — the cost of expanding the label vocabulary from 15 to 21 BIO classes without enough training steps. Issue #116 named six work areas for v0.4.0:

Per-token CRF NLL normalization — eliminate the hand-tuned crf_loss_weight=0.05 knob by scaling the CRF loss to per-token magnitude so it sums cleanly with cross-entropy.
Longer training — v0.3.0 early-stopped at step 1800 of 50K; the v0.4.0 floor was step 5000.
Class-weighted cross-entropy — pull softmax mass back onto the coarse classes the 21-label expansion had diluted.
Source-weight rebalance — drop NAD's per-sample weight (it had ended up at ~52% of the sampled corpus); promote the WOF admin sources to compensate.
JS-side Viterbi decoder + label vocabulary loading from model-card.json — runtime cleanup.
Reuse corpus-v0.3.0 — no rebuild needed.

Items 5 and 6 were pure engineering; they landed cleanly the day before training started. The contested ones were 1, 3, and 4: the recipe changes that actually touch the loss surface.

What actually happened

Five training runs. Three of them on different learning rates with the same full recipe; two of them as ablations dropping one item at a time. All five diverged. The fingerprint was distinctive: training loss dropped monotonically through a long warmup, plateaued at the bottom for several hundred steps, then spiked back up to its starting magnitude over 50-150 steps. Validation macro-F1 mirrored the train loss: it climbed to a peak around the LR's peak step, then collapsed to roughly the random-output baseline.

The collapse step shifted with the learning rate:

Learning rate	Collapsed at step
5e-4 (target)	750
3e-4	1000
1.5e-4 (v0.3.0 LR)	2000

Three runs each at a different LR, each diverging in the same shape, with the divergence delayed proportionally — but only roughly. A factor-2 LR drop bought a factor-1.3 step delay, not factor-2. That ruled out "we just picked too high an LR" as the explanation. The destabilizer was in the recipe, not the learning rate.

So we ran the ablations the issue had prescribed:

Ablation	LR	Result
Drop §1 (CRF norm)	5e-4	Diverged step 1000
Drop §3 (class weights)	5e-4	Diverged step 1000

Identical failures. At lr=5e-4 neither single-knob revert was enough, meaning lr=5e-4 was structurally unreachable for the codebase's dual-loss landscape regardless of which knob we touched.

We dropped back to the safe LR (1.5e-4, the same lr v0.3.0 had been forced down to) and ran a three-cell orthogonal matrix:

Recipe	Peak macro-F1	Verdict
§4 only (source rebalance)	0.419	Pass
§3 + §4 (class weights + source)	0.428	Best — pass
§1 + §4 (CRF norm + source)	—	Fail

The §3+§4 verdict-smoke peaked at 0.428 — better than v0.3.0's final 0.36 by a comfortable margin. So we promoted that recipe to the full 50K-step run.

It diverged at step 2250. Same fingerprint as the full §1+§3+§4 recipe at the same LR.

The meta-bug in the smoke framework

The verdict-smoke ran each ablation for 3000 steps with a cosine learning-rate schedule. With max_steps=3000 and warmup_steps=1000, the LR peaks around step 1000 and is back near zero by step 2750. The smoke's "pass" criterion (macro-F1 stable across the last three evals past step 2000) was actually measuring stability under a near-zero learning rate. The full 50K run kept the LR near its peak for thousands of steps. That sustained-peak exposure was where the destabilization happened.

The smoke wasn't testing what we thought it was testing. By the time it would have noticed the divergence, its own LR schedule had already saved it.

We didn't see this coming. The fix for future smokes is to use a constant LR for the verdict window, or to set max_steps large enough that the cosine tail doesn't dominate (something like 10000 keeps LR > 60% of peak through the relevant range).

What we shipped

The §4-only recipe. Source rebalance layered on top of v0.3.0's existing dual-loss recipe, at v0.3.0's safe LR. It's the only thing that stayed clean through both a verdict smoke AND a full 50K run.

The shipped checkpoint is v0_4_0-stableLR-source-only/step-002200. Architecture is unchanged from v0.3.0 (256-dim, 6 layers, 9M params). The label vocabulary is unchanged (the same 21 BIO classes). The only thing that's different is which shards the training loader oversamples.

The honest read on the eval numbers

Per-tag F1 on golden v0.1.2 (4535 entries):

Tag	v0.4.0	v0.3.0	Δ
country	0.21	0.28	−0.07
region	0.19	0.18	+0.01
locality	0.27	0.27	flat
postcode	0.69	0.76	−0.07
venue	0.39	0.39	flat
street	0.30	0.27	+0.03
house_number	0.79	0.78	+0.01

Macro raw average: 0.357 vs 0.293. Two regressions on coarse labels, two small improvements on fine labels.

This is where it would have been easy to ship the headline as "v0.4.0 mostly regressed" and walk away. We instead bucketed the 1217 postcode false-negatives and 194 country false-negatives into categories, by manually inspecting the differences between gold and prediction.

The picture changes meaningfully:

Country false-negatives: 92% are adversarial transliteration entries — golden has English country names but the raw input is mixed-script. Examples:

بار نون وایومینگ, Wyoming, United States of America   →  pred: "yoming, United Sta"
サーモポリス, WY, United States of America              →  pred: ", WY, United State"
France, Lozère, ՍԵՆՏ-ԱԼԲԱՆ-ՍՅՈՒՐ-ԼԻՄԱՆՅՈԼ              →  pred: "" (empty)

These are v0.3.0's documented known-failure modes. The bytefallback tokenizer treats non-Latin scripts as the same opaque sequence, and the model gives up on the prefix. We have known v0.3.0 was bad at these. The v0.4.0 weights didn't change anything about this slice, so it isn't a real recipe regression. After excluding adversarial inputs, country FN drops from 194 to roughly 16. The −0.07 country regression is mostly a golden-set adversarial-weighting artifact.

Postcode false-negatives split into four buckets:

Category	Share	Example
Empty prediction	65%	`Paris 75008` → model emits nothing for postcode
Non-Latin transliteration	18%	Same v0.3.0 failure mode
House number confused for postcode	11%	`47110 Sainte-Livrade-sur-Lot, 22 Rue Jasmin` → predicts `22`
BIO span boundary slip	6%	`LE TRÉPORT, 76470` → predicts `", 7647"`

The empty-prediction slice is the real story. NAD's downweight was the most aggressive change in the §4 source rebalance: it carried a lot of "postcode comes first" patterns (47110 Sainte-Livrade-sur-Lot, ND 58701, 44th Ave) and reducing its share removed that positional exposure. The model now defaults to tagging mid-position numeric tokens as house_number instead of postcode.

That's a real fix that v0.4.1 should target. It is not the same problem as the headline regression number suggests. The 6% boundary-slip slice is a different bug — the model gets the tag right but emits a span that includes the preceding comma+space. That's a decoder fix, no retraining required, and it has already landed on main as commit c72ab4c. The decoder now trims spans past leading/trailing non-word characters.

What didn't get fixed (the v0.4.1 list)

The two destabilizers — §1 per-token CRF normalization and §3 class-weighted cross-entropy — are deferred. Both individually look like reasonable training-side improvements; both, at this LR + this loss landscape, made the model find a confident-wrong degenerate minimum after several hundred post-warmup steps. The sanity-check pass over model.py and crf.py ruled out implementation bugs: the per_token reduction is mathematically nll.sum() / total_tokens.clamp(min=1), and class_weights enters via PyTorch-standard cross_entropy(weight=...). The destabilization is a real recipe interaction.

The leading hypothesis at the v0.4.0 boundary is that some adapter slice in corpus-v0.3.0 is producing high-variance gradients that the per-token-normalized CRF still can't dampen. We built the corpus-audit tool during the campaign (it measures per-source shard distribution against the training config's source_weights, with concentration warnings), and the v0.4.1 starting point will be running gradient-norm probes per adapter to find what's causing the spike.

Three orthogonal threads for v0.4.1:

Source-weight tweak (bump NAD partway back to recover positional exposure) + synthesis pass over component-order permutations.
Corpus-side investigation of which adapter slice destabilizes CRF gradients.
Schedule + class-weight ratio redesign (constant-LR smokes, milder weight ratios).

What we'd tell a future ourselves

A few things this campaign made obvious that we'd want to put on the shelf for the next iteration:

Verdict smokes need to test the same conditions as the full run. If the full run will spend thousands of steps near peak LR, the smoke needs to spend several thousand steps near peak LR too. Cosine decay is a perfectly reasonable training schedule and a terrible verdict schedule.
Look at failure modes before trusting the F1 delta. The country regression looked real until we bucketed the failures. 92% of the "regression" was the golden set holding v0.4.0 accountable for v0.3.0's known failure modes. Always categorize before reporting.
One change at a time. v0.4.0 stacked three orthogonal training changes — per-token CRF, class weights, source rebalance — and shipped them together. The campaign spent most of its time un-stacking them. Next time, ship one per release. The risk asymmetry (one change destabilizing, vs three changes each needing isolation) just isn't worth it.
Diagnostic tooling early. The corpus-audit and diagnose_regression.py tools we built during the campaign would have saved most of v0.3.0's investigation time if they'd existed earlier. We're keeping them in the tree.

Where to look

Issue #116 — the original work plan
PR for the v0.4.0 ship branch — 10 commits with the campaign retrospective in the merge body
docs/articles/plan/phases/PHASE_2_training.md — the canonical iteration log with the full campaign narrative
corpus-python/scripts/diagnose_regression.py — the per-tag FN/FP bucketer

Next: v0.4.1 scope discussion. The empty-pred slice on postcodes is the highest-confidence single-target fix. The §1 CRF investigation is the higher-risk, higher-reward thread. We'll likely run both in parallel.

Mailwoman log

A lookup table scored 100%. We shipped the model anyway.

The score was real. The fight was rigged.​

The objection that reopened it​

The fair fight​

The lesson we're keeping​

The right name in the wrong state

A metric that reads the label and never checks the map​

Leakage-free, or it's just a memory test​

Following the 326 kilometers down​

The number was right; the screwdriver was wrong​

What we're keeping​

We spent three retrains fixing a German bug that didn't exist

Three swings​

Measuring the thing that can't be gamed​

What native German was actually doing​

The bill​

Which Berlin? When your metric grades the wrong thing

The gold star for New Hampshire​

The hint that did nothing, loudly​

Measure the distance and the floor falls out​

The same week, the same lesson​

Which way does a postcode point?

The collapse that was a rendering bug​

Three swings at the residual​

What the anchor actually learned​

Accepting the asymmetry​

The lesson, which is older than this anchor​

The map runs out before the country does

Japan has no city polygons​

You stop asking the polygon and start asking Japan Post​

The same strategy, a build shaped to the country​

Korea, the same trick inverted​

Where the map runs out​

A bug the verifier caught, and you should want it to​

What we keep, and what the map still owes us​

Does a postcode know what country it's in?

The thing we actually wanted​

A second opinion, and a sharper question​

Measure before you build​

What the number was actually telling us​

What shipped, and what we left alone​

Our parser fails 80% of our own tests. We shipped it anyway.

Two parsers, one bench​

The catch: the bench was built by the opponent​

Decomposing the 20%​

The scoreboard that matters​

The lesson​

The model that never saw an intersection

The hypothesis: it's overconfident​

Following the evidence​

The probe​

A different coverage gap, a different fix​

What we actually learned​

Zero byte-fallback: a multi-script tokenizer from WOF-earth

The data​

The tokenizer​

The result​

What's training​

The pipeline​

Why Japanese addresses break Western parsers

The hierarchy​

Reversed ordering​

No street names​

Prefix postcode​

What we shipped today​

What's still missing​

Schema readiness​

Where rules fail and learning wins​

PO Box Boîte Postale Apartado: Stage 3 ships with 6 new tags

The schema was already there​

Where the data comes from​

PO box: the synthesis case​

The pipeline​

Golden eval expansion​

Results​

What's deferred​

What this proves​

FST gazetteer ships to the browser

What changed​

The score was real. The fight was rigged.

The objection that reopened it

The fair fight

The lesson we're keeping

A metric that reads the label and never checks the map

Leakage-free, or it's just a memory test

Following the 326 kilometers down

The number was right; the screwdriver was wrong

What we're keeping

Three swings

Measuring the thing that can't be gamed

What native German was actually doing

The bill

The gold star for New Hampshire

The hint that did nothing, loudly

Measure the distance and the floor falls out

The same week, the same lesson

The collapse that was a rendering bug

Three swings at the residual

What the anchor actually learned

Accepting the asymmetry

The lesson, which is older than this anchor

Japan has no city polygons

You stop asking the polygon and start asking Japan Post

The same strategy, a build shaped to the country

Korea, the same trick inverted

Where the map runs out

A bug the verifier caught, and you should want it to

What we keep, and what the map still owes us

The thing we actually wanted

A second opinion, and a sharper question

Measure before you build

What the number was actually telling us

What shipped, and what we left alone

Two parsers, one bench

The catch: the bench was built by the opponent

Decomposing the 20%

The scoreboard that matters

The lesson

The hypothesis: it's overconfident

Following the evidence

The probe

A different coverage gap, a different fix

What we actually learned

The data

The tokenizer

The result

What's training

The pipeline

The hierarchy

Reversed ordering

No street names

Prefix postcode

What we shipped today

What's still missing

Schema readiness

Where rules fail and learning wins

The schema was already there

Where the data comes from

PO box: the synthesis case

The pipeline

Golden eval expansion

Results

What's deferred

What this proves

What changed

The tokenizer incident

What we fixed along the way

Browser verification

Try it

What we saw

The wrong hypotheses

Why Playwright couldn't catch it

The real cause

The fix

What we should have done differently

References

The hardware wall

The pivot to Modal

The results