<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://mailwoman.sister.software/blog</id>
    <title>Mailwoman log</title>
    <updated>2026-06-09T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://mailwoman.sister.software/blog"/>
    <subtitle>Mailwoman Blog</subtitle>
    <icon>https://mailwoman.sister.software/img/favicon-32.png</icon>
    <rights>Copyright © 2026 Sister Software.</rights>
    <entry>
        <title type="html"><![CDATA[A lookup table scored 100%. We shipped the model anyway.]]></title>
        <id>https://mailwoman.sister.software/blog/the-lookup-table-that-almost-fooled-us</id>
        <link href="https://mailwoman.sister.software/blog/the-lookup-table-that-almost-fooled-us"/>
        <updated>2026-06-09T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[This morning we published a post that ended with a tidy rule: some address tags don't]]></summary>
        <content type="html"><![CDATA[<p>This morning we published a post that ended with a tidy rule: some address tags don't
want a neural network, they want a lookup table. Country names are a closed list in a
known position. Our deterministic matcher scored a perfect 100 on the eval. The
retrained model scored a mess. Case closed, we wrote.</p>
<p>By the afternoon we'd reopened the case, and the verdict flipped — hard enough that
we've retracted the morning post rather than leave the wrong conclusion lying around
for someone to cite. This is the story of how a perfect score nearly talked us out of
the entire premise of the project.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-score-was-real-the-fight-was-rigged">The score was real. The fight was rigged.<a href="https://mailwoman.sister.software/blog/the-lookup-table-that-almost-fooled-us#the-score-was-real-the-fight-was-rigged" class="hash-link" aria-label="Direct link to The score was real. The fight was rigged." title="Direct link to The score was real. The fight was rigged." translate="no">​</a></h2>
<p>Here's what the morning's comparison actually was. In one corner: a flat lookup,
matching the trailing chunk of an address against the ISO country list. In the other: a
model we had retrained on a synthetic shard where <em>every single row</em> ended in a country.
That model learned exactly what we taught it — "the last thing in an address is a
country" — and started promoting cities and states to nationhood. Precision: 23%.</p>
<p>And the referee? An eval with no homographs in it. Not one "Georgia." Not one "CA."
Fifty-four addresses where the trailing token was never ambiguous, which is to say,
fifty-four addresses where a lookup table cannot lose.</p>
<p>A crippled model, an unloseable eval, and a perfect score. We looked at that 100% and
wrote down a design principle. You've done this too — a benchmark hands you a clean
number, the number agrees with the architecture you were already tempted by, and the
question "what exactly did this measure?" quietly leaves the room.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-objection-that-reopened-it">The objection that reopened it<a href="https://mailwoman.sister.software/blog/the-lookup-table-that-almost-fooled-us#the-objection-that-reopened-it" class="hash-link" aria-label="Direct link to The objection that reopened it" title="Direct link to The objection that reopened it" translate="no">​</a></h2>
<p>The pushback, when it came, was about the soul of the thing. Mailwoman is a model
system. The entire bet is that a human reads "Atlanta, Georgia" and "Tbilisi, Georgia"
and resolves them without a rulebook, so a context-reading model should too. A lookup
table can't do that. It needs a hand-coded guard for every collision — Georgia, Jordan,
Lebanon, CA — and a growing list of exceptions is precisely the disease we left
rules-based parsing to escape.</p>
<p>So we did what we should have done in the morning: gave the model a fair fight.</p>
<p>We rebuilt the training shard with the homographs <em>in</em> it, both ways. "Tbilisi, Georgia"
labeled as a country, "Atlanta, Georgia 30309" labeled as a state, the same surface
form pulling in opposite directions until the only way to win is to read the
neighbors. We added addresses with no country at all, so abstaining stays on the menu.
Then we built the eval the morning's comparison never had: Paris, Texas against Paris,
France; Kingston, New York against Kingston, Jamaica; person-named countries; the works.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-fair-fight">The fair fight<a href="https://mailwoman.sister.software/blog/the-lookup-table-that-almost-fooled-us#the-fair-fight" class="hash-link" aria-label="Direct link to The fair fight" title="Direct link to The fair fight" translate="no">​</a></h2>
<p>The retrained model's first result: country recognition went from literally zero to
alive — at <strong>100% precision, with zero over-fires</strong>. Not one city promoted to a nation.
Not one "Georgia" guessed wrong. Region accuracy <em>improved</em> while the new tag came
online. The contrast pairs did exactly what the theory said: the model learned that the
label is contextual, because we finally showed it contexts that disagree.</p>
<p>What it missed, it missed honestly: Eswatini. Timor-Leste. Bhutan. Countries the
training data mentions a handful of times. That failure mode is recognition, a
vocabulary problem, and vocabulary is what gazetteers are for.</p>
<p>Which is where the lookup table re-enters the story — demoted. It doesn't get to be
the judge anymore; it gets to be a witness. We feed gazetteer membership into the model
as a per-token clue: <em>this word is on the country list; this word is on two lists, so
pay attention.</em> The model still rules on every tag. Add Liechtenstein to the gazetteer
tomorrow and the clue fires with no retrain, because the knowledge lives outside the
weights. The morning's matcher survives intact, doing the one job it was always
qualified for: knowing what's on the list. Reading the room was never on its résumé.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-lesson-were-keeping">The lesson we're keeping<a href="https://mailwoman.sister.software/blog/the-lookup-table-that-almost-fooled-us#the-lesson-were-keeping" class="hash-link" aria-label="Direct link to The lesson we're keeping" title="Direct link to The lesson we're keeping" translate="no">​</a></h2>
<p>The seductive thing about a deterministic component is that it cannot be wrong in the
cases you thought to test. The treacherous thing is the same sentence with the
emphasis moved: it cannot be <em>right</em> in the cases you didn't. Our 100% was an artifact
of an eval that only contained the easy half of the problem.</p>
<p>When a benchmark tells you the simple thing beats the learned thing, before you
celebrate, check who the learned thing was that day, and check what the benchmark left
out. Sometimes the simple thing genuinely wins. Ours lost the rematch the moment the
hard cases showed up — and the model, given one honest shot at the data, did the thing
we built it to do.</p>
<p>Trust the model. Feed it better.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="parsing" term="parsing"/>
        <category label="evaluation" term="evaluation"/>
        <category label="Model training" term="Model training"/>
        <category label="Architecture" term="Architecture"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The right name in the wrong state]]></title>
        <id>https://mailwoman.sister.software/blog/2026/06/08/the-right-name-in-the-wrong-state</id>
        <link href="https://mailwoman.sister.software/blog/2026/06/08/the-right-name-in-the-wrong-state"/>
        <updated>2026-06-08T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Our resolver scored 93.7% on the metric we'd been quoting for months. On the same addresses, its median answer was 326 kilometers from the truth. Both numbers were correct. Only one of them was honest.]]></summary>
        <content type="html"><![CDATA[<p>Our resolver scored 93.7% on the metric we'd been quoting for months. On the same addresses, its median answer was 326 kilometers from the truth.</p>
<p>Both numbers are correct. That's the uncomfortable part.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="a-metric-that-reads-the-label-and-never-checks-the-map">A metric that reads the label and never checks the map<a href="https://mailwoman.sister.software/blog/2026/06/08/the-right-name-in-the-wrong-state#a-metric-that-reads-the-label-and-never-checks-the-map" class="hash-link" aria-label="Direct link to A metric that reads the label and never checks the map" title="Direct link to A metric that reads the label and never checks the map" translate="no">​</a></h2>
<p>When the resolver turns a parsed address into a place, we used to grade it one way: did the place it picked carry the same <em>name</em> as the gold answer? Gold says the locality is "Sheldon", resolver says "Sheldon", that's a point. It's a reasonable-sounding check, and it is wrong in a way that took us months to see. It can only fail when the name is wrong, and the name is almost never wrong.</p>
<p>There are ten places called "Sheldon" in the United States. "New York" is a city and a state and a village 280 kilometers apart. "Washington" is a town in most states you can name. When you grade by name, every one of those is a tie, and the resolver gets full marks for picking <em>any</em> of them. The metric was answering "is this the right word?" when the only question that matters is "is this the right place on Earth?"</p>
<p>So we built a harness that asks the second question, and pointed it at the one slice of data where it would tell the truth.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="leakage-free-or-its-just-a-memory-test">Leakage-free, or it's just a memory test<a href="https://mailwoman.sister.software/blog/2026/06/08/the-right-name-in-the-wrong-state#leakage-free-or-its-just-a-memory-test" class="hash-link" aria-label="Direct link to Leakage-free, or it's just a memory test" title="Direct link to Leakage-free, or it's just a memory test" translate="no">​</a></h2>
<p>The honest slice matters as much as the honest metric. Our model trains on a corpus that covers the same towns the eval tests, so a random evaluation partly measures memorization: the model recalling a place it has already seen rather than generalizing to one it hasn't. The corpus deliberately holds a few regions out of training entirely. Evaluate only on those held-out places and you're testing the model on geography it has genuinely never met.</p>
<p>In our current data that's Vermont: 1,428 addresses the model trained around, not on. We ran the full pipeline on them and stopped grading by name. We measured <strong>region-match</strong>, the great-circle distance from the gold point to the resolved one, and <strong>PIP-containment</strong> (whether the gold coordinate actually falls inside the resolved place's polygon). None of those can be fooled by a matching string.</p>
<p>Here is what the honest slice said, next to the number we'd been quoting:</p>
<table><thead><tr><th>metric</th><th style="text-align:right">what we quoted</th><th style="text-align:right">the honest number</th></tr></thead><tbody><tr><td>locality name-match</td><td style="text-align:right">93.7%</td><td style="text-align:right">93.7%</td></tr><tr><td>region-match</td><td style="text-align:right">—</td><td style="text-align:right">0.0%</td></tr><tr><td>coordinate error (p50)</td><td style="text-align:right">—</td><td style="text-align:right">326 km</td></tr></tbody></table>
<p>Region-match: zero. Not low. Zero. The resolver was getting the state right essentially never, and the name-match metric had no way to tell us, because "Sheldon, Vermont" and "Sheldon, Iowa" are the same word.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="following-the-326-kilometers-down">Following the 326 kilometers down<a href="https://mailwoman.sister.software/blog/2026/06/08/the-right-name-in-the-wrong-state#following-the-326-kilometers-down" class="hash-link" aria-label="Direct link to Following the 326 kilometers down" title="Direct link to Following the 326 kilometers down" translate="no">​</a></h2>
<p>The model wasn't the problem. Hand it <code>226 Bridge Rd, North Hero, VT 05474</code> and it cleanly tags <code>region="VT"</code>, <code>locality="North Hero"</code>, the street, the number, the postcode. The parse is right. The resolver throws the region away.</p>
<p>It throws it away because it can't read it. Who's On First stores Vermont as "Vermont"; our search index carried no abbreviations, so <code>findPlace("VT")</code> matched nothing. With no resolved region, the resolver had no parent to constrain the locality search, so it searched the whole country — and when ten Sheldons compete with no geographic filter, the one with the largest population wins. Vermont's Sheldon (population 932) loses to Iowa's (population 5,455) every single time. The 326 kilometers was the distance between the right name and the famous one.</p>
<p>The fix already existed in the repo. A build step that pulls state abbreviations from a reference dataset we already ship had simply fallen out of the build manifest, so the gazetteer went out without it. We put it back, rebuilt the index, and re-ran the same slice:</p>
<table><thead><tr><th>metric</th><th style="text-align:right">before</th><th style="text-align:right">after</th></tr></thead><tbody><tr><td>region-match</td><td style="text-align:right">0.0%</td><td style="text-align:right">99.9%</td></tr><tr><td>coordinate error (p50)</td><td style="text-align:right">326 km</td><td style="text-align:right">3.4 km</td></tr></tbody></table>
<p>Across the full US sample, the long tail told the same story louder: the 90th-percentile error fell from <strong>2,763 kilometers to 10</strong>. We carry a flag called <code>--default-country</code>, the one that makes you tell the resolver the answer it's supposed to find, and it exists largely to paper over this exact blindness. The resolver can read the region now.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-number-was-right-the-screwdriver-was-wrong">The number was right; the screwdriver was wrong<a href="https://mailwoman.sister.software/blog/2026/06/08/the-right-name-in-the-wrong-state#the-number-was-right-the-screwdriver-was-wrong" class="hash-link" aria-label="Direct link to The number was right; the screwdriver was wrong" title="Direct link to The number was right; the screwdriver was wrong" translate="no">​</a></h2>
<p>This is where it would be tidy to stop. It wasn't tidy.</p>
<p>Before promoting anything we ran the demo presets, the eight addresses we look at by hand, and one of them had gotten worse. <code>350 5th Ave, New York, NY</code> used to resolve to New York City. Now it resolved to "New York Mills", a village 283 kilometers upstate. The aggregate said the fix was a triumph; the functional check said we'd broken the most famous address in the set. When those two disagree, the functional check is the one telling the truth, and that disagreement is where you go looking.</p>
<p>The clue led somewhere worth knowing. Now that the region resolved, the resolver was boosting places that descend from it, and it works out descent from a precomputed ancestry table. New York City spans five boroughs, so Who's On First gives it the "no single parent" sentinel for a parent id, and our table-builder, which only ever followed parent ids, had recorded NYC's ancestry as <em>just itself</em>. No link to New York state. So the region boost lifted the correctly-filed village over the city, and a village of three thousand beat a city of eight million on a technicality of bookkeeping.</p>
<p>The ancestry was never actually missing. NYC's source record carries the full hierarchy, with New York state in all five of its borough branches, sitting in a field our builder didn't read. So we read it: a repair pass that rebuilds ancestry from the authoritative hierarchy fixed 47,129 places. New York City resolves to New York City again, Vermont stayed at 3.4 kilometers, and the metro regression was gone.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-were-keeping">What we're keeping<a href="https://mailwoman.sister.software/blog/2026/06/08/the-right-name-in-the-wrong-state#what-were-keeping" class="hash-link" aria-label="Direct link to What we're keeping" title="Direct link to What we're keeping" translate="no">​</a></h2>
<p>Two things, and they're the same shape.</p>
<p>The first is about metrics. A measurement that grades by name can be gamed by coincidence and will flatter you right up until a customer geocodes into the wrong state. The coordinate can't be gamed: a point is either inside the right boundary or it isn't. We lead with region-match and distance now, and we report containment honestly, point geometry and all. The yardstick comes before the optimization, because every win you book against a dishonest yardstick is a win you might have to give back.</p>
<p>The second is about trust. The aggregate loved the abbreviation fix. The eight addresses we read with our own eyes caught the regression the aggregate buried, and chasing <em>why</em> those eight disagreed is what surfaced the ancestry bug underneath. Numbers scale and that is exactly their weakness; they average away the one case that would have embarrassed you. Keep reading the addresses by hand. The disagreement is where the bug lives.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="eval" term="eval"/>
        <category label="Resolver / WOF" term="Resolver / WOF"/>
        <category label="wof" term="wof"/>
        <category label="coordinate-first" term="coordinate-first"/>
        <category label="methodology" term="methodology"/>
        <category label="Advanced" term="Advanced"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[We spent three retrains fixing a German bug that didn't exist]]></title>
        <id>https://mailwoman.sister.software/blog/2026/06/07/three-retrains-and-a-phantom</id>
        <link href="https://mailwoman.sister.software/blog/2026/06/07/three-retrains-and-a-phantom"/>
        <updated>2026-06-07T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Our parser's international-order German 'collapsed' to 44%, so we retrained it three times. Then we measured the thing that can't be gamed and found out the model had been right all along — the metric was lying. Native German was sitting at 96%.]]></summary>
        <content type="html"><![CDATA[<p>There is a particular kind of engineering misery where you fix a bug three times and it never gets better, because the bug is in your ruler. This is that story.</p>
<p>Our neural parser handles German two ways. Native order — <code>Hauptstraße 5, 10115 Berlin</code> — is the layout real German feeds and real German people use. International order — <code>5 Hauptstraße, Berlin, 10115</code> — is the Americanized layout our evaluation set happens to ship. For months, international-order German "collapsed": locality accuracy sat around 44% while native cleared 80%. We had a story for it. The postcode anchor — a side-channel that feeds the model a country hint derived from the postcode — sits at the <em>trailing</em> postcode, which in international order lands on the far side of the locality from where it's needed. Plausible. So we retrained.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="three-swings">Three swings<a href="https://mailwoman.sister.software/blog/2026/06/07/three-retrains-and-a-phantom#three-swings" class="hash-link" aria-label="Direct link to Three swings" title="Direct link to Three swings" translate="no">​</a></h2>
<p>The first retrain taught the model both word orders. It moved the model's intrinsic parsing but the production number stayed flat. The second re-added a region tail the synthetic data had dropped. It fixed <em>region</em> tagging — and left locality exactly where it was. The third injected the country hint at the front of the sentence too, so word order couldn't hide it. Locality-match went from 44.7% to 43.7%. Down. Three swings, and the needle would not move.</p>
<p>Across all three, one number sat there glowing and we kept not looking at it: the median coordinate error was about <strong>6 kilometers</strong>. Six kilometers is city-centroid accuracy. That is not what a "collapse" looks like. A model that genuinely couldn't parse German addresses would be putting them in the wrong country, not six kilometers from the front door. The geography was fine the whole time while the locality-match score fell. When your accuracy metric drops and your distance-to-truth doesn't, the metric is the thing that's broken.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="measuring-the-thing-that-cant-be-gamed">Measuring the thing that can't be gamed<a href="https://mailwoman.sister.software/blog/2026/06/07/three-retrains-and-a-phantom#measuring-the-thing-that-cant-be-gamed" class="hash-link" aria-label="Direct link to Measuring the thing that can't be gamed" title="Direct link to Measuring the thing that can't be gamed" translate="no">​</a></h2>
<p>So we measured it. PIP-containment: forget whether the resolved name <em>string</em> matches the gold string — is the address's real GPS point physically <em>inside the polygon</em> of the place we resolved it to? You cannot game that with a string trick. It either lands in the right place or it doesn't.</p>
<p>The international-order German result split clean down the middle:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">                name-match   PIP-containment</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Saxony            51.1%          75.9%        (+24.8pp)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Berlin            36.3%          36.3%        ( 0.0pp)</span><br></div></code></pre></div></div>
<p>Two completely different stories had been hiding under one average.</p>
<p><strong>Saxony was never broken.</strong> The model places Saxon addresses correctly three times in four; the name-match metric only credited half of them. Look at what it was rejecting:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">gold "Plauen Vogtl"     resolved "Plauen"        point inside Plauen ✓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gold "Chemnitz Sachs"   resolved "Chemnitz"      point inside Chemnitz ✓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gold "Marienberg Erzgeb" resolved "Marienberg"   point inside Marienberg ✓</span><br></div></code></pre></div></div>
<p>OpenAddresses tags these with the regional district — <em>Vogtländischer Kreis</em>, <em>Sachsen</em>, <em>Erzgebirge</em> — and Who's On First's canonical name doesn't carry the suffix. So <code>Plauen Vogtl</code> ≠ <code>Plauen</code>, the string check fails, and the model eats a miss for resolving an address to <em>exactly the right town</em>. Twenty-five points of "collapse" was our ruler refusing to call Plauen Plauen.</p>
<p><strong>Berlin was genuinely broken</strong> — just not the way we'd been retraining for. Of 1,500 Berlin addresses, 955 resolved to nothing at all. The model drops the locality entirely in the city-state layout <code>…, Berlin, Berlin 10115</code>, where the city and the state are the same word: one <code>Berlin</code> gets labeled the region, the other vanishes, and the resolver has nothing to place. That's a real bug. It is also specific to Berlin, Hamburg, and Bremen, and it has nothing whatsoever to do with the postcode anchor or word order — which is precisely why three anchor-and-order retrains never laid a finger on it.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-native-german-was-actually-doing">What native German was actually doing<a href="https://mailwoman.sister.software/blog/2026/06/07/three-retrains-and-a-phantom#what-native-german-was-actually-doing" class="hash-link" aria-label="Direct link to What native German was actually doing" title="Direct link to What native German was actually doing" translate="no">​</a></h2>
<p>And then the part that stung. We ran the same honest metric on native order, the layout that actually matters:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">                name-match   PIP-containment</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">native German     83.5%          96.2%</span><br></div></code></pre></div></div>
<p><strong>Ninety-six percent.</strong> Native German, measured by where the addresses actually land, was essentially solved and beating the rules-based baseline comfortably — while we'd been reading 83.5% off the name string and quietly wishing it were better. The metric had been low-balling our best locale by thirteen points the whole time.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-bill">The bill<a href="https://mailwoman.sister.software/blog/2026/06/07/three-retrains-and-a-phantom#the-bill" class="hash-link" aria-label="Direct link to The bill" title="Direct link to The bill" translate="no">​</a></h2>
<p>Three retrains, an A100 each, to discover that the model was fine and the scoreboard was broken. The honest accounting: one bug was a measurement artifact in the resolver's name comparison (the fix is an alias, not a training run), one was a narrow city-state parsing bug (a small data fix, not a country hint), and the model's German was a good deal better than any of our numbers had admitted. We cancelled the fourth retrain that was already queued.</p>
<p>The thing I keep turning over is that the coordinate error sat at six kilometers across all three runs and we kept retraining anyway, because the metric we'd built our gates around was the one telling us to. A benchmark you can fail while being right is worse than no benchmark, because it doesn't just fail to help — it actively points you at the wrong fix and lets you feel diligent while you chase it. We have a non-gameable metric now. We should have built it first.</p>
<p><em>The 2×2s, the PIP-containment harness, and the per-state breakdowns are in <code>scripts/eval/de-pip-eval.sh</code> and <code>docs/articles/evals/</code>. Numbers in this post are generated.</em></p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Resolver / WOF" term="Resolver / WOF"/>
        <category label="Geocoder hubris" term="Geocoder hubris"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Which Berlin? When your metric grades the wrong thing]]></title>
        <id>https://mailwoman.sister.software/blog/2026/06/07/which-berlin</id>
        <link href="https://mailwoman.sister.software/blog/2026/06/07/which-berlin"/>
        <updated>2026-06-07T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Our resolver kept dropping German addresses in New Hampshire, and our scorecard handed it a gold star every time. Here's how a name-match metric lied to us for months, and what a postcode hint was quietly worth once we measured by distance instead.]]></summary>
        <content type="html"><![CDATA[<p>Ask a geocoder for "Berlin" and it has to make a choice. There's the one in Germany, obviously. There's also Berlin, New Hampshire (population nine thousand and change), Berlin, Wisconsin, Berlin, Connecticut, and a dozen more scattered across the United States like the name was on sale. The parser hands you the word <code>Berlin</code> tagged as a locality; something downstream has to decide <em>which</em> dot on the map that is. How would you even know if it picked right?</p>
<p>For a long time our answer was a scorecard that checked the name. Did the resolved place's name equal the expected name? Tick. Move on. It is a completely reasonable thing to measure, and it was lying to us for months.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-gold-star-for-new-hampshire">The gold star for New Hampshire<a href="https://mailwoman.sister.software/blog/2026/06/07/which-berlin#the-gold-star-for-new-hampshire" class="hash-link" aria-label="Direct link to The gold star for New Hampshire" title="Direct link to The gold star for New Hampshire" translate="no">​</a></h2>
<p>Here's the failure the name check can't see. Feed it a German address, let the resolver land on Berlin, New Hampshire, and ask the scorecard how it did. The resolved name is "Berlin." The expected name is "Berlin." Tick. Gold star. We just put a Berlin address an ocean away from Berlin and the metric congratulated us for it.</p>
<p>This isn't a contrived edge case. Bare locality names collide constantly across borders, and a name-only check is structurally blind to the collision. Whenever the model dropped a German locality on its American namesake, our headline number stayed perfectly, serenely flat. The bug and the scorecard were made for each other.</p>
<p>We only tripped over it by accident, chasing something else entirely.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-hint-that-did-nothing-loudly">The hint that did nothing, loudly<a href="https://mailwoman.sister.software/blog/2026/06/07/which-berlin#the-hint-that-did-nothing-loudly" class="hash-link" aria-label="Direct link to The hint that did nothing, loudly" title="Direct link to The hint that did nothing, loudly" translate="no">​</a></h2>
<p>Every address carries a postcode, and a postcode mostly pins down a country. So we built a small extractor that turns the postcode into a guess about which country you're in, and we ran a simulation: feed that country guess into the resolver's ranking, give candidates from the right country a nudge, and see how much the name-match score improves.</p>
<p>It improved by nothing. Zero. Flat line.</p>
<p>Which, briefly, looked like a dead end. The hint was supposed to help and the number said it didn't. Then it clicked: the number <em>couldn't</em> say it helped, because the number grades by name, and fixing a wrong-country pick doesn't change the name. We'd handed our metric exactly the kind of improvement it was built to ignore.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="measure-the-distance-and-the-floor-falls-out">Measure the distance and the floor falls out<a href="https://mailwoman.sister.software/blog/2026/06/07/which-berlin#measure-the-distance-and-the-floor-falls-out" class="hash-link" aria-label="Direct link to Measure the distance and the floor falls out" title="Direct link to Measure the distance and the floor falls out" translate="no">​</a></h2>
<p>So we threw out the name check and graded by distance instead. We have the real government coordinates for every test address, so we can ask the only question that actually matters: how far is the resolver's pick from where the address really is?</p>
<p>The picture inverted immediately. On German addresses, the postcode hint dragged 33 picks back across the Atlantic to where they belonged, erasing about 117,000 kilometers of total error. On American addresses it pulled 333 of them more than 100 km closer to the truth and pushed only 7 the wrong way, a roughly fifty-to-one trade. The hint was quietly worth a continent, and the name scorecard had been sitting there the whole time reporting that absolutely nothing was happening.</p>
<p>A metric you can satisfy without being right will let you be wrong forever, cheerfully, in production. "Berlin" matches "Berlin" no matter which one you meant. The distance to the real point does not care what you call the place; it just measures whether you found it. We switched the yardstick, and we're building the country hint into the resolver for real now that we can finally see what it does.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-same-week-the-same-lesson">The same week, the same lesson<a href="https://mailwoman.sister.software/blog/2026/06/07/which-berlin#the-same-week-the-same-lesson" class="hash-link" aria-label="Direct link to The same week, the same lesson" title="Direct link to The same week, the same lesson" translate="no">​</a></h2>
<p>This landed the same week we did something that sounds unrelated and turns out to be the identical problem: we calibrated the parser's confidence. Every span comes out stamped with a <code>conf=</code> number, and we'd never checked whether a 0.9 actually meant right-nine-times-in-ten. It didn't, until we fit a correction that made it honest (the <a class="" href="https://mailwoman.sister.software/docs/concepts/confidence-calibration">calibration writeup</a> has the details, including the weather-forecaster version of the story).</p>
<p>Both are the same realization wearing different hats. A geocoder reports numbers about itself constantly: how confident it is in a tag, how well it scored on a benchmark. Those numbers are worthless decoration until you've checked that they mean what they say. A confidence that isn't calibrated is a vibe with a decimal point. A benchmark you can game is a way to feel good while shipping the wrong Berlin.</p>
<p>So the next time a metric tells you everything is fine, ask it the one thing it isn't measuring. Ours was measuring the spelling. It should have been measuring the distance.</p>
<p><em>The harness, the per-row deltas, and the reproducible reports live in <code>scripts/eval/anchor-resolver-delta.ts</code> and <code>docs/articles/evals/</code>. Numbers in this post are generated, not hand-typed.</em></p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Resolver / WOF" term="Resolver / WOF"/>
        <category label="Geocoder hubris" term="Geocoder hubris"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Which way does a postcode point?]]></title>
        <id>https://mailwoman.sister.software/blog/2026/06/06/which-way-does-a-postcode-point</id>
        <link href="https://mailwoman.sister.software/blog/2026/06/06/which-way-does-a-postcode-point"/>
        <updated>2026-06-06T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We injected a postcode's country signal straight into the model, and German addresses jumped thirty-five points. Then three retrains taught us the same anchor can't help an address written in the other order, because it never learned to find the city. It learned which way to look.]]></summary>
        <content type="html"><![CDATA[<p>We left the last postcode story with a promise and a bill. The promise was that the "which country is this" signal has to come from the trained model reading the whole string, because the postcode on its own settles the question less than half the time. The bill was that this is the expensive version of the feature. This is the post where we paid it: we built the country signal into the model, watched it do something genuinely great, and then watched it refuse, in the most instructive way we've hit all month, to do that same great thing in a different word order.</p>
<p>The great thing first, because you've earned it. We took the postcode's gazetteer membership, that <code>[us, de, fr]</code> answer from last time, and instead of handing it to a regex we injected it into the model at the postcode token itself. A small additive nudge on the hidden state, right where the five digits sit, carrying "here is what this code could be." On German addresses written the way Germans actually write them, it was worth thirty-five points of locality accuracy. It beat Pelias. For one evening we were heroes.</p>
<p>Then we looked at the international numbers and the floor gave way. Same model, same anchor, the same German cities, but now written house-number-first with the postcode trailing the city, the way our test feed renders them, and it scored a hair above a coin flip. The hero anchor was, on those rows, slightly worse than no anchor at all.</p>
<p>Three questions sit under the rest of this, so let me put them on the table before we start:</p>
<ul>
<li class="">When a parser "collapses" on a test, is the parser wrong, or is the test?</li>
<li class="">Can you train one model to read an address in any order, or does each order quietly cost you the other?</li>
<li class="">And the one that took three retrains to answer honestly: what does a learned anchor actually learn, the thing you asked for, or the shape of where you kept putting it?</li>
</ul>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-collapse-that-was-a-rendering-bug">The collapse that was a rendering bug<a href="https://mailwoman.sister.software/blog/2026/06/06/which-way-does-a-postcode-point#the-collapse-that-was-a-rendering-bug" class="hash-link" aria-label="Direct link to The collapse that was a rendering bug" title="Direct link to The collapse that was a rendering bug" translate="no">​</a></h2>
<p>Before you can fix a collapse you have to be sure it's real, and ours mostly wasn't. The number that scared us, German international addresses parsing around 45% while native ones sat in the eighties, turned out to be measuring our test harness as much as our model.</p>
<p>Here's the thing we'd quietly done to ourselves. Our German evaluation set is rendered from OpenAddresses in the layout our US-trained tooling defaults to: <code>27 Straußstraße, Berlin, Berlin 12623</code>. House number first, postcode after the city, region hanging off the tail. No German has ever written an address that way. They write <code>Straußstraße 27, 12623 Berlin</code>, street then number, postcode <em>before</em> the city. The model had trained on the German order and we were grading it on the American one, then reading the low score as a model failure.</p>
<p>So we re-rendered the same cities in their native order and measured again. The "collapsing" model read them at 83.8%, comfortably past Pelias's 78.7. The collapse was, to a first approximation, us holding the test sideways. That's worth saying plainly because it's the cheap half of the lesson: when a model falls over on exactly one slice of your data, suspect the slice before you suspect the model. We've now been burned by eval-order twice, and both times the fix was free.</p>
<p>Only the first approximation was free, though. After we corrected the rendering, a residual gap stayed behind, and it had nothing to do with order artifacts. With the anchor switched on, international-order German <em>still</em> came in a few points below the same model with the anchor switched off. The boost that was worth +35 on native addresses had flipped its sign. No rendering fix was going to explain that one away; the anchor was actively making the harder order worse, and chasing why is where the rest of the story lives.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="three-swings-at-the-residual">Three swings at the residual<a href="https://mailwoman.sister.software/blog/2026/06/06/which-way-does-a-postcode-point#three-swings-at-the-residual" class="hash-link" aria-label="Direct link to Three swings at the residual" title="Direct link to Three swings at the residual" translate="no">​</a></h2>
<p>We did the obvious thing first, and the obvious thing told us something real. If the model had only ever seen German in native order, of course it stumbled on the international one, so we rebuilt the training shard to render both orders, roughly sixty/forty. The model with the anchor <em>off</em> responded exactly as you'd hope: international-order parsing climbed from 35.9% to 48.4%. The capability is learnable. Show the model both layouts and it reads both.</p>
<p>The model with the anchor <em>on</em> didn't move. International stayed stuck around 44%, with the anchor still dragging it below the anchor-off number. So we'd proven the corpus wasn't the ceiling, which is genuinely useful and was not the result we wanted.</p>
<p>Swing two. We noticed the international synth had been dropping the region from the tail while the eval fed it, so the model was being asked to segment a <code>City, Region Postcode</code> ending it had never trained on. Reasonable suspect. We rendered the region back into the tail and retrained. The region-matching did exactly its job, international region accuracy going from zero to about forty percent, and the locality number we actually cared about did not budge. The tail wasn't the ceiling either.</p>
<p>Swing three was the architectural one, and it's the one we'd have bet on. If the anchor lands on the postcode and the postcode trails the city in international order, then by the time the city gets read the anchor is firing on the wrong side of it. Fine: inject the anchor a second time, at the very first token, where every locality can attend back to it no matter where the postcode ended up. A clean change, no new parameters, the zero-confidence case stays a perfect identity. We retrained.</p>
<p>It did nothing. International held at 43.7%, the anchor still underwater.</p>
<table><thead><tr><th>retrain</th><th style="text-align:right">native, anchor on</th><th style="text-align:right">international, anchor on</th></tr></thead><tbody><tr><td>both-order corpus</td><td style="text-align:right">82.1</td><td style="text-align:right">44.5</td></tr><tr><td>region in the tail</td><td style="text-align:right">83.6</td><td style="text-align:right">44.7</td></tr><tr><td>second anchor at token 0</td><td style="text-align:right">83.5</td><td style="text-align:right">43.7</td></tr></tbody></table>
<p>Three swings, one number that would not move. At some point a column of results that flat stops being a series of failed fixes and starts being the finding itself.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-the-anchor-actually-learned">What the anchor actually learned<a href="https://mailwoman.sister.software/blog/2026/06/06/which-way-does-a-postcode-point#what-the-anchor-actually-learned" class="hash-link" aria-label="Direct link to What the anchor actually learned" title="Direct link to What the anchor actually learned" translate="no">​</a></h2>
<p>Here's where it helps to stop asking "why won't it improve" and start asking what the thing in front of you is actually doing. We'd been describing the anchor as if it carried a meaning, "this postcode could be German," and meanings don't have a handedness. What we actually add to that one position is a vector, and the model spends all of training learning what to <em>do</em> with the nudge. What it learned to do, it turns out, has a direction baked into it.</p>
<p>Think about where the city sits relative to the postcode in each training distribution. In native German, the postcode comes <em>before</em> the city: <code>12623 Berlin</code>. Every time the anchor fired during training, the locality it was supposed to help was sitting just to its right. So the model learned an anchor that reaches rightward, and on native addresses it reaches right and finds Berlin every time, which is your +35 points. Hand that same model an international-order address and the postcode is now <em>after</em> the city. The anchor reaches right out of long habit, finds the region or the end of the string, and meanwhile the actual city it was meant to rescue is sitting behind it, unhelped and slightly shoved.</p>
<p>The clean confirmation was hiding in the data the whole time, in the one locale that never suffered. American addresses put the postcode after the city, <code>Seattle WA 98101</code>, and the US anchor never hurt anything; US held at 96, 97%. Of course it did. US training is consistently postcode-after-city, so the US anchor learned to reach <em>left</em>, toward the city behind it, and it's right every time because the layout never varies. Same architecture, same injection point, opposite learned direction, because the two countries write their addresses in opposite orders and the anchor simply absorbed whichever one it was fed.</p>
<p>That's the asymmetry, and it's why it's fundamental rather than a tuning problem. A single added vector can encode "reach toward the city." It cannot encode "reach toward the city, which is sometimes to my left and sometimes to my right." Mix both orders into one shard and you're asking one direction to point two ways; it settles on the average and serves the dominant order, which is exactly the flat international number we kept retraining into. To check we weren't chasing a name-matching mirage, we ran a containment metric, does the resolved point land inside the right city's polygon, and the gap held: 96% on native German, 57% on the international order. The miss is geographic and real, not a scoreboard artifact.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="accepting-the-asymmetry">Accepting the asymmetry<a href="https://mailwoman.sister.software/blog/2026/06/06/which-way-does-a-postcode-point#accepting-the-asymmetry" class="hash-link" aria-label="Direct link to Accepting the asymmetry" title="Direct link to Accepting the asymmetry" translate="no">​</a></h2>
<p>When you've thrown corpus, tail, and architecture at a number and it hasn't twitched, the honest move is to stop calling it a bug. We brought the whole arc to our second-opinion model, the same one that talked us out of the doomed feature last time, and it made the call we'd been circling: accept the asymmetry, ship the native win.</p>
<p>The case is stronger than "we gave up." The native gain is large, it's stable across every retrain, and it generalizes; US and French held throughout. The international penalty is small and just as stable, and an international-order German address can route around the anchor entirely, since the model reads both orders fine on its own once it's seen them. You lose nothing real by switching the anchor off for the layout it was never going to help. So that's production: anchor on where the postcode leads the city, off where it trails it, and the +35 points kept exactly where they were earned.</p>
<p>The asymmetry doesn't kill the bigger plan either, which was the part worth keeping. If one vector can only ever point one way, then a cleverer single anchor was never going to save us. What we want is an anchor <em>per</em> locale, each one free to learn its own country's direction: the German anchor reaches right, the American one reaches left, and nobody is forced to average. That's a real week of work for another day, but it's a justified one now instead of a hopeful one, which is the same place the last postcode story left us standing.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-lesson-which-is-older-than-this-anchor">The lesson, which is older than this anchor<a href="https://mailwoman.sister.software/blog/2026/06/06/which-way-does-a-postcode-point#the-lesson-which-is-older-than-this-anchor" class="hash-link" aria-label="Direct link to The lesson, which is older than this anchor" title="Direct link to The lesson, which is older than this anchor" translate="no">​</a></h2>
<p>What we'd missed, going in, is that a learned signal doesn't carry the meaning you named it after. It carries the geometry of the data you trained it on. We called the thing a "country anchor" and reasoned about it as if it knew a fact about a postcode, when what it had absorbed was a habit about where cities tend to sit. The name was a label we put on the outside; the direction was the thing inside, and the direction is what shipped.</p>
<p>So when you train a helper signal and it works beautifully on the distribution you built it against, the question to ask before you trust it somewhere new is what it actually learned the shape of, and whether that shape still holds one locale to the left. Ours didn't. The good news is it told us so in three clean retrains, and the better news is that the thing it learned, narrow as it is, is worth thirty-five points right where we'll keep it.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Architecture" term="Architecture"/>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Postcode components" term="Postcode components"/>
        <category label="International" term="International"/>
        <category label="Model training" term="Model training"/>
        <category label="Advanced" term="Advanced"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The map runs out before the country does]]></title>
        <id>https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does</id>
        <link href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does"/>
        <updated>2026-06-05T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We taught a resolver to drop a postcode into the city polygon that contains it. Japan has no city polygons, so we resolved it another way and hit 94%. Then Korea let us nail the coordinate every time and quietly refused to give up the name — and the gazetteer is the reason.]]></summary>
        <content type="html"><![CDATA[<p>We spent a good month teaching our resolver exactly one trick. Take a postcode, drop its centroid into the city polygon that happens to contain it, read off the city. It's a genuinely good trick. It got the Netherlands to 95% and Germany to 93%, and for a while it felt like the whole problem was going to fall to it. Then we pointed it at Japan, and Japan calmly informed us that it has no city polygons to drop anything into.</p>
<p>What follows is a two-country story about what a geocoder can still do when the map underneath it goes thin, and where it finally can't. Japan we resolved anyway, 94% of the way, by putting the polygon down and asking a different question. Korea handed the same problem back to us turned inside-out: it let us pin the coordinate perfectly, every time, and then stopped us cold at the one thing we were really after, which is the name of the place you've landed in.</p>
<p>Three questions sit under all of it, so let me put them on the table before we start:</p>
<ul>
<li class="">What do you do when the gazetteer gives you points where you expected shapes?</li>
<li class="">Does the move that rescues Japan actually generalize, or did we get lucky once and dress it up as a method?</li>
<li class="">And the question with no comfortable answer: what happens when the map is simply missing the part of a country you most need to see?</li>
</ul>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="japan-has-no-city-polygons">Japan has no city polygons<a href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does#japan-has-no-city-polygons" class="hash-link" aria-label="Direct link to Japan has no city polygons" title="Direct link to Japan has no city polygons" translate="no">​</a></h2>
<p>Quick recap for anyone who missed the last Japan post: a few weeks ago we pulled Japan's address hierarchy out of Who's On First and learned that Japanese addresses run backwards and have no street names at all. This is the sequel, the one where we try to actually resolve them.</p>
<p>The European recipe is point-in-polygon, and it's about as simple as geocoding gets. A postcode comes with a centroid. Who's On First gives you administrative polygons. You ask which locality polygon contains the centroid, and that's your city. Clean, fast, and it carried four European locales without complaint.</p>
<p>It gets Japan to 25%, and it took us an embarrassingly long while to see why, because the failure wears the costume of a tuning problem and is nothing of the sort. We went digging through WOF's Japanese geometry placetype by placetype, and the pattern repeats every time. The prefectures have polygons. The wards and sub-prefectures have polygons. The municipality — the 市区町村 level a postcode actually resolves to — is essentially all points. Not coarse polygons, not bad ones, just points: a latitude and a longitude with nothing to be inside of.</p>
<p>So point-in-polygon has nothing to contain, and no amount of fiddling with a containment test rescues a containment test when there are no containers. We checked Korea and Taiwan while we were down there, and they tell the identical story. The municipality layer across all three countries is dots on a map where Europe gave us regions. This is the shape of the whole problem, and it means the recipe we were so pleased with simply doesn't travel east.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="you-stop-asking-the-polygon-and-start-asking-japan-post">You stop asking the polygon and start asking Japan Post<a href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does#you-stop-asking-the-polygon-and-start-asking-japan-post" class="hash-link" aria-label="Direct link to You stop asking the polygon and start asking Japan Post" title="Direct link to You stop asking the polygon and start asking Japan Post" translate="no">​</a></h2>
<p>If you can't ask "which shape am I inside," you ask the postal authority something more direct: "what's the municipality for this postcode?" Then you go find that municipality in WOF by name. Japan Post publishes exactly that mapping in a file called KEN_ALL, and, crucially, a romanized edition whose municipality column reads <code>SAPPORO SHI CHUO KU</code>, in the same alphabet WOF uses for its romanized place names. Two romanized strings you can actually compare. That's the whole pivot.</p>
<p>Getting the file was its own small comedy. Every KEN_ALL download URL we had on record returned a 404. The replacements turned out to be gated behind JavaScript and a Japan-only fetch, so a plain script came home with a polite error page instead of data. And when the file finally arrives, it's CP932 (Shift-JIS) encoded, in the year 2025. We got there, and it carries the one thing WOF's own postcode hierarchy refuses to give up: the municipality, where WOF stops at the prefecture and leaves you a hundred kilometres too coarse.</p>
<p>The matching had one wrinkle worth knowing about. Japanese municipalities don't sit in a single WOF placetype. A regular city lands in <code>locality</code>, a ward in <code>county</code> or <code>localadmin</code>, a Tokyo special ward in <code>borough</code>. Match against just one of those and you cap out around 55%. Search all of them at once and you get <strong>94.3% of postcodes matched to a real municipality</strong>, with end-to-end resolution landing between 94 and 98% depending on which gold set you grade against. Comfortably past our 85% bar, and the European locales come out of the change byte-for-byte identical, because the new path only fires for the countries that need it.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-same-strategy-a-build-shaped-to-the-country">The same strategy, a build shaped to the country<a href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does#the-same-strategy-a-build-shaped-to-the-country" class="hash-link" aria-label="Direct link to The same strategy, a build shaped to the country" title="Direct link to The same strategy, a build shaped to the country" translate="no">​</a></h2>
<p>Here's the part I want to dwell on, because it's the part that decides whether any of this scales. The Japanese build feeds the <strong>exact same resolver strategy</strong> the European one does. All we wrote was a Japan <em>build</em>: a different way of filling in the one table the resolver already reads. The resolver itself never changed, never even noticed. Postcode in, locality out, the same code path Amsterdam runs through.</p>
<p>That's the bet the whole "rule engine" design rests on: one strategy, and a per-country table that each country gets to populate however its data allows. Japan populates it by asking its postal authority for names. The question we hadn't answered was whether a genuinely <em>different</em> country, with genuinely different data, could populate that same table without us bolting on a pile of special cases. Which is where Korea comes in, and where the story stops being a victory lap.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="korea-the-same-trick-inverted">Korea, the same trick inverted<a href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does#korea-the-same-trick-inverted" class="hash-link" aria-label="Direct link to Korea, the same trick inverted" title="Direct link to Korea, the same trick inverted" translate="no">​</a></h2>
<p>Korea's data is the mirror image of Japan's, so the build came out mirrored too.</p>
<p>Japan made us go fetch the names from a postal authority. Korea hands them over for free: the GeoNames postal file for Korea already carries, in one place, the postcode, the place name, the province, and a latitude and longitude. No saga, no Shift-JIS. The snag is that the names are in Hangul (추자면), while WOF's romanized <code>spr.name</code> for Korea is some transliteration that may or may not line up. Matching Hangul against romaji goes nowhere, and that's exactly why Korea sat on our "blocked" list for a while.</p>
<p>It turns out that read was half right and gave up one step early. WOF doesn't only keep the romanized name. Its <code>names</code> table also carries Hangul, 13,120 native entries plus several thousand more filed under "undetermined language" that are Hangul all the same. So a Hangul-to-Hangul join is on the table after all. And because every Korean postcode arrives with a coordinate already attached, we could lead with the coordinate and treat the Hangul name as a second opinion. Korea's build is <strong>point-primary</strong>: take the postcode's coordinate, find the nearest WOF locality, confirm it by name where a name exists. A different first move from Japan, the same table out the other end, and the thing we were testing for, not one line of new resolver code.</p>
<p>On the parts that build resolves, it is excellent. WOF's Korean locality layer is dense, 21,139 of them, near enough one per village, so the nearest locality to a postcode sits a median of <strong>0.96 km away</strong>. The province falls out for free and exact: GeoNames' province name matches WOF's Korean region name 17 times out of 17. Hand us a Korean address and we'll put it in the right province and within a kilometre of the right spot, on 100% of postcodes. For a coarse fix, that's money in the bank.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-the-map-runs-out">Where the map runs out<a href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does#where-the-map-runs-out" class="hash-link" aria-label="Direct link to Where the map runs out" title="Direct link to Where the map runs out" translate="no">​</a></h2>
<p>Then you ask for the administrative name, and the floor gives way. The name confirms on <strong>26% of Korean postcodes</strong>. Japan was 94. Same method, same care, a third of the hit rate, and the whole gap is a story about what the map happens to hold.</p>
<p>Two things go wrong, and both earn their names because they tell you where to dig. The first is a granularity mismatch. GeoNames names a postcode at the eup/myeon/dong level, 추자면, Chuja-<em>myeon</em>. WOF's locality layer is one rung finer, down at the hamlet, so the nearest point to that postcode is a village called "Mung" sitting <em>inside</em> Chuja-myeon. The coordinate is dead-on and the name belongs to a smaller, different place. Both sources are telling the truth about different rungs of the same ladder, and the ladder doesn't line up.</p>
<p>The second one is worse, and it's the one I'd lose sleep over. The single biggest bucket of misses is 구 (gu), the urban districts. Gangnam-gu. Haeundae-gu. The level that <em>is</em> the address for most of Seoul and Busan. WOF Korea doesn't carry those as named localities at all, so there is nothing on the map to confirm against. <strong>The single most address-dense slice of the country is the slice the gazetteer is thinnest on.</strong> You can have a method that works and a map that's blank exactly where the people are, and that is the honest ceiling on Korea today. No recipe tweak gets past a name that was never in the dataset.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="a-bug-the-verifier-caught-and-you-should-want-it-to">A bug the verifier caught, and you should want it to<a href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does#a-bug-the-verifier-caught-and-you-should-want-it-to" class="hash-link" aria-label="Direct link to A bug the verifier caught, and you should want it to" title="Direct link to A bug the verifier caught, and you should want it to" translate="no">​</a></h2>
<p>One detour, because it's the kind of mistake that ships quietly if you let it. The first version of the Korean build reported 56% name confirmation, and we were briefly delighted. Then we looked at the distances, and the "confirmed" matches were averaging 71 kilometres from the postcode, a few of them out past 500.</p>
<p>Korean place names repeat. A lot. Dozens of villages share a name up and down the country, and the matcher had been finding a name match <em>anywhere in Korea</em> and then taking the nearest copy, which can still be a province away. The fix is the same proximity leash Japan's build already wore: a name only counts as confirmation if the place it names also sits nearby. That pulled the number down to its honest 26% and the average distance back under five kilometres. <strong>One signal in a costume of two is worse than the one signal alone</strong> — the inflated 56% would have told us Korea was twice as solved as it is. Make your two signals genuinely agree, or don't get to call it two.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-keep-and-what-the-map-still-owes-us">What we keep, and what the map still owes us<a href="https://mailwoman.sister.software/blog/2026/06/05/the-map-runs-out-before-the-country-does#what-we-keep-and-what-the-map-still-owes-us" class="hash-link" aria-label="Direct link to What we keep, and what the map still owes us" title="Direct link to What we keep, and what the map still owes us" translate="no">​</a></h2>
<p>So where does that leave the bet? The architecture held. A point-primary build and a name-primary build, two countries whose data shares almost nothing, both poured into the same resolver strategy with no new resolver code between them. The "less special" thing we wanted to prove, that this generalizes past one lucky locale, is proven. What it can't do is conjure place names that Who's On First was never handed.</p>
<p>So Korea ships as honest as it is: rock-solid on province and coordinate, explicit about a 26% name tier, marked experimental and kept out of the default bundle until the rest catches up. The catch-up has an address. Korea's road-name database, Juso, carries the gu and dong names natively. It's locked behind a government API key, so getting it is a deliberate acquisition, and it's next on the list to go fetch. Taiwan is one rung further back: there's no GeoNames postal file for it at all, a flat 404, so there isn't even a coordinate to begin with until we source one.</p>
<p>If there's a portable lesson in two countries' worth of this, it's that a geocoder is only ever as good as the map it stands on, and a map's favourite way to lie is to leave things out. Japan's map was missing shapes, and we could route around that. Korea's map is missing names, right where the cities are, and there's no routing around a blank. So before you tune a model or argue with a matcher, go look at what your reference data actually holds in the exact spot you care about most. The country is all still there. Whether the map admits it is a separate question, and it's usually the one that decides how far you get.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Architecture" term="Architecture"/>
        <category label="Resolver / WOF" term="Resolver / WOF"/>
        <category label="Postcode components" term="Postcode components"/>
        <category label="Japan" term="Japan"/>
        <category label="International" term="International"/>
        <category label="Non-Latin scripts" term="Non-Latin scripts"/>
        <category label="Advanced" term="Advanced"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Does a postcode know what country it's in?]]></title>
        <id>https://mailwoman.sister.software/blog/2026/06/03/does-a-postcode-know-what-country-its-in</id>
        <link href="https://mailwoman.sister.software/blog/2026/06/03/does-a-postcode-know-what-country-its-in"/>
        <updated>2026-06-03T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We thought a postcode was a strong enough signal to delete a configuration flag. Then we measured it. A five-digit code pins its own country less than half the time — and that number talked us out of writing the code we were about to write.]]></summary>
        <content type="html"><![CDATA[<p>We set out to fix a small wart in our address parser and came away with a number that told us to put the screwdriver down.</p>
<p>Here is the wart. When our postcode extractor sees a five-digit run and wants to know whether it's a real postcode or just a house number that happens to look like one, it peeks at the words sitting next to it and checks them against every country's street vocabulary we know — American, German, French, all at once. That "all at once" is fine at three countries. At twenty it gets loud, and a German street suffix starts shadowing an English word by sheer coincidence. So we went looking for the clean way to tell the extractor <em>which</em> country's words to bother with.</p>
<p>That question has a much bigger sibling, and chasing the sibling is where the story actually is.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-thing-we-actually-wanted">The thing we actually wanted<a href="https://mailwoman.sister.software/blog/2026/06/03/does-a-postcode-know-what-country-its-in#the-thing-we-actually-wanted" class="hash-link" aria-label="Direct link to The thing we actually wanted" title="Direct link to The thing we actually wanted" translate="no">​</a></h2>
<p>Our resolver, the part that turns a parsed address into a point on Earth, takes a <code>--default-country</code> flag. You hand it <code>US</code> and it searches the American gazetteer; you hand it <code>DE</code> and it searches the German one. It works, and we hate it, because in production nobody hands you the country. The whole reason you're parsing the address is that you <em>don't</em> know where it is yet. A flag that makes you supply the answer up front is a flag that solves the easy half of the problem and leaves the hard half on the floor.</p>
<p>So here's the dream, and it's a good one. The postcode is the most information-dense token in an address: five or six characters that encode a routing hierarchy, a region, often a neighbourhood. We already extract it before the neural parser runs. What if the postcode just <em>told</em> the resolver which country to search? Delete the flag, let the address speak for itself, and as a bonus we'd have the locale signal the street-vocabulary check was asking for in the first place. One stone, several birds.</p>
<p>You can probably feel the shape of the questions piling up:</p>
<ul>
<li class="">Where <em>should</em> that "which country" signal come from: the extractor, the resolver, the model?</li>
<li class="">Is the street-vocabulary blindness even a real problem, or a tidy-minded itch?</li>
<li class="">And the load-bearing one: is a postcode actually a strong enough signal to retire the flag?</li>
</ul>
<p>We brought all three to a second opinion before touching anything.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="a-second-opinion-and-a-sharper-question">A second opinion, and a sharper question<a href="https://mailwoman.sister.software/blog/2026/06/03/does-a-postcode-know-what-country-its-in#a-second-opinion-and-a-sharper-question" class="hash-link" aria-label="Direct link to A second opinion, and a sharper question" title="Direct link to A second opinion, and a sharper question" translate="no">​</a></h2>
<p>When a decision feels heavier than it looks, we run it past a second model (a different architecture, with no stake in our assumptions) and let it push back. This was one of those. Four turns in, it had stopped answering the question and started reframing it — and the reframe is the part worth keeping.</p>
<p>The street-vocabulary blindness, our second opinion argued, is a symptom wearing the costume of a bug. Conditioning that one helper on a locale would scratch the itch and teach us nothing. The actual gap underneath is that there is no <em>single, early, reliable place</em> where "which country is this" gets decided once and shared. We had three half-answers scattered across the codebase: the extractor computing a country posterior from the gazetteer, a rule-based stage guessing locale from the postcode's shape, the model's eventual learned guess. No one agreed which was the source of truth, or how they were supposed to relate. The blind helper was just the loose thread you could see.</p>
<p>That reframe pointed at a clean design, and I'll give you the one idea worth keeping: <strong>unify the data, not the modules.</strong> Every address system in our reference package already owns its own postcode shape. So the one new thing we built is the <em>inverse</em> of those shapes — a function that takes a postcode and asks every system at once, "is this yours?" A bare <code>68161</code> comes back <code>[us, de, fr]</code>, because a five-digit shape genuinely belongs to all three. Both the extractor and the rule-based stage read from that one function instead of keeping their own divergent copies. Nobody calls anybody; they share a table. That's the part that scales.</p>
<p>The rest of the design followed from there: a small fused "locale prior" object, and a clean rule that the resolver always takes that prior's <em>shape</em> while the thing <em>producing</em> it can be swapped (a cheap pre-pass today, the trained model later). It's tidy. It's the kind of architecture you sketch on a whiteboard and feel good about.</p>
<p>And then, before building a line of it, we did the thing we should always do and rarely want to: we tried to kill it.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="measure-before-you-build">Measure before you build<a href="https://mailwoman.sister.software/blog/2026/06/03/does-a-postcode-know-what-country-its-in#measure-before-you-build" class="hash-link" aria-label="Direct link to Measure before you build" title="Direct link to Measure before you build" translate="no">​</a></h2>
<p>The whole edifice rests on one assumption: that the postcode is present and unambiguous often enough to carry the country on its own. That's testable today, on real addresses, with no model and no new code beyond a probe. So we wrote the probe: take a thousand-plus real US addresses and a thousand-plus German ones, extract the postcode, resolve it against the gazetteer, and ask how confidently it names a single country.</p>
<p>The postcode is <em>present</em> every time. OpenAddresses is postcode-rich; an anchor fired on 100% of rows. That part of the dream survives.</p>
<p>Here's the part that doesn't.</p>
<table><thead><tr><th></th><th style="text-align:right">US</th><th style="text-align:right">DE</th></tr></thead><tbody><tr><td>postcode present</td><td style="text-align:right">100%</td><td style="text-align:right">100%</td></tr><tr><td><strong>names one country, confidently</strong></td><td style="text-align:right"><strong>27.9%</strong></td><td style="text-align:right"><strong>44.1%</strong></td></tr></tbody></table>
<p>A US postcode pins its own country a little over a quarter of the time. A German one, not quite half. The rest of the time the strongest signal in the address shrugs and offers you a menu.</p>
<p>The reason is the most ordinary thing in the world: a five-digit code is five digits in a lot of places. <code>75001</code> is the first arrondissement of Paris. It is also Addison, Texas. The gazetteer, asked in good faith, reports both, and a uniform posterior over <code>{FR, US}</code> is an honest answer to a question the postcode simply cannot settle. Same script, same length, two continents. Multiply that across every numeric-postcode country and the confident cases are the minority.</p>
<p>(One trap worth flagging, since I nearly fell in it: an early version of the probe looked far rosier because of an alphabetical tie-break. When the posterior is a flat <code>{DE, US}</code>, "DE" sorts first and quietly wins, so the German numbers looked almost perfect. They were an artifact of the sort order, not the signal. The honest reading is the confident-single-country rate above, and only that.)</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-the-number-was-actually-telling-us">What the number was actually telling us<a href="https://mailwoman.sister.software/blog/2026/06/03/does-a-postcode-know-what-country-its-in#what-the-number-was-actually-telling-us" class="hash-link" aria-label="Direct link to What the number was actually telling us" title="Direct link to What the number was actually telling us" translate="no">​</a></h2>
<p>A weak result is still a clue, so it's worth being precise about what it ruled out and what it confirmed.</p>
<p>It ruled out the bonus. An extractor-only locale prior cannot retire the <code>--default-country</code> flag, because more than half the time it would hand the resolver a coin-flip, and a coin-flip is worse than a default. The clean PR we'd sketched would have failed its own acceptance test. We just hadn't written it yet, which is the entire return on running the probe first.</p>
<p>What it confirmed is the more interesting half, and it's something our own design document had asserted on faith months ago: <em>figuring out the country is most of what parsing an address is.</em> If the single most information-dense token only settles the question a third of the time, then the rest of the answer has to come from everything around it — the city, the street, the order the pieces arrive in. You can't get that from a regex run before the model; you get it from the model itself, reading the whole string at once and conditioning its own decisions on what it infers. The number didn't break the plan. It told us which layer the country actually lives in, and that layer is the expensive one.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-shipped-and-what-we-left-alone">What shipped, and what we left alone<a href="https://mailwoman.sister.software/blog/2026/06/03/does-a-postcode-know-what-country-its-in#what-shipped-and-what-we-left-alone" class="hash-link" aria-label="Direct link to What shipped, and what we left alone" title="Direct link to What shipped, and what we left alone" translate="no">​</a></h2>
<p>So we shipped the piece that survived contact with the evidence. The street-vocabulary check is now gated by the postcode's real gazetteer membership: a US-only ZIP consults the American vocabulary and never asks the German one, because there's nothing German about it. An unrelated language's words can no longer down-weight a code that was never theirs. It scales to twenty countries cleanly, the resolver evals come out byte-identical to before (a precision change you can't see on a clean sample is exactly the change you want), and the shared inverse-shape function is now in place for whatever reads it next.</p>
<p>And we left the flag alone, on purpose, with a number to point at. <code>--default-country</code> stays until the country signal comes from where the evidence says it has to: the trained model, conditioning on the full address. That's a heavier piece of work, and now it's a justified one rather than a hopeful one.</p>
<p>The cheaper lesson is the one I'd actually press on you. We came within one satisfying afternoon of building a clean, well-argued, doomed feature. What stopped us wasn't taste or a code review — it was a few hours of measurement aimed squarely at the assumption everything else rested on. Find the load-bearing assumption in whatever you're about to build, and go try to break it before you write the part that depends on it. The probe that saves you a week looks, going in, exactly like the probe that wastes you an afternoon. Run it anyway.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Architecture" term="Architecture"/>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Postcode components" term="Postcode components"/>
        <category label="Resolver / WOF" term="Resolver / WOF"/>
        <category label="International" term="International"/>
        <category label="Advanced" term="Advanced"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Our parser fails 80% of our own tests. We shipped it anyway.]]></title>
        <id>https://mailwoman.sister.software/blog/two-scoreboards</id>
        <link href="https://mailwoman.sister.software/blog/two-scoreboards"/>
        <updated>2026-05-31T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A neural address parser that scores 20% on our test suite and beats Pelias on real addresses. Both numbers are true — here's why we trust the second one.]]></summary>
        <content type="html"><![CDATA[<p>Our neural address parser passes <strong>20.7%</strong> of our test suite. The rule-based parser it's
meant to replace passes <strong>93.7%</strong>. By that scoreboard, we should delete the neural model and
go home.</p>
<p>We shipped the neural model instead. Here's why both numbers are true — and why the one that
matters says the opposite.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="two-parsers-one-bench">Two parsers, one bench<a href="https://mailwoman.sister.software/blog/two-scoreboards#two-parsers-one-bench" class="hash-link" aria-label="Direct link to Two parsers, one bench" title="Direct link to Two parsers, one bench" translate="no">​</a></h2>
<p>Mailwoman carries two address parsers. <code>v0</code> is a hand-written rule engine — a TypeScript port
of the <a href="https://github.com/pelias/parser" target="_blank" rel="noopener noreferrer" class="">Pelias</a> parser, all regexes and dictionaries and
heuristics. The other is a 29M-parameter encoder-only transformer that tags each token (street,
locality, postcode, …) and was trained on synthetic and real corpora. The whole bet of the
neural model is that it generalizes to messy real-world input where rules brittle-fail.</p>
<p>To check the bet, we run both through the same 415-assertion test suite. The rules parser wins
in a landslide: 93.7% to 20.7%.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-catch-the-bench-was-built-by-the-opponent">The catch: the bench was built by the opponent<a href="https://mailwoman.sister.software/blog/two-scoreboards#the-catch-the-bench-was-built-by-the-opponent" class="hash-link" aria-label="Direct link to The catch: the bench was built by the opponent" title="Direct link to The catch: the bench was built by the opponent" translate="no">​</a></h2>
<p>Look one level down, at the per-file results, and something jumps out: <strong><code>v0</code> passes 100% of
every functional file.</strong> Not 99%. Every single one.</p>
<p>That's not skill — it's lineage. Every one of those 415 assertions was ported from the
Pelias and <code>addressit</code> test suites, and <code>v0</code> <em>is</em> our port of Pelias, so the suite is grading a
parser against its own author's answer key. It cannot, even in principle, catch <code>v0</code> being wrong,
because <code>v0</code>'s output <strong>is</strong> the definition of correct.</p>
<p>So "neural scores 20.7%" measures one thing: <strong>how often neural disagrees with Pelias's exact
conventions</strong> — where to split a multi-word street, where a venue ends and a locality begins, the
dozens of micro-decisions addressit happened to encode. It says nothing about how often neural is
<em>wrong</em>. Useful as a regression gate (did a retrain break something we used to match?); useless
as a verdict on which parser is better.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="decomposing-the-20">Decomposing the 20%<a href="https://mailwoman.sister.software/blog/two-scoreboards#decomposing-the-20" class="hash-link" aria-label="Direct link to Decomposing the 20%" title="Direct link to Decomposing the 20%" translate="no">​</a></h2>
<p>To judge quality fairly we need benches drawn from <em>outside</em> the Pelias lineage. We score both
parsers on three:</p>
<table><thead><tr><th>arena</th><th>what it is</th><th style="text-align:right">n</th><th style="text-align:right">v0</th><th style="text-align:right">neural</th></tr></thead><tbody><tr><td><strong>libpostal</strong></td><td>clean, canonical strings</td><td style="text-align:right">69</td><td style="text-align:right"><strong>29%</strong></td><td style="text-align:right">16%</td></tr><tr><td><strong>perturb</strong></td><td>noisy, abbreviated, reordered</td><td style="text-align:right">398</td><td style="text-align:right">39%</td><td style="text-align:right"><strong>61%</strong></td></tr><tr><td><strong>postal</strong></td><td>edge formats (PO box, military…)</td><td style="text-align:right">38</td><td style="text-align:right"><strong>26%</strong></td><td style="text-align:right">8%</td></tr></tbody></table>
<p>Three different stories:</p>
<ul>
<li class=""><strong>Clean input → rules win.</strong> Canonical strings are exactly what hand-tuned regexes are for.
This is also the <em>entire</em> harness — all canonical, all Pelias-convention — which is why
neural looks worst there.</li>
<li class=""><strong>Messy input → neural wins, decisively</strong> (61% vs 39%) — and this is the biggest bench by
far (398 cases), built by perturbing real addresses: dropped commas, abbreviations,
reordering, weird casing. It's the closest proxy we have to what people actually type, and
it's the whole reason the neural model exists.</li>
<li class=""><strong>Edge formats → both are bad.</strong> PO boxes, military APO/FPO, and rural routes are <strong>0% for
both</strong> parsers. Neither was built for them.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-scoreboard-that-matters">The scoreboard that matters<a href="https://mailwoman.sister.software/blog/two-scoreboards#the-scoreboard-that-matters" class="hash-link" aria-label="Direct link to The scoreboard that matters" title="Direct link to The scoreboard that matters" translate="no">​</a></h2>
<p>A geocoder's job is to put a real address on the map. So the honest test is end-to-end:
take 10,000 real US addresses with real government
coordinates, run each parser through the <em>same</em> resolver, and ask which one lands on the right
city.</p>
<table><thead><tr><th>parser</th><th style="text-align:right">locality match (10k real addresses)</th></tr></thead><tbody><tr><td><strong>neural</strong></td><td style="text-align:right"><strong>97.3%</strong></td></tr><tr><td>v0 (Pelias)</td><td style="text-align:right">95.8%</td></tr></tbody></table>
<p>On the metric that matches the product — real addresses, end to end — <strong>the neural parser
beats the rules parser.</strong> The 20.7% and the 97.3% are measuring two completely different
things: agreement with Pelias's answer key, versus getting real addresses right.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-lesson">The lesson<a href="https://mailwoman.sister.software/blog/two-scoreboards#the-lesson" class="hash-link" aria-label="Direct link to The lesson" title="Direct link to The lesson" translate="no">​</a></h2>
<p>If you port your test suite from the system you're trying to beat, that system scores 100% by
construction, and your challenger will always look broken. The suite is doing its job:
faithfully measuring <em>agreement with the incumbent</em>. Just don't mistake that for a measure of
quality.</p>
<p>Measure on the distribution you actually serve. For us that's messy, abbreviated, real-world
addresses — and there, the learned model is ahead.</p>
<hr>
<p>The full breakdown is in the
<a class="" href="https://mailwoman.sister.software/docs/retrospectives/v0-7-v0-8-neural-vs-rules-retrospective">v0.7–v0.8 retrospective</a>: every
arena, the genuine neural deficits (it does truncate <code>Belle Fourche</code> to <code>Belle</code>), the masked-LM
pre-training experiment that turned into a clean negative result, and what's next (street-level
geometry, to go from "right city" to "right spot").</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Rule-based classifiers" term="Rule-based classifiers"/>
        <category label="Resolver / WOF" term="Resolver / WOF"/>
        <category label="Geocoding" term="Geocoding"/>
        <category label="Advanced" term="Advanced"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The model that never saw an intersection]]></title>
        <id>https://mailwoman.sister.software/blog/2026/05/29/the-model-that-never-saw-an-intersection</id>
        <link href="https://mailwoman.sister.software/blog/2026/05/29/the-model-that-never-saw-an-intersection"/>
        <updated>2026-05-29T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We set out to fix our address parser's overconfidence. We discovered overconfidence was never the problem — and that you can't calibrate your way out of a coverage gap.]]></summary>
        <content type="html"><![CDATA[<p>We spent a night trying to make our neural address parser less cocky. We ended it having learned something more useful. The model wasn't cocky — it was uninformed. It had <strong>never been shown</strong> whole categories of address.</p>
<p>This is the story of chasing the wrong number, and the diagnostics that pointed at the right one.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-hypothesis-its-overconfident">The hypothesis: it's overconfident<a href="https://mailwoman.sister.software/blog/2026/05/29/the-model-that-never-saw-an-intersection#the-hypothesis-its-overconfident" class="hash-link" aria-label="Direct link to The hypothesis: it's overconfident" title="Direct link to The hypothesis: it's overconfident" translate="no">​</a></h2>
<p>Across the v0.6.x training cycle, one pattern kept surfacing: when the model was wrong, it was <em>confidently</em> wrong. On a held-out test set, <strong>86% of its incorrect predictions were made at ≥0.9 confidence</strong> — and most of those at a flat 1.00. A model that hedged appropriately would, we reasoned, stop steamrolling good answers with bad high-confidence ones.</p>
<p>The standard tool for that is <strong>label smoothing</strong>: instead of training toward a one-hot target (1.0 for the right tag, 0 for the rest), you train toward something softer (0.9 / spread-the-rest). It caps how peaked the model's outputs can get. So we ran a clean, single-variable experiment (the v0.6.0 recipe plus <code>label_smoothing=0.1</code>, nothing else changed) and measured.</p>
<p>It worked, exactly as advertised. Overconfidence-on-wrong dropped <strong>86% → 67%</strong>; the mass at 1.00 confidence vanished, capped around 0.95. Postcode recall even ticked up.</p>
<p>And the metric we actually ship on — <strong>harness pass rate</strong> — didn't move. 14.6% → 13.8%. If anything, slightly down. Two tags (house numbers, streets) <em>regressed</em>.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="following-the-evidence">Following the evidence<a href="https://mailwoman.sister.software/blog/2026/05/29/the-model-that-never-saw-an-intersection#following-the-evidence" class="hash-link" aria-label="Direct link to Following the evidence" title="Direct link to Following the evidence" translate="no">​</a></h2>
<p>A well-calibrated model that's no better at the job is a clue, not a victory. So instead of tuning the smoothing knob again, we asked a blunter question: <strong>of everything the harness gets wrong, what kind of wrong is it?</strong></p>
<p>We categorized every failure. The answer reframed the whole project:</p>
<ul>
<li class=""><strong>55% of the gap was missing labels</strong> — the model emitted <em>no tag at all</em> where one belonged. Not a wrong value, not a fuzzy boundary. Silence.</li>
<li class="">The most-missed tags were <code>street</code> (×197) and <code>house_number</code> (×100).</li>
<li class="">One cluster stood out: <strong>intersections</strong> — addresses like <code>Broadway &amp; W 42nd St</code>. They're 17% of our harness, and the model scored <strong>0%</strong> on them.</li>
</ul>
<p>Calibration softens the confidence of labels the model <em>does</em> emit. It is structurally incapable of conjuring a label the model never produces. That's why it left the harness flat: we'd been sharpening the model's aim at targets it wasn't even shooting at.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-probe">The probe<a href="https://mailwoman.sister.software/blog/2026/05/29/the-model-that-never-saw-an-intersection#the-probe" class="hash-link" aria-label="Direct link to The probe" title="Direct link to The probe" translate="no">​</a></h2>
<p>We ran a single probe on a canonical intersection. For every token in <code>Broadway &amp; W 42nd St</code>, we read off the probability the model assigned to the <code>intersection_a</code> / <code>intersection_b</code> tags.</p>
<p>The maximum, across every token, was <strong>~0.0001</strong>.</p>
<p>Uncertainty doesn't look like that. A model that's merely unsure still puts <em>some</em> probability on the right tag; ~0.0001 means the model has <em>no representation of intersections whatsoever</em>. The labels existed in its output vocabulary; it had simply never learned to use them.</p>
<p>Why? We checked the corpus pipeline. There are synthesizers for streets, no-street venues, PO boxes, house+venue combinations… and <strong>nothing that generates intersections</strong>. The real-world adapters don't emit them in that form either. The training signal for intersections was, to a very good approximation, zero. The model never saw one — so it never learned one. No loss function, no calibration trick, no bigger model recovers a category that isn't in the data.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="a-different-coverage-gap-a-different-fix">A different coverage gap, a different fix<a href="https://mailwoman.sister.software/blog/2026/05/29/the-model-that-never-saw-an-intersection#a-different-coverage-gap-a-different-fix" class="hash-link" aria-label="Direct link to A different coverage gap, a different fix" title="Direct link to A different coverage gap, a different fix" translate="no">​</a></h2>
<p>Calibration's one genuine win (a small postcode bump) pointed at a <em>second</em> coverage story, this one about tokenization.</p>
<p>Alphanumeric postcodes (<code>SW1A 1AA</code>, <code>M5V 2T6</code>) get shredded by the subword tokenizer into fragments like <code>["S","##W","##1","##A", "1","##AA"]</code>. The seven-character shape a regex would trivially recognize is invisible to a model reasoning over disconnected pieces. The result: GB/CA/NL postcodes at 0%.</p>
<p>Here the fix wasn't training at all. A <strong>deterministic regex repair</strong> runs <em>after</em> the model decodes: detect a postcode-shaped substring, and snap the label span to it. On the postcode harness that single pass fixed <strong>135 cases and regressed zero</strong>, taking GB/CA/DE/PT to 100%. Sometimes the right tool is a retrain. Sometimes it's eight lines of pattern-matching and a careful "longest-match-wins" rule so a US ZIP+4 doesn't get mistaken for a Dutch postcode.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-actually-learned">What we actually learned<a href="https://mailwoman.sister.software/blog/2026/05/29/the-model-that-never-saw-an-intersection#what-we-actually-learned" class="hash-link" aria-label="Direct link to What we actually learned" title="Direct link to What we actually learned" translate="no">​</a></h2>
<p>A few lessons we're keeping:</p>
<ul>
<li class=""><strong>Pick a metric that can't be gamed by the thing you're optimizing.</strong> Per-tag F1 looked fine while the product was stuck; harness pass rate (does the <em>whole</em> address come out right?) told the truth.</li>
<li class=""><strong>A confident-wrong model and an ignorant model need opposite fixes.</strong> We assumed the former; the data showed the latter. Calibration for one, coverage for the other.</li>
<li class=""><strong>Structural validity is its own signal.</strong> A checker that flags incoherent parses — a house number with no street, an orphaned unit — caught a mid-training regression that the headline accuracy number completely hid.</li>
<li class=""><strong>You can't learn what you never see.</strong> The most expensive-sounding problem of the night had the cheapest root cause: a missing synthesizer.</li>
</ul>
<p>So the real fix for intersections is mundane: a couple thousand synthetic <code>X &amp; Y St</code> examples, labeled and dropped into the corpus as a small targeted supplement, plus a retrain that finally gives the model something to learn from. That run is training as we publish this.</p>
<p>We'll report what the model does once it has, for the first time, actually seen an intersection.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Model training" term="Model training"/>
        <category label="Night shift" term="Night shift"/>
        <category label="Advanced" term="Advanced"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Zero byte-fallback: a multi-script tokenizer from WOF-earth]]></title>
        <id>https://mailwoman.sister.software/blog/2026/05/28/global-wof-tokenizer</id>
        <link href="https://mailwoman.sister.software/blog/2026/05/28/global-wof-tokenizer"/>
        <updated>2026-05-28T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We pulled Who's On First data for 7 countries, trained a new tokenizer on 2.19M multi-script place names, and eliminated CJK byte-fallback entirely.]]></summary>
        <content type="html"><![CDATA[<p>The v0.5.0-a1 tokenizer had a dirty secret: it was trained exclusively on US and French addresses. When it encountered Chinese, Japanese, Korean, Thai, or Arabic text, it fell back to encoding individual bytes — 50-75% of tokens for CJK scripts. Every byte-fallback token is a lost opportunity for the model to learn meaningful subword patterns.</p>
<p>Today we fixed that.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-data">The data<a href="https://mailwoman.sister.software/blog/2026/05/28/global-wof-tokenizer#the-data" class="hash-link" aria-label="Direct link to The data" title="Direct link to The data" translate="no">​</a></h2>
<p>Who's On First ships one GitHub repo per country, each containing GeoJSON files for every administrative place. Every place carries localized name variants — "New York" has a <code>name:zho</code> of "纽约", a <code>name:jpn</code> of "ニューヨーク", a <code>name:kor</code> of "뉴욕", and dozens more.</p>
<p>We cloned 7 priority countries (US, FR, JP, CN, KR, DE, GB) — 1.74 million GeoJSON files — and built them into a unified SQLite database using our WAL + Freeze pipeline:</p>
<table><thead><tr><th>Country</th><th>GeoJSON files</th><th>Time</th></tr></thead><tbody><tr><td>CN</td><td>680K</td><td>-</td></tr><tr><td>US</td><td>449K</td><td>-</td></tr><tr><td>FR</td><td>231K</td><td>-</td></tr><tr><td>DE</td><td>189K</td><td>-</td></tr><tr><td>GB</td><td>73K</td><td>-</td></tr><tr><td>JP</td><td>63K</td><td>-</td></tr><tr><td>KR</td><td>54K</td><td>-</td></tr><tr><td><strong>Total</strong></td><td><strong>1.74M</strong></td><td><strong>3 min</strong></td></tr></tbody></table>
<p>The result: 1.29 million places with 10.2 million name variants in 20+ languages. 768K Chinese names, 184K Japanese, 264K French, 261K German, 285K Arabic.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-tokenizer">The tokenizer<a href="https://mailwoman.sister.software/blog/2026/05/28/global-wof-tokenizer#the-tokenizer" class="hash-link" aria-label="Direct link to The tokenizer" title="Direct link to The tokenizer" translate="no">​</a></h2>
<p>We extracted a balanced multi-script training set (2.19M lines) from the global WOF names table, shuffled across script groups:</p>
<ul>
<li class="">500K Latin (English, French, German, Spanish, ...)</li>
<li class="">500K Chinese</li>
<li class="">468K Cyrillic (Russian, Ukrainian, ...)</li>
<li class="">285K Arabic</li>
<li class="">183K Japanese</li>
<li class="">94K Korean</li>
<li class="">160K other (Thai, Hindi, Hebrew, Greek, ...)</li>
</ul>
<p>SentencePiece trained in 28 seconds. Same 48K vocab size as before, same user-defined symbols (US state abbreviations, postcode formats). The difference: the vocab now allocates subword pieces for CJK characters, Hangul syllables, Thai consonant clusters, and Arabic word fragments — instead of wasting slots on Latin-only subwords that the old training data biased toward.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-result">The result<a href="https://mailwoman.sister.software/blog/2026/05/28/global-wof-tokenizer#the-result" class="hash-link" aria-label="Direct link to The result" title="Direct link to The result" translate="no">​</a></h2>
<table><thead><tr><th>Script</th><th>v0.5.0-a1 (old)</th><th>v0.6.0-a0 (new)</th></tr></thead><tbody><tr><td>Chinese</td><td>50-75% byte-fallback</td><td><strong>0%</strong></td></tr><tr><td>Japanese</td><td>58-60%</td><td><strong>0%</strong></td></tr><tr><td>Korean</td><td>41%</td><td><strong>0%</strong></td></tr><tr><td>Thai</td><td>30%</td><td><strong>0%</strong></td></tr><tr><td>Arabic</td><td>0%</td><td>0%</td></tr><tr><td>Latin</td><td>0%</td><td>0%</td></tr><tr><td><strong>Aggregate</strong></td><td><strong>36.6%</strong></td><td><strong>0.0%</strong></td></tr></tbody></table>
<p>Issue <a href="https://github.com/sister-software/mailwoman/issues/120" target="_blank" rel="noopener noreferrer" class="">#120</a> targeted less than 5% byte-fallback. We hit zero.</p>
<p>The tokenizer also produces fewer pieces per input. "北京市朝阳区建国路79号" (Beijing address) went from 19 pieces (63% byte-fallback) to 11 pieces (0% byte-fallback). That means more of the 128-token sequence budget is available for actual content instead of being consumed by byte encoding.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="whats-training">What's training<a href="https://mailwoman.sister.software/blog/2026/05/28/global-wof-tokenizer#whats-training" class="hash-link" aria-label="Direct link to What's training" title="Direct link to What's training" translate="no">​</a></h2>
<p>v0.5.4 is now running on a Modal A100 with the new tokenizer. It uses the v0.5.1 proven recipe (the one that achieved 0.638 F1) but with the multi-script tokenizer. If the model learns CJK address patterns as well as it learns Latin ones, this is the foundation for JP/CN/KR locale support.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-pipeline">The pipeline<a href="https://mailwoman.sister.software/blog/2026/05/28/global-wof-tokenizer#the-pipeline" class="hash-link" aria-label="Direct link to The pipeline" title="Direct link to The pipeline" translate="no">​</a></h2>
<p>The global WOF build pipeline follows the <a class="" href="https://mailwoman.sister.software/docs/reviews/2026-05-28-sqlite-wal-strategy">WAL + Freeze design brief</a>:</p>
<ol>
<li class=""><strong>Enumerate</strong>: glob <code>**/data/**/*.geojson</code> across all country repos</li>
<li class=""><strong>Ingest</strong>: WAL mode, parallel file reads (asyncParallelIterator), single-thread writer, batched transactions</li>
<li class=""><strong>Freeze</strong>: WAL checkpoint, journal_mode=DELETE, create indexes, ANALYZE, VACUUM INTO</li>
</ol>
<p>The frozen artifact is a clean 1.09 GB SQLite with no sidecars, verified read-only, integrity-checked. It's available for download from the <a href="https://huggingface.co/buckets/sister-software/mailwoman" target="_blank" rel="noopener noreferrer" class="">Hugging Face bucket</a>.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Model training" term="Model training"/>
        <category label="Infrastructure" term="Infrastructure"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Why Japanese addresses break Western parsers]]></title>
        <id>https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy</id>
        <link href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy"/>
        <updated>2026-05-28T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Japanese addresses run backwards: prefecture, then city, then block, then number. They have no street names. We just pulled 6,373 of them from Who's On First.]]></summary>
        <content type="html"><![CDATA[<p>In Tokyo, the address of Tokyo Tower is <code>〒105-0011 東京都港区芝公園4-2-8</code>.</p>
<p>In English: "4-2-8 Shibakōen, Minato City, Tokyo 105-0011".</p>
<p>The Japanese form runs <strong>right-to-left</strong> compared to the English form. The prefecture (都道府県) comes first, then the city or ward (市区町村), then a district (丁目) and a block-number-style locator. There's no street name — just a grid.</p>
<p>This is why every rule-based address parser written for Western addresses breaks on Japan.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-hierarchy">The hierarchy<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#the-hierarchy" class="hash-link" aria-label="Direct link to The hierarchy" title="Direct link to The hierarchy" translate="no">​</a></h2>
<p>Who's On First ships Japan's admin hierarchy as one repo with 62,896 GeoJSON files. After pulling it into our unified SQLite, the placetype distribution looks like this:</p>
<table><thead><tr><th>Placetype (English)</th><th>Japanese</th><th>Count</th></tr></thead><tbody><tr><td>country</td><td>国</td><td>1</td></tr><tr><td>region (prefecture)</td><td>都道府県</td><td>47</td></tr><tr><td>county (city)</td><td>郡</td><td>2,287</td></tr><tr><td>locality (ward/town)</td><td>市区町村</td><td>43,886</td></tr><tr><td>neighbourhood (chome)</td><td>丁目</td><td>7,736</td></tr></tbody></table>
<p>47 prefectures. The whole country. Every chome (city block district) tagged with a name like <code>１丁目</code> (1-chome), <code>２丁目</code> (2-chome).</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="reversed-ordering">Reversed ordering<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#reversed-ordering" class="hash-link" aria-label="Direct link to Reversed ordering" title="Direct link to Reversed ordering" translate="no">​</a></h2>
<p>Western address: <code>[house_number] [street] [unit?], [locality], [region] [postcode]</code>.</p>
<p>Japanese address: <code>〒[postcode]? [region][locality][chome][block]-[sub-block]-[house_number]</code>.</p>
<p>The order matters for parsers because we use position as a feature. A model trained on "1600 Pennsylvania Avenue NW, Washington, DC 20500" expects digits at the start, region near the end. A Japanese address inverts this entirely. Walking the parent chain in the WOF database confirms the inversion:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">neighbourhood   jpn=１丁目      eng=１丁目</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">locality        jpn=世田谷区     eng=Setagaya</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">county          jpn=世田谷区     eng=Setagaya</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">region          jpn=東京        eng=Tokyo</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">country         jpn=日本        eng=Japan</span><br></div></code></pre></div></div>
<p>To synthesize a JP address you concatenate the parent chain top-to-bottom: <code>東京 + 世田谷区 + １丁目 → 東京世田谷区１丁目</code>.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="no-street-names">No street names<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#no-street-names" class="hash-link" aria-label="Direct link to No street names" title="Direct link to No street names" translate="no">​</a></h2>
<p>Western addresses identify locations by street + number. "1600 Pennsylvania Avenue NW" picks a specific building because Pennsylvania Avenue is a known line and 1600 is a known offset along that line.</p>
<p>Japan uses block addressing instead. Read <code>4-2-8</code> in <code>芝公園</code> as chome 4, block 2, building 8 within the 芝公園 district. There's no "芝公園 street" for the number to sit on; the grid is the addressing primitive, not the line.</p>
<p>Implications for the parser:</p>
<ul>
<li class=""><code>street_prefix</code> and <code>street_suffix</code> don't apply (no street).</li>
<li class=""><code>house_number</code> becomes a hyphenated triple: <code>4-2-8</code>.</li>
<li class="">The "丁目" suffix is a categorical marker, not a street type.</li>
</ul>
<p>For now we map chome to <code>dependent_locality</code> since it's the closest existing tag. A proper JP locale would introduce <code>block</code> and <code>sub_block</code> tags per the schema in <code>core/types/component.ts</code> (declared but unused until JP ships).</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="prefix-postcode">Prefix postcode<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#prefix-postcode" class="hash-link" aria-label="Direct link to Prefix postcode" title="Direct link to Prefix postcode" translate="no">​</a></h2>
<p>Japanese addresses prefix the postcode with <code>〒</code>, the postal mark. Format: <code>〒NNN-NNNN</code>. Examples:</p>
<ul>
<li class=""><code>〒100-0005</code> — Tokyo Marunouchi</li>
<li class=""><code>〒530-0001</code> — Osaka Umeda</li>
<li class=""><code>〒810-0001</code> — Fukuoka Tenjin</li>
</ul>
<p>A parser needs to read <code>〒</code> as a categorical marker: the postal mark that flags the following 7 digits + dash as a postcode. SentencePiece tokenizes <code>〒</code> as a separate piece. Our new v0.6.0-a0 multi-script tokenizer handles this cleanly (0% byte-fallback on the <code>〒</code> character).</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-shipped-today">What we shipped today<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#what-we-shipped-today" class="hash-link" aria-label="Direct link to What we shipped today" title="Direct link to What we shipped today" translate="no">​</a></h2>
<p>The <code>wof-admin-jp</code> adapter prototype walks the WOF parent chain for every 丁目 in the Japanese repo and synthesizes a training row. Output:</p>
<div class="language-json codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-json codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"raw"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"東京港区芝公園"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"components"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">		</span><span class="token property" style="color:#36acaa">"region"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"東京"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">		</span><span class="token property" style="color:#36acaa">"locality"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"港区"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">		</span><span class="token property" style="color:#36acaa">"dependent_locality"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"芝公園"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">		</span><span class="token property" style="color:#36acaa">"country"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"JP"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p><strong>6,373 rows from 47 prefectures and 269 localities</strong> — that's training data we didn't have yesterday. Top prefectures by row count:</p>
<table><thead><tr><th>Prefecture</th><th>Rows</th></tr></thead><tbody><tr><td>東京 (Tokyo)</td><td>2,251</td></tr><tr><td>神奈川 (Kanagawa)</td><td>888</td></tr><tr><td>大阪 (Osaka)</td><td>460</td></tr><tr><td>千葉 (Chiba)</td><td>380</td></tr><tr><td>埼玉 (Saitama)</td><td>263</td></tr></tbody></table>
<p>Tokyo dominates because of its density of named neighborhoods — every chome of every ward is tagged. Smaller prefectures have fewer registered neighborhoods.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="whats-still-missing">What's still missing<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#whats-still-missing" class="hash-link" aria-label="Direct link to What's still missing" title="Direct link to What's still missing" translate="no">​</a></h2>
<p>Real JP addresses include house numbers (<code>4-2-8</code>) which WOF doesn't track. To complete a Stage 3 JP corpus we need a separate source — the <a href="https://www.mlit.go.jp/" target="_blank" rel="noopener noreferrer" class="">MLIT national address database</a> or <a href="https://www.post.japanpost.jp/zipcode/download.html" target="_blank" rel="noopener noreferrer" class="">JapanPost postcode CSVs</a>. Both are public.</p>
<p>Once those land, the JP corpus becomes a 100K+ row source with full Stage 3 + Phase 6 tags (<code>block</code>, <code>sub_block</code>, <code>house_number</code>). v0.6.0 trains on US/FR. v0.7.0 could ship JP if the data pipeline holds.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="schema-readiness">Schema readiness<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#schema-readiness" class="hash-link" aria-label="Direct link to Schema readiness" title="Direct link to Schema readiness" translate="no">​</a></h2>
<p>The infrastructure is already in place. <code>core/types/component.ts</code> declares JP-specific Phase 6 tags:</p>
<div class="language-ts codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-ts codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// JP-specific (Phase 6 — declared but unused until then)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token string" style="color:#e3116c">"prefecture"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token string" style="color:#e3116c">"municipality"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token string" style="color:#e3116c">"district"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token string" style="color:#e3116c">"block"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token string" style="color:#e3116c">"sub_block"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token string" style="color:#e3116c">"building_number"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token string" style="color:#e3116c">"building_name"</span><span class="token punctuation" style="color:#393A34">,</span><br></div></code></pre></div></div>
<p>The schema, formatting, runtime pipeline, and now the corpus prototype are ready. The blockers are: (1) the missing house-number data source, and (2) training time on a JP-aware recipe.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-rules-fail-and-learning-wins">Where rules fail and learning wins<a href="https://mailwoman.sister.software/blog/2026/05/28/japanese-address-hierarchy#where-rules-fail-and-learning-wins" class="hash-link" aria-label="Direct link to Where rules fail and learning wins" title="Direct link to Where rules fail and learning wins" translate="no">​</a></h2>
<p>Every address parser written for Western input fails on Japan in a specific, predictable way: it parses the prefecture as a country, then runs out of tokens. The locality and chome get lumped into a single span. The block-number triple gets parsed as a postcode or dropped entirely.</p>
<p>Mailwoman's transformer architecture is locale-agnostic at the BIO level. The same model can learn <code>region → locality → chome</code> ordering if it sees enough examples. The 6,373 rows we generated today are the first batch.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Model training" term="Model training"/>
        <category label="Japan" term="Japan"/>
        <category label="Locale-specific" term="Locale-specific"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[PO Box Boîte Postale Apartado: Stage 3 ships with 6 new tags]]></title>
        <id>https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box</id>
        <link href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box"/>
        <updated>2026-05-28T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[v0.6.0 expands the schema from 10 to 16 component tags. PO box recognition: 0% → 51.9% in one training run.]]></summary>
        <content type="html"><![CDATA[<p>For its first six versions, Mailwoman emitted ten BIO tags. The model could pick <code>street</code> out of a row but not <code>street_prefix</code>, <code>street_suffix</code>, <code>unit</code>, or <code>po_box</code>. Real addresses are messier than that. The golden eval set has known examples — <code>6220 SE Salmon St, Portland, OR 97215</code> (Stage 2 collapses prefix+name+suffix), <code>123 Main St Apt 4B, Springfield, IL 62701</code> (loses the apartment), <code>PO Box 123, Burlington, VT 05401</code> (treats it as a malformed street).</p>
<p>v0.6.0 adds six tags: <code>street_prefix</code>, <code>street_suffix</code>, <code>unit</code>, <code>po_box</code>, <code>intersection_a</code>, <code>intersection_b</code>. The model is the same h384/6L/6H transformer. The recipe is the same v0.5.1 settings. The tokenizer is the same v0.6.0-a0 multi-script bundle. The only structural change is the output head: 21 BIO labels → 33.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-schema-was-already-there">The schema was already there<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#the-schema-was-already-there" class="hash-link" aria-label="Direct link to The schema was already there" title="Direct link to The schema was already there" translate="no">​</a></h2>
<p><code>core/types/component.ts</code> has declared the canonical <code>ComponentTag</code> union since Phase 0, including all six new tags plus seven JP-specific ones (Phase 6). The schema was forward-declared. The runtime pipeline, the formatter, the golden eval, and even the rule classifiers (<code>StreetPrefixClassifier</code>, <code>StreetSuffixClassifier</code>) all knew about these tags. Only one constant was missing: the active training label set.</p>
<div class="language-python codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-python codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># corpus-python/src/mailwoman_train/labels.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Old:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ACTIVE_TAGS</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Final</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">tuple</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token punctuation" style="color:#393A34">.</span><span class="token punctuation" style="color:#393A34">.</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> STAGE2_TAGS  </span><span class="token comment" style="color:#999988;font-style:italic"># 10 tags</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># New:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ACTIVE_TAGS</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Final</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">tuple</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token punctuation" style="color:#393A34">.</span><span class="token punctuation" style="color:#393A34">.</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> STAGE3_TAGS  </span><span class="token comment" style="color:#999988;font-style:italic"># 16 tags</span><br></div></code></pre></div></div>
<p>The label IDs are stable: STAGE3 appends to STAGE2 without reordering. Old parquet shards work unchanged — they just don't emit the new tags. Models trained on STAGE2 IDs would still decode correctly against a STAGE3 classifier head; the new logit slots just never get picked.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-the-data-comes-from">Where the data comes from<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#where-the-data-comes-from" class="hash-link" aria-label="Direct link to Where the data comes from" title="Direct link to Where the data comes from" translate="no">​</a></h2>
<p>For street decomposition, the data was already there too. Three existing adapters got Stage 3 enhancements:</p>
<ul>
<li class=""><strong>TIGER</strong> (<code>corpus/src/adapters/tiger/</code>) — <code>FULLNAME</code> like "SE Salmon St" gets decomposed via <code>decomposeStreet()</code>, which uses the curated libpostal/en directional + street-type dictionaries (same dictionaries that back the runtime <code>StreetPrefixClassifier</code>).</li>
<li class=""><strong>NAD</strong> (<code>corpus/src/adapters/usgov-nad/</code>) — NAD already has structured <code>St_PreDir</code>, <code>St_PreTyp</code>, <code>St_Name</code>, <code>St_PosTyp</code>, <code>St_PosDir</code> fields. The adapter now emits them as separate components instead of joining into one monolithic <code>street</code>. <code>Unit</code>/<code>Building</code>/<code>Floor</code>/<code>Room</code> chain into the new <code>unit</code> tag.</li>
<li class=""><strong>BAN</strong> (<code>corpus/src/adapters/ban/</code>) — French street types are leading words: "Rue de Rivoli", "Avenue des Champs-Élysées". <code>decomposeFrStreet()</code> uses libpostal/fr/street_types.txt to pick off the leading type word as <code>street_prefix</code>.</li>
</ul>
<p>These changes immediately give the model thousands of correctly-labeled Stage 3 examples per adapter without retraining the upstream data.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="po-box-the-synthesis-case">PO box: the synthesis case<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#po-box-the-synthesis-case" class="hash-link" aria-label="Direct link to PO box: the synthesis case" title="Direct link to PO box: the synthesis case" translate="no">​</a></h2>
<p>PO boxes are different. No corpus adapter has explicit <code>po_box</code> data — TIGER is street segments, NAD has buildings, BAN is street-level addresses, WOF is the admin hierarchy. We need synthesis.</p>
<p>The good news: PO boxes are highly templated. USPS Pub 28 §28C2.040 and DMM 508 §4.1.4/§4.5.4 specify the allowed forms. Multi-locale extension is similarly bounded:</p>
<table><thead><tr><th>Locale</th><th>Leaders</th></tr></thead><tbody><tr><td>en-US</td><td>PO Box, P.O. Box, POB, Post Office Box, PMB, Box, #</td></tr><tr><td>en-CA</td><td>PO Box, P.O. Box, POB</td></tr><tr><td>en-GB</td><td>PO Box, P.O. Box, Post Office Box</td></tr><tr><td>en-AU</td><td>PO Box, GPO Box, Locked Bag</td></tr><tr><td>fr-FR</td><td>BP, B.P., Boîte Postale</td></tr><tr><td>fr-CA</td><td>CP, C.P., Case Postale, BP</td></tr><tr><td>es-ES</td><td>Apdo., Apartado, Apartado de Correos</td></tr><tr><td>es-MX</td><td>Apdo., Apartado Postal, AP</td></tr><tr><td>es-AR</td><td>Casilla, Casilla de Correo, CC</td></tr></tbody></table>
<p><code>corpus/src/synthesize-po-box.ts</code> ships these templates plus three design decisions from a DeepSeek consultation:</p>
<ol>
<li class=""><strong>PMB shares the <code>po_box</code> tag</strong>. USPS treats PMB as a PO Box alias in CASS processing; downstream code can distinguish via "is a street line also present?" without needing a separate label.</li>
<li class=""><strong>Whole-phrase spans</strong> ("PO Box 123" as one <code>po_box</code> span, not "123" alone). Matches the existing golden eval convention.</li>
<li class=""><strong>10% number-format noise</strong> (commas, dashes, embedded spaces). Real OCR'd input is lousy with "Box 1,234" and "PMB-200" — the parser ships with that as native input.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-pipeline">The pipeline<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#the-pipeline" class="hash-link" aria-label="Direct link to The pipeline" title="Direct link to The pipeline" translate="no">​</a></h2>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">WOF SQLite (1.29M places, 7 countries)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ↓  scripts/extract-tuples.py</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">50K (locality, region, postcode, country) tuples</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ↓  scripts/build-po-box-shard.mjs</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">50K LabeledRow JSONL with B-po_box/I-po_box spans</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ↓  scripts/jsonl-to-parquet.py</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">3 MB Parquet shard → Modal volume</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">v0.6.0 training (source_weight: 1.5)</span><br></div></code></pre></div></div>
<p>Sample output:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">P.O. Box 9, Bancroft, ID 83603</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  tokens: ['P', 'O', 'Box', '9', 'Bancroft', 'ID', '83603']</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  labels: ['B-po_box', 'I-po_box', 'I-po_box', 'I-po_box', 'B-locality', 'B-region', 'B-postcode']</span><br></div></code></pre></div></div>
<p>Five tokens get <code>po_box</code> (the whole "P.O. Box 9" phrase including the <code>.</code> punctuation). The model learns the span shape, the leader vocabulary, and the locale-to-template mapping all at once.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="golden-eval-expansion">Golden eval expansion<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#golden-eval-expansion" class="hash-link" aria-label="Direct link to Golden eval expansion" title="Direct link to Golden eval expansion" translate="no">​</a></h2>
<p>Test data matters as much as training data. The golden v0.1.2 set had 1 PO box entry — not enough to fail meaningfully, let alone measure progress. We added 26:</p>
<ul>
<li class="">20 US variants across all leader forms (PO Box, P.O. Box, P. O. Box, POB, POBOX, Post Office Box, Box, P.O.Box) and number ranges (single-digit to 7-digit)</li>
<li class="">3 PMB variants ("100 Main St PMB 200", "1234 Wilshire Blvd #500")</li>
<li class="">6 FR/CA variants (BP, B.P., Boîte Postale, Case Postale, CP)</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="results">Results<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#results" class="hash-link" aria-label="Direct link to Results" title="Direct link to Results" translate="no">​</a></h2>
<p>v0.6.0 trained 100K steps on a Modal A100 (CE-only — <code>crf_loss_weight: 0</code> after two NaN attempts with CRF training enabled; the 33×33 transition table + bf16 was numerically unstable. Inference-time CRF still active via the structural mask. v0.6.1 will investigate).</p>
<p>Demo presets: 11/11 parse (6 canonical addresses + 5 Stage 3 variants).</p>
<p>Per-tag golden eval (4,561 entries):</p>
<table><thead><tr><th>Tag</th><th>v0.5.4 recall</th><th>v0.6.0 recall</th></tr></thead><tbody><tr><td>postcode</td><td>75.7%</td><td>76.0%</td></tr><tr><td>house_number</td><td>78.7%</td><td>79.0%</td></tr><tr><td>region</td><td>65.0%</td><td>65.0%</td></tr><tr><td>locality</td><td>39.4%</td><td>39.7%</td></tr><tr><td>street</td><td>28.0%</td><td>27.9%</td></tr><tr><td>venue</td><td>29.4%</td><td>29.2%</td></tr><tr><td><strong>po_box</strong></td><td><strong>0.0%</strong></td><td><strong>51.9%</strong></td></tr><tr><td>street_prefix</td><td>0.0%</td><td>0.0%</td></tr><tr><td>street_suffix</td><td>0.0%</td><td>0.0%</td></tr><tr><td>unit</td><td>0.0%</td><td>0.0%</td></tr><tr><td>intersection_a/b</td><td>0.0%</td><td>0.0%</td></tr></tbody></table>
<p>PO box recognition went from impossible to functional in one training run. Sample:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">"PO Box 123, Burlington, VT 05401"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">→ { region: "VT", locality: "Burlington",</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    po_box: "PO Box 123", postcode: "05401" }</span><br></div></code></pre></div></div>
<p>Stage 2 metrics held flat: the new tags extended the schema without displacing the old ones.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="whats-deferred">What's deferred<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#whats-deferred" class="hash-link" aria-label="Direct link to What's deferred" title="Direct link to What's deferred" translate="no">​</a></h2>
<p>The other Stage 3 tags (street_prefix, street_suffix, unit, intersection) stayed at 0% recall because the TIGER/NAD/BAN adapter changes that emit them haven't been baked into a corpus rebuild yet. The training data still has monolithic <code>street</code> spans like "SE Salmon St" instead of decomposed <code>street_prefix: "SE", street: "Salmon", street_suffix: "St"</code>. v0.6.1 needs a fresh corpus build to surface those.</p>
<p>CRF learned transitions are also deferred. Two NaN attempts (<code>crf_loss_weight: 0.5</code> then <code>0.1</code>) both diverged post-warmup. The hypothesis: bf16 + the doubled transition table (33×33 vs 21×21) is numerically unstable. v0.6.1 will try fp32 precision for the CRF parameters specifically, or a gradient-clipped warmup-only schedule.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-this-proves">What this proves<a href="https://mailwoman.sister.software/blog/2026/05/28/stage-3-po-box#what-this-proves" class="hash-link" aria-label="Direct link to What this proves" title="Direct link to What this proves" translate="no">​</a></h2>
<p>The pattern works. A new tag in the canonical schema + a focused synthesis source + a one-line corpus config change + 100K training steps = working tag recognition. Total elapsed time tonight: ~6 hours from "no PO box training data exists" to a 28 MB model that hits PO box correctly more than half the time on a hostile eval set.</p>
<p>The same recipe scales to street decomposition, intersection, unit, and the JP-specific Phase 6 tags. The schema is already declared. Each new tag is the same shape of work as PO box was tonight.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Model training" term="Model training"/>
        <category label="Stage 3 (extended schema)" term="Stage 3 (extended schema)"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[FST gazetteer ships to the browser]]></title>
        <id>https://mailwoman.sister.software/blog/2026/05/27/fst-ships-to-browser</id>
        <link href="https://mailwoman.sister.software/blog/2026/05/27/fst-ships-to-browser"/>
        <updated>2026-05-27T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[The v0.5.3 model now runs with a 9 MB finite-state transducer in the browser — 94K US admin places as an emission prior for the neural classifier.]]></summary>
        <content type="html"><![CDATA[<p>The <code>/demo</code> page now loads a 9 MB FST (finite-state transducer) gazetteer alongside the 29 MB ONNX model. 94,000 US admin places with Wikipedia importance scores feed directly into the neural classifier's Viterbi decoder as emission priors — the same pipeline that runs server-side now runs entirely in the browser.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-changed">What changed<a href="https://mailwoman.sister.software/blog/2026/05/27/fst-ships-to-browser#what-changed" class="hash-link" aria-label="Direct link to What changed" title="Direct link to What changed" translate="no">​</a></h2>
<p>The FST binary encodes every US admin place name from Who's On First as a trie: <code>"new york"</code> walks to a state with 7 interpretations (NYC locality, NY state region, New York County, etc.). At query time, the classifier receives additive logit biases proportional to each place's Wikipedia importance — Washington DC (importance 0.815) correctly outranks Washington state (0.764).</p>
<p>The browser integration required a new deserializer (<code>fst-deserialize-web.ts</code>) that uses <code>DataView</code> + <code>TextDecoder</code> instead of Node's <code>Buffer</code>. Same binary format, zero Node dependencies. The FST loads in parallel with the ONNX model — no added latency on the critical path.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-tokenizer-incident">The tokenizer incident<a href="https://mailwoman.sister.software/blog/2026/05/27/fst-ships-to-browser#the-tokenizer-incident" class="hash-link" aria-label="Direct link to The tokenizer incident" title="Direct link to The tokenizer incident" translate="no">​</a></h2>
<p>While wiring the FST, we discovered the live demo was serving the <strong>wrong tokenizer</strong>. The v0.5.3 model (48K vocab, 29 MB) was paired with the old v0.1.0 tokenizer (24K vocab, 474 KB). This produced garbage output — every span labeled as locality with sub-0.5 confidence. Nobody noticed because the demo was "working" (it showed results), just badly.</p>
<p>The root cause: <code>docs/static/mailwoman/</code> was manually managed. Model and tokenizer were copied independently, and the tokenizer copy was missed during the v0.5.3 update.</p>
<p>The fix is a Docusaurus plugin (<code>docs/plugins/demo-assets/</code>) that stages all binary assets from the <code>neural-weights-en-us</code> package at build time. Model card version is the source of truth. The tokenizer/model mismatch can't recur because both come from the same source.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-fixed-along-the-way">What we fixed along the way<a href="https://mailwoman.sister.software/blog/2026/05/27/fst-ships-to-browser#what-we-fixed-along-the-way" class="hash-link" aria-label="Direct link to What we fixed along the way" title="Direct link to What we fixed along the way" translate="no">​</a></h2>
<p>The night shift addressed every recommendation from the <a class="" href="https://mailwoman.sister.software/docs/evals/2026-05-27-v0.5.3-diagnostic-training-review">v0.5.3 training review</a>:</p>
<ul>
<li class=""><strong>Per-tag F1 in training CSV.</strong> The macro F1 comparison that caused hours of wrong analysis in the v0.5.3 session (0.579 vs 0.638 across different tokenizers) is now impossible — per-tag breakdown logged at every eval step.</li>
<li class=""><strong>Grouper-audit fix.</strong> The audit was checking only top-level tree roots for coverage, missing nested children in containment trees. "400 Broad St, Seattle, WA 98109" was getting <code>locality=Broad</code> injected because the audit didn't see <code>street=Broad St</code> nested inside <code>locality=Seattle</code>.</li>
<li class=""><strong>Phrase grouper hardening.</strong> "Pennsylvania" was proposed as <code>LOCALITY_PHRASE</code> on "1600 Pennsylvania Ave NW" because any capitalized word matched. Now penalized -0.20 when the word is a US state name in a non-tail position. "Paris, Texas" is preserved (tail position).</li>
<li class=""><strong>CRF transition export pipeline.</strong> The Python training side can now export learned CRF transition scores to <code>crf-transitions.json</code>. The TypeScript classifier loads and composes them with the structural BIO mask. Not yet trained (v0.5.4 will be the first model to use this).</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="browser-verification">Browser verification<a href="https://mailwoman.sister.software/blog/2026/05/27/fst-ships-to-browser#browser-verification" class="hash-link" aria-label="Direct link to Browser verification" title="Direct link to Browser verification" translate="no">​</a></h2>
<p>Playwright headless test against the live site:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">400 Broad St, Seattle, WA 98109</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  house_number: "400"    (0.97)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  street:       "Broad St"   (0.98)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  locality:     "Seattle"    (0.98)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  region:       "WA"         (0.98)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  postcode:     "98109"      (0.96)</span><br></div></code></pre></div></div>
<p>6/6 demo presets correct, zero grouper-audit nodes. The model works.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="try-it">Try it<a href="https://mailwoman.sister.software/blog/2026/05/27/fst-ships-to-browser#try-it" class="hash-link" aria-label="Direct link to Try it" title="Direct link to Try it" translate="no">​</a></h2>
<p><a href="https://mailwoman.sister.software/demo/" target="_blank" rel="noopener noreferrer" class="">mailwoman.sister.software/demo</a> — type any US address. The neural classifier, FST gazetteer, and WOF locality resolver all run in your browser. No server round-trips after the initial ~75 MB asset load.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Browser" term="Browser"/>
        <category label="FST" term="FST"/>
        <category label="Demo" term="Demo"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Our model worked in CI but broke on every real device]]></title>
        <id>https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug</id>
        <link href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug"/>
        <updated>2026-05-27T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A one-line import change fixed a WebGPU inference bug that took hours to diagnose. The root cause: onnxruntime-web ships two WebGPU backends, and the default one is broken.]]></summary>
        <content type="html"><![CDATA[<p>We shipped a browser-based address parser that runs a 29 MB ONNX model entirely client-side. The Playwright tests showed perfect results. Chrome desktop looked great. Then someone opened it on an iPhone.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-saw">What we saw<a href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug#what-we-saw" class="hash-link" aria-label="Direct link to What we saw" title="Direct link to What we saw" translate="no">​</a></h2>
<p>Every address component was classified as "locality" with 0.2–0.4 confidence. "400 Broad St, Seattle, WA 98109" became three locality spans with no street, no region, no postcode. The model was producing near-uniform logits — as if it hadn't been trained at all.</p>
<p>Toggling to the WASM backend in our debug UI produced perfect results immediately. Same model bytes, same tokenizer, same input. The GPU path was the problem.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-wrong-hypotheses">The wrong hypotheses<a href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug#the-wrong-hypotheses" class="hash-link" aria-label="Direct link to The wrong hypotheses" title="Direct link to The wrong hypotheses" translate="no">​</a></h2>
<p>We burned hours on each of these before finding the real cause:</p>
<p><strong>Stale browser cache.</strong> We'd recently updated the model from 25 MB (old tokenizer) to 29 MB (new tokenizer). The old model with the wrong tokenizer produces exactly this symptom — garbage output. We added cache-busting query params, migrated assets to a CDN, and verified file sizes. The files were correct.</p>
<p><strong>Tokenizer mismatch.</strong> The v0.5.3 model uses a 48K-vocab tokenizer but an older 24K-vocab tokenizer was briefly deployed. We verified hashes. The tokenizer was correct.</p>
<p><strong>Model version drift.</strong> We have four model versions on the CDN. Maybe the wrong one was being loaded. We added a version selector to the demo page and confirmed v0.5.3 was selected. The model was correct.</p>
<p><strong>Browser-specific WASM numerics.</strong> Maybe Safari's WASM implementation handles int8 quantization differently. We tested WASM on Safari — it worked perfectly. The problem was WebGPU-specific.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="why-playwright-couldnt-catch-it">Why Playwright couldn't catch it<a href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug#why-playwright-couldnt-catch-it" class="hash-link" aria-label="Direct link to Why Playwright couldn't catch it" title="Direct link to Why Playwright couldn't catch it" translate="no">​</a></h2>
<p>Every automated test we ran passed. The reason: headless Chromium does not have a WebGPU adapter. When you request <code>executionProviders: ["webgpu", "wasm"]</code>, the runtime silently falls back to WASM. WASM handles int8 correctly, so the test passes.</p>
<p>We had a <code>verify</code> skill that launched a real headless browser, navigated to the live demo, typed an address, and checked the parse output. It ran after every deployment. It passed every time. And it was useless for catching this bug, because it could never exercise the code path that was broken.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-real-cause">The real cause<a href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug#the-real-cause" class="hash-link" aria-label="Direct link to The real cause" title="Direct link to The real cause" translate="no">​</a></h2>
<p>onnxruntime-web ships two WebGPU execution providers in the same npm package:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">onnxruntime-web          → ort.bundle.min.mjs     → JSEP (old, broken)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">onnxruntime-web/webgpu   → ort.webgpu.bundle.min.mjs → Native EP (fixed)</span><br></div></code></pre></div></div>
<p>The JSEP (JavaScript-based execution provider) has a <a href="https://github.com/microsoft/onnxruntime/issues/25227" target="_blank" rel="noopener noreferrer" class="">slice kernel bug</a> that produces incorrect results when reversing a tensor on a specific axis. This corrupts the dequantization of int8 weights. The bug is worse on Safari's Metal backend than Chrome's Dawn/Vulkan backend — Chrome happened to mask it in our case.</p>
<p>The native WebGPU EP handles the same operations correctly on all backends.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-fix">The fix<a href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug#the-fix" class="hash-link" aria-label="Direct link to The fix" title="Direct link to The fix" translate="no">​</a></h2>
<div class="language-diff codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-diff codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">- import * as ort from "onnxruntime-web"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">+ import * as ort from "onnxruntime-web/webgpu"</span><br></div></code></pre></div></div>
<p>One line. The API is identical. Session creation, tensor I/O, and provider fallback all work the same way. The native bundle is also smaller (113 KB vs 405 KB).</p>
<p>After this change, the model produces correct results on Chrome, Safari macOS, and iOS Safari — all via WebGPU, no WASM fallback needed.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-should-have-done-differently">What we should have done differently<a href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug#what-we-should-have-done-differently" class="hash-link" aria-label="Direct link to What we should have done differently" title="Direct link to What we should have done differently" translate="no">​</a></h2>
<p>The diagnostic path that would have saved hours:</p>
<ol>
<li class="">Force WASM. If results become correct, the problem is GPU-side.</li>
<li class="">Check which execution provider is actually active. We didn't have this instrumentation — we've since added a backend indicator to the demo page.</li>
<li class="">Check the import path. <code>grep "onnxruntime-web"</code> in your source. If you're importing the bare package, you're on the JSEP.</li>
<li class="">Test on Safari. If it fails on Safari but works on Chrome, the JSEP is the prime suspect.</li>
</ol>
<p>The deeper lesson: test infrastructure that lacks GPU access will never catch GPU-specific bugs. Headless browsers are not real browsers when it comes to hardware acceleration. If your product runs on GPUs, you need at least one test that exercises the GPU path on a device that has one.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="references">References<a href="https://mailwoman.sister.software/blog/2026/05/27/webgpu-safari-bug#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<ul>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/understanding/onnxruntime-web-webgpu-gotcha">Technical reference: the two WebGPU providers</a></li>
<li class=""><a href="https://github.com/microsoft/onnxruntime/issues/25227" target="_blank" rel="noopener noreferrer" class="">microsoft/onnxruntime#25227</a></li>
<li class=""><a href="https://github.com/huggingface/transformers.js/issues/1512" target="_blank" rel="noopener noreferrer" class="">huggingface/transformers.js#1512</a></li>
</ul>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Neural classifier" term="Neural classifier"/>
        <category label="Browser" term="Browser"/>
        <category label="Infrastructure" term="Infrastructure"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Night Shift 2 — from thermal hangs to a shipped model in one session]]></title>
        <id>https://mailwoman.sister.software/blog/night-shift-2-model-ships</id>
        <link href="https://mailwoman.sister.software/blog/night-shift-2-model-ships"/>
        <updated>2026-05-25T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[The second night shift ran from roughly 2am to 2pm UTC on May 25th, 2026. It started with a GPU that wouldn't stop crashing and ended with a trained model, an ONNX export, and a full evaluation report. This is the story of how infrastructure choices turned a hardware problem into a non-issue.]]></summary>
        <content type="html"><![CDATA[<p>The second night shift ran from roughly 2am to 2pm UTC on May 25th, 2026. It started with a GPU that wouldn't stop crashing and ended with a trained model, an ONNX export, and a full evaluation report. This is the story of how infrastructure choices turned a hardware problem into a non-issue.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-hardware-wall">The hardware wall<a href="https://mailwoman.sister.software/blog/night-shift-2-model-ships#the-hardware-wall" class="hash-link" aria-label="Direct link to The hardware wall" title="Direct link to The hardware wall" translate="no">​</a></h2>
<p>The lab runs on a small form factor desktop with an AMD Radeon 780M integrated GPU. For short bursts (a 2-minute smoke test, a 10-minute diagnostic probe), it works fine. For sustained multi-hour training at 98% GPU utilization, it overheats. The firmware detects thermal stress and resets the GPU, killing whatever process was running on it.</p>
<p>During this session, the GPU hit 22 resets before we stopped counting. Every 60-90 minutes of training, the hardware would fault. A watchdog script would wait 15 minutes for the chassis to cool, then restart from the last checkpoint. Net progress: about 8,800 training steps out of a target 50,000.</p>
<p>At that rate — 90 minutes of compute, 15 minutes of cooldown, 500 steps lost per restart — the full training run would take roughly 38 hours of wall-clock time. That's fine for a research prototype, but it's not a productive use of a night shift.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-pivot-to-modal">The pivot to Modal<a href="https://mailwoman.sister.software/blog/night-shift-2-model-ships#the-pivot-to-modal" class="hash-link" aria-label="Direct link to The pivot to Modal" title="Direct link to The pivot to Modal" translate="no">​</a></h2>
<p><a href="https://modal.com/" target="_blank" rel="noopener noreferrer" class="">Modal</a> is a cloud compute platform where you write a Python function, decorate it with <code>@app.function(gpu="A100")</code>, and it runs in a datacenter with a proper GPU. No SSH, no Docker, no instance management.</p>
<p>The pivot took about an hour:</p>
<ol>
<li class="">
<p><strong>Upload the corpus to Cloudflare R2</strong> — 30 GB of training data, synced via rclone. This took about 15 minutes (the data was already on a fast local drive; the upload was bandwidth-limited but not painfully so).</p>
</li>
<li class="">
<p><strong>Write a Modal wrapper</strong> — 20 lines around the existing training script. The wrapper pulls the corpus from R2 into a Modal Volume (a persistent disk), runs the train, writes checkpoints back.</p>
</li>
<li class="">
<p><strong>Debug three small issues</strong> — the Modal worker needed the R2 credentials passed as secrets (first attempt used empty env vars), the training config wasn't on the volume yet, and the ONNX export needed <code>onnxscript</code> added to the image.</p>
</li>
<li class="">
<p><strong>Run the training</strong> — 50,000 steps on an NVIDIA A100-SXM4-40GB in 2 hours. No hangs, no resets, no watchdog. Just clean, uninterrupted compute at 6.9 steps per second (vs 0.56 on the local iGPU).</p>
</li>
</ol>
<p>Total cost: about $5, covered entirely by Modal's $30/month free credits for new accounts.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-results">The results<a href="https://mailwoman.sister.software/blog/night-shift-2-model-ships#the-results" class="hash-link" aria-label="Direct link to The results" title="Direct link to The results" translate="no">​</a></h2>
<p>The CE-only model (which drops the problematic CRF loss term that caused nine previous runs to diverge) trained to completion:</p>
<ul>
<li class=""><strong>val_macro_f1: 0.605</strong> (final), 0.621 (peak at step 35K)</li>
<li class=""><strong>Train loss: 0.068</strong> (final)</li>
<li class=""><strong>Zero divergence</strong> across all 50,000 steps</li>
<li class=""><strong>ONNX export: 66 MB</strong> (full-precision training artifact; the shipped weights are quantized to ~25 MB for the npm package, and smaller still for the browser demo)</li>
</ul>
<p>For context: v0.4.0 shipped at macro_f1 = 0.36. This is a 68% relative improvement on the same evaluation set.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-eval-matrix">The eval matrix<a href="https://mailwoman.sister.software/blog/night-shift-2-model-ships#the-eval-matrix" class="hash-link" aria-label="Direct link to The eval matrix" title="Direct link to The eval matrix" translate="no">​</a></h2>
<p>After the model shipped, we ran the full product-level evaluation — four pipeline modes compared on 4,535 hand-curated golden addresses:</p>
<table><thead><tr><th>Mode</th><th>Exact Match</th><th>Macro F1</th><th>Empty Parse</th><th>Overconf Wrong</th></tr></thead><tbody><tr><td>Rule-only</td><td><strong>30.8%</strong></td><td>22.0%</td><td>6.3%</td><td>2.4%</td></tr><tr><td>Neural</td><td>0.1%</td><td>7.3%</td><td>0.3%</td><td><strong>54.5%</strong></td></tr><tr><td>Hybrid</td><td>0.1%</td><td>7.3%</td><td>0.3%</td><td>54.5%</td></tr><tr><td>Hybrid-joint (reconciler)</td><td>6.0%</td><td>16.6%</td><td><strong>0.0%</strong></td><td><strong>0.1%</strong></td></tr></tbody></table>
<p>A few things jump out:</p>
<p><strong>The neural model hallucinates components it shouldn't.</strong> On the golden set, it invented a <code>dependent_locality</code> — a sub-city neighborhood — 956 times where none existed. Two explanations look tempting, and both lead nowhere. Calibration? These predictions come out at high confidence; the model commits hard to the wrong answer. Decoding? Viterbi with the structural mask is already running. What's left is training: cross-entropy treats every mislabeling equally, so the model never learned that <code>dependent_locality</code> is rare and should be emitted sparingly. Class-weighted CE — which was blocked in v0.4.0 because it destabilized the dual-loss training — puts a thumb on the scale: mislabeling a rare tag costs more. Now that CE-only training is proven stable, this lever is unlocked.</p>
<p><strong>Hybrid mode shows identical numbers to neural alone.</strong> The hybrid mode fuses rule classifications with neural output, but in this iteration the raw neural decoder's overconfidence drowns out the rules, hence the identical numbers. The reconciler (hybrid-joint) is the mode that actually disciplines the merge.</p>
<p><strong>The reconciler fixes the honesty problem.</strong> It drops overconfident-wrong from 54.5% to 0.1% by checking whether parsed components form a coherent real-world hierarchy. It also eliminates empty parses entirely (0.0% vs rules' 6.3%): it always produces something, even if conservative.</p>
<p><strong>The rules are a ceiling. The neural model is a ramp.</strong> Rule-only at 30.8% exact match is a mature system, hand-tuned over years. Each additional percentage point costs engineering time. The neural model at 6.0% (hybrid-joint) after one stable training run is learning from data, which means each new training run can improve across every component and every locale simultaneously. The 68% improvement from v0.4.0 to v0.5.0 is the trend that matters — and the ramp just proved it can climb.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-infrastructure-lesson">The infrastructure lesson<a href="https://mailwoman.sister.software/blog/night-shift-2-model-ships#the-infrastructure-lesson" class="hash-link" aria-label="Direct link to The infrastructure lesson" title="Direct link to The infrastructure lesson" translate="no">​</a></h2>
<p>The overnight session could have been a write-off. A GPU that crashes every 90 minutes, a 50,000-step training target, and 12 hours of wall-clock to fill. Instead:</p>
<ul>
<li class=""><strong>Corpus on R2</strong> means any GPU provider can pull it at datacenter speed. Upload once, train anywhere.</li>
<li class=""><strong>Modal's per-second billing</strong> means we paid $0 for data upload, $0 for debugging, and ~$5 for the actual GPU compute.</li>
<li class=""><strong>Checkpoints every 500 steps</strong> on the Modal Volume means even if a Modal preemption happened (it didn't), we'd lose at most 7 minutes of work.</li>
<li class=""><strong>The same training script</strong> ran locally (for smoke tests) and remotely (for the full run) without modification: the config just points at <code>/data/</code>, which is either the local mount or the Modal Volume.</li>
</ul>
<p>The local iGPU still has a role: smoke tests, gradient probes, quick 50-step experiments. The expensive runs go to the cloud. The separation happened naturally once we accepted that the hardware wall was real and not worth engineering around.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="whats-next">What's next<a href="https://mailwoman.sister.software/blog/night-shift-2-model-ships#whats-next" class="hash-link" aria-label="Direct link to What's next" title="Direct link to What's next" translate="no">​</a></h2>
<p>Now that we have cloud GPU access at $5 per full training run, several decisions we made for the local hardware no longer apply. The v0.5.0 model was trained with constraints that made sense on a thermal-limited iGPU but don't make sense on an A100:</p>
<ul>
<li class=""><strong>Hidden size 256</strong> — we wanted 384 but fell back when it wouldn't train locally. The A100 has 40 GB of VRAM; 384 or 512 are trivial.</li>
<li class=""><strong>Effective batch 128 via gradient accumulation</strong> (batch=16, accumulate 8 steps) — a workaround for limited GPU memory. The A100 can do batch 128 directly, which changes the gradient noise characteristics and potentially the training dynamics.</li>
<li class=""><strong>50,000 steps</strong> — sized for "affordable locally." At 6.9 steps/second on the A100, 100K steps costs $10. We might be undertrained.</li>
<li class=""><strong>Phrase-prior conditioning disabled</strong> — turned off during debugging and never turned back on. The architectural thesis was built around it.</li>
<li class=""><strong>Class-weighted cross-entropy disabled</strong> — the v0.4.0 recipe lever that addresses the 956-FP hallucination problem is now safe to use.</li>
</ul>
<p>The next iteration removes all of these constraints at once: h384, direct large-batch, phrase priors on, class weights on, longer schedule. Same corpus, same tokenizer, same CE-only stability fix — just the model the architecture was designed to produce. One A100 run, a few hours, covered by free credits.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-to-look">Where to look<a href="https://mailwoman.sister.software/blog/night-shift-2-model-ships#where-to-look" class="hash-link" aria-label="Direct link to Where to look" title="Direct link to Where to look" translate="no">​</a></h2>
<ul>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/getting-started">Getting started</a> — 5-minute install + first parse</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/status">Project status</a> — what ships today, per package</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/evals/2026-05-25-v0.5.0-ce-only-eval-matrix">Eval matrix report</a> — full per-component breakdown</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/understanding/our-approach/what-the-eval-numbers-mean">What the eval numbers mean</a> — plain-English interpretation</li>
<li class=""><a href="https://github.com/sister-software/mailwoman/blob/main/scripts/modal/train_remote.py" target="_blank" rel="noopener noreferrer" class="">Modal training wrapper</a> — the 250-line script that runs the whole thing</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/concepts/dual-loss-curvature-conflict">Dual-loss curvature conflict</a> — why CE-only works when nine dual-loss runs didn't</li>
</ul>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Model training" term="Model training"/>
        <category label="Infrastructure" term="Infrastructure"/>
        <category label="Night shift" term="Night shift"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Five tries, same failure — narrowing v0.5.0's training problem by elimination]]></title>
        <id>https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination</id>
        <link href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination"/>
        <updated>2026-05-24T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[This is a follow-up to yesterday's post about the v0.5.0 C-train failures. Yesterday we ran four attempts and ruled out three suspects. Today we ran a fifth and ruled out a fourth. We're now down to one remaining hypothesis — and the way we got here is a kind of debugging that translates pretty cleanly from software engineering, so this post is pitched at engineers who haven't run a training campaign before.]]></summary>
        <content type="html"><![CDATA[<p>This is a follow-up to <a class="" href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect">yesterday's post about the v0.5.0 C-train failures</a>. Yesterday we ran four attempts and ruled out three suspects. Today we ran a fifth and ruled out a fourth. We're now down to one remaining hypothesis — and the way we got here is a kind of debugging that translates pretty cleanly from software engineering, so this post is pitched at engineers who haven't run a training campaign before.</p>
<p>If you've ever bisected a regression in a piece of software (used <code>git bisect</code>, narrowed a test failure by reverting changes one at a time, taken a known-good build and a known-broken build and asked which of the changes between them caused the breakage), then you already understand the core move. The rest is vocabulary.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-setup-in-software-terms">The setup, in software terms<a href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination#the-setup-in-software-terms" class="hash-link" aria-label="Direct link to The setup, in software terms" title="Direct link to The setup, in software terms" translate="no">​</a></h2>
<p>Last week we shipped a "v0.4.0" model. Think of a model as a long-lived process — millions of internal numbers (weights) that we tune by feeding it labelled examples for hours and adjusting based on how wrong each guess is. The output of all that tuning is a single file (~50MB) we copy to production.</p>
<p>v0.4.0 worked. We then changed a handful of things in parallel to ship v0.5.0:</p>
<ol>
<li class="">New <strong>tokenizer</strong> (the thing that splits input strings into model-readable units; we made a bigger, smarter one because the old one fell back to raw bytes on non-Latin scripts).</li>
<li class="">New <strong>corpus</strong> (we added synthetic adversarial examples + transliteration pairs to the training data).</li>
<li class="">New <strong>input layer</strong> ("phrase priors": pre-computed hints about where each meaningful span starts and ends, fed into the model alongside the raw tokens).</li>
<li class="">Bigger <strong>hidden size</strong> (the internal width of the model — more capacity, in principle).</li>
<li class="">Plus a bunch of new code surrounding it (top-k inference, joint decoding, a new reconcile stage).</li>
</ol>
<p>Items 5 are pure new code, those are fine. Items 1-4 are the suspects. Any combination of them could be the thing that breaks training. Welcome to a multi-variable regression.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-failure-mode">The failure mode<a href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination#the-failure-mode" class="hash-link" aria-label="Direct link to The failure mode" title="Direct link to The failure mode" translate="no">​</a></h2>
<p>When we trained the model — which is just a long loop, run for ~50,000 iterations, watching a number called "loss" go down — the loss went down beautifully for the first ~1000 iterations and then started going up. Catastrophically. By the time we noticed, the model had unlearned everything useful and was producing garbage.</p>
<p>In software terms: imagine a process that runs fine for the first hour and then enters a kind of cascading state corruption that slowly destroys all its in-memory data, even though no individual operation looks wrong. There's no segfault, no exception. The numbers just slowly drift away from useful and toward useless.</p>
<p>This pattern has a fingerprint:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">descent through warmup  →  brief plateau at a low loss  →  sharp climb back to nonsense</span><br></div></code></pre></div></div>
<p>Every single run we've done so far has shown this exact fingerprint at slightly different points. The depth of the plateau varies; the moment the climb starts shifts; the climb itself is always there.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="bisecting-by-elimination">Bisecting by elimination<a href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination#bisecting-by-elimination" class="hash-link" aria-label="Direct link to Bisecting by elimination" title="Direct link to Bisecting by elimination" translate="no">​</a></h2>
<p>Five attempts now, each varying one knob from the previous:</p>
<table><thead><tr><th>Run</th><th>What changed</th><th>Best loss before climb</th><th>When the climb started</th></tr></thead><tbody><tr><td>v1</td><td>(all v0.5.0 changes ON)</td><td>0.61</td><td>step 700</td></tr><tr><td>v2</td><td>lowered LR (1.5e-4 → 1e-4)</td><td>0.51</td><td>step 1000</td></tr><tr><td>v3</td><td>turned off two loss-side knobs (§1, §3)</td><td>0.41</td><td>step 800</td></tr><tr><td><strong>bisect-h256</strong></td><td>reverted hidden-size bump</td><td>0.31</td><td>step 1050</td></tr><tr><td><strong>bisect-phrase-off</strong></td><td>reverted phrase-prior input layer</td><td>0.38</td><td>step 1050</td></tr></tbody></table>
<p>The bisect-phrase-off run is the new one (today). The previous post covered v1 through bisect-h256.</p>
<p>What every bisect attempt has in common: <strong>the model is provably learning something useful for several hundred iterations</strong> (the loss decreases, validation accuracy climbs), and <strong>then it falls off a cliff</strong>. This means the model isn't fundamentally broken. It can fit the data, it just can't <em>stay</em> fit. Something is pushing it off the cliff.</p>
<p>Each bisect tested a different "is this the cliff?" hypothesis:</p>
<ul>
<li class=""><strong>v2 tested "is the learning rate too high?"</strong> No. Lowering LR delayed the climb but didn't stop it.</li>
<li class=""><strong>v3 tested "are the new loss-side weighting knobs destabilising?"</strong> No. v0.4.0's known-stable loss settings still produce the cliff.</li>
<li class=""><strong>bisect-h256 tested "is the bigger model the problem?"</strong> No. We reverted hidden size to v0.4.0's value and got the cleanest training so far (best validation macro-F1 we've ever measured), and the cliff still happened.</li>
<li class=""><strong>bisect-phrase-off (today) tested "is the phrase-prior input layer the problem?"</strong> No. We turned off the phrase-prior feature concatenation entirely and the cliff is still there, in the same shape, at almost the same step.</li>
</ul>
<p>Five attempts, five identical fingerprints, four hypotheses eliminated. <strong>Exactly one architectural change from the known-stable v0.4.0 setup is still in play</strong>: the tokenizer + corpus pair.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="whats-left-and-why-its-interesting">What's left, and why it's interesting<a href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination#whats-left-and-why-its-interesting" class="hash-link" aria-label="Direct link to What's left, and why it's interesting" title="Direct link to What's left, and why it's interesting" translate="no">​</a></h2>
<p>The two remaining variables are linked:</p>
<ul>
<li class=""><strong>A1 tokenizer</strong>: a new vocabulary of 48,000 sub-pieces that the model uses to chop input strings into atomic units. It was trained on the v0.4.0 corpus (which includes the new transliteration data) so it knows about CJK / Cyrillic / Hangul / Han / Armenian script. The old tokenizer just gave up on non-Latin scripts and emitted raw bytes.</li>
<li class=""><strong>corpus-v0.4.0</strong>: the old corpus plus ~78,000 new rows generated by an LLM — adversarial "trick" addresses and transliteration pairs in non-Latin scripts.</li>
</ul>
<p>These two are bundled. A1's vocabulary was constructed <em>from</em> v0.4.0's content. So the model is simultaneously seeing new tokens (vocab change) and new data (corpus change) for the first time, and we can't fully separate them without retraining one or the other.</p>
<p>But we have one cheap experiment that gets us most of the way there. Hold the tokenizer constant (keep A1), and just swap the corpus back to v0.3.0 (the old data, no transliteration mass). That tests whether the transliteration data is the destabiliser, while preserving the tokenizer-side win.</p>
<p>This is the next bisect. If it trains cleanly, we'll know the synthetic data — specifically, B2's transliteration mass — is what's breaking training. That'd be a useful answer because we have a couple of obvious follow-ups:</p>
<ul>
<li class=""><strong>Downweight transliteration in the training mix</strong>. The corpus has per-source weights; we just turn down the new stuff. Lossy but cheap.</li>
<li class=""><strong>Investigate why the transliteration data destabilises</strong>. The honest hypothesis is that the LLM-generated rows have systematically different gradient signatures than human-validated address data — they might be <em>too</em> structured, or have repetitive patterns the model overfits to and then explodes on. We have tooling (<code>corpus-audit</code>) that can quantify this.</li>
<li class=""><strong>Ship A1 (tokenizer wins) + corpus-v0.3.0 model (proven-stable)</strong> for v0.5.0, defer transliteration training to v0.5.1.</li>
</ul>
<p>If the corpus-revert bisect also fails, we're left with the A1 tokenizer itself as the destabiliser. That's a stranger answer (tokenizer training is mostly orthogonal from classifier training), but not impossible. New vocabulary means a fresh embedding table the model has never seen; unusual sub-piece frequencies could in principle produce unusual gradient norms.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-wed-tell-a-software-engineer-reading-this">What we'd tell a software engineer reading this<a href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination#what-wed-tell-a-software-engineer-reading-this" class="hash-link" aria-label="Direct link to What we'd tell a software engineer reading this" title="Direct link to What we'd tell a software engineer reading this" translate="no">​</a></h2>
<p>Three things about ML debugging that don't translate cleanly from regular software:</p>
<ol>
<li class=""><strong>There's no stack trace.</strong> Loss is the only signal you get. You don't get to step into the model and see what's wrong. You change one knob, run the experiment for hours, and read the resulting curve like a fortune teller. This is the part that makes ML feel unscientific — but if you're disciplined about it (one knob at a time, write down the result, save the artifacts), it's exactly the same bisect-by-elimination workflow as <code>git bisect</code>.</li>
<li class=""><strong>Iterations are expensive.</strong> Each "is the bug here?" check costs hours of GPU time. You can't make 100 tries and look at the distribution. You make 5-10 tries, and each one has to be carefully designed to maximise the information yield. This is why ML researchers obsess over "ablation studies" — they're the equivalent of unit tests, but each one costs $5 of compute.</li>
<li class=""><strong>The "obvious" suspect is often wrong.</strong> When v0.5.0 started failing we assumed the bigger model was at fault. (We made it bigger! That's a lot of new parameters to break!) The h256 bisect ruled that out cleanly. Then we assumed it was the new input layer. The phrase-off bisect ruled that out too. The remaining suspect — the tokenizer + corpus — was our least-favourite hypothesis going in, because the tokenizer was the <em>headline win</em> of v0.5.0. But the data has a way of being indifferent to your preferences. You keep elimination-bisecting until you find the answer the data is actually telling you.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-we-go-next">Where we go next<a href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination#where-we-go-next" class="hash-link" aria-label="Direct link to Where we go next" title="Direct link to Where we go next" translate="no">​</a></h2>
<p>The corpus-revert bisect is the next experiment. It's a 25-30 hour training run on the lab's GPU, so we'll start it tonight and check on it tomorrow morning. If it trains clean, we have a clear shippable v0.5.0 (with a v0.5.1 follow-up to figure out the transliteration data destabilisation). If it doesn't, we'll have the cleanest possible signal that the tokenizer change itself is interacting weirdly with classifier training — a much more interesting problem to write about.</p>
<p>Either way the bisect ladder is short now. Five experiments in, one hypothesis left, and a clear next experiment that resolves it. The frustrating part of ML debugging is the long iteration cycle; the satisfying part is that the same systematic elimination always works in the end.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-to-look">Where to look<a href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination#where-to-look" class="hash-link" aria-label="Direct link to Where to look" title="Direct link to Where to look" translate="no">​</a></h2>
<ul>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/plan/v0-5-0-shipped">v0.5.0 fresh-slate plan</a></li>
<li class=""><a class="" href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect">Yesterday's post on the first four attempts</a></li>
<li class=""><a class="" href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign">The v0.4.0 retrospective</a> (the original "destabilisation fingerprint" we recognised in v0.5.0)</li>
<li class=""><code>VERDICT_SMOKES.md</code> — discipline doc for the smoke-test framework we built during v0.4.0 to catch divergences early</li>
</ul>
<p>If you do ML work and have ideas about what classes of corpus distribution shift could produce a "trains fine for a thousand steps then catastrophically diverges" pattern, the mailbox is open: <code>contact@sister.software</code>.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Model training" term="Model training"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Taming Who's On First — making sense of the world's open place data]]></title>
        <id>https://mailwoman.sister.software/blog/taming-whosonfirst</id>
        <link href="https://mailwoman.sister.software/blog/taming-whosonfirst"/>
        <updated>2026-05-24T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Mailwoman is an open-source address parser + geocoder that uses Who's On First as its gazetteer. This post is a practical reference on WOF's gotchas and the tooling we built to work around them. Try the demo or see what ships today.]]></summary>
        <content type="html"><![CDATA[<div class="theme-admonition theme-admonition-note admonition_WCGJ alert alert--secondary"><div class="admonitionHeading_GCBg"><span class="admonitionIcon_L39b"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>If you found this via search</div><div class="admonitionContent_pbrs"><p><strong><a href="https://mailwoman.sister.software/" target="_blank" rel="noopener noreferrer" class="">Mailwoman</a></strong> is an open-source address parser + geocoder that uses Who's On First as its gazetteer. This post is a practical reference on WOF's gotchas and the tooling we built to work around them. <a class="" href="https://mailwoman.sister.software/demo">Try the demo</a> or see <a class="" href="https://mailwoman.sister.software/docs/status">what ships today</a>.</p></div></div>
<p>Who's On First is the best open gazetteer we have. It's also one of the strangest datasets you'll encounter as a developer. This post is about what makes it hard to use, what makes it worth the effort, and the tooling we built inside Mailwoman to tame it.</p>
<p>If you've ever tried to answer "what city is this address in?" programmatically, using open data without paying a geocoding API, you've probably already run into WOF. And you probably had some questions.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-whos-on-first-actually-is">What Who's On First actually is<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#what-whos-on-first-actually-is" class="hash-link" aria-label="Direct link to What Who's On First actually is" title="Direct link to What Who's On First actually is" translate="no">​</a></h2>
<p>WOF is a gazetteer — a structured database of places. Not addresses, not roads, not buildings. <em>Places</em>: countries, regions, counties, cities, neighbourhoods, venues. Each one gets a stable numeric ID, a parent-child hierarchy, multilingual name variants, and a polygon geometry.</p>
<p>It was created by Mapzen (RIP, 2018) as the successor to GeoPlanet (Yahoo's gazetteer, also RIP). The data lives on GitHub as approximately 100 repositories under the <code>whosonfirst-data</code> org, totalling several million individual GeoJSON files. Geocode Earth maintains the canonical SQLite distributions at <code>data.geocode.earth</code>.</p>
<p>The key thing WOF gives you that no other open dataset does: <strong>a consistent hierarchy with stable IDs</strong>. You can take a locality (<code>Houston</code>, id <code>85922029</code>), follow its <code>parent_id</code> to a region (<code>Texas</code>, id <code>85688753</code>), follow <em>that</em> to a country (<code>United States</code>, id <code>85633793</code>), and know the chain is consistent. OpenStreetMap doesn't give you this. GeoNames gives you a partial version. WOF gives you the whole thing, with an opinion on how the world's administrative boundaries nest.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-makes-it-hard">What makes it hard<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#what-makes-it-hard" class="hash-link" aria-label="Direct link to What makes it hard" title="Direct link to What makes it hard" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="one-file-per-place">One file per place<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#one-file-per-place" class="hash-link" aria-label="Direct link to One file per place" title="Direct link to One file per place" translate="no">​</a></h3>
<p>WOF stores each place as a separate <code>.geojson</code> file in a directory tree. A US admin dataset has roughly 120,000 individual files. The French equivalent has about 80,000. Opening, parsing, and indexing 200,000 JSON files is a meaningful engineering problem before you've even asked a question of the data.</p>
<p>This per-file layout made sense for WOF's original use case: git-trackable changes to individual places. You can see who edited Houston last, what changed, and when. But for a geocoder that needs to query "all localities named Houston" across 120K files, it's the wrong access pattern entirely.</p>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="the-property-namespace-explosion">The property namespace explosion<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#the-property-namespace-explosion" class="hash-link" aria-label="Direct link to The property namespace explosion" title="Direct link to The property namespace explosion" translate="no">​</a></h3>
<p>A WOF GeoJSON feature's properties object looks like this:</p>
<div class="language-json codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-json codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"wof:id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">85830005</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"wof:name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Lawrence Corner"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"wof:placetype"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"neighbourhood"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"wof:parent_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1729442683</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"wof:country"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"US"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"wof:hierarchy"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">		</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">			</span><span class="token property" style="color:#36acaa">"continent_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">102191575</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">			</span><span class="token property" style="color:#36acaa">"country_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">85633793</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">			</span><span class="token property" style="color:#36acaa">"county_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">102085493</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">			</span><span class="token property" style="color:#36acaa">"localadmin_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">404477193</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">			</span><span class="token property" style="color:#36acaa">"locality_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1729442683</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">			</span><span class="token property" style="color:#36acaa">"neighbourhood_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">85830005</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">			</span><span class="token property" style="color:#36acaa">"region_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">85688689</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">		</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"name:eng_x_preferred"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"Lawrence Corner"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"name:eng_x_variant"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"Lawrence Cor"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"src:geom"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"quattroshapes"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"edtf:inception"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"uuuu"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"edtf:cessation"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"uuuu"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"geom:area"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0.000047</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"geom:bbox"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"-74.73,40.08,-74.72,40.09"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">	</span><span class="token property" style="color:#36acaa">"mz:hierarchy_label"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>There are a few things to notice:</p>
<ol>
<li class=""><strong>Namespaced keys everywhere.</strong> <code>wof:</code>, <code>name:</code>, <code>src:</code>, <code>edtf:</code>, <code>geom:</code>, <code>mz:</code> — each prefix is a different source or concern. The schema is flat (one object, no nesting) with meaning encoded in the key name.</li>
<li class=""><strong>Name variants are language-coded.</strong> <code>name:eng_x_preferred</code> is the preferred English name. <code>name:fra_x_preferred</code> would be French. <code>name:zho_x_preferred</code> would be Chinese. The <code>_x_</code> separator splits language code from name kind (<code>preferred</code>, <code>variant</code>, <code>colloquial</code>, <code>abbr</code>, <code>short</code>).</li>
<li class=""><strong>Some places have dozens of name keys.</strong> A major city like Paris has <code>name:</code> entries in 50+ languages. A rural US neighbourhood might have only one.</li>
<li class=""><strong>The hierarchy is pre-computed.</strong> Instead of walking <code>parent_id</code> up the tree at query time, WOF bakes the full ancestry chain into each record. Convenient for display; redundant for storage; occasionally stale when a parent is reclassified.</li>
</ol>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="brooklyn-integers">Brooklyn Integers<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#brooklyn-integers" class="hash-link" aria-label="Direct link to Brooklyn Integers" title="Direct link to Brooklyn Integers" translate="no">​</a></h3>
<p>WOF IDs are issued by a service called Brooklyn Integers, a distributed ID generator that guarantees uniqueness across the dataset. The IDs are not sequential, not geographically meaningful, and not sortable. They're just unique numbers. This is fine for lookup but means you can't reason about "nearby" places by ID proximity.</p>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="supersession-chains">Supersession chains<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#supersession-chains" class="hash-link" aria-label="Direct link to Supersession chains" title="Direct link to Supersession chains" translate="no">​</a></h3>
<p>Places get deprecated: a neighbourhood is absorbed by a neighbouring one, a county boundary changes, a locality is merged. WOF tracks this via <code>wof:superseded_by</code> arrays. A query that doesn't check supersession may return a place that hasn't existed since 2015.</p>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="parent-id---1">Parent ID = -1<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#parent-id---1" class="hash-link" aria-label="Direct link to Parent ID = -1" title="Direct link to Parent ID = -1" translate="no">​</a></h3>
<p>A <code>parent_id</code> of -1 means "we don't know the parent." A <code>parent_id</code> of 0 means "no parent (this is a continent or Earth itself)." The first French postalcode dataset was ingested with <code>parent_id: -1</code> for every record, making hierarchy traversal useless until someone manually assigned parents. Some of those records still have <code>-1</code>.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-built-to-tame-it">What we built to tame it<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#what-we-built-to-tame-it" class="hash-link" aria-label="Direct link to What we built to tame it" title="Direct link to What we built to tame it" translate="no">​</a></h2>
<p>Mailwoman needs WOF for two things:</p>
<ol>
<li class=""><strong>Rule classifiers</strong>: "is this token a known locality name?" (Used by the locality/region/country dictionaries in the rule-based classifiers.)</li>
<li class=""><strong>Reconciler concordance scoring</strong>: "does this parse's locality/region/country assignment form a valid <code>parent_id</code> chain in the world?" (Used by Stage 5 joint decoding.)</li>
</ol>
<p>Each use case has a different access pattern, so we built two layers:</p>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="layer-1-normalised-placename-index-wofplacenamecache">Layer 1: normalised placename index (<code>WOFPlacenameCache</code>)<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#layer-1-normalised-placename-index-wofplacenamecache" class="hash-link" aria-label="Direct link to layer-1-normalised-placename-index-wofplacenamecache" title="Direct link to layer-1-normalised-placename-index-wofplacenamecache" translate="no">​</a></h3>
<p>For the rule classifiers, all we need is a fast "is this string a placename in any language?" lookup. We don't need coordinates, hierarchy, or geometry — just the normalised string and which languages it's valid in.</p>
<p><code>WOFPlacenameCache</code> builds this index by streaming GeoJSON files via <code>TextSpliterator</code> (our line-delimited streaming library), extracting <code>name:*</code> properties, normalising them (case folding, accent stripping), and inserting into an in-memory Map keyed by the normalised form. The value is a Set of language codes the name appears in.</p>
<p>The normalisation matters because WOF stores "São Paulo" with the accent, but user input might arrive as "Sao Paulo" or "SAO PAULO". The index needs to match all three.</p>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="layer-2-per-placetype-sqlite-db-placetypedatasource">Layer 2: per-placetype SQLite DB (<code>PlacetypeDataSource</code>)<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#layer-2-per-placetype-sqlite-db-placetypedatasource" class="hash-link" aria-label="Direct link to layer-2-per-placetype-sqlite-db-placetypedatasource" title="Direct link to layer-2-per-placetype-sqlite-db-placetypedatasource" translate="no">​</a></h3>
<p>For the reconciler, we need richer queries: "give me all localities named Houston with their parent_id chains" and "walk this locality's parent_id up to region — does it reach Texas?"</p>
<p><code>PlacetypeDataSource</code> is a SQLite database per (placetype, language) combination. Schema:</p>
<div class="language-sql codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-sql codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">TABLE</span><span class="token plain"> records </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  id        </span><span class="token keyword" style="color:#00009f">INTEGER</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">NOT</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">NULL</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  src       </span><span class="token keyword" style="color:#00009f">TEXT</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">NOT</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">NULL</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  name      </span><span class="token keyword" style="color:#00009f">TEXT</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">NOT</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">NULL</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  preferred </span><span class="token keyword" style="color:#00009f">TEXT</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  variant   </span><span class="token keyword" style="color:#00009f">TEXT</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  colloquial </span><span class="token keyword" style="color:#00009f">TEXT</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  abbr      </span><span class="token keyword" style="color:#00009f">TEXT</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  short     </span><span class="token keyword" style="color:#00009f">TEXT</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  parent_id </span><span class="token keyword" style="color:#00009f">INTEGER</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token keyword" style="color:#00009f">PRIMARY</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">KEY</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> src</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> name</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>One row per name variant. "Saint Petersburg" and "St. Petersburg" and "St Petersburg" are three rows for the same <code>id</code>, different <code>name</code>/<code>variant</code>/<code>short</code> columns. The reconciler can query any variant form and get the same <code>parent_id</code> chain, which is what solves the "not found" problem we hit in testing.</p>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="the-piscina-pipeline-stalled-documented">The Piscina pipeline (stalled, documented)<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#the-piscina-pipeline-stalled-documented" class="hash-link" aria-label="Direct link to The Piscina pipeline (stalled, documented)" title="Direct link to The Piscina pipeline (stalled, documented)" translate="no">​</a></h3>
<p>Processing 120K GeoJSON files into these DBs is an embarrassingly-parallel problem. Our <code>commands/wof/prepare</code> command uses Piscina (a Node.js worker-thread pool) to dispatch files across all available CPU cores. Each worker:</p>
<ol>
<li class="">Reads a GeoJSON file.</li>
<li class="">Calls <code>pluckPlacetypeSpec</code> to extract the structured fields + all name variants per language.</li>
<li class=""><em>(Should)</em> upsert into the appropriate <code>PlacetypeDataSource</code>.</li>
</ol>
<p>Step 3 currently targets Redis (a leftover from an earlier prototype). The migration to SQLite is documented but not yet complete. The design intent was in-memory SQLite per worker (zero disk I/O during the hot path) with a consolidation step at the end — but that never got past the design stage.</p>
<h3 class="anchor anchorTargetStickyNavbar_tleR" id="asyncspliteratorasmany--the-file-that-got-away"><code>AsyncSpliterator.asMany</code> — the file that got away<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#asyncspliteratorasmany--the-file-that-got-away" class="hash-link" aria-label="Direct link to asyncspliteratorasmany--the-file-that-got-away" title="Direct link to asyncspliteratorasmany--the-file-that-got-away" translate="no">​</a></h3>
<p>When the data arrives as a single bulk NDJSON dump rather than 120K files, the access pattern changes. Instead of "glob files, dispatch per-file," you want "split one huge file into N byte-range chunks, process each chunk in parallel."</p>
<p><code>AsyncSpliterator.asMany(source, delimiter, concurrency)</code> was built for this case. Given a file handle and a desired concurrency, it seeks to N roughly-equal byte positions in the file, snaps each position to the nearest delimiter boundary (so no line gets split between workers), and returns N independent async iterators that each process their own byte range.</p>
<p>The analogy: you have a book with a million pages. Instead of having one person read cover-to-cover, you measure the book's thickness, divide it into N roughly-equal stacks, and hand each stack to a different reader. Each reader finds the nearest chapter boundary at their stack's start and end (the delimiter-snap), then processes independently.</p>
<p>We built it, marked it <code>@internal</code>, and haven't exercised it at scale because the per-file path was sufficient for the repos we actually use. But when someone wants to process the full Geocode Earth SQLite distribution (3+ GB of admin data across all countries), this is the right primitive.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="whats-next">What's next<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#whats-next" class="hash-link" aria-label="Direct link to What's next" title="Direct link to What's next" translate="no">​</a></h2>
<p>Three things, in priority order:</p>
<ol>
<li class=""><strong>Finish the SQLite migration.</strong> The worker targets Redis; it should target <code>PlacetypeDataSource</code>. The <code>pluckPlacetypeSpec</code> output already matches the schema. The remaining work is plumbing, not design.</li>
<li class=""><strong>Wire PlacetypeDataSource into the reconciler.</strong> The concordance scoring currently uses the raw WOF <code>spr</code> SQLite table (from Geocode Earth's distribution). It should use our per-placetype/per-language DBs, which carry the name variants the raw table doesn't expose. This fixes the "Saint Petersburg not found" class of lookup failures.</li>
<li class=""><strong>Benchmark the in-memory-then-consolidate pattern.</strong> If 120K individual writes to the same few DB files from N concurrent workers bottlenecks on SQLite's WAL writer (likely), the in-memory-SQLite-per-worker → ATTACH-and-merge pattern is the escape hatch. Whether it's actually needed depends on whether step 1 is fast enough without it.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="so-why-put-up-with-wof">So why put up with WOF?<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#so-why-put-up-with-wof" class="hash-link" aria-label="Direct link to So why put up with WOF?" title="Direct link to So why put up with WOF?" translate="no">​</a></h2>
<p>Every geocoder needs a gazetteer. The choice is: pay for one (Google, HERE, Mapbox), use an open one (WOF, GeoNames, OSM Nominatim), or build your own from government sources (BAN, NAD, TIGER).</p>
<p>WOF is the best open option for hierarchy and multilingual names. But it's hard to use raw. The per-file layout, the flat namespace, the supersession chains, the <code>parent_id: -1</code> holes — each one is a trap for a naive consumer.</p>
<p>The tooling we built (<code>WOFPlacenameCache</code>, <code>PlacetypeDataSource</code>, the Piscina pipeline, <code>pluckPlacetypeSpec</code>, <code>AsyncSpliterator.asMany</code>) is our attempt to close the gap between "WOF exists" and "WOF is usable as a geocoder component." It's not complete, but the architecture is sound and the incomplete pieces are documented.</p>
<p>If you're building a geocoder or any location-aware system and you need hierarchy + multilingual names from open data, WOF is probably your starting point. The gotchas above are the things we wish someone had told us when we started.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-to-look">Where to look<a href="https://mailwoman.sister.software/blog/taming-whosonfirst#where-to-look" class="hash-link" aria-label="Direct link to Where to look" title="Direct link to Where to look" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://github.com/whosonfirst-data" target="_blank" rel="noopener noreferrer" class="">Who's On First on GitHub</a> — the source repos</li>
<li class=""><a href="https://data.geocode.earth/wof/dist/sqlite/" target="_blank" rel="noopener noreferrer" class="">Geocode Earth WOF distributions</a> — pre-built SQLite files</li>
<li class=""><a href="https://spelunker.whosonfirst.org/" target="_blank" rel="noopener noreferrer" class="">Spelunker</a> — the official WOF browser/explorer</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/concepts/whosonfirst-gotchas"><code>docs/articles/concepts/whosonfirst-gotchas.md</code></a> — the stable reference version of this article (data model, gotchas, tooling architecture)</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/concepts/wof-data-pipeline"><code>docs/articles/concepts/wof-data-pipeline.md</code></a> — our internal architecture doc for the ingest pipeline</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/concepts/resolver-and-wof"><code>docs/articles/concepts/resolver-and-wof.md</code></a> — how the runtime resolver queries WOF</li>
<li class=""><a href="https://github.com/sister-software/mailwoman/tree/main/core/resources/whosonfirst" target="_blank" rel="noopener noreferrer" class=""><code>core/resources/whosonfirst/</code></a> — the TypeScript tooling source</li>
</ul>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Resolver / WOF" term="Resolver / WOF"/>
        <category label="Geocoding" term="Geocoding"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Two voices arguing inside a model — a beginner-friendly debugging story]]></title>
        <id>https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model</id>
        <link href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model"/>
        <updated>2026-05-24T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Mailwoman is an open-source address parser that runs in Node and the browser. It uses a small neural model to label address components ("350" = house number, "NY" = region, etc.). Try the live demo.]]></summary>
        <content type="html"><![CDATA[<div class="theme-admonition theme-admonition-note admonition_WCGJ alert alert--secondary"><div class="admonitionHeading_GCBg"><span class="admonitionIcon_L39b"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>If you found this via search</div><div class="admonitionContent_pbrs"><p><strong><a href="https://mailwoman.sister.software/" target="_blank" rel="noopener noreferrer" class="">Mailwoman</a></strong> is an open-source address parser that runs in Node and the browser. It uses a small neural model to label address components ("350" = house number, "NY" = region, etc.). <a class="" href="https://mailwoman.sister.software/demo">Try the live demo.</a></p><p>This post is a beginner-friendly debugging story — no ML background needed. If you just want the project status, see <a class="" href="https://mailwoman.sister.software/docs/status">what ships today</a>.</p></div></div>
<p>This is the third post in a series about a training problem we've been chasing. The <a class="" href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect">first two</a> <a class="" href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination">were</a> written for software engineers. This one is for someone who is just starting to learn about AI and machine learning — no jargon assumed, no math beyond high-school algebra. The point is to show you what real ML debugging looks like, using a problem we actually had this week.</p>
<p>If you've been programming for a while but ML feels opaque, this post is for you. The core technique we used — figuring out which of two instructions our model was listening to — turns out to be much more like ordinary debugging than the field usually makes it sound.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-were-building-in-one-paragraph">What we're building, in one paragraph<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#what-were-building-in-one-paragraph" class="hash-link" aria-label="Direct link to What we're building, in one paragraph" title="Direct link to What we're building, in one paragraph" translate="no">​</a></h2>
<p>Mailwoman is a piece of software that reads address strings (<code>"350 5th Avenue, New York, NY 10118"</code>) and turns them into structured place information ("this is in Manhattan, here are the coordinates, here's the postcode, etc."). It uses a small AI model to do the parsing. "Small" by AI standards: about 9 million numbers inside it. (For comparison: GPT-4 is rumoured to have over a trillion.)</p>
<p>We don't need a giant model because the task is narrow: addresses follow patterns, and we just need to identify which parts of a string are which (<code>350</code> is a house number, <code>5th Avenue</code> is a street, etc.).</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-training-a-model-actually-looks-like">What "training a model" actually looks like<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#what-training-a-model-actually-looks-like" class="hash-link" aria-label="Direct link to What &quot;training a model&quot; actually looks like" title="Direct link to What &quot;training a model&quot; actually looks like" translate="no">​</a></h2>
<p>Forget everything you've seen about AI in movies. Training a model is, mechanically, this:</p>
<ol>
<li class="">You have a model with millions of numbers inside it (call them "weights"). At the start they're random.</li>
<li class="">You have a pile of example data — addresses with the correct answers labelled, like flashcards.</li>
<li class="">You show the model an address. It guesses what each part is.</li>
<li class="">You compare the guess to the correct answer. The difference is a number called <strong>loss</strong>: low loss means a good guess, high loss means a bad guess.</li>
<li class="">The training algorithm then tweaks the millions of internal numbers to make the loss a little bit smaller next time.</li>
<li class="">Repeat thousands or millions of times.</li>
</ol>
<p>The "intelligence" of the model is just an enormous lookup table of patterns, refined slowly by 50,000 rounds of "you said it was a street name, but it was actually a postcode; here, nudge these specific numbers a tiny bit so you'll guess better next time."</p>
<p>If you've ever debugged a function by running it, looking at the wrong output, and tweaking one parameter at a time until the output got right, you've done a single iteration of model training by hand.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-loss-curve-you-keep-hearing-about">The "loss curve" you keep hearing about<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#the-loss-curve-you-keep-hearing-about" class="hash-link" aria-label="Direct link to The &quot;loss curve&quot; you keep hearing about" title="Direct link to The &quot;loss curve&quot; you keep hearing about" translate="no">​</a></h2>
<p>People who train models stare at a chart called the loss curve all day. It looks like this:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">loss</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> ^</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |   X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |    X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |     X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |      XX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |        XXX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |           XXXXX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |                XXXXXXXXXX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> +----------------------------&gt; step</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   0      500      1000     1500</span><br></div></code></pre></div></div>
<p>Each X is one round of training. Loss starts high (the model is randomly guessing) and decreases as the model learns. A good training run looks exactly like that — descending until it plateaus.</p>
<p>Now here's our actual loss curve from one of nine training runs we did this month:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">loss</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> ^                                  XXXX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |                                 X    X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |                                X      XX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |                               X         XX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |    X                         X            XX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |     X                       X                X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |      X                    XX                  X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |       XX                 X                     ...</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |         XXX             X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |            XXXXX       X</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> |                 XXXXXXX</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> +----------------------------&gt; step</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">   0     500     800     1100</span><br></div></code></pre></div></div>
<p>The model descends nicely for 500 steps (warmup), settles at a low loss for a bit — and then climbs back up. By the end, it's worse than when it started. We trained it on 50,000 examples and it got <em>worse</em>.</p>
<p>Every training engineer's heart sinks at this curve. It means something is wrong, and the model isn't telling us what. There's no stack trace. There's no exception. There's just a chart that says "I learned, and then I unlearned."</p>
<p>We saw this exact shape in nine different runs. Different learning speeds, different model sizes, different feature combinations. Every time: clean descent, then catastrophic climb.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-clue-wed-been-missing">The clue we'd been missing<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#the-clue-wed-been-missing" class="hash-link" aria-label="Direct link to The clue we'd been missing" title="Direct link to The clue we'd been missing" translate="no">​</a></h2>
<p>To find a bug in a program, you usually narrow it down by ruling out parts of the code one at a time. ML debugging works the same way: you change one thing, retrain, look at the curve. But each "retrain" takes hours and costs real money on a rented GPU. You learn to be careful about which experiments are worth running.</p>
<p>For weeks we'd been ruling out hypotheses:</p>
<ul>
<li class="">Maybe the learning rate is too high? (No — lowering it just delayed the climb.)</li>
<li class="">Maybe the model is too small? (No — we made it bigger and the same thing happened.)</li>
<li class="">Maybe a new feature we added is destabilising it? (No — we turned it off, same problem.)</li>
<li class="">Maybe the data has a bug? (Couldn't rule out, expensive to check.)</li>
</ul>
<p>Then somebody pointed out a thing we hadn't questioned: <strong>we were training the model with two different goals at once.</strong></p>
<p>The model has two scoring systems. We'll call them Voice A and Voice B.</p>
<ul>
<li class=""><strong>Voice A</strong> says: "How good are your guesses for each individual word? Did you tag '350' correctly as a house number? Did you tag 'NY' correctly as a region?"</li>
<li class=""><strong>Voice B</strong> says: "How sensible is your overall pattern? Is your sequence of tags structurally valid? Does it look like a real address?"</li>
</ul>
<p>Both voices are useful. A working geocoder needs both per-word accuracy <em>and</em> sensible patterns. We'd been adding their feedback together (with Voice B's contribution scaled down to 5%) and using that combined signal to train the model.</p>
<p>The question we'd never asked: <strong>were Voice A and Voice B telling the model to do the same thing?</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-five-minute-diagnostic">The five-minute diagnostic<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#the-five-minute-diagnostic" class="hash-link" aria-label="Direct link to The five-minute diagnostic" title="Direct link to The five-minute diagnostic" translate="no">​</a></h2>
<p>Here's the part that surprised me about ML debugging — the technique we used could be explained to a curious teenager.</p>
<p>When you train a model, every weight inside it gets nudged in a particular direction based on the combined loss. That nudge is called a <strong>gradient</strong>. If gradients are big, the weight moves a lot per step; if they're small, it moves a little.</p>
<p>The two voices each contribute their own gradient. They get added together (with Voice B at 5%) to produce the final nudge.</p>
<p>So we asked: at the moment just before the model starts unlearning, <strong>which voice is doing most of the talking?</strong> We took a saved snapshot of the model from that moment, fed it a few example addresses, and measured the size of each voice's contribution to the gradient separately.</p>
<p>We expected the answer to be something like "Voice A is 20× louder than Voice B" — meaning Voice B was contributing almost nothing, which would mean the 5% scaling we'd set was actually appropriate.</p>
<p>What we got instead: <strong>Voice B's gradient was 16× LOUDER than Voice A's.</strong></p>
<p>Wait. Voice B was supposed to be scaled to 5%. But the raw gradient was 16× larger than Voice A's. Multiply 5% by 16 and Voice B's effective contribution to the model's training was actually 80% of Voice A's. The hand-tuned scaling knob we'd been treating as "Voice B contributes lightly" was secretly producing "Voice B contributes nearly as much as Voice A."</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-cooperative-vs-conflict-picture">The cooperative-vs-conflict picture<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#the-cooperative-vs-conflict-picture" class="hash-link" aria-label="Direct link to The cooperative-vs-conflict picture" title="Direct link to The cooperative-vs-conflict picture" translate="no">​</a></h2>
<p>Here's the framing that made it all click.</p>
<p>Imagine you're a hiker on a foggy hill, trying to walk to the lowest point in the landscape. You can't see far, so you have two GPS devices that each tell you "go downhill, that way."</p>
<ul>
<li class=""><strong>At the top of the hill (high loss),</strong> both GPSes agree: every direction is downhill, so they both point you roughly the same way. You make progress. Loss decreases.</li>
<li class=""><strong>As you descend into a specific valley (loss gets lower),</strong> the landscape becomes more detailed. Suddenly the two GPSes start disagreeing: Voice A says the valley floor is to the left; Voice B says it's to the right. They don't see the same valley.</li>
</ul>
<p>When that happens, your hiking direction is mostly determined by <strong>whichever GPS is shouting louder</strong>. With Voice B shouting 16× louder than Voice A, you stop following Voice A's instructions and start following Voice B's — even though Voice A was correct about where the valley floor actually was. You climb out of one valley toward a different point that Voice B prefers, and Voice A's loss (the per-word accuracy) gets worse.</p>
<p>That's literally what happened to our model. Above loss 0.41 (high up on the hill), both voices agreed and the model descended cleanly. Below 0.41, they started disagreeing, and Voice B's louder gradient pulled the model away from the basin Voice A had been guiding it toward. The model's per-word accuracy got worse and worse, which we saw as loss climbing back up.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-fix">The fix<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#the-fix" class="hash-link" aria-label="Direct link to The fix" title="Direct link to The fix" translate="no">​</a></h2>
<p>Once you understand what's happening, the fix is mechanical: <strong>silence Voice B during training</strong>. Don't let it contribute to the gradient at all.</p>
<p>But we still want Voice B's contribution somewhere, because it really does encode useful structural rules (no orphan tags, no invalid BIO sequences). So we keep Voice B for <strong>inference</strong> (the moment when we actually use the trained model to parse an address) but not during training.</p>
<p>This is a one-line change in the code: when Voice B's weight is set to 0, don't even compute it. The training loop then runs purely on Voice A's gradient, which has been the well-behaved one all along.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-tell-ourselves-we-learned">What we tell ourselves we learned<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#what-we-tell-ourselves-we-learned" class="hash-link" aria-label="Direct link to What we tell ourselves we learned" title="Direct link to What we tell ourselves we learned" translate="no">​</a></h2>
<p>A few things stand out:</p>
<ol>
<li class=""><strong>"Add two losses together with weights" sounds simple. It can be a disaster.</strong> Two loss functions can have wildly different gradient magnitudes even when their loss <em>values</em> look comparable. Multiplicative scaling on the loss does NOT produce balanced contributions to the optimiser. Watching the loss values fooled us for nine training runs.</li>
<li class=""><strong>The five-minute diagnostic was more valuable than the previous month of retraining experiments.</strong> Every "what if we change this knob and retrain" experiment cost hours. The gradient-norm probe cost five minutes and gave a sharper answer than any of them. It works because it asks a more fundamental question: not "what's the result," but "what's the model actually listening to?"</li>
<li class=""><strong>ML debugging is more like programming debugging than the field admits.</strong> Once you have a vocabulary for what's happening, the techniques are familiar: bisect, isolate, instrument, hypothesise, test. The hard part is finding the right vocabulary for <em>what's actually happening inside the model</em>. Once you have it, the bug is usually findable.</li>
<li class=""><strong>Cheap experiments first.</strong> A 5-minute probe should always run before a 25-hour retrain. We didn't think to run the probe earlier because nobody had told us it was a thing. Now we know.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-to-read-more">Where to read more<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#where-to-read-more" class="hash-link" aria-label="Direct link to Where to read more" title="Direct link to Where to read more" translate="no">​</a></h2>
<ul>
<li class=""><a class="" href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect">The original failure post</a> — written for engineers, has the loss curves and recipe details.</li>
<li class=""><a class="" href="https://mailwoman.sister.software/blog/v0-5-0-bisect-by-elimination">The bisect-by-elimination post</a> — what we ruled out before the diagnostic.</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/concepts/dual-loss-curvature-conflict">The technical writeup</a> — for engineers who want the gradient math and the cooperative-vs-conflict framing in detail.</li>
</ul>
<p>If you're starting out in ML and any of this helped, the mailbox is open: <code>contact@sister.software</code>. We'd genuinely like to hear what gaps the existing intro material still leaves.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="update--it-worked">Update — it worked<a href="https://mailwoman.sister.software/blog/two-voices-arguing-in-a-model#update--it-worked" class="hash-link" aria-label="Direct link to Update — it worked" title="Direct link to Update — it worked" translate="no">​</a></h2>
<p>We wrote this post while the CE-only experiment was still running. It passed. The model trained past step 2000 — the point where every prior run had diverged — with no loss climb at all. Final validation accuracy: 0.444, the best number any run in this project has ever produced. The full 50,000-step training run is now in progress.</p>
<p>Without the competing voice, the model settled deeper into its basin than any prior run could before being dragged out.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Model training" term="Model training"/>
        <category label="Beginner-friendly" term="Beginner-friendly"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Four training runs, zero shipped weights — bisecting v0.5.0's divergence]]></title>
        <id>https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect</id>
        <link href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect"/>
        <updated>2026-05-24T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Mailwoman is an open-source address parser. This post is a training log entry from May 2026 documenting the v0.5.0 divergence investigation. For current project status, see what ships today.]]></summary>
        <content type="html"><![CDATA[<div class="theme-admonition theme-admonition-note admonition_WCGJ alert alert--secondary"><div class="admonitionHeading_GCBg"><span class="admonitionIcon_L39b"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>If you found this via search</div><div class="admonitionContent_pbrs"><p><strong><a href="https://mailwoman.sister.software/" target="_blank" rel="noopener noreferrer" class="">Mailwoman</a></strong> is an open-source address parser. This post is a training log entry from May 2026 documenting the v0.5.0 divergence investigation. For current project status, see <a class="" href="https://mailwoman.sister.software/docs/status">what ships today</a>.</p></div></div>
<p>v0.5.0 was the <strong>fresh-slate ship</strong>: new tokenizer, expanded corpus, new architecture, new reconcile stage. The plan was to bundle several months of structural improvements into one big iteration and pay the cost once. Most of it landed clean. The classifier didn't.</p>
<p>This post walks through the four training attempts the v0.5.0 C-train made overnight, the bisect that ruled out three plausible explanations, and what we think the remaining culprit is. It's a sister piece to <a class="" href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign">the v0.4.0 retrospective</a> — same shape of failure, different diagnostic ladder.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-v050-shipped-before-the-train-started">What v0.5.0 shipped before the train started<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#what-v050-shipped-before-the-train-started" class="hash-link" aria-label="Direct link to What v0.5.0 shipped before the train started" title="Direct link to What v0.5.0 shipped before the train started" translate="no">​</a></h2>
<p>Six threads merged to <code>main</code> before the C-train attempts began:</p>
<ul>
<li class=""><strong>Thread A1</strong> — sentencepiece tokenizer retrained on corpus-v0.4.0. Overall byte-fallback 36.7% → 18.2% on the multi-script eval; CJK 80% → 45.2%; Armenian / Devanagari to 0%. Halved on the eval fixture and validated against a real adversarial slice.</li>
<li class=""><strong>Thread B + B2</strong> — <code>corpus-v0.4.0</code> adds 4,771 kryptonite rows (NY-NY Steakhouse, Paris-Texas, Saint Petersburg FL) plus 73,316 transliteration pairs across CJK / Cyrillic / Hangul / Han / Armenian, all DeepSeek-generated and validated through a substring-match aligner that caught ~1.1% reject rate worth of misaligned LLM output.</li>
<li class=""><strong>Thread C-s</strong> — classifier code path with top-k inference and a phrase-prior input layer that condition on Stage 2.7's proposed spans. Forward-pass tested on stub data; no full train.</li>
<li class=""><strong>Thread D-s</strong> — <code>reconcile.ts</code> joint decoder. Beam search over (span × tag × resolver candidate) with concordance scoring via WOF parent_id chains. The empty-parse trap was caught early and fixed with an inclusion log-bonus.</li>
<li class=""><strong>Thread E</strong> — <code>@mailwoman/phrase-grouper</code> workspace. Rule-based span proposer feeding Stage 2.7.</li>
<li class=""><strong>Thread F</strong> — verdict-smoke discipline. New <code>--smoke-mode constant</code> flag so the cosine-LR mask that hid v0.4.0's divergence cannot reoccur.</li>
</ul>
<p>C-train was the experiment that actually used all of it together for the first time.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-recipe-we-tried-first">The recipe we tried first<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#the-recipe-we-tried-first" class="hash-link" aria-label="Direct link to The recipe we tried first" title="Direct link to The recipe we tried first" translate="no">​</a></h2>
<p>Going in we had a clean confirmation from the operator: hidden_size=384 (up from v0.4.0's 256), effective batch 128 via batch=16 grad_accum=8, constant LR, starting LR=1.5e-4 (the same lr v0.3.0 had to drop to), top-k inference and phrase-prior conditioning ON (PR #128). The corpus was A1 tokenizer + corpus-v0.4.0. Recipe knobs §1 (per_token CRF normalization) and §3 (class_weights) were carried over from the C-s scaffold: a host-side YAML-drafting decision, not an operator confirmation.</p>
<p>The 50-step constant-LR smoke passed cleanly. val_macro_f1 climbed 0.121 → 0.187 across 50 steps. The recipe looked stable.</p>
<p>Promoted to full. The full run diverged at step 1000.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="four-attempts-one-fingerprint">Four attempts, one fingerprint<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#four-attempts-one-fingerprint" class="hash-link" aria-label="Direct link to Four attempts, one fingerprint" title="Direct link to Four attempts, one fingerprint" translate="no">​</a></h2>
<p>The pattern repeated, with each variant getting marginally further before the same shape of failure took over. All four runs use the same constant-LR schedule (mode A per <code>VERDICT_SMOKES.md</code>) and the same effective batch of 128.</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">v1: h384, §1+§3 ON,  lr=1.5e-4</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  500: train_loss=0.69 (warmup end, LR plateau)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  600: train_loss=0.61 (settled)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  700: train_loss=1.49 (climb start)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step 1000: train_loss=3.29 (killed)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">v2: h384, §1+§3 ON,  lr=1e-4</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  500: train_loss=0.90 (warmup end)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  900: train_loss=0.51 (best)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step 1000: train_loss=0.69 (climb start)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step 1200: train_loss=1.96 (killed)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">v3: h384, §1+§3 OFF, lr=1.5e-4  ← v0.4.0-stable recipe</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  500: train_loss=0.63 (warmup end)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  700: train_loss=0.41 (best ever — better than v1/v2)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  800: train_loss=1.21 (climb start)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  900: train_loss=1.97 (killed)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">h256-bisect: h256, §1+§3 OFF, lr=1.5e-4, eff_batch=128</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step  500: train_loss=0.67 + val_macro_f1=0.311</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step 1000: train_loss=0.31 + val_macro_f1=0.399 (best ever)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step 1050: train_loss=0.42 (climb start)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  step 1500: train_loss=1.85 + val_macro_f1=0.229 (killed)</span><br></div></code></pre></div></div>
<p>The fingerprint is identical to v0.4.0's. Loss descends through warmup, settles for a few hundred steps near the bottom, then climbs back to its starting magnitude over 100-300 steps. val_macro_f1 (where we measured it) does the same: peaks around the time the loss bottoms out, then collapses.</p>
<p>The only thing that shifts between runs is <strong>how deep the loss gets before the climb starts</strong>. v1 bottomed at 0.61, v2 at 0.51, v3 at 0.41, h256-bisect at 0.31. Each successive variant trained better for longer, and then collapsed in exactly the same way.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-the-bisect-ruled-out">What the bisect ruled out<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#what-the-bisect-ruled-out" class="hash-link" aria-label="Direct link to What the bisect ruled out" title="Direct link to What the bisect ruled out" translate="no">​</a></h2>
<p>We ran three knob-changes between v1 and h256-bisect, each motivated by a different hypothesis. None of them held.</p>
<p><strong>Learning rate isn't it.</strong> v0.4.0's bisect already showed that a factor-2 LR drop only buys a factor-1.3 step delay, ruling out "we picked too high an LR" as a full explanation. v1 → v2 confirmed the same shape: 1.5e-4 → 1e-4 moved the divergence from step 700 to step 1000. Same dynamic, just shifted later. The LR controls <em>when</em>, not <em>whether</em>.</p>
<p><strong>Recipe knobs §1+§3 aren't it.</strong> v0.4.0's retrospective concluded the destabilizer was in the recipe, not the LR, and shipped with §1 (per_token CRF) and §3 (class_weights) OFF. v3 reverted those knobs and dropped back to LR=1.5e-4, the canonical v0.4.0-stable recipe. v3 trained better than v1/v2 (bottom of 0.41 vs 0.61/0.51), zero GPU hangs (vs v2's six), and still diverged at step 900.</p>
<p><strong>Hidden size isn't it.</strong> h256-bisect reverted the only architectural change still in the recipe: 384 → 256 hidden, 6 → 4 heads, 1536 → 1024 intermediate. With everything else at v0.4.0-shipped settings (eff<em>batch=128, LR=1.5e-4, §1+§3 OFF), this configuration is _identical to v0.4.0's shipped recipe</em>, except for two architectural pieces we haven't touched yet. h256-bisect was the best-performing run of the four (peak val_macro_f1=0.399, train_loss=0.31), and it diverged anyway.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-remaining-suspects">The remaining suspects<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#the-remaining-suspects" class="hash-link" aria-label="Direct link to The remaining suspects" title="Direct link to The remaining suspects" translate="no">​</a></h2>
<p>Only two architectural changes from the proven-stable v0.4.0 baseline remain:</p>
<ol>
<li class="">
<p><strong>Phrase-prior input features</strong> (PR #128). The classifier's input embedding takes 10 extra per-token features encoding the phrase grouper's proposed spans — <code>is_phrase_start</code> / <code>is_phrase_mid</code> / <code>is_phrase_end</code> plus a one-hot for the proposed <code>PhraseKind</code>. New projection layer; new gradient pathway.</p>
</li>
<li class="">
<p><strong>A1 tokenizer</strong>. New 48K vocab (up from v0.1.0's 16K), trained on corpus-v0.4.0 including the transliteration adapter. The embedding table is sized to that vocab and the model has never trained against it before.</p>
</li>
</ol>
<p>Both are real architectural changes from v0.4.0. Either could plausibly produce a confident-wrong degenerate minimum that the loss bottoms out at and then escapes from. We can't tell from the curves alone — the shape is the same regardless.</p>
<p>The cheapest next bisect is <strong>phrase priors OFF</strong> (revert PR #128's input-layer features, keep A1 tokenizer). One YAML knob change, ~15min smoke + a partial train if smoke passes. That isolates whether the destabilizer is the phrase-prior projection or the new tokenizer's interaction.</p>
<p>If phrase-priors-off ALSO diverges, the next bisect is the A1 tokenizer itself: revert to v0.1.0's sentencepiece weights against the same corpus + recipe + h256. At that point we're back to v0.4.0's proven-stable shipping configuration; if even that diverges, the destabilizer is in <code>corpus-v0.4.0</code>'s composition (the transliteration shards' distribution might be the issue, not the tokenizer trained on them).</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="a-discipline-lesson-worth-keeping">A discipline lesson worth keeping<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#a-discipline-lesson-worth-keeping" class="hash-link" aria-label="Direct link to A discipline lesson worth keeping" title="Direct link to A discipline lesson worth keeping" translate="no">​</a></h2>
<p>We caught a real bug in our verdict-smoke discipline along the way. The original 50-step smoke for v1 passed cleanly — train_loss descended, val_macro_f1 climbed, no NaN or spike. We promoted to full. Full diverged at step 1000 in a regime the smoke had never reached.</p>
<p>Two things had to change for the smoke to be a real predictor:</p>
<ul>
<li class=""><strong>Smoke length matters.</strong> 50 steps captures only the warmup descent. The sustained-peak-LR regime where the recipe destabilises starts at step 500 in our schedule. Smokes need to be long enough to spend several hundred steps near peak LR — we ended up at 1500 steps as the floor.</li>
<li class=""><strong>Effective batch must match the full run.</strong> v3's smoke ran at batch_size=8 grad_accum=1 (eff_batch=8); the full run was eff_batch=128. The smoke said stable; the full diverged. <strong>The recipe's stability is batch-geometry-dependent.</strong> A smoke that doesn't reproduce the full-run gradient noise is a smoke that can't detect this class of failure.</li>
</ul>
<p>The constant-LR-mode discipline that landed in Thread F is still correct: it's what made the v0.5.0 destabilization observable at all instead of hiding under cosine decay. But the smoke configuration needs to mirror the full-run <em>throughput</em> characteristics on top of the LR schedule.</p>
<p>Both lessons will land in <code>VERDICT_SMOKES.md</code> as a follow-up. The current text describes constant-LR mode as the gate but doesn't say "your smoke's eff_batch must match the full run." That's an obvious-in-hindsight footgun.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-didnt-get-burned">What didn't get burned<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#what-didnt-get-burned" class="hash-link" aria-label="Direct link to What didn't get burned" title="Direct link to What didn't get burned" translate="no">​</a></h2>
<p>Most of the v0.5.0 fresh-slate work survives this episode intact:</p>
<ul>
<li class=""><strong>A1 tokenizer's byte-fallback wins are real.</strong> Multi-script eval went from 36.7% to 18.2%, with B2's targeted scripts (CJK, Cyrillic, Hangul, Han, Armenian) all hitting or beating their v0.1.0 leakage baselines. That's a tokenizer that's actually fit for non-Latin addresses. It works fine for inference even though the classifier we'd train on top of it diverges.</li>
<li class=""><strong><code>corpus-v0.4.0</code> is sound.</strong> corpus-audit passes; the substring-match validator caught the LLM's alignment failures; both adapter additions land cleanly via the new MANIFEST.json-driven harness. Whatever destabilises the train, it isn't a corpus integrity problem.</li>
<li class=""><strong>Stage 5 reconcile + phrase grouper + verdict-smoke discipline</strong> all shipped and live in <code>main</code>. They run on v0.4.0 weights right now and produce correct output on the kryptonite catalogue. They'll keep working when v0.5.0 weights land.</li>
<li class=""><strong><code>TRAINING_ENV.md</code></strong> documents the playpen container's ROCm bootstrap recipe so the next training spawn doesn't re-discover the wall. ~15min one-time setup that took us most of an hour to invent the first time.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-honest-read">The honest read<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#the-honest-read" class="hash-link" aria-label="Direct link to The honest read" title="Direct link to The honest read" translate="no">​</a></h2>
<p>We spent roughly four hours of GPU time on four diverging training runs, learned what isn't the destabilizer, and stopped before we burned a fifth shot at it. v0.5.0's classifier weights aren't shipping today.</p>
<p>That's still a useful outcome. We have a smaller hypothesis space (two architectural pieces left to bisect), better infrastructure than we started with (TRAINING_ENV, MANIFEST harness fix, longer smoke discipline), and a concrete recommendation for v0.5.0.1's first move. The v0.4.0 model continues to ship in production; nothing downstream is blocked.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-wed-tell-a-future-ourselves">What we'd tell a future ourselves<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#what-wed-tell-a-future-ourselves" class="hash-link" aria-label="Direct link to What we'd tell a future ourselves" title="Direct link to What we'd tell a future ourselves" translate="no">​</a></h2>
<ol>
<li class=""><strong>Smoke geometry must match the full-run geometry.</strong> Constant-LR isn't enough if eff_batch differs. Either match the full-run batch shape or run two smokes — one at the smaller geometry for fast iteration, one at the full geometry as the actual gate.</li>
<li class=""><strong>The "destabilizes a few hundred steps after warmup ends" fingerprint isn't unique to v0.4.0.</strong> It's appearing in v0.5.0 too with a different recipe. Whatever it is, it's a deeper issue with the dual-loss landscape under sustained peak LR than either retrospective has so far named.</li>
<li class=""><strong>Plan for divergence retries in the time budget.</strong> A single full-train shot is rarely the experiment that ships. v0.4.0 needed five runs; v0.5.0 has needed four so far with at least two more bisects ahead. Realistic v0.X.0 release cadence is probably 8-12 training runs per cycle, not one.</li>
<li class=""><strong>Operator-side and host-claude-side recipe knobs need to be distinguished early.</strong> §1+§3 entered the v0.5.0 recipe by being in the C-s scaffold YAML — a host-claude inheritance decision, not an operator confirmation. That cost us v3 in the bisect ladder.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-to-look">Where to look<a href="https://mailwoman.sister.software/blog/v0-5-0-c-train-bisect#where-to-look" class="hash-link" aria-label="Direct link to Where to look" title="Direct link to Where to look" translate="no">​</a></h2>
<ul>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/plan/v0-5-0-shipped"><code>docs/articles/plan/v0-5-0-shipped.md</code></a> — what landed and what didn't in the v0.5.0 bundle</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/plan/reference/VERDICT_SMOKES"><code>docs/articles/plan/reference/VERDICT_SMOKES.md</code></a> — the smoke discipline (with the eff_batch lesson pending)</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/plan/reference/TRAINING_ENV"><code>docs/articles/plan/reference/TRAINING_ENV.md</code></a> — playpen container ROCm bootstrap</li>
<li class="">Diverged train CSVs at <code>c-train-full-{DIVERGED-lr1.5e4,v2-watchdog-DIVERGED-step1200,v3-DIVERGED-step900,h256-bisect}.csv</code></li>
<li class=""><a class="" href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign">v0.4.0 retrospective</a> — the sister piece</li>
</ul>
<p>Next: phrase-priors-off bisect. If that lands a stable train, we ship v0.5.0 weights without phrase-prior conditioning and pick up the priors as a v0.5.1 follow-up. If it doesn't, we revert the A1 tokenizer and confirm v0.4.0-shipped configuration trains cleanly on <code>corpus-v0.4.0</code> — which would isolate the destabilizer to corpus distribution effects from B2's transliteration mass.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Model training" term="Model training"/>
        <category label="Retrospective" term="Retrospective"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Five training runs, one shipped checkpoint — what we learned from v0.4.0]]></title>
        <id>https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign</id>
        <link href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign"/>
        <updated>2026-05-23T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Mailwoman is an open-source address parser. This post is a historical retrospective on the v0.4.0 training campaign (May 2026). For current project status, see what ships today.]]></summary>
        <content type="html"><![CDATA[<div class="theme-admonition theme-admonition-note admonition_WCGJ alert alert--secondary"><div class="admonitionHeading_GCBg"><span class="admonitionIcon_L39b"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>If you found this via search</div><div class="admonitionContent_pbrs"><p><strong><a href="https://mailwoman.sister.software/" target="_blank" rel="noopener noreferrer" class="">Mailwoman</a></strong> is an open-source address parser. This post is a historical retrospective on the v0.4.0 training campaign (May 2026). For current project status, see <a class="" href="https://mailwoman.sister.software/docs/status">what ships today</a>.</p></div></div>
<p><code>@mailwoman/neural-weights-en-us@v0.4.0</code> (and the fr-fr sibling) shipped today as packaged artifacts (the npm publish is a separate step we do by hand). It is a mixed-result release: one clear win on fine-grained labels, two regressions on coarse labels that turned out to be mostly artifacts of how we measured. Almost everything we set out to do — combine three orthogonal training improvements into one ship — was empirically falsified by a divergence pattern we hadn't seen before.</p>
<p>This is a writeup of how the campaign went. We're publishing it for two reasons: to be honest about what the headline numbers mean, and because the way the failures stacked up is worth thinking about if you train your own NER-style models.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-v040-was-supposed-to-do">What v0.4.0 was supposed to do<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#what-v040-was-supposed-to-do" class="hash-link" aria-label="Direct link to What v0.4.0 was supposed to do" title="Direct link to What v0.4.0 was supposed to do" translate="no">​</a></h2>
<p>v0.3.0 had shipped with a known regression on coarse labels (country, region, locality) — the cost of expanding the label vocabulary from 15 to 21 BIO classes without enough training steps. <a href="https://github.com/sister-software/mailwoman/issues/116" target="_blank" rel="noopener noreferrer" class="">Issue #116</a> named six work areas for v0.4.0:</p>
<ol>
<li class=""><strong>Per-token CRF NLL normalization</strong> — eliminate the hand-tuned <code>crf_loss_weight=0.05</code> knob by scaling the CRF loss to per-token magnitude so it sums cleanly with cross-entropy.</li>
<li class=""><strong>Longer training</strong> — v0.3.0 early-stopped at step 1800 of 50K; the v0.4.0 floor was step 5000.</li>
<li class=""><strong>Class-weighted cross-entropy</strong> — pull softmax mass back onto the coarse classes the 21-label expansion had diluted.</li>
<li class=""><strong>Source-weight rebalance</strong> — drop NAD's per-sample weight (it had ended up at ~52% of the sampled corpus); promote the WOF admin sources to compensate.</li>
<li class=""><strong>JS-side Viterbi decoder</strong> + label vocabulary loading from <code>model-card.json</code> — runtime cleanup.</li>
<li class=""><strong>Reuse <code>corpus-v0.3.0</code></strong> — no rebuild needed.</li>
</ol>
<p>Items 5 and 6 were pure engineering; they landed cleanly the day before training started. The contested ones were 1, 3, and 4: the recipe changes that actually touch the loss surface.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-actually-happened">What actually happened<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#what-actually-happened" class="hash-link" aria-label="Direct link to What actually happened" title="Direct link to What actually happened" translate="no">​</a></h2>
<p>Five training runs. Three of them on different learning rates with the same full recipe; two of them as ablations dropping one item at a time. All five diverged. The fingerprint was distinctive: training loss dropped monotonically through a long warmup, plateaued at the bottom for several hundred steps, then spiked back up to its starting magnitude over 50-150 steps. Validation macro-F1 mirrored the train loss: it climbed to a peak around the LR's peak step, then collapsed to roughly the random-output baseline.</p>
<p>The collapse step shifted with the learning rate:</p>
<table><thead><tr><th>Learning rate</th><th style="text-align:right">Collapsed at step</th></tr></thead><tbody><tr><td>5e-4 (target)</td><td style="text-align:right">750</td></tr><tr><td>3e-4</td><td style="text-align:right">1000</td></tr><tr><td>1.5e-4 (v0.3.0 LR)</td><td style="text-align:right">2000</td></tr></tbody></table>
<p>Three runs each at a different LR, each diverging in the same shape, with the divergence delayed proportionally — but only roughly. A factor-2 LR drop bought a factor-1.3 step delay, not factor-2. That ruled out "we just picked too high an LR" as the explanation. The destabilizer was in the recipe, not the learning rate.</p>
<p>So we ran the ablations the issue had prescribed:</p>
<table><thead><tr><th>Ablation</th><th>LR</th><th>Result</th></tr></thead><tbody><tr><td>Drop §1 (CRF norm)</td><td>5e-4</td><td>Diverged step 1000</td></tr><tr><td>Drop §3 (class weights)</td><td>5e-4</td><td>Diverged step 1000</td></tr></tbody></table>
<p>Identical failures. At lr=5e-4 neither single-knob revert was enough, meaning lr=5e-4 was structurally unreachable for the codebase's dual-loss landscape regardless of which knob we touched.</p>
<p>We dropped back to the safe LR (1.5e-4, the same lr v0.3.0 had been forced down to) and ran a three-cell orthogonal matrix:</p>
<table><thead><tr><th>Recipe</th><th style="text-align:right">Peak macro-F1</th><th>Verdict</th></tr></thead><tbody><tr><td>§4 only (source rebalance)</td><td style="text-align:right">0.419</td><td>Pass</td></tr><tr><td>§3 + §4 (class weights + source)</td><td style="text-align:right">0.428</td><td><strong>Best</strong> — pass</td></tr><tr><td>§1 + §4 (CRF norm + source)</td><td style="text-align:right">—</td><td>Fail</td></tr></tbody></table>
<p>The §3+§4 verdict-smoke peaked at 0.428 — better than v0.3.0's final 0.36 by a comfortable margin. So we promoted that recipe to the full 50K-step run.</p>
<p>It diverged at step 2250. Same fingerprint as the full §1+§3+§4 recipe at the same LR.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-meta-bug-in-the-smoke-framework">The meta-bug in the smoke framework<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#the-meta-bug-in-the-smoke-framework" class="hash-link" aria-label="Direct link to The meta-bug in the smoke framework" title="Direct link to The meta-bug in the smoke framework" translate="no">​</a></h2>
<p>The verdict-smoke ran each ablation for 3000 steps with a cosine learning-rate schedule. With <code>max_steps=3000</code> and <code>warmup_steps=1000</code>, the LR peaks around step 1000 and is back near zero by step 2750. The smoke's "pass" criterion (macro-F1 stable across the last three evals past step 2000) was actually measuring stability <em>under a near-zero learning rate</em>. The full 50K run kept the LR near its peak for thousands of steps. That sustained-peak exposure was where the destabilization happened.</p>
<p>The smoke wasn't testing what we thought it was testing. By the time it would have noticed the divergence, its own LR schedule had already saved it.</p>
<p>We didn't see this coming. The fix for future smokes is to use a constant LR for the verdict window, or to set max_steps large enough that the cosine tail doesn't dominate (something like 10000 keeps LR &gt; 60% of peak through the relevant range).</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-we-shipped">What we shipped<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#what-we-shipped" class="hash-link" aria-label="Direct link to What we shipped" title="Direct link to What we shipped" translate="no">​</a></h2>
<p>The §4-only recipe. Source rebalance layered on top of v0.3.0's existing dual-loss recipe, at v0.3.0's safe LR. It's the only thing that stayed clean through both a verdict smoke AND a full 50K run.</p>
<p>The shipped checkpoint is <code>v0_4_0-stableLR-source-only/step-002200</code>. Architecture is unchanged from v0.3.0 (256-dim, 6 layers, 9M params). The label vocabulary is unchanged (the same 21 BIO classes). The only thing that's different is which shards the training loader oversamples.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="the-honest-read-on-the-eval-numbers">The honest read on the eval numbers<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#the-honest-read-on-the-eval-numbers" class="hash-link" aria-label="Direct link to The honest read on the eval numbers" title="Direct link to The honest read on the eval numbers" translate="no">​</a></h2>
<p>Per-tag F1 on golden v0.1.2 (4535 entries):</p>
<table><thead><tr><th>Tag</th><th style="text-align:right">v0.4.0</th><th style="text-align:right">v0.3.0</th><th style="text-align:right">Δ</th></tr></thead><tbody><tr><td>country</td><td style="text-align:right">0.21</td><td style="text-align:right">0.28</td><td style="text-align:right"><strong>−0.07</strong></td></tr><tr><td>region</td><td style="text-align:right">0.19</td><td style="text-align:right">0.18</td><td style="text-align:right">+0.01</td></tr><tr><td>locality</td><td style="text-align:right">0.27</td><td style="text-align:right">0.27</td><td style="text-align:right">flat</td></tr><tr><td>postcode</td><td style="text-align:right">0.69</td><td style="text-align:right">0.76</td><td style="text-align:right"><strong>−0.07</strong></td></tr><tr><td>venue</td><td style="text-align:right">0.39</td><td style="text-align:right">0.39</td><td style="text-align:right">flat</td></tr><tr><td>street</td><td style="text-align:right">0.30</td><td style="text-align:right">0.27</td><td style="text-align:right">+0.03</td></tr><tr><td>house_number</td><td style="text-align:right">0.79</td><td style="text-align:right">0.78</td><td style="text-align:right">+0.01</td></tr></tbody></table>
<p>Macro raw average: 0.357 vs 0.293. Two regressions on coarse labels, two small improvements on fine labels.</p>
<p>This is where it would have been easy to ship the headline as "v0.4.0 mostly regressed" and walk away. We instead bucketed the 1217 postcode false-negatives and 194 country false-negatives into categories, by manually inspecting the differences between gold and prediction.</p>
<p>The picture changes meaningfully:</p>
<p><strong>Country false-negatives</strong>: <strong>92%</strong> are adversarial transliteration entries — golden has English country names but the raw input is mixed-script. Examples:</p>
<div class="language-text codeBlockContainer_mQmQ theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_t_Hd"><pre tabindex="0" class="prism-code language-text codeBlock_RMoD thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_AclH"><div class="token-line" style="color:#393A34"><span class="token plain">بار نون وایومینگ, Wyoming, United States of America   →  pred: "yoming, United Sta"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">サーモポリス, WY, United States of America              →  pred: ", WY, United State"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">France, Lozère, ՍԵՆՏ-ԱԼԲԱՆ-ՍՅՈՒՐ-ԼԻՄԱՆՅՈԼ              →  pred: "" (empty)</span><br></div></code></pre></div></div>
<p>These are v0.3.0's documented known-failure modes. The bytefallback tokenizer treats non-Latin scripts as the same opaque sequence, and the model gives up on the prefix. We have known v0.3.0 was bad at these. The v0.4.0 weights didn't change anything about this slice, so it isn't a real recipe regression. After excluding adversarial inputs, country FN drops from 194 to roughly 16. The −0.07 country regression is mostly a golden-set adversarial-weighting artifact.</p>
<p><strong>Postcode false-negatives</strong> split into four buckets:</p>
<table><thead><tr><th>Category</th><th style="text-align:right">Share</th><th>Example</th></tr></thead><tbody><tr><td>Empty prediction</td><td style="text-align:right"><strong>65%</strong></td><td><code>Paris 75008</code> → model emits nothing for postcode</td></tr><tr><td>Non-Latin transliteration</td><td style="text-align:right">18%</td><td>Same v0.3.0 failure mode</td></tr><tr><td>House number confused for postcode</td><td style="text-align:right">11%</td><td><code>47110 Sainte-Livrade-sur-Lot, 22 Rue Jasmin</code> → predicts <code>22</code></td></tr><tr><td>BIO span boundary slip</td><td style="text-align:right">6%</td><td><code>LE TRÉPORT, 76470</code> → predicts <code>", 7647"</code></td></tr></tbody></table>
<p>The empty-prediction slice is the real story. NAD's downweight was the most aggressive change in the §4 source rebalance: it carried a lot of "postcode comes first" patterns (<code>47110 Sainte-Livrade-sur-Lot</code>, <code>ND 58701, 44th Ave</code>) and reducing its share removed that positional exposure. The model now defaults to tagging mid-position numeric tokens as house_number instead of postcode.</p>
<p>That's a real fix that v0.4.1 should target. It is not the same problem as the headline regression number suggests. The 6% boundary-slip slice is a different bug — the model gets the tag right but emits a span that includes the preceding comma+space. That's a decoder fix, no retraining required, and it has already landed on <code>main</code> as commit <code>c72ab4c</code>. The decoder now trims spans past leading/trailing non-word characters.</p>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-didnt-get-fixed-the-v041-list">What didn't get fixed (the v0.4.1 list)<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#what-didnt-get-fixed-the-v041-list" class="hash-link" aria-label="Direct link to What didn't get fixed (the v0.4.1 list)" title="Direct link to What didn't get fixed (the v0.4.1 list)" translate="no">​</a></h2>
<p>The two destabilizers — §1 per-token CRF normalization and §3 class-weighted cross-entropy — are deferred. Both individually look like reasonable training-side improvements; both, at this LR + this loss landscape, made the model find a confident-wrong degenerate minimum after several hundred post-warmup steps. The sanity-check pass over <code>model.py</code> and <code>crf.py</code> ruled out implementation bugs: the <code>per_token</code> reduction is mathematically <code>nll.sum() / total_tokens.clamp(min=1)</code>, and class_weights enters via PyTorch-standard <code>cross_entropy(weight=...)</code>. The destabilization is a real recipe interaction.</p>
<p>The leading hypothesis at the v0.4.0 boundary is that some adapter slice in <code>corpus-v0.3.0</code> is producing high-variance gradients that the per-token-normalized CRF still can't dampen. We built the <code>corpus-audit</code> tool during the campaign (it measures per-source shard distribution against the training config's source_weights, with concentration warnings), and the v0.4.1 starting point will be running gradient-norm probes per adapter to find what's causing the spike.</p>
<p>Three orthogonal threads for v0.4.1:</p>
<ul>
<li class="">Source-weight tweak (bump NAD partway back to recover positional exposure) + synthesis pass over component-order permutations.</li>
<li class="">Corpus-side investigation of which adapter slice destabilizes CRF gradients.</li>
<li class="">Schedule + class-weight ratio redesign (constant-LR smokes, milder weight ratios).</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="what-wed-tell-a-future-ourselves">What we'd tell a future ourselves<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#what-wed-tell-a-future-ourselves" class="hash-link" aria-label="Direct link to What we'd tell a future ourselves" title="Direct link to What we'd tell a future ourselves" translate="no">​</a></h2>
<p>A few things this campaign made obvious that we'd want to put on the shelf for the next iteration:</p>
<ol>
<li class=""><strong>Verdict smokes need to test the same conditions as the full run.</strong> If the full run will spend thousands of steps near peak LR, the smoke needs to spend several thousand steps near peak LR too. Cosine decay is a perfectly reasonable training schedule and a terrible verdict schedule.</li>
<li class=""><strong>Look at failure modes before trusting the F1 delta.</strong> The country regression looked real until we bucketed the failures. 92% of the "regression" was the golden set holding v0.4.0 accountable for v0.3.0's known failure modes. Always categorize before reporting.</li>
<li class=""><strong>One change at a time.</strong> v0.4.0 stacked three orthogonal training changes — per-token CRF, class weights, source rebalance — and shipped them together. The campaign spent most of its time un-stacking them. Next time, ship one per release. The risk asymmetry (one change destabilizing, vs three changes each needing isolation) just isn't worth it.</li>
<li class=""><strong>Diagnostic tooling early.</strong> The <code>corpus-audit</code> and <code>diagnose_regression.py</code> tools we built during the campaign would have saved most of v0.3.0's investigation time if they'd existed earlier. We're keeping them in the tree.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_tleR" id="where-to-look">Where to look<a href="https://mailwoman.sister.software/blog/v0-4-0-ablation-campaign#where-to-look" class="hash-link" aria-label="Direct link to Where to look" title="Direct link to Where to look" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://github.com/sister-software/mailwoman/issues/116" target="_blank" rel="noopener noreferrer" class="">Issue #116</a> — the original work plan</li>
<li class=""><a href="https://github.com/sister-software/mailwoman/tree/issue-116-phase-2-x-v0-4-0" target="_blank" rel="noopener noreferrer" class="">PR for the v0.4.0 ship branch</a> — 10 commits with the campaign retrospective in the merge body</li>
<li class=""><a class="" href="https://mailwoman.sister.software/docs/plan/phases/PHASE_2_training"><code>docs/articles/plan/phases/PHASE_2_training.md</code></a> — the canonical iteration log with the full campaign narrative</li>
<li class=""><code>corpus-python/scripts/diagnose_regression.py</code> — the per-tag FN/FP bucketer</li>
</ul>
<p>Next: v0.4.1 scope discussion. The empty-pred slice on postcodes is the highest-confidence single-target fix. The §1 CRF investigation is the higher-risk, higher-reward thread. We'll likely run both in parallel.</p>]]></content>
        <author>
            <name>Teffen Ellis</name>
            <uri>https://github.com/GirlBossRush</uri>
        </author>
        <category label="Model training" term="Model training"/>
        <category label="Retrospective" term="Retrospective"/>
    </entry>
</feed>