Skip to main content

12 posts tagged with "Neural classifier"

Articles about the encoder-only transformer model, BIO labeling, ONNX runtime, and training.

View All Tags

A lookup table scored 100%. We shipped the model anyway.

· 5 min read
Teffen Ellis
Sister Software

This morning we published a post that ended with a tidy rule: some address tags don't want a neural network, they want a lookup table. Country names are a closed list in a known position. Our deterministic matcher scored a perfect 100 on the eval. The retrained model scored a mess. Case closed, we wrote.

By the afternoon we'd reopened the case, and the verdict flipped — hard enough that we've retracted the morning post rather than leave the wrong conclusion lying around for someone to cite. This is the story of how a perfect score nearly talked us out of the entire premise of the project.

We spent three retrains fixing a German bug that didn't exist

· 5 min read
Teffen Ellis
Sister Software

There is a particular kind of engineering misery where you fix a bug three times and it never gets better, because the bug is in your ruler. This is that story.

Our neural parser handles German two ways. Native order — Hauptstraße 5, 10115 Berlin — is the layout real German feeds and real German people use. International order — 5 Hauptstraße, Berlin, 10115 — is the Americanized layout our evaluation set happens to ship. For months, international-order German "collapsed": locality accuracy sat around 44% while native cleared 80%. We had a story for it. The postcode anchor — a side-channel that feeds the model a country hint derived from the postcode — sits at the trailing postcode, which in international order lands on the far side of the locality from where it's needed. Plausible. So we retrained.

Which Berlin? When your metric grades the wrong thing

· 4 min read
Teffen Ellis
Sister Software

Ask a geocoder for "Berlin" and it has to make a choice. There's the one in Germany, obviously. There's also Berlin, New Hampshire (population nine thousand and change), Berlin, Wisconsin, Berlin, Connecticut, and a dozen more scattered across the United States like the name was on sale. The parser hands you the word Berlin tagged as a locality; something downstream has to decide which dot on the map that is. How would you even know if it picked right?

For a long time our answer was a scorecard that checked the name. Did the resolved place's name equal the expected name? Tick. Move on. It is a completely reasonable thing to measure, and it was lying to us for months.

Which way does a postcode point?

· 10 min read
Teffen Ellis
Sister Software

We left the last postcode story with a promise and a bill. The promise was that the "which country is this" signal has to come from the trained model reading the whole string, because the postcode on its own settles the question less than half the time. The bill was that this is the expensive version of the feature. This is the post where we paid it: we built the country signal into the model, watched it do something genuinely great, and then watched it refuse, in the most instructive way we've hit all month, to do that same great thing in a different word order.

The great thing first, because you've earned it. We took the postcode's gazetteer membership, that [us, de, fr] answer from last time, and instead of handing it to a regex we injected it into the model at the postcode token itself. A small additive nudge on the hidden state, right where the five digits sit, carrying "here is what this code could be." On German addresses written the way Germans actually write them, it was worth thirty-five points of locality accuracy. It beat Pelias. For one evening we were heroes.

Then we looked at the international numbers and the floor gave way. Same model, same anchor, the same German cities, but now written house-number-first with the postcode trailing the city, the way our test feed renders them, and it scored a hair above a coin flip. The hero anchor was, on those rows, slightly worse than no anchor at all.

Three questions sit under the rest of this, so let me put them on the table before we start:

  • When a parser "collapses" on a test, is the parser wrong, or is the test?
  • Can you train one model to read an address in any order, or does each order quietly cost you the other?
  • And the one that took three retrains to answer honestly: what does a learned anchor actually learn, the thing you asked for, or the shape of where you kept putting it?

Does a postcode know what country it's in?

· 8 min read
Teffen Ellis
Sister Software

We set out to fix a small wart in our address parser and came away with a number that told us to put the screwdriver down.

Here is the wart. When our postcode extractor sees a five-digit run and wants to know whether it's a real postcode or just a house number that happens to look like one, it peeks at the words sitting next to it and checks them against every country's street vocabulary we know — American, German, French, all at once. That "all at once" is fine at three countries. At twenty it gets loud, and a German street suffix starts shadowing an English word by sheer coincidence. So we went looking for the clean way to tell the extractor which country's words to bother with.

That question has a much bigger sibling, and chasing the sibling is where the story actually is.

Our parser fails 80% of our own tests. We shipped it anyway.

· 4 min read
Teffen Ellis
Sister Software

Our neural address parser passes 20.7% of our test suite. The rule-based parser it's meant to replace passes 93.7%. By that scoreboard, we should delete the neural model and go home.

We shipped the neural model instead. Here's why both numbers are true — and why the one that matters says the opposite.

Zero byte-fallback: a multi-script tokenizer from WOF-earth

· 3 min read
Teffen Ellis
Sister Software

The v0.5.0-a1 tokenizer had a dirty secret: it was trained exclusively on US and French addresses. When it encountered Chinese, Japanese, Korean, Thai, or Arabic text, it fell back to encoding individual bytes — 50-75% of tokens for CJK scripts. Every byte-fallback token is a lost opportunity for the model to learn meaningful subword patterns.

Today we fixed that.

Why Japanese addresses break Western parsers

· 5 min read
Teffen Ellis
Sister Software

In Tokyo, the address of Tokyo Tower is 〒105-0011 東京都港区芝公園4-2-8.

In English: "4-2-8 Shibakōen, Minato City, Tokyo 105-0011".

The Japanese form runs right-to-left compared to the English form. The prefecture (都道府県) comes first, then the city or ward (市区町村), then a district (丁目) and a block-number-style locator. There's no street name — just a grid.

This is why every rule-based address parser written for Western addresses breaks on Japan.

PO Box Boîte Postale Apartado: Stage 3 ships with 6 new tags

· 6 min read
Teffen Ellis
Sister Software

For its first six versions, Mailwoman emitted ten BIO tags. The model could pick street out of a row but not street_prefix, street_suffix, unit, or po_box. Real addresses are messier than that. The golden eval set has known examples — 6220 SE Salmon St, Portland, OR 97215 (Stage 2 collapses prefix+name+suffix), 123 Main St Apt 4B, Springfield, IL 62701 (loses the apartment), PO Box 123, Burlington, VT 05401 (treats it as a malformed street).

v0.6.0 adds six tags: street_prefix, street_suffix, unit, po_box, intersection_a, intersection_b. The model is the same h384/6L/6H transformer. The recipe is the same v0.5.1 settings. The tokenizer is the same v0.6.0-a0 multi-script bundle. The only structural change is the output head: 21 BIO labels → 33.