The better posterior was confidently wrong
Date: 2026-06-05 Scope: the postcode anchor's country posterior (#240) โ uniform vs a de-biased Bayesian weighting, decided on held-out real collisions before changing a line of shipped code.
The postcode anchor turns a code like 75001 into a distribution over the countries it could belong to. The shipped version uses a uniform posterior: 1/k over the countries a code exists in. 75001 is in both France and the US, so it comes back {FR: 0.5, US: 0.5} โ an honest shrug that leaves the disambiguation to the parser and the resolver.
That shrug throws away real information, and we knew it. The true posterior is P(country | postcode) โ N_c(x) โ the actual address-count ratio. 75001 is the 1st arrondissement of Paris, dense and central; it is also Addison, Texas, a quiet suburb. Those don't have equal address counts, so the honest answer isn't 50/50. A de-biased Bayesian posterior โ within-country frequency times a real-world address-volume prior โ targets that ratio directly, and on paper it's the more correct shape. So we went to adopt it. Then we did the thing our earlier postcode-country investigation was entirely about: we measured the load-bearing assumption before building on it.
The measurementโ
We A/B'd three posteriors on the canonical collision โ US โ FR, the literal 75001 case, and the one pair where we have rich frequency data on both sides:
fฬ(train): corpus v0.1.0 โ 4.4M real addresses โ counting(country, postcode).- candidate set: the 13,186 five-digit postcodes that exist in both the US and FR postal gazetteers.
- test (held-out):
openaddresses-{us,fr}-sample.jsonlโ a different extraction, sofฬis never graded on the data it was fit to.
Three posteriors, scored per true country and balanced (so a US-leaning prior can't win by exploiting that US collisions outnumber FR in the test):
| posterior | balanced logloss | balanced top-1 | high-conf errors (true-FR) |
|---|---|---|---|
| uniform (shipped) | 0.6931 | 50.0% | 0.0% |
| naive-count | 0.7555 | 39.4% | 0.0% |
| de-biased (the "smart" one) | 2.2292 | 51.6% | 96.9% |
De-biased is three times worse-calibrated than the shrug, and confidently wrong about the minority country 97% of the time. Its one bragging point โ +1.6pp top-1 โ is pure test-imbalance arbitrage: it nails the abundant US cases and immolates the FR ones. naive-count, the control, comes in worse than uniform too, exactly as the raw-count skeptics warned.
Why โ and it isn't the mathโ
The de-biasing math is sound. The data underneath it is half-broken. fฬ needs per-country address frequency on both sides of a collision, and we only have it on one. France arrives as 1.13M real street addresses (ban, the national address base), so a busy Paris postcode genuinely shows up thousands of times. The US arrives as 58k rows from wof-postalcode โ essentially a membership list, one row per code, no frequency signal at all (97.5% of v0.1.0's US rows are wof-admin places that carry no postcode). So fฬ_US is estimated off a tiny, flat denominator and comes out systematically inflated; multiply by a US-favoring volume prior and the posterior screams "US" at nearly every collision, right or wrong.
Feed sound math a lopsided fฬ and you don't get a slightly-off posterior โ you get a confidently wrong one. And a confidently-wrong soft prior is the single thing that corrupts the resolver re-rank we're building toward: uniform's honest 0.5 gets gated out when the resolver is already sure, but a 0.97-to-the-wrong-country prior sails through the confidence gate and flips a correct answer. The shrug is worse on paper and far safer in practice.
Verdictโ
Uniform stays. Not because de-biasing is the wrong idea, but because we can't feed it. This is a data-availability finding wearing a posterior-design costume: the day we have balanced per-country address frequency โ bulk OpenAddresses for the US, or census ZIP-population weights, neither on disk today โ it's worth re-measuring, because the shape it targets really is the correct one. Until then, weighting the guess makes it worse.
The measurement is scripts/eval/postcode-posterior-ab.py; re-run it when the data changes. It cost an afternoon and saved us from shipping a posterior that is wrong with conviction โ which is the only kind of wrong a soft anchor can't afford to be.