Skip to main content

The better posterior was confidently wrong

Date: 2026-06-05 Scope: the postcode anchor's country posterior (#240) โ€” uniform vs a de-biased Bayesian weighting, decided on held-out real collisions before changing a line of shipped code.

The postcode anchor turns a code like 75001 into a distribution over the countries it could belong to. The shipped version uses a uniform posterior: 1/k over the countries a code exists in. 75001 is in both France and the US, so it comes back {FR: 0.5, US: 0.5} โ€” an honest shrug that leaves the disambiguation to the parser and the resolver.

That shrug throws away real information, and we knew it. The true posterior is P(country | postcode) โˆ N_c(x) โ€” the actual address-count ratio. 75001 is the 1st arrondissement of Paris, dense and central; it is also Addison, Texas, a quiet suburb. Those don't have equal address counts, so the honest answer isn't 50/50. A de-biased Bayesian posterior โ€” within-country frequency times a real-world address-volume prior โ€” targets that ratio directly, and on paper it's the more correct shape. So we went to adopt it. Then we did the thing our earlier postcode-country investigation was entirely about: we measured the load-bearing assumption before building on it.

The measurementโ€‹

We A/B'd three posteriors on the canonical collision โ€” US โ†” FR, the literal 75001 case, and the one pair where we have rich frequency data on both sides:

  • fฬ‚ (train): corpus v0.1.0 โ€” 4.4M real addresses โ€” counting (country, postcode).
  • candidate set: the 13,186 five-digit postcodes that exist in both the US and FR postal gazetteers.
  • test (held-out): openaddresses-{us,fr}-sample.jsonl โ€” a different extraction, so fฬ‚ is never graded on the data it was fit to.

Three posteriors, scored per true country and balanced (so a US-leaning prior can't win by exploiting that US collisions outnumber FR in the test):

posteriorbalanced loglossbalanced top-1high-conf errors (true-FR)
uniform (shipped)0.693150.0%0.0%
naive-count0.755539.4%0.0%
de-biased (the "smart" one)2.229251.6%96.9%

De-biased is three times worse-calibrated than the shrug, and confidently wrong about the minority country 97% of the time. Its one bragging point โ€” +1.6pp top-1 โ€” is pure test-imbalance arbitrage: it nails the abundant US cases and immolates the FR ones. naive-count, the control, comes in worse than uniform too, exactly as the raw-count skeptics warned.

Why โ€” and it isn't the mathโ€‹

The de-biasing math is sound. The data underneath it is half-broken. fฬ‚ needs per-country address frequency on both sides of a collision, and we only have it on one. France arrives as 1.13M real street addresses (ban, the national address base), so a busy Paris postcode genuinely shows up thousands of times. The US arrives as 58k rows from wof-postalcode โ€” essentially a membership list, one row per code, no frequency signal at all (97.5% of v0.1.0's US rows are wof-admin places that carry no postcode). So fฬ‚_US is estimated off a tiny, flat denominator and comes out systematically inflated; multiply by a US-favoring volume prior and the posterior screams "US" at nearly every collision, right or wrong.

Feed sound math a lopsided fฬ‚ and you don't get a slightly-off posterior โ€” you get a confidently wrong one. And a confidently-wrong soft prior is the single thing that corrupts the resolver re-rank we're building toward: uniform's honest 0.5 gets gated out when the resolver is already sure, but a 0.97-to-the-wrong-country prior sails through the confidence gate and flips a correct answer. The shrug is worse on paper and far safer in practice.

Verdictโ€‹

Uniform stays. Not because de-biasing is the wrong idea, but because we can't feed it. This is a data-availability finding wearing a posterior-design costume: the day we have balanced per-country address frequency โ€” bulk OpenAddresses for the US, or census ZIP-population weights, neither on disk today โ€” it's worth re-measuring, because the shape it targets really is the correct one. Until then, weighting the guess makes it worse.

The measurement is scripts/eval/postcode-posterior-ab.py; re-run it when the data changes. It cost an afternoon and saved us from shipping a posterior that is wrong with conviction โ€” which is the only kind of wrong a soft anchor can't afford to be.