Skip to main content

Isotonic confidence calibration — neural-weights-en-us v4.0.0

Post-hoc calibration of the decoder's per-span softmax confidence (the conf= a resolver or human reads off the parse). Method: isotonic regression (PAVA) over (raw confidence, correct?) pairs from a 50/50 OpenAddresses + training-corpus calibration set. Fit on 80%, every number below measured on the held-out 20%. Task #59 (#240 PR3).

correct? is a normalized exact-or-token-subset span match (so street decomposition and multi-word fragmentation aren't penalized), so the absolute accuracy runs mildly optimistic — isotonic corrects the reliability shape, which the lenient threshold leaves intact. The corpus half is in-domain (the model trained on it); the OA-only row above is the trustworthy held-out ECE.

Headline

SplitECE rawECE calibratedtarget
Combined (deliverable)0.06730.0035<0.05
OA-only (held-out, trustworthy)0.07060.0067
corpus-only (in-domain)0.06590.0061

MCE (bins n≥20) 0.2891 → 0.1829 · Brier 0.0340 → 0.0270 · n_fit=26043 n_eval=6510 spans.

MCE is reported over bins with ≥20 samples. The model is confident — ~94% of held-out spans sit in [0.93, 1.0] — so equal-width bins below ~0.7 hold a handful of samples each and their all-bins max gap is single-sample noise, not a calibration failure. ECE (sample-weighted) is the headline; it weights each bin by its mass.

Reliability (held-out eval, raw confidence)

confidence binnmean confaccuracygap
[0.13, 0.20)10.1551.0000.845
[0.20, 0.27)20.2171.0000.783
[0.27, 0.33)160.3080.5000.192
[0.33, 0.40)200.3700.5000.130
[0.40, 0.47)260.4320.6150.183
[0.47, 0.53)490.5000.5920.092
[0.53, 0.60)590.5680.8310.263
[0.60, 0.67)700.6390.9290.289
[0.67, 0.73)1070.7000.8970.197
[0.73, 0.80)2270.7700.9820.212
[0.80, 0.87)4640.8410.9630.122
[0.87, 0.93)23730.9120.9740.062
[0.93, 1.00)30960.9500.9860.037

Reliability (held-out eval, calibrated confidence)

confidence binnmean calaccuracygap
[0.00, 0.07)10.0461.0000.954
[0.33, 0.40)60.3480.6670.319
[0.40, 0.47)320.4250.5000.075
[0.47, 0.53)140.5210.5710.051
[0.67, 0.73)100.6840.8000.116
[0.73, 0.80)510.7510.5690.183
[0.80, 0.87)1160.8410.8790.038
[0.87, 0.93)1480.9240.9190.006
[0.93, 1.00)61320.9800.9800.000

ECE by locale (held-out eval, raw → calibrated)

localenaccuracyECE rawECE calibrated
NL1920.9950.17860.0465
DE1950.8870.11110.0973
US53200.9720.06840.0070
FR8030.9660.05940.0169

ECE by tag (held-out eval, raw → calibrated)

tagnaccuracyECE rawECE calibrated
venue4340.9310.11380.0430
postcode15470.9990.08930.0227
locality16680.9720.07590.0083
region12190.9980.06390.0185
street8170.9650.06240.0180
house_number7830.8880.03220.0821

Abstention curve (calibrated confidence)

Accept spans at or above the threshold; route the rest to review. Precision is the accuracy of the accepted set.

thresholdcoverage (accepted)precisionreviewed
0.5099.4%97.20%0.6%
0.8098.2%97.64%1.8%
0.9096.5%97.82%3.5%
0.9594.2%97.96%5.8%
0.9773.1%98.44%26.9%

The single global table is fit across all locales/tags, so it under-serves the worst-calibrated subgroups — the per-locale rows show where the one-size table leaves residual error (the OOD locales and rare tags run far higher than the US/FR-dominated global ECE). A per-locale table is the natural next step once the deployed multi-locale model is the calibration target (#368).

20-bin lookup table (raw → calibrated)

bin centercalibrated
0.0250.000
0.0750.000
0.1250.000
0.1750.211
0.2250.348
0.2750.348
0.3250.425
0.3750.426
0.4250.521
0.4750.738
0.5250.776
0.5750.840
0.6250.851
0.6750.924
0.7250.925
0.7750.957
0.8250.963
0.8750.969
0.9250.983
0.9750.986

How it's wired

The table ships as data/eval/calibration/isotonic-en-us-v4.0.0.json and is turned into a (raw)=>calibrated function by the OPT-IN decoder calibrator (core/decoder/calibration.tscreateCalibrator). Default parse output is unchanged (byte-stable); pass the calibrator via ParseOpts.calibrate / BuildTreeOpts.calibrate to emit calibrated conf=. Regenerate with scripts/eval/{build-calibration-set.py,collect-span-confidences.ts,fit-isotonic-calibration.py}.