Confidence calibration
Every span the parser emits carries a number when the model says it's 94% sure, is it actually right 94% of the time?
Every span the parser emits carries a number when the model says it's 94% sure, is it actually right 94% of the time?
Mailwoman's eval methodology learned its most important lessons the hard way — from shipping two model versions that regressed on headline F1 but told a different story when the failures were examined properly. This article documents the discipline: what to measure, what not to trust, and how to read a model release report.