2 docs tagged with "eval"

Confidence calibration

Every span the parser emits carries a number when the model says it's 94% sure, is it actually right 94% of the time?

Eval discipline — reading the numbers honestly

Mailwoman's eval methodology learned its most important lessons the hard way — from shipping two model versions that regressed on headline F1 but told a different story when the failures were examined properly. This article documents the discipline: what to measure, what not to trust, and how to read a model release report.