Eval discipline — reading the numbers honestly

Mailwoman's eval methodology learned its most important lessons the hard way — from shipping two model versions that regressed on headline F1 but told a different story when the failures were examined properly. This article documents the discipline: what to measure, what not to trust, and how to read a model release report.

Why aggregate F1 is misleading

Per-component F1 scores are the standard metric for sequence-labelling models. A table like this looks authoritative:

Component	v0.4.0	v0.3.0	Δ
country	0.21	0.28	−0.07
region	0.19	0.18	+0.01
locality	0.27	0.27	flat
postcode	0.69	0.76	−0.07
street	0.30	0.27	+0.03
house_number	0.79	0.78	+0.01

The easy reading: v0.4.0 regressed on country and postcode. Ship the previous version.

The honest reading, after bucketing the 1,217 postcode false-negatives and 194 country false-negatives into failure categories, reveals that the headline regressions are mostly eval artifacts, not real model degradation.

The false-negative bucketing methodology

For every component where F1 moves more than a few points between releases, manually inspect a sample of the disagreements between the model's prediction and the golden-set label. Categorize each failure into buckets. The buckets will typically fall into a few patterns:

Pattern 1: Adversarial eval entries

The golden set contains entries chosen specifically to break the parser — multi-script addresses, ambiguous locality names, prefix-honorific homographs. If the model has a known limitation (e.g., the v0.1.0 tokenizer's byte-fallback on non-Latin scripts) and the golden set includes entries that exercise that limitation, then F1 deltas on those entries are measuring whether the limitation got fixed, not whether the new weights are better or worse.

In v0.4.0's case, 92% of the country false-negatives were adversarial transliteration entries:

بار نون وایومینگ, Wyoming, United States of America   →  pred: "yoming, United Sta"
サーモポリス, WY, United States of America              →  pred: ", WY, United State"

The model was never trained to handle these cases. The v0.4.0 weights didn't change the behaviour on this slice — the regression was the golden set holding v0.4.0 accountable for v0.3.0's known failure modes. After excluding adversarial inputs, country false-negatives dropped from 194 to roughly 16.

The discipline: report F1 both with and without known-adversarial slices. The "with" number is the honest ceiling; the "without" number is the signal for whether the recipe changed anything real.

Pattern 2: Empty predictions

When the model emits nothing for a component that the golden set expects, the failure is usually a training-distribution effect — the model learned a positional prior that doesn't apply to the eval set.

In v0.4.0, 65% of postcode false-negatives were empty predictions on mid-position postcodes like Paris 75008. The NAD downweight (the most aggressive change in the source rebalance) removed "postcode-first" positional patterns from the training mix. The model learned to tag mid-position numeric tokens as house_number instead of postcode.

This is a real regression — the recipe change had an unintended side effect — but it's a training-data distribution problem, not a model architecture problem. It suggests a targeted fix (bump NAD weight back up, synthesize component-order permutations) rather than a rollback.

Pattern 3: Label confusion

The model picks the wrong label for a span. In v0.4.0, 11% of postcode false-negatives were house-number confusion: 47110 Sainte-Livrade-sur-Lot, 22 Rue Jasmin → the model predicted 22 as postcode instead of the leading 47110.

These are genuine model errors. They suggest the label vocabulary is ambiguous for numeric tokens in certain positions.

Pattern 4: Span boundary slip

The model gets the label right but the span wrong. In v0.4.0, 6% of postcode false-negatives were boundary-slip cases: LE TRÉPORT, 76470 → model predicted ", 7647" for postcode. The tag was correct (postcode) but the span included the preceding comma and space, and sometimes truncated the final digit.

This is a decoder problem, not a model problem — no retraining required. The fix (trimming spans past leading/trailing non-word characters) landed in the decoder without touching the model weights.

The discipline: always ask whether the failure is a model problem or a decoder problem. Decoder fixes are cheap and don't require a retrain. Many "model regressions" turn out to be decoder bugs on closer inspection.

Golden-set hygiene

The golden eval set (v0.1.2, 4,535 entries) is the single most important artifact in the eval pipeline. A few rules:

Adversarial entries belong in their own slice. Report F1 with and without them. The adversarial slice is a stress test, not a release gate.
Golden-set versions are pinned. Every eval report references a specific golden-set version. If the golden set is expanded, the old reports are not retroactively recomputed — that would falsify the historical record.
Annotation noise is real. At typical 1% annotator error rates in human-labeled NER data, a 0.5–1.5pt macro_F1 shift can be noise. When a regression lands in this band, manually inspect the disagreement entries before deciding.
Small eval sets amplify noise. 4,535 entries means a 1pt macro_F1 regression is ~45 flipped entries. At 1% annotator error, the false-positive rate on "regression detected" is ~10%.

Resolver eval: the name is gameable, the coordinate is not

The discipline above grades parser spans against a golden. The resolver — the stage that turns a parsed address into a place on Earth — needs its own, and it fails differently. The trap here is grading resolution by name-match: did the resolved place carry the same name string as the gold? That question can only fail when the name is wrong, and the name is almost never wrong. There are ten US localities called "Sheldon"; "New York" is a city, a state, and a village 280km apart. A name-match metric scores every one of those a tie and gives the resolver full marks for picking any of them.

The 2026-06-08 honest-eval run made the cost concrete: on a leakage-free Vermont slice the resolver scored 93.7% locality name-match while its median coordinate was 326km from the truth — it was finding the right name in the wrong state, and name-match could not see it. Two disciplines follow.

Evaluate on geography the model never trained on

Random evaluation flatters the model: the corpus covers the same towns the eval tests, so component recall is partly memorization. The corpus holds specific regions out of training (corpus/src/split.ts defaultHoldouts(): VT/WY/ND for US, Corse/Lozère/Creuse for FR). OpenAddresses rows in that held-out geography test the model on places it has never seen — the only honest estimate of generalization. Require a minimum slice size (we use 1,000 rows) and report it UNTRUSTED below that rather than scoring noise.

Grade by the coordinate, not the string

Three metrics survive a name collision:

region-match — does the resolved region equal the input region? 100% checkable, and the single fastest way to catch a wrong-state resolve.
coordinate error (great-circle, gold point to resolved centroid, p50/p90) — handles point geometry natively and exposes the wrong-instance resolves that name-match ties.
PIP-containment — is the gold point inside the resolved place's polygon? Name-surface-independent, but report it with a polygon-coverage denominator: WOF point-geometry localities have no polygon and would otherwise count as silent failures, and tight municipal polygons reject rural addresses ascribed to the nearest town. Lead with region-match and coordinate; treat locality-PIP as a coverage-adjusted secondary.

scripts/eval/honest-eval.sh + scripts/eval/pip-containment.py implement this; see the honest-eval report for the full run.

When aggregate and functional disagree, functional wins

The region fix that came out of that run looked like a pure triumph on aggregate (Vermont 326→3.4km), and the eight demo presets we read by hand caught it regressing the most famous address in the set (New York, NY → "New York Mills", 283km upstate). The aggregate had averaged that one case away. Chasing why the eight disagreed is what surfaced the deeper bug. Run the functional presets before any verdict, and treat a disagreement between them and the aggregate as a clue, not noise — the same lesson the parser side learned, on the resolver side.

Verdict smokes and eval infrastructure

A separate but related discipline surrounds training experiments. See VERDICT_SMOKES.md for the full framework. The eval-relevant lessons:

Constant-LR smokes, not cosine. Cosine decay hides divergence under a near-zero learning rate. A verdict smoke that uses cosine decay will report "stable" even when the recipe would diverge under sustained peak LR. v0.4.0's cosine-LR meta-bug cost five training runs before it was diagnosed.
Full-run batch geometry. A smoke that runs at a different effective batch size than the full run is testing a different gradient-noise regime. The smoke's "pass" verdict is not transferable.
Run smokes before expensive retrains. The smoke framework exists to catch divergence, NaN, and sampler starvation before they cost a full GPU run. It's the cheapest experiment in the training loop.

The discipline checklist

Before shipping a model release:

Report per-component F1 with and without adversarial eval slices. The adversarial number is the honest ceiling; the non-adversarial number is the recipe-change signal.
Bucket false negatives into categories. Empty predictions, label confusion, span boundary slip, adversarial artifacts — each has a different fix and a different urgency.
Distinguish model problems from decoder problems. Span boundary slip is a decoder fix; label confusion is a model fix. Don't retrain for a decoder bug.
Inspect borderline regressions manually. A 0.5–1.5pt macro_F1 shift could be annotator noise. Look at the actual disagreements before deciding.
Run verdict smokes at full-run geometry with constant LR. Cosine decay and mismatched batch sizes produce false confidence.
Build diagnostic tooling before you need it. corpus-audit and diagnose_regression.py were built during the v0.4.0 campaign — they would have saved most of v0.3.0's investigation time if they'd existed earlier.
Grade the resolver by coordinate on a leakage-free slice. Never ship a resolver change on name-match alone — it ties same-named places in different states. Evaluate on held-out geography (honest-eval.sh), lead with region-match + coordinate error, and read the demo presets before the verdict.

Why aggregate F1 is misleading​

The false-negative bucketing methodology​

Pattern 1: Adversarial eval entries​

Pattern 2: Empty predictions​

Pattern 3: Label confusion​

Pattern 4: Span boundary slip​

Golden-set hygiene​

Resolver eval: the name is gameable, the coordinate is not​

Evaluate on geography the model never trained on​

Grade by the coordinate, not the string​

When aggregate and functional disagree, functional wins​

Verdict smokes and eval infrastructure​

The discipline checklist​

See also​