What the eval numbers mean
Mailwoman evaluates itself by running against a set of 4,535 hand-labelled addresses (the "golden set") and measuring how well each pipeline mode does. This article explains the four modes, the metrics, and what the v0.5.0 results actually tell us โ in plain terms.
The four modesโ
Mailwoman can parse addresses in four different ways. Each one uses a different combination of tools:
| Mode | What it uses | Analogy |
|---|---|---|
| Rule-only | Hand-written rules, pattern matching, dictionaries | A postmaster who memorises the rulebook |
| Neural | The AI model's best guess, decoded with structural constraints | A student who writes their first instinct, checked for grammar |
| Hybrid | Rules + AI model working together | The postmaster and the student collaborating |
| Hybrid-joint | Rules + AI + a "sanity checker" that rejects incoherent guesses | The collaboration, plus an editor who crosses out answers that contradict each other |
These are simplifications of the same staged pipeline โ each "mode" is a different composition of the same underlying stages, not four separate parsers.
The metricsโ
Exact match โ did the parser get every single component of the address right? House number, street, city, region, postcode โ all must match the human-labelled answer exactly. This is harsh. Getting 4 out of 5 components right scores zero.
Macro F1 โ a softer measure that balances two questions per component type: did you find it when it was there? (recall) and did you make it up when it wasn't? (precision). The score averages the balance across all component types. A parser that's great at postcodes but bad at venues gets partial credit.
Empty-parse rate โ how often does the parser give up entirely and return nothing? Lower is better. A parser that always guesses something (even if wrong) scores 0% here.
Overconfident-wrong rate โ how often does the parser say "I'm very sure" (confidence above 90%) but get the parse wrong? This is the most dangerous failure mode for downstream consumers: a geocoder that's confidently wrong will silently return the wrong coordinates with no signal that something went amiss.
The v0.5.0 resultsโ
| Mode | Exact Match | Macro F1 | Empty Parse | Overconf Wrong |
|---|---|---|---|---|
| Rule-only | 30.8% | 22.0% | 6.3% | 2.4% |
| Neural | 0.1% | 7.3% | 0.3% | 54.5% |
| Hybrid | 0.1% | 7.3% | 0.3% | 54.5% |
| Hybrid-joint | 6.0% | 16.6% | 0.0% | 0.1% |
What this tells usโ
Rule-only is still the most accurate on addresses it coversโ
30.8% exact match means: for roughly 1 in 3 addresses in the golden set, the rule parser gets every component perfectly right. This sounds low, but exact match is a strict measure โ and the rule parser only knows the patterns it was hand-taught. It has zero coverage on addresses outside its training (different countries, unusual formats).
The rule parser's weakness: 6.3% empty-parse rate (gives up on some inputs entirely) and only 22% macro F1 (meaning it's good at some component types but bad at others โ venue detection is particularly weak at 24% F1).
The neural model learned to spell words but not write sentencesโ
The v0.5.0 neural model achieved val_macro_f1=0.605 during training โ which sounds good. But on the eval matrix it scores 0.1% exact match and 54.5% overconfident-wrong. What happened?
Training eval asks "did the model label each word correctly?" โ a local question. The golden eval asks "did the parser produce a correct address?" โ a global question. These are different. The model can score 0.605 on the first and 0.001 on the second because correct per-token labeling doesn't guarantee correct parses โ one wrong token cascades into a structurally invalid address.
The concrete failure: the model invented a dependent_locality (a sub-city neighborhood) 956 times where none existed in the golden labels. The model was actively hallucinating a component it hadn't learned to distinguish; overconfidence was only the symptom. Cross-entropy treats every mislabeling equally, so the model never learned that dependent_locality is rare and should be emitted sparingly.
In hybrid mode, the neural model's overconfidence drowns out the rules entirely โ when the neural decoder says "this token is a dependent_locality" at 95% confidence and the rule parser disagrees, the neural vote wins. This is why hybrid and neural show identical numbers: the rules never get a say.
The reconciler fixes the honesty problemโ
Hybrid-joint mode (the reconciler) drops overconfident-wrong from 54.5% to 0.1%. How? By checking whether the parsed components form a consistent real-world hierarchy. "Is there actually a city called Houston in a state called NY?" If not, the parse is rejected or rewritten.
The reconciler also eliminates empty parses entirely (0.0%) โ it always produces something, even if conservative.
The trade-off: exact match drops from 30.8% (rule-only) to 6.0% (hybrid-joint). The reconciler is more honest but less precise on well-formed addresses. This is a calibration-vs-accuracy trade-off that the next iteration will address by re-adding class weights to the training recipe.
The architecture is working, the quality isn't there yetโ
The staged pipeline โ rules for structure, neural for ambiguity, reconciler for honesty โ is producing the behaviour it was designed for. Each layer adds value:
- Rules contribute high precision on common patterns.
- Neural contributes coverage on unusual inputs (0% empty parse vs rules' 6.3%).
- Reconciler contributes honesty (0.1% overconfident-wrong vs 54.5%).
The quality gap is in the neural classifier's per-component accuracy. This is addressable without architectural changes: class-weighted cross-entropy (pulling the model's attention back to underperforming tags) and longer training are both now safe to try because the dual-loss instability that blocked them is gone.
See alsoโ
- The staged pipeline โ how the four modes compose
- The knowledge ladder โ why each layer exists
- Dual-loss curvature conflict โ why the training was unstable before and what fixed it
- v0.5.0 โ as shipped โ what the six threads delivered