Skip to main content

Negative space — why training every component sharpens each one

A useful intuition guides Mailwoman's corpus-coverage work: as we add training signal for the address components we've been missing (unit, the street affixes, and more), the model gets better at the components we already handle — not just the new ones. The reason is that a sequence labeller learns a tag by its boundaries, so teaching it what a token isn't is how it learns what the neighbours are. (One caveat that the campaign earned the hard way: this logic applies to tags the model must learn from context, not to every starved tag — see Two kinds of starved tag at the end.)

This has names at three altitudes:

  • Philosophy calls it via negativa — defining a thing by what it is not.
  • Art calls it negative space — you draw the vase by drawing the gap.
  • Linguistics is Saussure's whole thesis: a sign has no meaning in isolation, only value through difference. "Street" means street partly because it isn't "unit" or "locality."
  • Machine learning calls it discriminative learning — the model fits the decision boundary between classes, not a description of each class in isolation.

The catch-all failure it cures

The softmax at each token position is a competition: every tag's probability is normalised against all the others, and the mass has to land somewhere. When a real category has no training signal, the model never learns a detector for it — so tokens that genuinely belong to that category get dumped into the nearest tag it did learn. That tag becomes a catch-all (or "dumping-ground") label.

This is not hypothetical for us. The v0-parity assessment caught it directly:

  • 123 Main St Apt 456 Oakland CA → the model tags Apt 456 as street, because unit was never a class it could reach for.
  • 1 Main St Pittsburg PAstreet absorbs the city, because the model has never been shown a clean example of what legitimately comes after a street, so it doesn't know where street ends.

street is acting as a garbage collector for everything on the road line. The moment unit becomes a trained category, those tokens have a home: their probability mass moves off street and onto unit, and street gets sharper as a side effect — we improved a tag by teaching its neighbours.

The honest caveat

This is usually net-positive but not guaranteed monotonic. Adding a class is free when it's a genuine category the old label was wrongly swallowing (our case). It can backfire when two categories are genuinely ambiguous (you trade one confusion for another), or through capacity and label-noise effects — Mailwoman has scar tissue here, where adding a locale interfered with an existing one on some retrains. So negative space is a strong hypothesis, not a theorem.

Which is exactly why coverage work is gated on measurability first. Before generating corpus rows for a starved tag, the eval set must carry enough held-out examples of it that its F1 is real signal, not noise from a handful of rows (see eval discipline and the val-set stratification work). The coverage eval is the apparatus that lets the numbers say whether negative space held for our model — whether covering unit lifted street/locality precision, and whether anything regressed — rather than assuming it.

Two kinds of starved tag — and why the model handles both

Negative space tells you which tags are taxing their neighbours; it doesn't tell you the only way to cover them is more rows of the same shape. The parity campaign surfaced two shapes of starved tag — but the conclusion is that the model stays the authority for both; the difference is only what kind of help it needs.

  • Open-vocabulary, boundary-defined tagsunit, street_prefix, street_suffix, locality. Their members are open-ended and their extent depends on context, so the model has to learn the boundary. A balanced coverage shard is the whole fix, and this is where the negative-space effect compounds: covering unit (0 → 92%) lifted US street by about three points.
  • Closed-vocabulary tagscountry, po_box, cedex. A finite, enumerable surface set in a predictable slot. The tempting shortcut is a deterministic lookup, and an early pass took it. That was wrong, for an instructive reason.

A first attempt trained country on a shard that was almost entirely trailing-country rows, so the model over-learned the cheap correlation "trailing token ⇒ country" and started promoting cities and regions to nationhood (49 F1, 23% precision). A flat lookup scored 100 — but only on an eval with no homographs. The homograph case ("Atlanta, Georgia" → region vs "Tbilisi, Georgia" → country) is exactly where a lookup can't win without a hand-coded guard, and resolving it from context is the model's whole reason to exist. So the over-firing was a training- distribution artifact, not a model limitation, and "deterministic wins" was a mirage measured on the easy cases.

The corrected rule: cover every starved tag by training; let the lexicon inform the model, never override it. Open-vocab and closed-vocab-without-homographs (po_box, cedex) just need a balanced coverage shard. A closed-vocab tag with a large set and homographs (country) additionally gets the gazetteer as a soft anchor feature — a match flag plus a "known-homograph" bit — fed into the model the same way the postcode anchor is, so the model decides with the lexicon as evidence. The matcher becomes a feature-source, not a post-parse verdict. Full rationale, data recipe, and the eval-first plan: Closed-vocab fields: model-first.