Skip to main content

Pre-publish 2D eval gate

Mechanical guard that runs before any neural-weights release lands on the default HF channel. Compares a candidate model's per-tag eval output against a baseline (typically the current default), applies a two-dimensional threshold, and exits non-zero if any tag violates either dimension.

The shape of the gate is the lesson the 2026-05-28 night-2 postmortem and the Layer 1 eval both pointed at: a recall-only gate would have passed v0.6.1 (no tag regressed > 2pp in recall on tags where the model already had > 10% baseline), letting the 1066 dependent_locality hallucinations through. The two-dimensional gate catches both halves.

The 2D ruleโ€‹

FAIL if (recall drop > recall_threshold_pp AND baseline_recall > recall_min_baseline_pct)
OR (hallucination spike > hall_abs_threshold
AND new_hallucination_rate > hall_rate_threshold_pct)

Defaults:

ThresholdDefaultWhy
recall_threshold_pp2A drop smaller than 2pp is noise on aggregated golden-set runs
recall_min_baseline_pct10Drops on already-low-recall tags don't move user-visible quality much
hall_abs_threshold100Below this is plausibly normal training drift
hall_rate_threshold_pct20A tag hallucinating more than 20% of its expected count is a structural failure

Each dimension stands alone but they compose: a regression has to clear BOTH gates to ship.

Why both dimensions matter โ€” the v0.6.1 retroactive checkโ€‹

Running the gate against v0.6.0 โ†’ v0.6.1 produces:

GATE FAILED: 3 violation(s).
- locality (recall) โ€” recall 39.7% โ†’ 31.1% (ฮ” -8.6pp; baseline > 10%)
- house_number (recall) โ€” recall 79.0% โ†’ 75.9% (ฮ” -3.1pp; baseline > 10%)
- dependent_locality (hallucination) โ€” hallucinated 0 โ†’ 1066 (ฮ” +1066; new rate 2665.0%)

The first two would be caught by a recall-only gate. The third โ€” dependent_locality going 0 โ†’ 1066 โ€” would NOT be caught by a recall-only gate (recall there actually improved, 0% โ†’ 30%). That's the v0.6.1 failure pattern from the postmortem encoded as a mechanical check: a tag silently exploding in false-positive count even as nominal recall holds up.

Workflowโ€‹

  1. Score the baseline:

    node --experimental-strip-types scripts/eval-morphology-fst.ts \
    --model /mnt/playpen/.../model-v060.onnx \
    --tokenizer /mnt/playpen/.../v0.6.0-a0/tokenizer.model \
    --model-card neural-weights-en-us/model-card.json \
    --admin-fst /mnt/playpen/.../fst-en-us.bin \
    --golden data/eval/golden/v0.1.2 \
    --name v0.6.0-default \
    --out-json /tmp/eval-v060.json
  2. Score the candidate (same eval, new weights).

  3. Run the gate:

    node --experimental-strip-types scripts/eval-gate.ts \
    --baseline /tmp/eval-v060.json \
    --candidate /tmp/eval-v062.json \
    --out-md /tmp/gate-report.md

    Exit 0 โ†’ safe to promote. Exit 1 โ†’ block the release-it dispatch.

The gate is reusable forever โ€” it doesn't know about specific tags or specific models, only about the threshold contract. v0.7+ models with dependent_street, unit_designator, JP tags will be measured the same way.

What the gate does NOT replaceโ€‹

  • The v0-vs-neural harness (per-locale + per-file pass rates across 376 hand-tuned assertions). The gate works on aggregate golden-set metrics; the harness works on the legacy rule-based pipeline's harsh acceptance criteria. Both should be green.
  • Manual demo-preset eval. The /eval-model skill remains the human-eyeballs gate before any push to default.
  • Falsehoods catalog regression check (data/eval/falsehoods/streets.jsonl).

The gate is one slice of the pre-publish protocol, not the whole thing.

See alsoโ€‹