v0.6.3 step 100K eval

The v0.6.3 retrain (per the street-supplement architecture) reached step 100K and was evaluated. Verdict: HOLD, three gate violations, and the predicted dilution failure mode materialized.

Setup

Model: v0.6.3 step 100K (/data/output-v063/checkpoints/step-100000)
Training: A100-SXM4-40GB, ~12.8 steps/s, ~130 min wall-clock. No NaN.
Corpus changes from v0.6.2: filtered venue pool (removed 5th Avenue Theatre, 7th Street Bistro); ADDED synth-house-venue-v063 shard at weight 1.0 (32K rows of house_number + venue + street coexistence); dropped synth-no-street weight 1.0 → 0.5 (kept v0.6.2b's setting).
Tokenizer: v0.6.0-a0 (unchanged)
Eval: v0.1.2 golden set, --stage3-fold enabled, v0.6.0 baseline

Headline numbers

Metric	v0.6.0 baseline	v0.6.2 step 100K	v0.6.3 step 100K
Exact match	21.1%	22.4%	21.8%
Gate violations	(ref)	1	3
`dependent_locality` hallucinations	0	0 ✓	844 ❌
`locality` recall	40.0%	41.0%	34.7% ❌ (-5.3pp)
`house_number` recall	79.0%	74.0%	77.0% ✓ (+3pp vs v0.6.2)
`street` recall	27.7%	27.1%	29.2% ✓ (+1.5pp)
`postcode` recall	76.0%	84.1%	83.4% ✓
`country` recall	24.5%	33.1%	33.5% ✓
Harness pass rate	14.4%	14.0%	12.5% ❌

What worked

house_number recovered by +3pp vs v0.6.2 (74.0% → 77.0%). The synth-house-venue companion shard did exactly what it was designed to do — taught the model that house_number and venue coexist, recovering most of the regression from v0.6.2.
street recall improved (27.1% → 29.2%, +1.5pp vs v0.6.2). The filtered venue pool (no more digit+ordinal venues) likely contributed.
postcode, country held their v0.6.2 gains.

What broke (badly)

`dependent_locality` regression — predicted and confirmed

The original v0.6.1 problem came back. 844 hallucinations vs v0.6.2's 0. Rate: 2110% of expected occurrences (40 expected dep_loc spans in the golden set; the model emitted 844 hallucinations).

This was DeepSeek's turn 10 dilution diagnosis: dropping synth-no-street weight 1.0 → 0.5 AND adding 32K rows of synth-house-venue at weight 1.0 changed the anti-decompose:companion-shard ratio from v0.6.2's 1.0:0 to v0.6.3's 0.5:1.0. The anti-decompose signal got proportionally weaker. Decompose mode came back over the 20K → 100K window.

The trajectory confirms it:

v0.6.3 step 20K: 1 dep_loc hallucination
v0.6.3 step 100K: 844 dep_loc hallucinations

The damage accumulated gradually as the model overfitted to certain patterns in the diluted distribution.

`locality` recall dropped 5.3pp

40.0% → 34.7%. Side effect of the dep_loc explosion — many tokens that should be classified as locality are instead getting classified as dep_locality. The model is confusing the two tags more than v0.6.2 did.

Harness pass rate dropped below v0.6.0

12.5% vs v0.6.0's 14.4%. v0.6.3 is a sidegrade — worse than v0.6.0 on the breadth eval. Per the decision tree, this alone is grounds for HOLD even if the gate passed.

Gate verdict: HOLD

Per the decision tree:

Gate eval	harness pass	Action
Pass	≥ 25%	Promote to default
Pass	15-24%	Ship as experimental
Pass	≤ 14.4%	HOLD (sidegrade)
Fail	any	HOLD

v0.6.3 falls in row 4: HOLD. Two paths from here:

Launch the pre-staged v0.6.4 (synth-no-street weight 0.5 → 0.75). DeepSeek's mechanism-3 hypothesis just got empirically confirmed by v0.6.3 going wrong in the predicted way. v0.6.4 directly counters the dilution.
Step back and rethink v0.6.x as a whole. Three iterations in, the pattern is clear: every corpus-weight change produces non-trivial trade-offs. Maybe the issue is upstream of any recipe — overfitting, evaluation methodology, schema design, or model capacity.

The operator chose path 2. This doc captures what we learned. The v0.6.x retrospective synthesizes what the cycle revealed and proposes a rethink agenda.

Pre-staged v0.6.4 — held but not discarded

The v0.6.4 yaml + parquet artifacts remain on the Modal volume. If the rethink concludes that "weight rebalance is the right path but we just need to land the right setpoint," v0.6.4 is one modal run away. If the rethink concludes that something more structural needs to change, v0.6.4 is reference data for what would have happened.

Reproducing

./scripts/eval-v062-checkpoint.sh 100000 v063

Artifacts in /tmp/v063-eval-step-100000/.

Setup​

Headline numbers​

What worked​

What broke (badly)​

dependent_locality regression — predicted and confirmed​

locality recall dropped 5.3pp​

Harness pass rate dropped below v0.6.0​

Gate verdict: HOLD​

Pre-staged v0.6.4 — held but not discarded​

Reproducing​

See also​