v0.6.3 step 100K eval
The v0.6.3 retrain (per the street-supplement architecture) reached step 100K and was evaluated. Verdict: HOLD, three gate violations, and the predicted dilution failure mode materialized.
Setup
- Model: v0.6.3 step 100K (
/data/output-v063/checkpoints/step-100000) - Training: A100-SXM4-40GB, ~12.8 steps/s, ~130 min wall-clock. No NaN.
- Corpus changes from v0.6.2: filtered venue pool (removed
5th Avenue Theatre,7th Street Bistro); ADDEDsynth-house-venue-v063shard at weight 1.0 (32K rows of house_number + venue + street coexistence); dropped synth-no-street weight 1.0 → 0.5 (kept v0.6.2b's setting). - Tokenizer: v0.6.0-a0 (unchanged)
- Eval: v0.1.2 golden set,
--stage3-foldenabled, v0.6.0 baseline
Headline numbers
| Metric | v0.6.0 baseline | v0.6.2 step 100K | v0.6.3 step 100K |
|---|---|---|---|
| Exact match | 21.1% | 22.4% | 21.8% |
| Gate violations | (ref) | 1 | 3 |
dependent_locality hallucinations | 0 | 0 ✓ | 844 ❌ |
locality recall | 40.0% | 41.0% | 34.7% ❌ (-5.3pp) |
house_number recall | 79.0% | 74.0% | 77.0% ✓ (+3pp vs v0.6.2) |
street recall | 27.7% | 27.1% | 29.2% ✓ (+1.5pp) |
postcode recall | 76.0% | 84.1% | 83.4% ✓ |
country recall | 24.5% | 33.1% | 33.5% ✓ |
| Harness pass rate | 14.4% | 14.0% | 12.5% ❌ |
What worked
house_numberrecovered by +3pp vs v0.6.2 (74.0% → 77.0%). Thesynth-house-venuecompanion shard did exactly what it was designed to do — taught the model that house_number and venue coexist, recovering most of the regression from v0.6.2.streetrecall improved (27.1% → 29.2%, +1.5pp vs v0.6.2). The filtered venue pool (no more digit+ordinal venues) likely contributed.postcode,countryheld their v0.6.2 gains.
What broke (badly)
dependent_locality regression — predicted and confirmed
The original v0.6.1 problem came back. 844 hallucinations vs v0.6.2's 0. Rate: 2110% of expected occurrences (40 expected dep_loc spans in the golden set; the model emitted 844 hallucinations).
This was DeepSeek's turn 10 dilution diagnosis: dropping synth-no-street weight 1.0 → 0.5 AND adding 32K rows of synth-house-venue at weight 1.0 changed the anti-decompose:companion-shard ratio from v0.6.2's 1.0:0 to v0.6.3's 0.5:1.0. The anti-decompose signal got proportionally weaker. Decompose mode came back over the 20K → 100K window.
The trajectory confirms it:
- v0.6.3 step 20K: 1 dep_loc hallucination
- v0.6.3 step 100K: 844 dep_loc hallucinations
The damage accumulated gradually as the model overfitted to certain patterns in the diluted distribution.
locality recall dropped 5.3pp
40.0% → 34.7%. Side effect of the dep_loc explosion — many tokens that should be classified as locality are instead getting classified as dep_locality. The model is confusing the two tags more than v0.6.2 did.
Harness pass rate dropped below v0.6.0
12.5% vs v0.6.0's 14.4%. v0.6.3 is a sidegrade — worse than v0.6.0 on the breadth eval. Per the decision tree, this alone is grounds for HOLD even if the gate passed.
Gate verdict: HOLD
Per the decision tree:
| Gate eval | harness pass | Action |
|---|---|---|
| Pass | ≥ 25% | Promote to default |
| Pass | 15-24% | Ship as experimental |
| Pass | ≤ 14.4% | HOLD (sidegrade) |
| Fail | any | HOLD |
v0.6.3 falls in row 4: HOLD. Two paths from here:
- Launch the pre-staged v0.6.4 (synth-no-street weight 0.5 → 0.75). DeepSeek's mechanism-3 hypothesis just got empirically confirmed by v0.6.3 going wrong in the predicted way. v0.6.4 directly counters the dilution.
- Step back and rethink v0.6.x as a whole. Three iterations in, the pattern is clear: every corpus-weight change produces non-trivial trade-offs. Maybe the issue is upstream of any recipe — overfitting, evaluation methodology, schema design, or model capacity.
The operator chose path 2. This doc captures what we learned. The v0.6.x retrospective synthesizes what the cycle revealed and proposes a rethink agenda.
Pre-staged v0.6.4 — held but not discarded
The v0.6.4 yaml + parquet artifacts remain on the Modal volume. If the
rethink concludes that "weight rebalance is the right path but we just
need to land the right setpoint," v0.6.4 is one modal run away. If the
rethink concludes that something more structural needs to change, v0.6.4
is reference data for what would have happened.
Reproducing
./scripts/eval-v062-checkpoint.sh 100000 v063
Artifacts in /tmp/v063-eval-step-100000/.
See also
- v0.6.2 step 100K eval — the immediately-prior release
- Layer 1 morphology FST eval — established the v0.6.x recipe direction
- Street-supplement architecture — the architectural framing
- Corpus poisoning vulnerability — what the v0.6.x cycle teaches about fundamental architectural risk