corpus-v0.4.0 Generation
How corpus-v0.4.0 is produced from corpus-v0.3.0. This is Thread B of the
PHASE_8_v0_5_0_fresh_slate plan: synthetic
adversarial data generated by DeepSeek and folded into the training corpus as a pure
adapter addition. Pairs with concepts/the-knowledge-ladder
for the conceptual framing and with OPERATIONS.md for the working norms.
Scopeβ
corpus-v0.4.0 = corpus-v0.3.0 + DeepSeek-generated rows. The v0.3.0 shards are not re-emitted; the v0.4.0 MANIFEST points at the same on-disk parquet files plus the new kryptonite shard(s).
Two row classes are in scope for the v0.5.0 plan. Both are now shipped β kryptonite in
Thread B's commit (d8a6bae), transliteration in this commit (Thread B2).
| Class | Status (corpus-v0.4.0) | Target row count | Adapter source id |
|---|---|---|---|
| Kryptonite | Shipped (Thread B, commit d8a6bae) | ~5,000 | deepseek-kryptonite |
| Transliteration | Shipped (Thread B2) | ~50,000β75,000 | deepseek-translit-<scrp> |
The kryptonite slice is what unblocks Stage 5 reconcile (Thread D) and the Stage 2.5
kind-classifier's joint-decoding test surface. Transliteration is what unblocks Thread A's
<5% byte-fallback target on non-Latin scripts. See
PHASE_8 Β§B for the threading rationale.
PHASE_8 Β§B originally enumerated eight scripts (CJK + Cyrillic + Armenian + Greek + Arabic +
Hebrew + Devanagari + Thai). Thread B's smoke testing validated the prompt + alignment
pipeline against five (cyrl, jpan, hans, hang, armn); those five ship in Thread B2.
The remaining scripts (grek, arab, hebr, deva, thai) are additive β extending
TRANSLIT_SCRIPTS + KNOWN_SOURCE_PREFIXES is sufficient β but the prompt + substring
invariant should be re-smoked per-script before a production run, so they are deferred.
Pipelineβ
corpus-python/scripts/generate_deepseek_corpus.py
β prompts DeepSeek β raw JSONL (one chat-completion per batch)
β validates substring-match invariant on each row
β persists raw HTTP payload alongside canonical row (reproducibility)
βΌ
corpus-v0.4.0/kryptonite/canonical-kryptonite.jsonl (canonical rows)
corpus-v0.4.0/kryptonite/raw-deepseek-kryptonite.jsonl (raw API responses)
corpus-v0.4.0/kryptonite/.kryptonite-checkpoint.json (resumable batches)
β
β corpus/scripts/build-kryptonite-shard.ts
β streams JSONL β alignRow (BIO labels) β parquet shard
β composes MANIFEST = v0.3.0.shards + new shard
βΌ
/mnt/playpen/mailwoman-data/corpus/versioned/v0.4.0/corpus-v0.4.0/
βββ MANIFEST.json
βββ train/part-0000.parquet β new kryptonite shard
Generator (corpus-python/scripts/generate_deepseek_corpus.py)β
Model + API contractβ
| Knob | Value |
|---|---|
| Model | deepseek-v4-flash |
| Reasoning effort | low |
| Endpoint | https://api.deepseek.com/v1/chat/completions |
| Max tokens | 20,000 per response |
| Retries | 5, exponential backoff at 2.0Γ from 2s |
| Timeout | 300s per call |
The OpenAI-compatible Chat Completions schema is used directly via stdlib urllib β
no SDK dependency. 429 / 5xx are retried; 4xx (other) bubble out as fatal.
Concurrency + rateβ
Default --concurrency 15, --batch-size 50. Wall-clock observation on the production
run: ~12 rows/s sustained, dominated by DeepSeek server latency rather than client
throughput. 5,000 rows takes ~7 minutes. No rate-limit headers are surfaced; if the API
ever returns 429 the backoff loop handles it.
License hygieneβ
Raw DeepSeek API responses are AGPL-compatible for the operator's use case (no output-source-attribution constraints from DeepSeek's terms as of 2026-05-23). Every emitted row carries:
license: "Synthetic (DeepSeek-v4-flash, AGPL-compatible)"synth.method: "deepseek-kryptonite:<category>"(ordeepseek-translit:<script>)synth.base_source_id: "kryptonite-seed:<category>"(or the seed row'ssource_id)
source: "deepseek-kryptonite" (or "deepseek-translit-<slug>") makes provenance
explicit at audit + downstream-training time so the model card can disclose synthetic
fraction.
Reproducibilityβ
Three artifacts pin the generation:
- Prompts β the system + user prompt strings live in
generate_deepseek_corpus.pyconstants (KRYPTONITE_SYSTEM,KRYPTONITE_USER_TEMPLATE,TRANSLIT_SYSTEM,KRYPTONITE_CATEGORIES). Version-pinned through git history. - Raw responses β every chat-completion is persisted to
raw-deepseek-kryptonite.jsonlwith{batch_id, category, n_requested, model, finish_reason, usage, response_content}. Regenerating from this raw log requires only the JSON validator, not another DeepSeek call. - Checkpoint β
.kryptonite-checkpoint.jsonrecords the set of completedbatch_ids. Reruns with identical args skip done batches.
batch_id is a deterministic SHA-256 prefix over (category, batch_index, n), so
re-running the generator with the same args produces the same batch ids β the
checkpoint set composes with itself across runs.
Kryptonite categoriesβ
The generator's KRYPTONITE_CATEGORIES list defines 10 adversarial flavours, each with
a category id, a description, 2-3 hand-written seed examples, and a weight that
allocates the total budget proportionally:
| Category | Description | Weight |
|---|---|---|
venue-shadow-region | Venue brand contains a region-like token; actual region is elsewhere | 1.0 |
locality-shadow-country | US locality shadows a famous foreign city (Paris TX, Moscow ID) | 1.0 |
mid-position-postcode | Postcode appears between locality and country, not at the end | 1.0 |
repeated-token | Same token in venue and locality (Buffalo Buffalo, Walla Walla) | 0.9 |
abbrev-collision | State abbreviation collides with a venue/street token | 0.8 |
saint-shadow | Saint X / St. X colliding with a famous European saint-name city | 0.8 |
compass-prefix | Compass-prefixed locality whose base name is a famous other place | 0.7 |
region-shadow-venue | Venue brand embeds a US state name as a token | 0.7 |
french-saint | FR equivalent of saint-shadow | 0.7 |
po-box | PO Box intermixed with street-style tokens | 0.5 |
The substring-match invariant (every component value must appear verbatim in raw) is
enforced in the system prompt and revalidated locally before the row is committed
to the canonical JSONL. Failure rate observed on production run: ~0.02%.
Builder (corpus/scripts/build-kryptonite-shard.ts)β
Streams the canonical JSONL through alignRow (corpus/src/align.ts) to produce
tokens + labels, writes a single parquet shard under train/, and emits the
combined MANIFEST.json. v0.3.0 shard descriptors are copied verbatim into the new
manifest β no v0.3.0 bytes are touched. New shard descriptors are stamped with
source: "deepseek-kryptonite" so corpus-audit can attribute them without the
filename-prefix inference fallback.
Quarantined rows (alignment failures) are logged to quarantine-kryptonite.tsv
alongside the new shard. Surface-form validation already happens in the Python
generator, so the alignment step should reject <1% β anything above 5% indicates a
prompt or substring-match regression and should be investigated before commit.
Split policyβ
All kryptonite rows land in train. Synthetic adversarial data is augmentation; it
must not appear in val or test where it would inflate the eval against itself. The
v0.3.0 splitter's locality-holdout policy (corpus/src/split.ts) does not apply β
kryptonite rows have no natural locality boundary and are not produced by the holdout
regions anyway.
The kryptonite catalogue's eval surface lives elsewhere β Stage 5 reconcile (Thread D)
ships its own hand-curated fixture set. See PHASE_8 Β§D.
Auditingβ
npx tsx corpus/scripts/audit.ts \
/mnt/playpen/mailwoman-data/corpus/versioned/v0.4.0/corpus-v0.4.0 \
--config corpus-python/src/mailwoman_train/configs/v0_5_0.yaml
The audit reports per-source shard counts and (with --config) effective sample
weights. deepseek-kryptonite is now in KNOWN_SOURCE_PREFIXES. Expected:
1 train shard with source: "deepseek-kryptonite", sub-1% of total shards (the
v0.3.0 baseline has 674 train shards; one more is noise at the audit level).
Transliteration generation (Thread B2)β
Goalsβ
- Seed corpus:
/data/corpus/versioned/v0.4.0/staging/seeds-en-us.jsonl(~4.5K US rows) andseeds-fr-fr.jsonl(~10.5K FR rows), already sampled from the v0.3.0 train set by the previous session. - For each seed, generate one transliteration per target script. Five scripts:
cyrlβ Russian Cyrillic (localeru-RU)jpanβ Japanese Katakana + Kanji (localeja-JP)hansβ Simplified Chinese (localezh-CN)hangβ Korean Hangul (localeko-KR)armnβ Armenian (localehy-AM)
- Target row count: 15K seeds Γ 5 scripts = ~75K transliterations. The original plan said ~50K; the seed pool grew during sampling and the upper bound is what fits the v0.5.0 byte-fallback budget on Thread A.
Mode + invocationβ
Implemented in generate_deepseek_corpus.py:
python3 corpus-python/scripts/generate_deepseek_corpus.py \
--mode transliteration \
--out-dir /data/corpus/versioned/v0.4.0/transliteration \
--seed-paths /data/corpus/versioned/v0.4.0/staging/seeds-en-us.jsonl \
/data/corpus/versioned/v0.4.0/staging/seeds-fr-fr.jsonl \
--scripts cyrl jpan hans hang armn \
--batch-size 50 --concurrency 15
The Thread B2 production run consumed the full seed pool (14,978 rows = 4,478 en-US + 10,500 fr-FR). 5 scripts Γ 14,978 seeds = 74,890 planned rows. Wall-clock and rejection rate are pinned in the Changelog below.
Prompt designβ
TRANSLIT_SYSTEM in the generator script β the contract is:
- Keep digits / commas / periods / hyphens verbatim.
- Transliterate place names + street-type words using natural conventions for the target script; do not translate semantically.
- Mirror the input component tags exactly (every input tag appears in output).
- Surface-form invariant holds: every component value substring-matches the
transliterated
raw.
Each batch is built per-script (not per-seed): one chat completion handles 50 seeds in
one target script. The response is JSONL keyed by i (batch index 0..49). Substring
validation rejects malformed transliterations before commit; rejection rate in initial
prompt-engineering smoke was ~3% β acceptable, but the prompt is worth tuning before
the production 75K pass.
Cost + wall time estimateβ
- 75K rows / 50 per batch = 1,500 batches.
- Per-batch latency is API-dominated; with reasoning headroom (see below), 30β60 s.
- Wall time at conc=15: ~1,500 batches Γ 45 s / 15 β 75 min on a clean run.
max_tokens and the reasoning budgetβ
DeepSeek-v4-flash with reasoning_effort=low still consumes a non-trivial reasoning budget
on transliteration batches. Thread B2's first launch ran at max_tokens=20000 (the same knob
that worked for kryptonite) and saw 30/32 batches hit finish_reason=length because
reasoning ate 12β15K of the 20K budget, leaving <5K for the 50-row JSONL output. Truncated
responses produced ~8 rows per batch instead of 50.
The validated production knob is max_tokens=60000: empirically reasoning peaks around
~15K and 50-row output uses ~5K, so 60K leaves comfortable headroom. The Thread B2 generator
also marks finish_reason=length batches with a !RETRY: prefix so they don't checkpoint
and get retried on subsequent runs β defence in depth for the rare batches that still
truncate at 60K.
This budget reasoning is transliteration-specific; the kryptonite mode at max_tokens=20000
remains correct because English-only ASCII output uses ~1 token per row character and 50
rows fit comfortably.
What Thread B2 addsβ
- Runs the transliteration generator end-to-end against the full seed pool.
- New
build-transliteration-shard.tsmirroringbuild-kryptonite-shard.tsβ buckets canonical rows bysourceand emits one parquet shard per script (train/part-translit-<slug>.parquet). - Composes the new MANIFEST as
(v0.3.0 base shards) + (Thread B kryptonite shard) + (Thread B2 translit-<slug> shards). - Migrates v0.3.0 shard descriptors in MANIFEST from
/mnt/playpen/mailwoman-data/...to/data/..., the form that container loaders resolve at training time. The on-disk bytes are unchanged; only the path strings in MANIFEST move. (Thread B's MANIFEST mixed the two forms; this PR canonicalizes.) KNOWN_SOURCE_PREFIXESinaudit.tsalready contains the fivedeepseek-translit-*slugs from Thread B β no audit.ts edit needed.
Known risksβ
- Cross-script seed pollution. If the seed pool is biased (e.g. all FR seeds are
Γle-de-France because the BAN sampling skewed there), the resulting transliterations
would teach the tokenizer about that bias 5Γ over. The current staging files were
sampled stratified by
sourceonly. Region stratification is deferred β accepted as a known limitation of this pass; can be revisited if Thread A's byte-fallback eval surfaces a regional skew. - Whitespace-tokenizer interaction with CJK. Alignment uses the whitespace tokenizer, which treats space-less CJK as a single token. The substring invariant still holds (raw and component values are space-matched verbatim), so alignment passes β but per-character labelling at training time comes from the sentencepiece tokenizer Thread A retrains, not from the corpus alignment.
Changelogβ
| Date | Change |
|---|---|
| 2026-05-23 | Initial doc + kryptonite slice generation (5K rows, deepseek-v4-flash). |
| 2026-05-24 | Transliteration slice generation shipped under Thread B2: 73,319 rows from 5 scripts Γ 14,978 seeds (en-US + fr-FR), max_tokens=60000, 5.2 rps sustained, ~234 min wall-clock at conc=15. 73,316 rows aligned (3 quarantined). 5 shards added: part-translit-{armn,cyrl,hang,hans,jpan}.parquet. v0.3.0 shard paths in MANIFEST canonicalized from /mnt/playpen/mailwoman-data/... to /data/.... |