Skip to main content

corpus-v0.4.0 Generation

How corpus-v0.4.0 is produced from corpus-v0.3.0. This is Thread B of the PHASE_8_v0_5_0_fresh_slate plan: synthetic adversarial data generated by DeepSeek and folded into the training corpus as a pure adapter addition. Pairs with concepts/the-knowledge-ladder for the conceptual framing and with OPERATIONS.md for the working norms.

Scope​

corpus-v0.4.0 = corpus-v0.3.0 + DeepSeek-generated rows. The v0.3.0 shards are not re-emitted; the v0.4.0 MANIFEST points at the same on-disk parquet files plus the new kryptonite shard(s).

Two row classes are in scope for the v0.5.0 plan. Both are now shipped β€” kryptonite in Thread B's commit (d8a6bae), transliteration in this commit (Thread B2).

ClassStatus (corpus-v0.4.0)Target row countAdapter source id
KryptoniteShipped (Thread B, commit d8a6bae)~5,000deepseek-kryptonite
TransliterationShipped (Thread B2)~50,000–75,000deepseek-translit-<scrp>

The kryptonite slice is what unblocks Stage 5 reconcile (Thread D) and the Stage 2.5 kind-classifier's joint-decoding test surface. Transliteration is what unblocks Thread A's <5% byte-fallback target on non-Latin scripts. See PHASE_8 Β§B for the threading rationale.

PHASE_8 Β§B originally enumerated eight scripts (CJK + Cyrillic + Armenian + Greek + Arabic + Hebrew + Devanagari + Thai). Thread B's smoke testing validated the prompt + alignment pipeline against five (cyrl, jpan, hans, hang, armn); those five ship in Thread B2. The remaining scripts (grek, arab, hebr, deva, thai) are additive β€” extending TRANSLIT_SCRIPTS + KNOWN_SOURCE_PREFIXES is sufficient β€” but the prompt + substring invariant should be re-smoked per-script before a production run, so they are deferred.

Pipeline​

corpus-python/scripts/generate_deepseek_corpus.py
β”‚ prompts DeepSeek β†’ raw JSONL (one chat-completion per batch)
β”‚ validates substring-match invariant on each row
β”‚ persists raw HTTP payload alongside canonical row (reproducibility)
β–Ό
corpus-v0.4.0/kryptonite/canonical-kryptonite.jsonl (canonical rows)
corpus-v0.4.0/kryptonite/raw-deepseek-kryptonite.jsonl (raw API responses)
corpus-v0.4.0/kryptonite/.kryptonite-checkpoint.json (resumable batches)
β”‚
β”‚ corpus/scripts/build-kryptonite-shard.ts
β”‚ streams JSONL β†’ alignRow (BIO labels) β†’ parquet shard
β”‚ composes MANIFEST = v0.3.0.shards + new shard
β–Ό
/mnt/playpen/mailwoman-data/corpus/versioned/v0.4.0/corpus-v0.4.0/
β”œβ”€β”€ MANIFEST.json
└── train/part-0000.parquet ← new kryptonite shard

Generator (corpus-python/scripts/generate_deepseek_corpus.py)​

Model + API contract​

KnobValue
Modeldeepseek-v4-flash
Reasoning effortlow
Endpointhttps://api.deepseek.com/v1/chat/completions
Max tokens20,000 per response
Retries5, exponential backoff at 2.0Γ— from 2s
Timeout300s per call

The OpenAI-compatible Chat Completions schema is used directly via stdlib urllib β€” no SDK dependency. 429 / 5xx are retried; 4xx (other) bubble out as fatal.

Concurrency + rate​

Default --concurrency 15, --batch-size 50. Wall-clock observation on the production run: ~12 rows/s sustained, dominated by DeepSeek server latency rather than client throughput. 5,000 rows takes ~7 minutes. No rate-limit headers are surfaced; if the API ever returns 429 the backoff loop handles it.

License hygiene​

Raw DeepSeek API responses are AGPL-compatible for the operator's use case (no output-source-attribution constraints from DeepSeek's terms as of 2026-05-23). Every emitted row carries:

  • license: "Synthetic (DeepSeek-v4-flash, AGPL-compatible)"
  • synth.method: "deepseek-kryptonite:<category>" (or deepseek-translit:<script>)
  • synth.base_source_id: "kryptonite-seed:<category>" (or the seed row's source_id)

source: "deepseek-kryptonite" (or "deepseek-translit-<slug>") makes provenance explicit at audit + downstream-training time so the model card can disclose synthetic fraction.

Reproducibility​

Three artifacts pin the generation:

  1. Prompts β€” the system + user prompt strings live in generate_deepseek_corpus.py constants (KRYPTONITE_SYSTEM, KRYPTONITE_USER_TEMPLATE, TRANSLIT_SYSTEM, KRYPTONITE_CATEGORIES). Version-pinned through git history.
  2. Raw responses β€” every chat-completion is persisted to raw-deepseek-kryptonite.jsonl with {batch_id, category, n_requested, model, finish_reason, usage, response_content}. Regenerating from this raw log requires only the JSON validator, not another DeepSeek call.
  3. Checkpoint β€” .kryptonite-checkpoint.json records the set of completed batch_ids. Reruns with identical args skip done batches.

batch_id is a deterministic SHA-256 prefix over (category, batch_index, n), so re-running the generator with the same args produces the same batch ids β€” the checkpoint set composes with itself across runs.

Kryptonite categories​

The generator's KRYPTONITE_CATEGORIES list defines 10 adversarial flavours, each with a category id, a description, 2-3 hand-written seed examples, and a weight that allocates the total budget proportionally:

CategoryDescriptionWeight
venue-shadow-regionVenue brand contains a region-like token; actual region is elsewhere1.0
locality-shadow-countryUS locality shadows a famous foreign city (Paris TX, Moscow ID)1.0
mid-position-postcodePostcode appears between locality and country, not at the end1.0
repeated-tokenSame token in venue and locality (Buffalo Buffalo, Walla Walla)0.9
abbrev-collisionState abbreviation collides with a venue/street token0.8
saint-shadowSaint X / St. X colliding with a famous European saint-name city0.8
compass-prefixCompass-prefixed locality whose base name is a famous other place0.7
region-shadow-venueVenue brand embeds a US state name as a token0.7
french-saintFR equivalent of saint-shadow0.7
po-boxPO Box intermixed with street-style tokens0.5

The substring-match invariant (every component value must appear verbatim in raw) is enforced in the system prompt and revalidated locally before the row is committed to the canonical JSONL. Failure rate observed on production run: ~0.02%.

Builder (corpus/scripts/build-kryptonite-shard.ts)​

Streams the canonical JSONL through alignRow (corpus/src/align.ts) to produce tokens + labels, writes a single parquet shard under train/, and emits the combined MANIFEST.json. v0.3.0 shard descriptors are copied verbatim into the new manifest β€” no v0.3.0 bytes are touched. New shard descriptors are stamped with source: "deepseek-kryptonite" so corpus-audit can attribute them without the filename-prefix inference fallback.

Quarantined rows (alignment failures) are logged to quarantine-kryptonite.tsv alongside the new shard. Surface-form validation already happens in the Python generator, so the alignment step should reject <1% β€” anything above 5% indicates a prompt or substring-match regression and should be investigated before commit.

Split policy​

All kryptonite rows land in train. Synthetic adversarial data is augmentation; it must not appear in val or test where it would inflate the eval against itself. The v0.3.0 splitter's locality-holdout policy (corpus/src/split.ts) does not apply β€” kryptonite rows have no natural locality boundary and are not produced by the holdout regions anyway.

The kryptonite catalogue's eval surface lives elsewhere β€” Stage 5 reconcile (Thread D) ships its own hand-curated fixture set. See PHASE_8 Β§D.

Auditing​

npx tsx corpus/scripts/audit.ts \
/mnt/playpen/mailwoman-data/corpus/versioned/v0.4.0/corpus-v0.4.0 \
--config corpus-python/src/mailwoman_train/configs/v0_5_0.yaml

The audit reports per-source shard counts and (with --config) effective sample weights. deepseek-kryptonite is now in KNOWN_SOURCE_PREFIXES. Expected: 1 train shard with source: "deepseek-kryptonite", sub-1% of total shards (the v0.3.0 baseline has 674 train shards; one more is noise at the audit level).

Transliteration generation (Thread B2)​

Goals​

  • Seed corpus: /data/corpus/versioned/v0.4.0/staging/seeds-en-us.jsonl (~4.5K US rows) and seeds-fr-fr.jsonl (~10.5K FR rows), already sampled from the v0.3.0 train set by the previous session.
  • For each seed, generate one transliteration per target script. Five scripts:
    • cyrl β€” Russian Cyrillic (locale ru-RU)
    • jpan β€” Japanese Katakana + Kanji (locale ja-JP)
    • hans β€” Simplified Chinese (locale zh-CN)
    • hang β€” Korean Hangul (locale ko-KR)
    • armn β€” Armenian (locale hy-AM)
  • Target row count: 15K seeds Γ— 5 scripts = ~75K transliterations. The original plan said ~50K; the seed pool grew during sampling and the upper bound is what fits the v0.5.0 byte-fallback budget on Thread A.

Mode + invocation​

Implemented in generate_deepseek_corpus.py:

python3 corpus-python/scripts/generate_deepseek_corpus.py \
--mode transliteration \
--out-dir /data/corpus/versioned/v0.4.0/transliteration \
--seed-paths /data/corpus/versioned/v0.4.0/staging/seeds-en-us.jsonl \
/data/corpus/versioned/v0.4.0/staging/seeds-fr-fr.jsonl \
--scripts cyrl jpan hans hang armn \
--batch-size 50 --concurrency 15

The Thread B2 production run consumed the full seed pool (14,978 rows = 4,478 en-US + 10,500 fr-FR). 5 scripts Γ— 14,978 seeds = 74,890 planned rows. Wall-clock and rejection rate are pinned in the Changelog below.

Prompt design​

TRANSLIT_SYSTEM in the generator script β€” the contract is:

  • Keep digits / commas / periods / hyphens verbatim.
  • Transliterate place names + street-type words using natural conventions for the target script; do not translate semantically.
  • Mirror the input component tags exactly (every input tag appears in output).
  • Surface-form invariant holds: every component value substring-matches the transliterated raw.

Each batch is built per-script (not per-seed): one chat completion handles 50 seeds in one target script. The response is JSONL keyed by i (batch index 0..49). Substring validation rejects malformed transliterations before commit; rejection rate in initial prompt-engineering smoke was ~3% β€” acceptable, but the prompt is worth tuning before the production 75K pass.

Cost + wall time estimate​

  • 75K rows / 50 per batch = 1,500 batches.
  • Per-batch latency is API-dominated; with reasoning headroom (see below), 30–60 s.
  • Wall time at conc=15: ~1,500 batches Γ— 45 s / 15 β‰ˆ 75 min on a clean run.

max_tokens and the reasoning budget​

DeepSeek-v4-flash with reasoning_effort=low still consumes a non-trivial reasoning budget on transliteration batches. Thread B2's first launch ran at max_tokens=20000 (the same knob that worked for kryptonite) and saw 30/32 batches hit finish_reason=length because reasoning ate 12–15K of the 20K budget, leaving <5K for the 50-row JSONL output. Truncated responses produced ~8 rows per batch instead of 50.

The validated production knob is max_tokens=60000: empirically reasoning peaks around ~15K and 50-row output uses ~5K, so 60K leaves comfortable headroom. The Thread B2 generator also marks finish_reason=length batches with a !RETRY: prefix so they don't checkpoint and get retried on subsequent runs β€” defence in depth for the rare batches that still truncate at 60K.

This budget reasoning is transliteration-specific; the kryptonite mode at max_tokens=20000 remains correct because English-only ASCII output uses ~1 token per row character and 50 rows fit comfortably.

What Thread B2 adds​

  1. Runs the transliteration generator end-to-end against the full seed pool.
  2. New build-transliteration-shard.ts mirroring build-kryptonite-shard.ts β€” buckets canonical rows by source and emits one parquet shard per script (train/part-translit-<slug>.parquet).
  3. Composes the new MANIFEST as (v0.3.0 base shards) + (Thread B kryptonite shard) + (Thread B2 translit-<slug> shards).
  4. Migrates v0.3.0 shard descriptors in MANIFEST from /mnt/playpen/mailwoman-data/... to /data/..., the form that container loaders resolve at training time. The on-disk bytes are unchanged; only the path strings in MANIFEST move. (Thread B's MANIFEST mixed the two forms; this PR canonicalizes.)
  5. KNOWN_SOURCE_PREFIXES in audit.ts already contains the five deepseek-translit-* slugs from Thread B β€” no audit.ts edit needed.

Known risks​

  • Cross-script seed pollution. If the seed pool is biased (e.g. all FR seeds are Île-de-France because the BAN sampling skewed there), the resulting transliterations would teach the tokenizer about that bias 5Γ— over. The current staging files were sampled stratified by source only. Region stratification is deferred β€” accepted as a known limitation of this pass; can be revisited if Thread A's byte-fallback eval surfaces a regional skew.
  • Whitespace-tokenizer interaction with CJK. Alignment uses the whitespace tokenizer, which treats space-less CJK as a single token. The substring invariant still holds (raw and component values are space-matched verbatim), so alignment passes β€” but per-character labelling at training time comes from the sentencepiece tokenizer Thread A retrains, not from the corpus alignment.

Changelog​

DateChange
2026-05-23Initial doc + kryptonite slice generation (5K rows, deepseek-v4-flash).
2026-05-24Transliteration slice generation shipped under Thread B2: 73,319 rows from 5 scripts Γ— 14,978 seeds (en-US + fr-FR), max_tokens=60000, 5.2 rps sustained, ~234 min wall-clock at conc=15. 73,316 rows aligned (3 quarantined). 5 shards added: part-translit-{armn,cyrl,hang,hans,jpan}.parquet. v0.3.0 shard paths in MANIFEST canonicalized from /mnt/playpen/mailwoman-data/... to /data/....