Phase 1 β Corpus Pipeline
Goal: build a reproducible, versioned dataset pipeline. End state is corpus-v0.1.0 on disk, containing aligned BIO-labeled Parquet shards for US + FR, with a hand-labeled golden eval set.
Duration estimate: 2 weeks.
Branch: neural/phase-1-corpus
Depends on: Phase 0 complete and tagged.
Pre-flightβ
- Phase 0 success criteria all green
- Disk space available at
/data/corpus/perreference/OPERATIONS.mdlayout - You understand BIO labeling and the
LabeledRowshape
Tasksβ
1. Package scaffoldingβ
- Create
packages/corpus/workspace - Create
packages/corpus-python/(Python, not in workspaces β has its ownpyproject.toml) -
packages/corpus/src/types.tsβCanonicalRow,LabeledRow,CorpusAdapter,AdapterOptionsperreference/INTERFACES.md - Add to root tsconfig references, root package.json scripts
2. Adapter frameworkβ
-
packages/corpus/src/adapter.tsβ baseCorpusAdapterinterface, helper utilities (streaming, checksum, dedup) -
packages/corpus/src/runner.tsβ drives an adapter, writes intermediate JSONL, reports progress, handles backpressure - CLI:
npx mailwoman corpus run <adapter-id> --input <path> --output <dir> [--country XX] [--limit N]
3. Adapters β priority orderβ
Build adapters in this order. Each must have:
- A small fixture in
packages/corpus/fixtures/<adapter-id>/(license-clean, hand-crafted, < 1MB) - An integration test that runs the adapter against the fixture
- A README documenting the source, license, download instructions, expected schema, known quirks
3a. wof-admin (P1, coarse)β
- Input: WOF SQLite distributions (
whosonfirst-data-admin-us-latest.spatial.db,-fr) - Emit one row per WOF record, with synthesized
rawstrings using country-appropriate templates - Walk the ancestry chain β emit hierarchical variants (city alone, city+region, city+region+country)
- License: WOF is CC0
- Expected output: ~50k US localities, ~36k FR communes, plus regions/countries
3b. wof-postalcode (P1, coarse)β
- Input: WOF postalcode SQLite distributions
- Emit rows pairing postcode with its parent locality/region
- License: CC0
3c. osm-places (P1, coarse corroboration)β
- Input: OSM PBF file, filtered to
place=city|town|village|hamlet|suburb|neighbourhood - Emit rows for each named place, with admin hierarchy from
is_in:*tags or reverse-geocoded from WOF - License: ODbL β tag in
licensefield - Use
osm-pbf-parser-nodeor equivalent (verify maintenance before committing)
3d. address-formatting (P1, synthesis support)β
- Vendor or fetch OpenCageData's
address-formattingrepo (MIT). It contains country-keyed templates for rendering component dicts into strings. -
packages/corpus/src/format.tsexposesformatAddress(components: ComponentDict, country: string): string - This is not an adapter β it's a utility used by adapters and synthesis to render
rawfrom component dicts.
3e. ban (P2, FR street-level)β
- Input: Base Adresse Nationale dump from
adresse.data.gouv.fr. Format: CSV, very large (~25M rows). - Emit rows with
house_number, street, postcode, locality, region, country='FR' - License: ODbL / Licence Ouverte
- Note: BAN has authoritative French addresses. This is your highest-quality FR source.
3f. openaddresses (P2, US/global street-level)β
- Input: OpenAddresses GeoJSON, country-partitioned downloads
- License: varies per source β propagate per-row
- Emit street-level rows
3g. osm-addr (P2, US/FR street-level)β
- Same OSM PBF, filtered to
addr:*tagged ways/nodes - Lower quality than OpenAddresses or BAN but broader coverage
3h. US gov registries (P3, US venue-level)β
Defer to end of Phase 1 or push to Phase 2 if running long. Each is a small adapter, none individually critical:
-
usgov/hrsaβ Health Resources & Services Administration facility list -
usgov/npiβ National Provider Identifier (medical providers) -
usgov/fccβ FCC licensee addresses
3i. SIRENE (P3, FR venue-level)β
- Input: SIRENE bulk download (FR business registry)
- License: Licence Ouverte
- Defer if running long
4. Alignmentβ
Given a CanonicalRow with raw and components, produce a LabeledRow with token-level BIO labels.
-
packages/corpus/src/align.ts - Strategy: for each component value, find its character span in
rawusing fuzzy match (fastest-levenshtein). Tokenizerawwith SentencePiece. Assign BIO labels to tokens whose spans overlap component spans. - Reject rows where any component cannot be aligned (component text doesn't appear in
rawwithin edit distance threshold). Write rejected rows to/data/corpus/quarantine/with a reason for human review. - Unit tests: alignment correct on hand-crafted examples covering: missing components, reordered components, abbreviated forms, accented vs unaccented text.
β Use the same SentencePiece model that training and inference will use. Train it first (next task), then run alignment.
5. SentencePiece tokenizer trainingβ
The tokenizer is trained on the corpus, not picked off the shelf.
-
packages/corpus-python/scripts/train_tokenizer.py - Input: a sample (say 5M lines) of
rawstrings from coarse adapters, balanced US/FR - Train SentencePiece with:
vocab_size=16000,character_coverage=0.9995,model_type=unigram,byte_fallback=true - Output:
tokenizer.modelandtokenizer.vocabwritten to/data/models/tokenizer/v0.1.0/ - Run alignment using this tokenizer.
β Tokenizer version is locked into corpus version. corpus-v0.1.0 ships with tokenizer-v0.1.0. Don't retrain mid-corpus.
6. Synthesis / augmentationβ
-
packages/corpus/src/synthesize.ts - Augmentations for both US and FR:
- Case perturbation (random upper/lower)
- Punctuation drop/add (commas)
- Abbreviation swap (
StreetβStusing Mailwoman's existing dictionaries β reuse fromresources/) - Whitespace normalization variants (single/double space, tab/newline)
- Typo injection (single-char edits, low rate ~2%)
- FR-specific augmentations:
- Accent stripping (
HΓ΄telβHotel) - Particle variants (
Rue de la RΓ©publiqueβRue RΓ©publique) - Arrondissement notation (
Paris 8eβParis VIIIβ75008 Paris) - CEDEX variants (with and without)
- Accent stripping (
- US-specific augmentations:
- State abbreviation vs full (
ORβOregon) - Directional abbreviation (
SEβSoutheast) - ZIP+4 with and without dash
- State abbreviation vs full (
- Each augmented row carries
synth.methodandsynth.base_source_id.
7. Parquet outputβ
-
packages/corpus/src/parquet.tsβ write labeled rows to Parquet shards, ~1M rows per shard - Schema:
raw: string, tokens: list<string>, labels: list<string>, country: string, locale: string, source: string, source_id: string, corpus_version: string, license: string, synth_method: string?, synth_base_id: string? - Use
@dsnp/parquetjsor, if that proves limiting, write JSONL and convert via a tiny Python script (PyArrow). Either is acceptable. - Output path:
/data/corpus/versioned/corpus-v0.1.0/
8. Eval splitsβ
β This is where corpora silently leak. Do it correctly.
-
packages/corpus/src/split.ts - Split strategy: hold out by locality, not random row sampling
- US: hold out all rows from Vermont, Wyoming, North Dakota (low-density states, ensures generalization)
- FR: hold out all rows from Corse, Lozère, Creuse (small departments)
- Output: split manifests as JSON files listing source_ids in each split. Manifests go in git.
- Splits: 90% train, 5% val, 5% test
- Document the holdout choices in
DECISIONS.md
9. Golden eval setβ
-
/data/eval/golden/v0.1.0/ - 500 US addresses, 500 FR addresses, hand-labeled by a human (the maintainer). Use the existing Mailwoman parser as a starting point and hand-correct.
- Each entry:
{ raw, components: { ... }, source: 'golden', notes: '...' } - Cover: residential, commercial, PO boxes, intersections, venues, edge cases (single-line, multi-line, abbreviations, typos)
- Check into git. This is the contract for "what good looks like."
βΆ Drafting the golden set takes real time. Start it in week 1 of Phase 1 even while building adapters. Don't leave it for the end.
10. Corpus build pipelineβ
-
npx mailwoman corpus build --version 0.1.0β single command that runs all adapters, alignment, synthesis, splits, writes Parquet - Reproducible β same inputs β same outputs
- Logs progress, writes a manifest with file checksums to
/data/corpus/versioned/corpus-v0.1.0/MANIFEST.json
Success criteria checklistβ
-
corpus-v0.1.0/exists on disk with Parquet shards - MANIFEST.json has checksums for every shard
- At least 5M labeled rows (coarse + street where available)
- Eval split manifests in git
- Golden set in git: 1000 hand-labeled entries
- Tokenizer v0.1.0 saved alongside corpus
- All adapters have fixtures + integration tests
-
npm testgreen -
LOG.md,DECISIONS.mdup to date - Branch tagged
neural-phase-1-complete
Common pitfallsβ
- β Splitting train/test randomly. Locality holdout or it doesn't count.
- β Retraining the tokenizer after some alignment has been done. Tokens shift, labels become wrong.
- β Skipping the quarantine pile. The alignment failures are where future bugs hide.
- β Forgetting to record the license per source. You'll regret this when someone asks if they can use a derived model commercially.
- β Letting OSM dominate by row count. Stratified sampling at training time, but warn now if one source is > 60% of rows.
When to call this phase doneβ
When you can run npx mailwoman corpus build --version 0.1.0 from a clean state, get a complete corpus on disk, and run a small Python script that loads a shard and prints (tokens, labels) pairs that look obviously correct.