Operations
๐งช Operator documentation. For contributors running corpus builds, training experiments, and releases. If you want to use Mailwoman, see Getting started.
Working normsโ
You are an autonomous agent. Optimize for legibility to a human checking in once a day, not for clever one-shot completions.
Pacingโ
- Work in small, complete units. A "unit" is anything that ends with a green test suite or a working command.
- Commit at every unit boundary. Better too many commits than too few.
- Do not start Phase N+1 until Phase N's success criteria checklist is fully green and committed.
Decision disciplineโ
When you hit a fork in the road:
- If the plan answers it: follow the plan.
- If the plan is silent and the decision is reversible (file naming, test structure, internal variable names): pick the most conventional option and move on.
- If the plan is silent and the decision is hard to reverse (public API surface, schema change, npm package name, license commitment, data format): write the decision and the alternatives to
DECISIONS.mdand pick the most conservative option (the one that preserves the most future options). - Never block waiting for input. If a decision feels truly blocking, write it to
DECISIONS.mdunder a## BLOCKEDheading and continue with adjacent work.
Commit messagesโ
Format:
<phase>: <terse imperative>
<optional body: rationale, what changed, what didn't>
Refs: <plan file paths if relevant>
Example:
phase-0: wrap legacy classifiers in ClassificationProposal adapter
All existing rule classifiers now route through wrapLegacyClassifier.
Output shape is identical to pre-refactor for SDK consumers.
Solver code untouched.
Refs: reference/INTERFACES.md, phases/PHASE_0_foundation.md ยง3
Branchingโ
- Work on a branch named
neural/phase-N-<slug>per phase. - Squash-merge to
mainat phase completion. - Tag
mainat each phase completion:neural-phase-N-complete.
Progress reportingโ
LOG.mdโ
Append-only, one line per meaningful event. Format:
YYYY-MM-DD HH:MM | phase-N | <what was done> | <next>
Examples:
2026-05-16 14:30 | phase-0 | added ComponentTag union, BIO_LABELS derivation, 47 unit tests passing | wrap rule classifiers
2026-05-16 16:05 | phase-0 | wrapped 12 rule classifiers via adapter, zero behavior change | ClassifierPolicy registry
2026-05-17 09:12 | phase-1 | OSM PBF adapter streams 100k rows in 38s, memory steady at 220MB | WOF admin adapter
Keep it terse. The radio-console format.
DECISIONS.mdโ
Decisions that affect future code. One entry per decision. Format:
## YYYY-MM-DD โ <decision title>
**Context:** what was being done
**Options considered:**
1. <option> โ pros, cons
2. <option> โ pros, cons
**Chosen:** <option>
**Rationale:** <one paragraph>
**Reversibility:** <reversible | costly | irreversible>
BLOCKERS.mdโ
Only created if you have a genuine blocker. One entry per blocker.
## YYYY-MM-DD โ <blocker title>
**What's blocked:** <task>
**Why:** <reason>
**What you tried:** <attempts>
**What would unblock:** <answer needed / resource needed / decision needed>
Always continue with adjacent work while a blocker is open. Do not idle.
Code qualityโ
Testsโ
- Unit tests for every classifier (rule and neural alike).
- Integration tests for every adapter against a small fixture in
corpus/fixtures/. - Golden-set tests for the full pipeline. Regression here blocks merge.
- Use Vitest (already in Mailwoman).
Typesโ
- Strict TypeScript.
strict: truein every package. - No
any. Useunknownand narrow. - Public API exports must have explicit type annotations.
Linting and formattingโ
- ESLint (already in Mailwoman). Run on every commit.
- Prettier (already in Mailwoman). Run on every commit.
- Add a pre-commit hook via Husky (already present) if not already enforced.
Performance budgetsโ
| Path | Budget |
|---|---|
| Rule-only classify, single address | < 5ms p95 |
| Neural classify, single address, CPU | < 20ms p95 |
| Neural classify, single address, GPU | < 5ms p95 |
| Cold model load | < 2s |
| Tokenizer parity check, 10k addresses | < 30s |
If a change blows a budget, profile before optimizing. Add a microbenchmark to bench/ so the regression doesn't return.
Dependenciesโ
Adding a dependency requires:
- Justification in the commit message (why this, why not X, why not a few lines of code).
- License check โ must be MIT, Apache-2.0, BSD, or ISC. AGPL-compatible licenses fine for the package itself (Mailwoman is AGPL) but verify case-by-case.
- Size check โ flag in commit if it adds > 1MB to bundled output.
- Stewardship check โ verify the dep is actively maintained (commit in last 12 months, > 1 maintainer or > 100 stars).
Approved dependencies for this project:
onnxruntime-nodeโ ONNX inference@bloomberg/sentencepiece-wasm(or equivalent) โ SentencePiece tokenization in JS. Verify and pick.@dsnp/parquetjsโ Parquet read/write in TSosm-pbf-parser-node(or equivalent) โ OSM PBF streaming. Verify and pick.better-sqlite3โ for WOF SQLite readsfast-csvโ CSV parsing for government datafastest-levenshteinโ alignment fuzzy matching- Existing Mailwoman deps โ all retained
Data handlingโ
Storage layout in the home labโ
/data/
corpus/
sources/ # raw downloads, never modified
osm/
wof/
ban/
openaddresses/
usgov/
intermediate/ # adapter outputs before alignment
aligned/ # post-alignment Parquet shards
versioned/ # frozen corpus versions, e.g. corpus-v0.1.0/
models/
checkpoints/ # PyTorch training checkpoints
onnx/ # exported ONNX models
quantized/ # int8 quantized models
eval/
golden/ # hand-labeled golden set, locked, versioned
splits/ # train/val/test split manifests
Versioningโ
- Corpus versions:
corpus-v<major>.<minor>.<patch>. Major = schema change. Minor = source added. Patch = synthesis tweak. - Model versions:
model-v<major>.<minor>.<patch>-<locale>. Same scheme. - Both versions appear in
ModelCard.trainedOnandLabeledRow.corpus_versionso any prediction is traceable to its training data.
What never goes in gitโ
- Raw source data (too large, redistribution risk for some sources)
- Trained model checkpoints over 100MB (use Git LFS or external storage)
- Anything containing PII
What goes in git:
- All code
- Corpus manifests (file lists + checksums, not the files)
- Eval splits (lists of source IDs in each split)
- The hand-labeled golden set (small, human-verified)
- Adapter fixtures (small, hand-crafted, license-clean)
Communication with the humanโ
The human checks in periodically. They will read:
LOG.mdfirstDECISIONS.mdifLOG.mdreferences a decisionBLOCKERS.mdif you have one open- The actual code only if something looks off in the logs
Write logs assuming the reader hasn't seen the code yet. "Wrapped 12 rule classifiers" is fine. "Refactored the thing" is not.
When the human asks a question, answer in the radio-console format (see user preferences they've set). Lead with โ. No greetings.
When you finish a phaseโ
- Run the full test suite. Green.
- Run benchmarks. Within budget.
- Update
LOG.mdwith phase-complete entry. - Squash-merge the phase branch to
main. - Tag
main:git tag neural-phase-N-complete && git push --tags. - Open
phases/PHASE_<N+1>_*.mdand begin.
When you finish all phasesโ
Do not "polish" indefinitely. Write a final entry in LOG.md saying phase 6 is complete, run one final test pass, push, and stop. The human will pick it up from there.