Skip to main content

7 docs tagged with "corpus"

View all tags

Corpus construction

The training corpus is the largest single source of leverage in the project. The model can only learn patterns that appear in the data. This article walks through how Mailwoman builds its corpus — what sources are in it, how rows are aligned to BIO labels, and how the synthesis step multiplies the effective size.

How it will work

This article describes where Mailwoman is heading. The work is tracked in GitHub issues and the plan/ directory. Status is current as of May 2026.

Synthetic corpus — alignment validation is load-bearing

Both v0.5.0 corpus threads (B kryptonite and B2 transliteration) used an LLM (DeepSeek) to generate annotated training rows. Both surfaced the same lesson: the substring-match alignment check is structural infrastructure, not a quality filter you can drop later. This article explains why.

Training pipeline

The training pipeline turns raw address data sources into a model file that ships on npm. This article walks through each stage end-to-end. The Corpus construction article digs into the first three stages in more detail.