7 docs tagged with "corpus"

Corpus construction

The training corpus is the largest single source of leverage in the project. The model can only learn patterns that appear in the data. This article walks through how Mailwoman builds its corpus — what sources are in it, how rows are aligned to BIO labels, and how the synthesis step multiplies the effective size.

Corpus poisoning vulnerability

The empirical-learner nature of the neural parser

DeepSeek — max_tokens covers reasoning, not just output

A short article on one DeepSeek API quirk that has burned every mailwoman thread that called it with reasoning enabled. Worth knowing before you write the next one.

How it will work

This article describes where Mailwoman is heading. The work is tracked in GitHub issues and the plan/ directory. Status is current as of May 2026.

Negative space — why training every component sharpens each one

A useful intuition guides Mailwoman's corpus-coverage work: as we add training

Synthetic corpus — alignment validation is load-bearing

Both v0.5.0 corpus threads (B kryptonite and B2 transliteration) used an LLM (DeepSeek) to generate annotated training rows. Both surfaced the same lesson: the substring-match alignment check is structural infrastructure, not a quality filter you can drop later. This article explains why.

Training pipeline

The training pipeline turns raw address data sources into a model file that ships on npm. This article walks through each stage end-to-end. The Corpus construction article digs into the first three stages in more detail.