Skip to main content

9 docs tagged with "training"

View all tags

BIO labels

BIO labelling is the trick that lets a token classifier (which decides one token at a time) emit spans (groups of consecutive tokens that mean one thing together). It is the standard approach for sequence labelling tasks in NLP — Named Entity Recognition, part-of-speech tagging, address parsing.

Corpus construction

The training corpus is the largest single source of leverage in the project. The model can only learn patterns that appear in the data. This article walks through how Mailwoman builds its corpus — what sources are in it, how rows are aligned to BIO labels, and how the synthesis step multiplies the effective size.

Eval discipline — reading the numbers honestly

Mailwoman's eval methodology learned its most important lessons the hard way — from shipping two model versions that regressed on headline F1 but told a different story when the failures were examined properly. This article documents the discipline: what to measure, what not to trust, and how to read a model release report.

How it will work

This article describes where Mailwoman is heading. The work is tracked in GitHub issues and the plan/ directory. Status is current as of May 2026.

Synthetic corpus — alignment validation is load-bearing

Both v0.5.0 corpus threads (B kryptonite and B2 transliteration) used an LLM (DeepSeek) to generate annotated training rows. Both surfaced the same lesson: the substring-match alignment check is structural infrastructure, not a quality filter you can drop later. This article explains why.

Training pipeline

The training pipeline turns raw address data sources into a model file that ships on npm. This article walks through each stage end-to-end. The Corpus construction article digs into the first three stages in more detail.