Skip to main content

38 docs tagged with "concepts"

View all tags

A walkthrough — NY-NY Steakhouse, Houston, TX

The knowledge ladder explains why the v0.5.0 pipeline grew two new information layers (Stage 2.7 phrase grouper, expanded Stage 5 reconcile). This article walks through what they actually do on one concrete input, end-to-end.

Addresses that break geocoders

What is an address? covered the data model. This article goes the other way: a tour of address shapes that consistently break parsers and resolvers, with concrete examples for each. If you are building or evaluating a geocoder, this is the failure-mode catalogue worth keeping next to your test suite.

BIO labels

BIO labelling is the trick that lets a token classifier (which decides one token at a time) emit spans (groups of consecutive tokens that mean one thing together). It is the standard approach for sequence labelling tasks in NLP — Named Entity Recognition, part-of-speech tagging, address parsing.

Confidence calibration

Every span the parser emits carries a number when the model says it's 94% sure, is it actually right 94% of the time?

Corpus construction

The training corpus is the largest single source of leverage in the project. The model can only learn patterns that appear in the data. This article walks through how Mailwoman builds its corpus — what sources are in it, how rows are aligned to BIO labels, and how the synthesis step multiplies the effective size.

CRF decoder

A Conditional Random Field (CRF) is a structured-prediction layer that sits on top of a per-token classifier and finds the best whole-sequence assignment of labels, subject to learnable transition rules between consecutive labels.

Dual-role places and hierarchy completion

Some places are two things at once. Berlin is a city and a federal state. Milano is a comune and the province named after it. Bristol is a city and a unitary authority. Utrecht is a city and a province. A census of our gazetteer found about 122 of these across nine countries — Italy alone has 70 (its provinces are nearly all named after their capital), then Spain (14), the UK (12), Japan (10), Korea (7), and a handful each for France, Germany, the Netherlands, and China.

Eval discipline — reading the numbers honestly

Mailwoman's eval methodology learned its most important lessons the hard way — from shipping two model versions that regressed on headline F1 but told a different story when the failures were examined properly. This article documents the discipline: what to measure, what not to trust, and how to read a model release report.

FST gazetteer prior

The FST (finite-state transducer) gazetteer prior is the structure that lets the neural classifier benefit from everything Who's On First already knows. The neural model knows grammar; the gazetteer knows places. The FST is the bridge — pre-computed at build time so the classifier can consult it at inference time without paying gazetteer-lookup costs per token.

How can a building have two addresses?

A single physical building can have multiple valid addresses. For commercial buildings, multifamily housing, corner properties, and any structure that touches more than one administrative system, that is the normal state of affairs. A parser that assumes "one building = one address" will fail silently on a large fraction of real-world queries.

Neural classification

The neural classifier is a small transformer encoder that reads an address as a sequence of tokens and emits one BIO label for each token. This article explains what a transformer encoder is, what makes Mailwoman's specific model the shape it is, and what the model is and is not capable of.

ONNX runtime

ONNX (Open Neural Network Exchange) is a standardized format for serializing trained neural networks. ONNX Runtime is the family of inference engines that consume those files. Mailwoman uses ONNX so the same trained model can run in Node.js, browsers, mobile devices, or anywhere else with an ONNX Runtime build.

Resolver and Who's On First

Parsing answers "what kind of thing is each part of this string?". Resolving answers "where is the resulting place?". They are different jobs and Mailwoman keeps them apart on purpose.

Rule-based classifiers

A rule classifier is a small piece of hand-written code that labels tokens. Pelias and Mailwoman v1 are built almost entirely on rule classifiers. Mailwoman v2 keeps them and adds the neural classifier alongside — see How it works now for the hybrid.

Street-supplement architecture

This is the design reference for the work that fills the WOF hierarchy gap. It synthesizes the architectural decisions reached during the v0.6.1 postmortem and the subsequent design consult. Code in neural/, resolver-wof-sqlite/, and core/ refers back here; this article tells you why each piece looks the way it does.

Synthetic corpus — alignment validation is load-bearing

Both v0.5.0 corpus threads (B kryptonite and B2 transliteration) used an LLM (DeepSeek) to generate annotated training rows. Both surfaced the same lesson: the substring-match alignment check is structural infrastructure, not a quality filter you can drop later. This article explains why.

The pipeline contract

You don't have to take Mailwoman's pipeline as-is. The runtime coordinator (createRuntimePipeline) accepts each of the six stages as an injectable function or interface; an integrator can swap any of them for a custom implementation without forking the core.

The WOF hierarchy gap

Who's On First is a place gazetteer. Mailwoman is an address parser. The two are misaligned at one specific point in the hierarchy — and that misalignment shapes how the neural model fails on street-level inputs.

Tokenization

Tokenization is the step that turns an input string into a list of small pieces called tokens. Every downstream component — rule classifiers, the neural classifier, the solver, the resolver — works with tokens, not with raw strings.

Training pipeline

The training pipeline turns raw address data sources into a model file that ships on npm. This article walks through each stage end-to-end. The Corpus construction article digs into the first three stages in more detail.

Training-run charts

The v0.6.x eval reports keep claiming things like "same noise pattern across

What is a concordance?

In Mailwoman's architecture, a concordance is the resolver's answer to the question: "Do these parsed components form a coherent place in the real world?" It is the mechanism that prevents the parser from emitting a parse that is structurally valid but geographically impossible — like "Paris, Texas" labelled as "Paris, Île-de-France" or "NY-NY Steakhouse, Houston TX" with "NY" tagged as a region.

What is a postcode?

A postcode is a routing instruction, not a geographic area. It tells a postal service how to sort and deliver mail. It does not tell a geocoder where a building is, what municipality it belongs to, or what polygon contains it. Confusing these two things is the most common error in address geocoding — and the source of a surprising fraction of production bugs.

What is a ZIP Code?

The US ZIP Code is the most influential postal code system in the world — not because it is the best, but because US-origin address data dominates geocoding training sets and shapes what parsers expect. Understanding its structure is essential for understanding why US-trained parsers fail on international addresses.

What is an address?

Addresses look obvious. You see one every day. But parsing them turns out to be one of the harder problems in natural language processing, because a single string can mean many places, and a single place can be written many ways.

What is an intersection address?

An intersection address names a location by the crossing of two streets. Unlike a conventional street address ("350 5th Ave"), it does not specify a building number. It says "find where these two streets meet." This is common in urban navigation, emergency dispatch, and some countries (notably Japan) where the street-address system is entirely different.

Who's On First — data model and gotchas

Who's On First (WOF) is the best open gazetteer available. It's also one of the strangest datasets you'll encounter as a developer. This article documents the gotchas — the structural quirks that trip up new consumers — and the tooling Mailwoman built to work around them.