Containment and tree projection
After the encoder produces emissions, the FST priors add their biases, and
Viterbi finds the best valid BIO sequence
(see those sub-articles), the last step turns
that flat label sequence into a hierarchical AddressTree. This is the
piece that lets downstream consumers ask "what's the locality?" or "how
does this street nest inside its admin parent?" without doing the structural
walk themselves.
The piece is small but load-bearing. It's where the schema's parent/child rules turn an array of labeled tokens into a queryable tree.
What the tree looks like
For 123 Main St, Boston, MA 02101:
Five AddressNodes with these relationships:
regionparents nothing (root in this address)localityparents toregionstreetparents tolocalityhouse_numberparents tostreetpostcodeparents tolocality(siblings of street)
The decoder didn't "discover" this structure from the address text. It walked a rule table that says "given these tags, here's how they nest."
The rule table: PARENT_OF
core/decoder/containment.ts exports PARENT_OF: Partial<Record<ComponentTag, ComponentTag[]>>. Each tag lists its
preferred parents in priority order. Excerpt:
export const PARENT_OF: Partial<Record<ComponentTag, ComponentTag[]>> = {
// Universal coarse — geographic granularity
region: ["country"],
subregion: ["region", "country"],
locality: ["subregion", "region", "country"],
dependent_locality: ["locality"],
postcode: ["locality", "subregion", "region", "country"],
cedex: ["postcode", "locality"],
// Street-level
street: ["dependent_locality", "locality", "subregion", "region"],
street_prefix: ["street"],
street_suffix: ["street"],
house_number: ["street"],
unit: ["street", "house_number"],
// Venue / mailing
venue: ["street", "locality"],
po_box: ["locality", "subregion", "region"],
// JP-specific
prefecture: ["country"],
block: ["district"],
sub_block: ["block"],
building_number: ["sub_block", "block"],
}
Reading the rule for street: a street tries to nest under
dependent_locality first, then locality, then subregion, then
region. The tree builder walks this list and parents the street to the
first listed tag that ACTUALLY has a labeled span in this address.
Missing tags fall through.
The tree-building walk
core/decoder/build-tree.ts does the projection in three passes:
- Span aggregation. Walk the BIO sequence, group consecutive
same-tag tokens into spans.
[B-street, I-street, I-street]becomes one street span covering all three tokens. - Parent resolution. For each span, walk its
PARENT_OFlist. Find the first parent-tag that has a labeled span in this address. If multiple spans of the parent tag exist (multiple localities, say), pick the one nearest by character distance. - Tree assembly. Spans whose parent tag has no labeled span become tree roots. All others attach as children of their resolved parent.
The output is AddressTree { raw, roots: AddressNode[] }. Each
AddressNode carries tag, start, end, value, confidence,
children.
What the projection buys you
It's deterministic, given the labels
The encoder + FST priors + Viterbi produce the labels. Everything after that is rule-table application. Two different runs that produce the same BIO sequence will produce bit-identical trees. This makes the decoder behavior easy to reason about and easy to test.
It surfaces structural correctness as a separate axis
A model can produce per-token labels that are correct in isolation but structurally weird:
- A locality span with no parent region (just floats at the root)
- Two locality spans in one address (Brooklyn AND New York)
- A house_number without a street parent (the resolver downstream doesn't know how to handle this)
These are "labels correct, structure wrong" failures. They show up in the tree, not in per-tag recall. A tree validator (planned, not yet shipped — see corpus-poisoning-vulnerability.md's deferred items) would check the AddressTree's structural validity as a final quality gate.
It's the schema's source of truth
The PARENT_OF map is the canonical statement of "what nests in what."
The
street-supplement architecture
treats this as the schema's source of truth — extending it (e.g. adding
dependent_street → street) is a schema-level decision that flows into
training corpus design, eval golden sets, and downstream consumers.
The
WOF hierarchy gap doc explores how WOF's
placetype DAG and mailwoman's PARENT_OF map are related but distinct
ontologies — the model's tags answer "what role does this span play in
an address," while WOF's placetypes answer "what kind of geographical
feature is this." The containment rules let us project model output
into either ontology cleanly.
What this CAN'T do
- Multi-parent. Each span has exactly one parent. The DAG-like
fact that a
boroughcan nest under bothlocalityandlocaladminin WOF doesn't carry over — the decoder picks one. - Cross-address linkage. The tree is per-address. If you parse a list of addresses and want to detect that they all share the same locality, that's a resolver concern, not a decoder one.
- Disagreement resolution. If the BIO sequence has two locality spans that disagree (say one is "Brooklyn" and one is "New York" in the same input), the tree builder produces both as separate spans parenting to whatever they parent to. The downstream resolver decides what to do with that.
See also
- How the model reasons — the central pipeline; this article expands stage 6
- Viterbi and BIO validity — what produces the BIO sequence this article consumes
- WOF hierarchy gap — the relationship between mailwoman's PARENT_OF and WOF's placetype DAG
- Street-supplement architecture —
the layered design that extends
PARENT_OFfor street-side components core/decoder/containment.ts— the actual rule tablecore/decoder/build-tree.ts— the projection algorithm