Skip to main content

Component Schema

This document defines the canonical ComponentTag union. It is the single source of truth.

Rule: any change to this file requires:

  1. A written rationale in the commit message.
  2. A migration plan for existing corpus rows tagged with the old schema.
  3. A check that downstream alignment, training, and inference code is updated in the same commit.

Tag inventoryโ€‹

Universal (Phase 1, all locales)โ€‹

TagDescriptionExample
countrySovereign state name or codeUSA, France, FR
regionFirst-level admin (state, rรฉgion)OR, รŽle-de-France
localityCity, town, communePortland, Paris
dependent_localitySub-locality (neighborhood, arrondissement, ward)Brooklyn, 8e arrondissement
postcodePostal code97215, 75008
subregionOptional county-level admin between region and localityMultnomah County

Street-level (Phase 2)โ€‹

TagDescriptionExample
house_numberBuilding number on a street6220, 12bis
streetStreet name properSalmon St, Rรฉpublique
street_prefixDirectional or descriptive prefix (Anglophone)SE, North
street_prefix_particleNon-English grammatical particle (FR)de la, du, des
street_suffixStreet type suffix (Anglophone)Street, Boulevard, Ave
intersection_aFirst street of an intersection query5th Ave (in "5th Ave & 42nd St")
intersection_bSecond street of an intersection query42nd St (in "5th Ave & 42nd St")
unitApartment, suite, floorApt 4B, Suite 200, 5e รฉtage

Venue-level (Phase 3)โ€‹

TagDescriptionExample
venueNamed place (business, landmark, park)Mt Tabor Park, Eiffel Tower
attention"Attention" or "care of" linec/o Jane Doe, Att: Sales Dept
po_boxPost office boxPO Box 1234, BP 42

Locale-specificโ€‹

TagLocaleDescriptionExample
cedexFRSpecial postal routing designationCEDEX 08 in 75008 PARIS CEDEX 08

JP-specific (Phase 6 โ€” listed for forward compatibility, not used in Phase 1โ€“3)โ€‹

TagDescriptionExample
prefectureJP first-level admin (้ƒฝ้“ๅบœ็œŒ)ๆฑไบฌ้ƒฝ, Tokyo
municipalityJP city/ward (ๅธ‚ๅŒบ็”บๆ‘)ๅƒไปฃ็”ฐๅŒบ, Chiyoda
districtJP district (ๅคงๅญ—)ไธธใฎๅ†…, Marunouchi
blockJP chลme (ไธ็›ฎ)1ไธ็›ฎ
sub_blockJP banchi (็•ชๅœฐ)1็•ชๅœฐ
building_numberJP gล (ๅท)1ๅท
building_nameJP named building (often in romaji)Tokyo Building

Note for JP-forward-compatibility: the JP-specific tags above must not be referenced anywhere in core code in Phases 0โ€“5. They exist in this document so that schema additions in Phase 6 do not require a core rewrite. The componentsSupported field on LocaleProfile is how the system knows which tags a locale actually uses.

BIO labelingโ€‹

For training and inference, each tag T becomes two labels:

  • B-T โ€” beginning of a span tagged T
  • I-T โ€” inside (continuation) of a span tagged T

Plus one universal label:

  • O โ€” outside any address component (punctuation, noise, junk)

Example labeling of "6220 SE Salmon St, Portland OR":

Token Label
โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€
6220 B-house_number
SE B-street_prefix
Salmon B-street
St I-street
, O
Portland B-locality
OR B-region

Implementation notesโ€‹

TypeScript representationโ€‹

// packages/core/src/types/component.ts

export const COMPONENT_TAGS = [
// Universal
"country",
"region",
"locality",
"dependent_locality",
"postcode",
"subregion",
// Street-level
"house_number",
"street",
"street_prefix",
"street_prefix_particle",
"street_suffix",
"intersection_a",
"intersection_b",
"unit",
// Venue-level
"venue",
"attention",
"po_box",
// FR-specific
"cedex",
// JP-specific (Phase 6 โ€” declared but unused until then)
"prefecture",
"municipality",
"district",
"block",
"sub_block",
"building_number",
"building_name",
] as const

export type ComponentTag = (typeof COMPONENT_TAGS)[number]

export const BIO_LABELS = ["O", ...COMPONENT_TAGS.flatMap((t) => [`B-${t}`, `I-${t}`])] as const

export type BioLabel = (typeof BIO_LABELS)[number]

The as const and derived types are deliberate. TypeScript will surface schema-aware errors at compile time wherever a tag is referenced.

Validation ruleโ€‹

A LocaleProfile.componentsSupported array must be a subset of COMPONENT_TAGS. Runtime check at profile registration. Fail loudly if violated.

Rationale for specific choicesโ€‹

Why dependent_locality and not neighborhood or borough? WOF and ISO use dependent_locality for the general concept. Names like borough are locale-specific. Pick the umbrella term.

Why split street_prefix from street_prefix_particle? English SE and French de la are grammatically different and synthesis pipelines need to treat them differently. Conflating them produces worse training data.

Why expose subregion if it's optional? Some US addresses include county (rare in display but common in government data). Modeling it explicitly is better than forcing it into region or locality.

Why cedex is FR-specific and not subsumed by postcode? A CEDEX designation is a postal routing instruction, not a postcode. Treating it as one corrupts FR postal code statistics.

Why list JP tags here at all before Phase 6? Forces Phase 0 type design to handle them. If core code reaches Phase 6 and needs to add seven new tags plus rewrite the policy system, the schema-first principle failed.