Skip to main content

Query Shape β€” Cheap Structural Priors

Captures the design intent for the QueryShape sub-system, a stage 2 + 2.5 addition to the runtime pipeline (the-staged-pipeline). Bitter-lesson-safe because it recognizes what shape an input has, not which strings are localities.

Status​

The bitter-lesson framing test​

Mailwoman v1 (Pelias lineage) hit the long-tail trap by using rules to encode place knowledge: dictionaries of 50K city names, hand-curated state abbreviation tables, country-specific keyword sets. Every new locale meant a new dictionary, and the maintenance burden compounded.

The QueryShape sub-system intentionally avoids that trap by restricting itself to universal structural patterns β€” what character classes the tokens are, where the punctuation is, whether a substring matches a known postcode regex. The locale-test:

Would this code change if Mailwoman added a 50th locale?

PrimitiveAdds per new localeVerdict
Character class detectorNothinguniversal
Segmentation grammar~5 lines of locale punctuation rulesbounded
Known-format regex set~1 postcode patternbounded
User-location priorNothinguniversal

Compare to v1's gazetteer-based locality classifier (~50K place names per locale). That is the long-tail explosion this sub-system explicitly does not bring back.

The QueryShape data structure​

interface QueryShape {
/** Dominant character class of the whole input. */
characterClass: "numeric" | "alpha" | "alphanumeric" | "cjk" | "cyrillic" | "arabic" | "mixed"

/** Per-token character class. */
tokenClasses: TokenClass[]

/** Punctuation-bounded segments, locale-aware. */
segments: Segment[]

/** Known-format regex hits (US ZIP, UK postcode, FR postcode, etc.). */
knownFormats: KnownFormatHit[]

totalLength: number
whitespacePattern: "single" | "double" | "tab" | "mixed" | "none"
}

interface TokenClass {
span: Span
class: "digit" | "alpha" | "mixed" | "punct" | "cjk" | "cyrillic" | "arabic"
length: number
}

interface Segment {
span: Span
body: string
index: number // position in the segment list
separator: "comma" | "newline" | "tab" | "whitespace" | "japanese-style" | null
}

interface KnownFormatHit {
format:
| "us_zip" // \d{5}
| "us_zip4" // \d{5}-\d{4}
| "uk_postcode" // SW1A 1AA pattern
| "fr_postcode" // \d{5} with French context
| "ca_postcode" // A1A 1A1 pattern
| "de_postcode"
| "jp_postcode" // \d{3}-\d{4}
| "po_box" // PO Box / BP pattern
span: Span
confidence: number
}

Compute time: microseconds. No ML. Runs on every input before the encoder.

The four primitives​

1. Meta queries (character class)​

Pure-character analysis of the input. Cheap, deterministic, language-agnostic. Examples:

InputCharacter class signal
"10118"pure-digit, length 5 β†’ strong postcode prior
"10118-1234"digit-hyphen-digit, length 10 β†’ US ZIP+4 prior
"SW1A 1AA"letter-digit-letter pattern β†’ UK postcode prior
"NYC NY"all-alpha, very short β†’ likely locality-only query
"東京駅"all CJK β†’ script signal + likely landmark/venue
"دبي"all Arabic β†’ script signal
"350 5th Ave, New York, NY 10118"mixed alphanumeric + 3 commas β†’ structured address

These signals feed Stage 2 (locale gate) and Stage 2.5 (kind classifier) as priors.

2. Phrasification (segmentation)​

Punctuation-bounded chunking. Currently the encoder re-learns segment structure from punctuation tokens; making it explicit gives downstream stages a free structural prior.

Three uses, in order of complexity:

  1. As encoder input feature β€” segment-id per token concatenated to the input embedding. Cheap; no architecture change.
  2. As attention mask β€” let tokens within a segment attend strongly to each other, weakly across. Modest architecture change.
  3. As per-segment kind hint β€” segment[3] is 5-digit-numeric β†’ postcode-prior for this segment only. Stage 2.5 emits per-segment QueryShape; encoder gets per-segment priors.

Locale-dependence is bounded. Commas segment in US/FR/UK; whitespace segments in JP; honorifics in Korean. The punctuation grammar itself is locale-aware Stage-1 normalization, but the output is an abstract Segment[] that downstream stages consume uniformly. Each new locale adds ~5 lines of segmentation rules, not 50K dictionary entries.

3. Known-format detection​

A bounded set of universal postcode and well-known-format regexes:

  • US ZIP / ZIP+4
  • UK postcode
  • FR postcode (5 digits with French context)
  • CA postcode (A1A 1A1)
  • DE postcode
  • JP postcode (\d3-\d4)
  • PO Box / BP variants

When a known-format pattern matches with high confidence AND the user-location prior agrees, Stage 2.5 can route directly to Stage 6 (resolver), skipping the encoder entirely. This is a measurable browser-side win β€” ~5 ms encoder inference + ~25 MB model load saved per call for the trivial cases.

⚠ Fast paths are advisory, not authoritative. "10118" could be a partial typed input ("10118 Maple St" mid-keystroke). Every fast path must:

  • Honor a force_full_pipeline?: boolean option (default false)
  • Apply a conservative confidence floor before short-circuiting
  • Still report alternatives in the result, not collapse to one answer

If autocomplete or streaming-input workflows ever break because fast-paths short-circuited prematurely, the confidence floor was too low.

4. User physical location​

External signal joining the input-side priors. Goes into Stage 6 (resolver) scoring:

score(candidate) = match Γ— population_prior Γ— regional_prior_from_language Γ— regional_prior_from_user_location

All four factors are soft priors. The TX-IP user typing "Paris" in French input gets a tie between Paris-FR and Paris-TX; structural cues (state abbreviation, postcode shape, etc.) break it. None of the priors hard-filter β€” that would violate the project's stated "possibilities, not constraints" principle (reference/CONTEXT.md).

API shape (optional input):

interface ResolveOpts {
userLocation?:
| { lat: number; lon: number } // precise (GPS, IP-geolocated)
| { country: string } // coarse (CDN region, user setting)
| { region: string; country: string } // medium (US state)
}

Where this lands in the pipeline​

QueryShape is computed once, at the boundary between Stage 1 and Stage 2, and threaded through the rest of the pipeline as additional context. It is not its own stage β€” it is a shared data structure that Stages 2, 2.5, 3 (optional), and 6 all consume.

The packaging decision​

Bundle as a single workspace package:

packages/query-shape/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ character-class.ts # token-level + input-level classification
β”‚ β”œβ”€β”€ segmentation.ts # punctuation-grammar-aware chunking
β”‚ β”œβ”€β”€ known-formats.ts # postcode + PO-box regex set
β”‚ β”œβ”€β”€ compute.ts # entry point: (input, locale?) β†’ QueryShape
β”‚ └── types.ts # the QueryShape interface
β”œβ”€β”€ test/
└── package.json

Zero ML. Zero npm dependencies beyond what @mailwoman/core already pulls. Tiny (< 100 KB compiled). Reusable in browser and Node.

Locality soft prior (v0.5.2 extension)​

Added 2026-05-25 after diagnosing demo preset locality/region confusion (DEMO_PRESET_DIAGNOSIS.md).

The existing buildEmissionPriors path applies logit boosts for known-format hits (e.g., postcode regex β†’ B-postcode). The locality soft prior extends this: when an unambiguous 2-letter region abbreviation is detected (e.g., DC, NY, CA), preceding tokens that match a WOF locality entry at the detected locale receive a +2.0 logit boost toward B-locality / I-locality.

New field on QueryShape:

regionAbbreviations?: Array<{ start: number; span: string }>

Detection is a single regex per locale (/,\s*[A-Z]{2}\b/ for en-us), same cost as postcode-format detection. The WOF verification is the safety constraint β€” only bias tokens that are verified localities, not every preceding token.

This passes the bitter-lesson framing test: the regex is locale-bounded, not a gazetteer. The WOF check is a safety rail that prevents over-biasing non-locality tokens like "Pennsylvania" (street) in the same address.

Open questions for implementation​

These are decisions to make when the implementation phase opens; recording so they aren't re-derived from scratch.

  1. Where does QueryShape get computed? Eagerly at parse-entry (always pay the microseconds cost), or lazily on first downstream consumer (some inputs may not need it)? Recommend eagerly β€” it's cheaper to compute than to threaded-lazy-evaluate.
  2. Known-format regex set extensibility? Hardcoded vs. registerable. Recommend hardcoded for v1 (the postcode set rarely changes) and re-evaluate if community adapters want to register country-specific formats.
  3. Confidence threshold for fast-path routing? Pick conservatively (~0.95) and tune from real-world data. Failure mode of too-low is catastrophic (wrong answer instead of full pipeline); failure mode of too-high is just slower inference.
  4. Segment-id as encoder feature β€” additive or concatenated? Concatenated to input embeddings is simplest and avoids architecture change. Decide when the encoder retrain happens.
  5. How does QueryShape interact with LocaleDetector? The detector probably uses QueryShape.characterClass + QueryShape.tokenClasses as input features. Suggests LocaleDetector runs after QueryShape.compute(). Document this dependency.

What this is not​

  • Not the gazetteer. Place-name lookup lives in Stage 6 / Resolver. QueryShape never asks "is this a city name?".
  • Not the kind classifier. The kind classifier (Stage 2.5) consumes QueryShape but is itself a separate component that may or may not use ML.
  • Not a replacement for the encoder. Trivial inputs short-circuit; structured inputs still hit the full pipeline.
  • Not free of locale concerns. Segmentation grammar and the known-format set both grow per locale, but they grow at O(1) per locale rather than O(N place names).

See also​