Skip to main content

Why Japanese addresses break Western parsers

· 5 min read
Teffen Ellis
Sister Software

In Tokyo, the address of Tokyo Tower is 〒105-0011 東京都港区芝公園4-2-8.

In English: "4-2-8 Shibakōen, Minato City, Tokyo 105-0011".

The Japanese form runs right-to-left compared to the English form. The prefecture (都道府県) comes first, then the city or ward (市区町村), then a district (丁目) and a block-number-style locator. There's no street name — just a grid.

This is why every rule-based address parser written for Western addresses breaks on Japan.

The hierarchy

Who's On First ships Japan's admin hierarchy as one repo with 62,896 GeoJSON files. After pulling it into our unified SQLite, the placetype distribution looks like this:

Placetype (English)JapaneseCount
country1
region (prefecture)都道府県47
county (city)2,287
locality (ward/town)市区町村43,886
neighbourhood (chome)丁目7,736

47 prefectures. The whole country. Every chome (city block district) tagged with a name like 1丁目 (1-chome), 2丁目 (2-chome).

Reversed ordering

Western address: [house_number] [street] [unit?], [locality], [region] [postcode].

Japanese address: 〒[postcode]? [region][locality][chome][block]-[sub-block]-[house_number].

The order matters for parsers because we use position as a feature. A model trained on "1600 Pennsylvania Avenue NW, Washington, DC 20500" expects digits at the start, region near the end. A Japanese address inverts this entirely. Walking the parent chain in the WOF database confirms the inversion:

neighbourhood jpn=1丁目 eng=1丁目
locality jpn=世田谷区 eng=Setagaya
county jpn=世田谷区 eng=Setagaya
region jpn=東京 eng=Tokyo
country jpn=日本 eng=Japan

To synthesize a JP address you concatenate the parent chain top-to-bottom: 東京 + 世田谷区 + 1丁目 → 東京世田谷区1丁目.

No street names

Western addresses identify locations by street + number. "1600 Pennsylvania Avenue NW" picks a specific building because Pennsylvania Avenue is a known line and 1600 is a known offset along that line.

Japan uses block addressing instead. Read 4-2-8 in 芝公園 as chome 4, block 2, building 8 within the 芝公園 district. There's no "芝公園 street" for the number to sit on; the grid is the addressing primitive, not the line.

Implications for the parser:

  • street_prefix and street_suffix don't apply (no street).
  • house_number becomes a hyphenated triple: 4-2-8.
  • The "丁目" suffix is a categorical marker, not a street type.

For now we map chome to dependent_locality since it's the closest existing tag. A proper JP locale would introduce block and sub_block tags per the schema in core/types/component.ts (declared but unused until JP ships).

Prefix postcode

Japanese addresses prefix the postcode with , the postal mark. Format: 〒NNN-NNNN. Examples:

  • 〒100-0005 — Tokyo Marunouchi
  • 〒530-0001 — Osaka Umeda
  • 〒810-0001 — Fukuoka Tenjin

A parser needs to read as a categorical marker: the postal mark that flags the following 7 digits + dash as a postcode. SentencePiece tokenizes as a separate piece. Our new v0.6.0-a0 multi-script tokenizer handles this cleanly (0% byte-fallback on the character).

What we shipped today

The wof-admin-jp adapter prototype walks the WOF parent chain for every 丁目 in the Japanese repo and synthesizes a training row. Output:

{
"raw": "東京港区芝公園",
"components": {
"region": "東京",
"locality": "港区",
"dependent_locality": "芝公園",
"country": "JP"
}
}

6,373 rows from 47 prefectures and 269 localities — that's training data we didn't have yesterday. Top prefectures by row count:

PrefectureRows
東京 (Tokyo)2,251
神奈川 (Kanagawa)888
大阪 (Osaka)460
千葉 (Chiba)380
埼玉 (Saitama)263

Tokyo dominates because of its density of named neighborhoods — every chome of every ward is tagged. Smaller prefectures have fewer registered neighborhoods.

What's still missing

Real JP addresses include house numbers (4-2-8) which WOF doesn't track. To complete a Stage 3 JP corpus we need a separate source — the MLIT national address database or JapanPost postcode CSVs. Both are public.

Once those land, the JP corpus becomes a 100K+ row source with full Stage 3 + Phase 6 tags (block, sub_block, house_number). v0.6.0 trains on US/FR. v0.7.0 could ship JP if the data pipeline holds.

Schema readiness

The infrastructure is already in place. core/types/component.ts declares JP-specific Phase 6 tags:

// JP-specific (Phase 6 — declared but unused until then)
"prefecture",
"municipality",
"district",
"block",
"sub_block",
"building_number",
"building_name",

The schema, formatting, runtime pipeline, and now the corpus prototype are ready. The blockers are: (1) the missing house-number data source, and (2) training time on a JP-aware recipe.

Where rules fail and learning wins

Every address parser written for Western input fails on Japan in a specific, predictable way: it parses the prefecture as a country, then runs out of tokens. The locality and chome get lumped into a single span. The block-number triple gets parsed as a postcode or dropped entirely.

Mailwoman's transformer architecture is locale-agnostic at the BIO level. The same model can learn region → locality → chome ordering if it sees enough examples. The 6,373 rows we generated today are the first batch.