Skip to main content

Wikipedia Importance Scores

Place importance scores derived from Wikipedia link count, replacing raw population as the FST emission prior weight. Shipped in #173.

Sourceโ€‹

Nominatim's wikimedia-importance.csv.gz โ€” 19M rows mapping Wikidata IDs to importance scores. The score is log(total_links) / log(max_links) where total_links = internal links + cross-language links to the Wikipedia article. Normalized to [0, 1] where the United States article (5.2M links) = 1.0.

ETL pipelineโ€‹

wikimedia-importance.csv.gz (19M rows, ~250 MB compressed)
โ”‚
โ–ผ
scripts/build-importance.ts
โ”‚
โ”œโ”€ Load WOF concordances (other_source='wd:id') โ†’ Set of needed Wikidata IDs
โ”œโ”€ Stream-decompress TSV, filter to matching IDs only
โ”œโ”€ Collapse duplicates by MAX(importance) per Wikidata ID
โ”œโ”€ JOIN: wof_id โ†’ concordance wikidata_id โ†’ importance score
โ”œโ”€ Write place_importance(id, importance) table
โ””โ”€ Population fallback: min(1.0, log2(1+pop/1000)/14) for places without Wikidata

Coverage (US admin)โ€‹

SourcePlaces
Wikipedia importance (via Wikidata concordance)47,348
Population fallback108,111
Total in place_importance155,459

How scores flow into the FSTโ€‹

  1. build-importance.ts writes place_importance table into the WOF SQLite
  2. fst-builder.ts reads place_importance (falls back to place_population โ†’ pseudo-importance)
  3. PlaceEntry.importance carries the score through serialization (Float32 in the binary FST)
  4. fst-prior.ts computes bias: importance ร— biasScale ร— maxBias (linear, capped at 3.0 logits)

Why not populationโ€‹

SignalWashington DCWashington stateWinner
Population678K7.6MState (wrong for bare "Washington")
Wikipedia importance0.8150.764DC (correct โ€” more culturally prominent)

Population is an administrative headcount. Wikipedia importance captures actual cultural prominence โ€” how often the place is referenced, linked to, and discussed across languages.

Regenerationโ€‹

Run after any WOF data refresh:

node scripts/build-importance.js --db /path/to/wof-unified.db [--tsv /path/to/wikimedia-importance.csv.gz]

The TSV is cached at /tmp/wikimedia-importance.csv.gz after first download. Pass --tsv to skip the download.

See alsoโ€‹