Wikipedia Importance Scores
Place importance scores derived from Wikipedia link count, replacing raw population as the FST emission prior weight. Shipped in #173.
Sourceโ
Nominatim's wikimedia-importance.csv.gz โ 19M rows mapping Wikidata IDs to importance scores. The score is log(total_links) / log(max_links) where total_links = internal links + cross-language links to the Wikipedia article. Normalized to [0, 1] where the United States article (5.2M links) = 1.0.
ETL pipelineโ
wikimedia-importance.csv.gz (19M rows, ~250 MB compressed)
โ
โผ
scripts/build-importance.ts
โ
โโ Load WOF concordances (other_source='wd:id') โ Set of needed Wikidata IDs
โโ Stream-decompress TSV, filter to matching IDs only
โโ Collapse duplicates by MAX(importance) per Wikidata ID
โโ JOIN: wof_id โ concordance wikidata_id โ importance score
โโ Write place_importance(id, importance) table
โโ Population fallback: min(1.0, log2(1+pop/1000)/14) for places without Wikidata
Coverage (US admin)โ
| Source | Places |
|---|---|
| Wikipedia importance (via Wikidata concordance) | 47,348 |
| Population fallback | 108,111 |
| Total in place_importance | 155,459 |
How scores flow into the FSTโ
build-importance.tswritesplace_importancetable into the WOF SQLitefst-builder.tsreadsplace_importance(falls back toplace_populationโ pseudo-importance)PlaceEntry.importancecarries the score through serialization (Float32 in the binary FST)fst-prior.tscomputes bias:importance ร biasScale ร maxBias(linear, capped at 3.0 logits)
Why not populationโ
| Signal | Washington DC | Washington state | Winner |
|---|---|---|---|
| Population | 678K | 7.6M | State (wrong for bare "Washington") |
| Wikipedia importance | 0.815 | 0.764 | DC (correct โ more culturally prominent) |
Population is an administrative headcount. Wikipedia importance captures actual cultural prominence โ how often the place is referenced, linked to, and discussed across languages.
Regenerationโ
Run after any WOF data refresh:
node scripts/build-importance.js --db /path/to/wof-unified.db [--tsv /path/to/wikimedia-importance.csv.gz]
The TSV is cached at /tmp/wikimedia-importance.csv.gz after first download. Pass --tsv to skip the download.
See alsoโ
- FST Gazetteer LM โ the FST architecture this feeds into
- Nominatim importance docs โ upstream methodology