Skip to main content

Address Data Sources for NER Training

Catalog of public-domain and openly-licensed data sources containing facility names + addresses, intended as raw material for training a token-classification (BIO-tagging) postal address parser. Companion to federal/aggregator sources (OpenAddresses, OSM, NAD, TIGER, NPI, IRS BMF, etc.) โ€” this list focuses on state-level sources that add component-level variety, residential coverage, and venue-name-paired records the federal sources lack.

Compiled with web-search verification of canonical agency URLs. Some per-state entries in the Regulated Facilities section are based on known agency patterns rather than direct URL verification โ€” those are noted inline.

Why state sources matter for an address NER modelโ€‹

  1. Component supervision is automatic. State directories are typically already field-structured (separate columns for street, city, postcode, etc.). BIO tags derive from which column each token came from โ€” no manual labeling required.
  2. Format diversity per-state. 50 states + DC means 50+ schema flavors, abbreviation conventions, and transcription styles. Excellent for training format robustness.
  3. Venue+address pairs are everywhere. Every facility/professional/establishment listing pairs an organization name with a structured address โ€” natural B-venue / I-venue training spans.
  4. Residential coverage. Real estate, contractor, cosmetology, notary, and license-lookup tools surface actual home/office addresses that federal facility datasets don't.

Formats to watch for beyond CSVโ€‹

Many state agencies distribute address data in geospatial formats. The DBF attribute table inside a shapefile is essentially a CSV with geometry attached โ€” extract with ogr2ogr -f CSV and the records are training-ready. Common formats:

  • Shapefile (.shp/.dbf/.shx/.prj) โ€” most common state GIS distribution
  • GeoJSON โ€” increasingly standard on ArcGIS Hub portals
  • GeoPackage (.gpkg) โ€” modern SQLite-based, multi-layer
  • File Geodatabase (.gdb) โ€” ESRI native, common from state GIS offices
  • KML / KMZ โ€” Google Earth, sometimes used for facility points
  • FlatGeobuf โ€” emerging streaming format (used by OpenAddresses)

The highest-value geospatial layers for address NER:

  • Site address points / E911 / NG911 address points โ€” every addressable location as a point with full component breakdown (housenumber, prefix, street, suffix, unit, city, ZIP, parcel ID). Published by counties and increasingly by states via ArcGIS Hub. The strongest single layer type for component-supervised training.
  • Parcel polygons with situs + owner addresses โ€” two addresses per record (property situs + owner mailing), often diverging structurally (P.O. boxes for owners, street addresses for situs). State-aggregated in MA, NC, MN, UT, MT, VT, RI, and a handful of others.
  • Road centerlines with address ranges โ€” state-current analogue to TIGER; each segment has from/to house numbers per side.
  • Critical infrastructure / facilities layers โ€” state equivalents of HIFLD covering hospitals, schools, fire/police, jails, courts, public buildings. Each is a B-venue + address pair.
  • EPA/state-DEP regulated sites โ€” Superfund, brownfields, RCRA, NPDES, leaking USTs as shapefile/GeoJSON.
  • NCES, NREL AFDC, FCC ASR/ULS โ€” federal but distributed as shapefile in addition to CSV.

Most state ArcGIS Hubs publish data under CC-BY or equivalent open terms; check per-dataset.

Licensing realityโ€‹

State records aren't "public domain" in the IP sense, but factual data (name + address) isn't copyrightable under Feist v. Rural Telephone. A few caveats:

  • Washington OSPI school directory: state law forbids commercial use of directory data.
  • OSM (ODbL): share-alike obligations may extend to models trained on it โ€” unsettled, but a risk if the corpus is republished.
  • CC-BY state ArcGIS Hub data: attribution required if redistributing the corpus.
  • G-NAF, CC-BY-4.0: clean for AU coverage.

For maximum licensing safety on published model weights, restrict to US-federal sources (NCES, NAD, TIGER, NPPES, IRS BMF) + state sources confirmed as public records with no use restrictions. Lose coverage, gain unambiguous redistribution rights.


Gazetteer / resolver coordinate sources (companion layer)โ€‹

Everything below this section is corpus material โ€” address strings and name/address pairs to train the BIO parser. This section is the other half: the gazetteer the resolver and the postcode anchor consume โ€” the datasets that turn a parsed span into a place + coordinate. Different shopping list, and it's where WOF's gaps actually bite (see resolver-wof-sqlite/POSTCODE-ANCHOR.md and the 2026-06-03 postcode-anchor postmortem).

WOF is an excellent admin gazetteer (the in-DB hierarchy + ancestors table is our Elasticsearch-free differentiator), and it's CC0, which is half of why it came naturally. But it was never complete for every token type: no street node (neighbourhood โ†’ address directly), thin and uneven postcode geometry outside a few countries (US/NL carry own coords; DE ~66% placeable via ancestor-borrow; ES/IT orphan-heavy and effectively unplaceable), and limited POI/venue coverage. The canonical fix is the rest of the Pelias four-layer stack โ€” WOF (admin) + OpenAddresses (addresses/postcodes) + OpenStreetMap (streets/POIs) + GeoNames (places/postcodes) โ€” supplemented per gap.

By gap layerโ€‹

  • Postcode โ†’ centroid. GeoNames postal (download.geonames.org/export/zip, CC-BY 4.0, ~80+ countries) is a ready-made postcode โ†’ place + admin + lat/lon table, the cheapest fill for the DE/ES/IT gap with no point-cloud math (coarser than OA in places, often locality-level). OpenAddresses point aggregation (median centroid per postcode) is higher-fidelity where we have the points. National authorities are the gold standard but fragmented: UK Code-Point Open (OS, OGL), France BAN, NL via PDOK.
  • Street / rooftop coords. OpenAddresses (primary), OSM addr:* on buildings (great in DE/NL, patchy elsewhere), national registries โ€” France BAN, Australia G-NAF (CC-BY; backlog #31).
  • POIs / landmarks / salience (the generalized-anchor and exotic-POI layer). OSM is the richest open POI source. Wikidata / Wikipedia is the right home for the data-driven salience signal the anchor guardrail demands (notability + coordinates + multilingual names), and we already pull wikimedia-importance.csv in build-importance, so the path has precedent.
  • Admin (the spine). Keep WOF. Alternatives if it is ever reconsidered: GeoNames admin, Overture Divisions, geoBoundaries.

Probe: GeoNames vs the WOF postcode gap (2026-06-03)โ€‹

Measured GeoNames postal against the WOF shard's actual gap (the postcodes WOF carries as membership but cannot place). GeoNames carries coordinates for 100% of its DE/ES/IT records:

localeWOF placementGeoNames closesresult
IT0% (orphans + wrong links)4,447 / 4,936 (90%)and it fixes the bad links โ€” Milan 20121 โ†’ 45.46, 9.19 (correct), where WOF pointed at a Liguria village
DE66% (ancestor-borrow)4,819 / 10,061 unplaced (~48%)would lift DE to ~82%
ESnot built (orphan-heavy)~full (11,150 codes, all with coords)the clean way to add ES at all
NL100% (own PC6 coords)no help โ€” GeoNames is PC4-level (4,086), coarser than WOF's PC6keep WOF

So GeoNames postal is the cheap, CC-BY fill for ES/IT (and roughly half of DE's remaining gap), and it corrects WOF's Italian mis-links as a side effect. Integration stays clean under the discipline above: match on the postcode string, keep the WOF id, write the GeoNames lat/lon as the centroid. (Overture was not probed โ€” for postcodes it folds in OpenAddresses, which we already use, so its differential value is the POI layer; that probe wants a DuckDB + S3-parquet setup and is the right next step for the generalized-anchor work, not for postcodes.)

The modern entrant: Overture Mapsโ€‹

The Meta/Microsoft/AWS/TomTom consortium (2023+, post-dates the corpus catalog below). Its Places theme is a large open POI dataset and its Addresses theme folds in OpenAddresses, so it covers two of our three gaps in one source, with stable GERS entity IDs. Distributed as cloud parquet (DuckDB/S3). The one to probe before committing to per-source OSM imports.

Licensing gradient (the real selection axis)โ€‹

For an AGPL product, license terms decide more than coverage does:

SourceLicenseFriction for a shipped DB
WOFCC0none (public domain)
OpenAddressesper-source, mostly openper-source attribution tracking (we already do this)
GeoNamesCC-BY 4.0attribution
OvertureCDLA-Permissive 2.0attribution
OpenStreetMapODbLattribution + share-alike on derived databases โ€” the spicy one; can pull copyleft on

So the gradient argues for GeoNames + Overture + OpenAddresses as the supplements, and treating OSM carefully (derived signals, not the shipped DB).

Integrity disciplineโ€‹

Attach supplemental data as attributes on WOF-keyed entities (a centroid here, a popularity score there), not as imported foreign-id entities. WOF stays the spine and the eval keys; the coordinates and the long tail come from elsewhere. Mixing in Overture's GERS ids (or any parallel id space) as primary keys is what would quietly break the WOF-id-keyed resolver evals, which is the reason behind the "extend the custom WOF build, never a prebuilt dump" rule. OA centroid aggregation passes this test because the postcode keeps its WOF id; only its coordinate comes from OA.


Educationโ€‹

Federal aggregatorsโ€‹

State school directoriesโ€‹

Librariesโ€‹

Federal aggregatorโ€‹

State public library directoriesโ€‹


Licensed professionalsโ€‹

State-level public databases of licensed professionals (with practice/office addresses) suitable as training data for a postal address parsing NER model. All URLs verified via web search. Most state license-lookup tools are query-only (search-by-name/number); the NPPES NPI Registry and several state portals offer bulk CSV exports.

Federal / multi-stateโ€‹

State unified license portalsโ€‹

(Some states route many professions through a single search; per-profession entries below reference these where applicable.)

Per stateโ€‹

Alabamaโ€‹

Alaskaโ€‹

Arizonaโ€‹

Arkansasโ€‹

Californiaโ€‹

Coloradoโ€‹

Connecticutโ€‹

Delawareโ€‹

  • All four via DELPROS โ€” https://delpros.delaware.gov/OH_VerifyLicense โ€” query-only
  • Medical: Delaware Board of Medical Licensure & Discipline (via DELPROS)
  • Nursing: Delaware Board of Nursing (via DELPROS)
  • Real estate: Delaware Real Estate Commission (via DELPROS)
  • Contractor: Local-only at general-contractor level; specialty trades (electrical, plumbing) licensed via DELPROS

District of Columbiaโ€‹

Floridaโ€‹

Georgiaโ€‹

Hawaiiโ€‹

Idahoโ€‹

Illinoisโ€‹

  • All four via Illinois IDFPR License Lookup โ€” https://online-dfpr.micropact.com/lookup/licenselookup.aspx โ€” query-only
  • Medical: IL Medical Board (via IDFPR)
  • Nursing: IL Board of Nursing (via IDFPR)
  • Real estate: IL Division of Real Estate (via IDFPR)
  • Contractor: Illinois does not have a statewide general contractor license โ€” local-only at GC level; roofing & plumbing licensed via IDFPR

Indianaโ€‹

  • All via Indiana PLA โ€” https://secure.in.gov/apps/pla/search and https://mylicense.in.gov/everification/ โ€” query-only
  • Medical: Indiana Medical Licensing Board (via PLA)
  • Nursing: Indiana State Board of Nursing (via PLA)
  • Real estate: Indiana Real Estate Commission (via PLA)
  • Contractor: Local-only โ€” no state general contractor license (Plumbing Commission licenses plumbers via PLA)

Iowaโ€‹

Kansasโ€‹

Kentuckyโ€‹

Louisianaโ€‹

Maineโ€‹

Marylandโ€‹

Massachusettsโ€‹

Michiganโ€‹

  • All four via MiPLUS / LARA โ€” https://val.apps.lara.state.mi.us/ โ€” query-only
  • Medical: MI Board of Medicine (via MiPLUS)
  • Nursing: MI Board of Nursing (via MiPLUS)
  • Real estate: MI Board of Real Estate Brokers & Salespersons (via MiPLUS)
  • Contractor: MI Residential Builders & M&A Contractors (via MiPLUS)

Minnesotaโ€‹

Mississippiโ€‹

Missouriโ€‹

  • All via MO DPR Licensee Search โ€” https://pr.mo.gov/licensee-search.asp โ€” query-only; bulk "Downloadable Listings" updated nightly at https://pr.mo.gov/
  • Medical: Missouri Board of Registration for the Healing Arts (via DPR)
  • Nursing: Missouri State Board of Nursing (via DPR)
  • Real estate: Missouri Real Estate Commission (via DPR)
  • Contractor: Local-only โ€” no state general contractor license

Montanaโ€‹

Nebraskaโ€‹

Nevadaโ€‹

New Hampshireโ€‹

  • All four via NH OPLC License Lookup โ€” https://www.oplc.nh.gov/license-lookup โ€” query-only
  • Medical: NH Board of Medicine (via OPLC)
  • Nursing: NH Board of Nursing (via OPLC)
  • Real estate: NH Real Estate Commission (via OPLC)
  • Contractor: Local-only โ€” no state general contractor license (NH licenses some trades, e.g. electricians, via OPLC)

New Jerseyโ€‹

New Mexicoโ€‹

New Yorkโ€‹

North Carolinaโ€‹

North Dakotaโ€‹

Ohioโ€‹

Oklahomaโ€‹

Oregonโ€‹

Pennsylvaniaโ€‹

  • All medical/nursing/real estate via PALS โ€” https://www.pals.pa.gov/ โ€” query-only
  • Medical: PA State Board of Medicine (via PALS)
  • Nursing: PA State Board of Nursing (via PALS)
  • Real estate: PA Real Estate Commission (via PALS)
  • Contractor: PA Attorney General Home Improvement Contractor Registration โ€” https://hicsearch.attorneygeneral.gov/ โ€” query-only

Rhode Islandโ€‹

South Carolinaโ€‹

South Dakotaโ€‹

Tennesseeโ€‹

Texasโ€‹

Utahโ€‹

  • All four via Utah DOPL License Lookup Verification โ€” https://secure.utah.gov/llv/search/index.html โ€” query-only
  • Medical: Utah Physician Licensing Board (via DOPL)
  • Nursing: Utah Board of Nursing (via DOPL)
  • Real estate: Utah Division of Real Estate (via DOPL) โ€” separate division but uses DOPL lookup
  • Contractor: Utah Contractors (via DOPL)

Vermontโ€‹

Virginiaโ€‹

Washingtonโ€‹

West Virginiaโ€‹

Wisconsinโ€‹

  • All four via Wisconsin DSPS License Search โ€” https://licensesearch.wi.gov/ โ€” query-only
  • Medical: WI Medical Examining Board (via DSPS)
  • Nursing: WI Board of Nursing (via DSPS)
  • Real estate: WI Real Estate Examining Board (via DSPS)
  • Contractor: WI Dwelling Contractor / trades (via DSPS)

Wyomingโ€‹


Secretary of State, business, notary, charity, lobbyistโ€‹

Notes on scope and quirksโ€‹

  • Business entity searches: every state offers free name/agent query; only a minority offer bulk downloads (FL, OR partial via open data, LA via paid API, HI by-record paid). DE does not publish principal addresses. NV charges for some access; TX SOSDirect charges $1/search.
  • Notary registries: most states publish searchable directories. NH, TN, WY publish only PDF lists or none. GA notaries are commissioned by county clerks but indexed centrally by GSCCCA.
  • Charity registries: typically AG, but SoS in CO, GA, MD, MS, NC, ND, OK, PA, SC, TN, WA, WV; DBR in RI; DCP in CT; DLCP in DC; VDACS in VA; DPFR in ME; DFI in WI; FDACS in FL; DCP in UT (transitioning to DCCC). DE, ID, MT, SD, WY have no state-level charity registration (only fundraisers in some). AZ repealed charity registration in 2013 (veterans + fundraisers still register).
  • Lobbyist registries: SoS for most; Ethics Commission for AL, GA, HI, KS, MO, OK, SC, TN, TX, WI, WV; Legislature for IA, NE, NV; PIC for DE; APOC for AK; DLS Ethics Council for VA; PDC for WA; OGEC for OR; ELEC for NJ; BEGA for DC; JLEC/OLIG for OH; KLEC for KY; LREC/ILRC for IN; ME Ethics; MA SoC; NH SoS; MD SEC.

Per stateโ€‹

Alabamaโ€‹

Alaskaโ€‹

Arizonaโ€‹

Arkansasโ€‹

Californiaโ€‹

Coloradoโ€‹

Connecticutโ€‹

Delawareโ€‹

District of Columbiaโ€‹

Floridaโ€‹

Georgiaโ€‹

Hawaiiโ€‹

Idahoโ€‹

Illinoisโ€‹

Indianaโ€‹

Iowaโ€‹

Kansasโ€‹

Kentuckyโ€‹

Louisianaโ€‹

Maineโ€‹

Marylandโ€‹

Massachusettsโ€‹

Michiganโ€‹

Minnesotaโ€‹

Mississippiโ€‹

Missouriโ€‹

Montanaโ€‹

Nebraskaโ€‹

Nevadaโ€‹

New Hampshireโ€‹

New Jerseyโ€‹

New Mexicoโ€‹

New Yorkโ€‹

North Carolinaโ€‹

North Dakotaโ€‹

Ohioโ€‹

Oklahomaโ€‹

Oregonโ€‹

Pennsylvaniaโ€‹

Rhode Islandโ€‹

South Carolinaโ€‹

South Dakotaโ€‹

Tennesseeโ€‹

Texasโ€‹

Utahโ€‹

Vermontโ€‹

Virginiaโ€‹

Washingtonโ€‹

West Virginiaโ€‹

Wisconsinโ€‹

Wyomingโ€‹


Regulated facilities, open data portals, and parcelsโ€‹

State open data portalsโ€‹

License terms note: most state ArcGIS Hub deployments default to Esri Open Data terms (effectively public-domain or attribution); confirm per dataset. Socrata-hosted state portals generally publish under each state's open-data terms (typically attribution-only).


Regulated establishments per stateโ€‹

Each state lists, where available: Child care licensing, Assisted living/nursing home (state-level), Alcohol/ABC, Cannabis (where legal & public), Auto dealer licensing.

Alabamaโ€‹

Alaskaโ€‹

Arizonaโ€‹

Arkansasโ€‹

Californiaโ€‹

Coloradoโ€‹

Connecticutโ€‹

Delawareโ€‹

District of Columbiaโ€‹

Floridaโ€‹

Georgiaโ€‹

Hawaiiโ€‹

Idahoโ€‹

Illinoisโ€‹

Indianaโ€‹

Iowaโ€‹

Kansasโ€‹

Kentuckyโ€‹

Louisianaโ€‹

Maineโ€‹

Marylandโ€‹

Massachusettsโ€‹

Michiganโ€‹

Minnesotaโ€‹

Mississippiโ€‹

Missouriโ€‹

Montanaโ€‹

Nebraskaโ€‹

Nevadaโ€‹

New Hampshireโ€‹

New Jerseyโ€‹

New Mexicoโ€‹

New Yorkโ€‹

North Carolinaโ€‹

North Dakotaโ€‹

Ohioโ€‹

Oklahomaโ€‹

Oregonโ€‹

Pennsylvaniaโ€‹

Rhode Islandโ€‹

South Carolinaโ€‹

South Dakotaโ€‹

Tennesseeโ€‹

Texasโ€‹

Utahโ€‹

Vermontโ€‹

Virginiaโ€‹

Washingtonโ€‹

West Virginiaโ€‹

Wisconsinโ€‹

Wyomingโ€‹


State-aggregated parcel dataโ€‹

Only states with a true statewide parcel aggregation (not just per-county) listed. License notes added where visible.

States NOT included here either lack state-level parcel aggregation (county-only) or only publish partial datasets: AL, AK, AZ, CA, CO, FL, GA, HI, ID, IL, IN, IA, KS, KY, LA, MI, MS, MO, NE, NV, NJ, NM, NY, ND, OH, OK, OR, PA, SC, SD, TX, WA, WV, WI. Some of these (e.g., NJ, TX, WI, OR) maintain statewide indexes pointing to county data but do not redistribute parcels as a unified statewide layer at no cost.


Notes on bulk vs. query, and license termsโ€‹

  • ArcGIS Hub-hosted state portals (Esri Open Data) generally allow CSV/GeoJSON/SHP downloads with attribution; Esri terms are public-domain-equivalent by default but each dataset may override.
  • Socrata-hosted portals (data.ca.gov, data.ny.gov, opendata.maryland.gov, etc.) expose SODA APIs for query and CSV/JSON bulk export per dataset.
  • Many regulator websites (DMV dealer lookups, liquor license search portals) are query-only; bulk acquisition typically requires FOIA/public records request unless an open-data dataset is also published.
  • Cannabis: included only where state has a legal commercial program and publishes a licensee list. States marked N/A either prohibit cannabis or only permit narrow medical/CBD programs without a meaningful public dispensary roster.
  • Always check per-state terms of use; CC-BY and CC0 are common but not universal.