# NPPES Provider Golden Record

### About the Dataset

**Every team that builds on NPPES eventually builds the same thing — a cleanup layer to make the raw registry usable.** The Python scripts, the address scrubbers, the dedup logic, the exclusion cross-checks, the FHIR mapping work. It's the project nobody budgeted for and everybody ends up owning. We built it once, so your team doesn't have to.

The **NPPES Provider Golden Record** is the cleaned, deduplicated, enriched version of the Centers for Medicare & Medicaid Services (CMS) **National Plan and Provider Enumeration System (NPPES)** — refreshed weekly within days of each CMS release.

We validate every practice address against USPS postal records, standardize it to deliverable form, and score each with a confidence rating — so claim denials from bad ZIPs and typos stop at our pipeline, not yours. We surface the duplicate NPI pairs — candidate matches across millions of providers — and resolve them with tiered-confidence scoring, so the same physician doesn't show up three times in your network map. We cross-reference every NPI against **CMS PECOS** for Medicare enrollment, specialty, and group practice PAC ID — and flag **OIG/LEIE** exclusions and deactivated NPIs before they reach your billing team.

We fill missing NUCC taxonomy gaps, attach hospital CCN and group practice affiliations from **CMS Care Compare**, and enrich every address with census tract, CBSA metro, **HPSA** shortage scores, MUA designations, congressional and state-legislative districts, IANA timezone, and a native Snowflake `geo_point` GEOGRAPHY column for `ST_DISTANCE` and `ST_DWITHIN` spatial queries. And we publish pre-built **FHIR R4.0.1** Practitioner, Organization, Location, and PractitionerRole resources — so your next **TEFCA**, QHIN, HIE, or EHR integration project starts on day thirty instead of day zero.

One weekly-refreshed source of truth for revenue cycle, credentialing, network planning, compliance, and TEFCA-ready interop.

{% hint style="info" %}
**Get Full Access** | [Snowflake Marketplace](https://app.snowflake.com/marketplace/listing/GZT1Z7QRT0IO) | [Free Trial](https://trial.dataplex-consulting.com)
{% endhint %}

{% hint style="warning" %}
**Upgrade Path for CMS NPPES Provider Dataset Subscribers:** This is the premium counterpart to the raw [CMS NPPES Provider Dataset](/data-catalog/cms-nppes-provider-dataset.md). The Golden Record delivers every NPI in NPPES *after* address validation, census/ACS enrichment, PECOS enrollment joins, OIG exclusion checks, duplicate clustering, and FHIR mapping — consolidating work that teams typically distribute across internal data quality, interoperability, and compliance pipelines into a single analytics-ready dataset refreshed weekly on the CMS release cadence.

**Backwards-compatible by design:** every table name, column, and join path from the CMS NPPES Provider Dataset is preserved in the Golden Record. Existing subscribers can swap the database name in their queries — `PROVIDERS`, `PROVIDERS_ADDRESSES`, `PROVIDERS_IDENTIFIERS`, `PROVIDERS_LICENSES`, `PROVIDERS_TAXONOMIES`, and `TAXONOMIES` keep their original shape, so existing SQL runs unchanged. New enrichment columns and tables are layered alongside.
{% endhint %}

### What You Get

|                                               |                                                                                                                                                                  |
| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Addresses validated & geocoded**            | Every practice address USPS-standardized, confidence-scored, and paired with a native Snowflake GEOGRAPHY point for `ST_DISTANCE` / `ST_DWITHIN` spatial queries |
| **Duplicate NPIs resolved**                   | Tiered-confidence clustering collapses multi-NPI providers into unique real-world entities, with the canonical and all alternate NPIs preserved                  |
| **Medicare enrollment joined**                | PECOS enrollment status, specialty, enrollment state, and group practice PAC ID attached to every NPI                                                            |
| **Compliance risks flagged**                  | OIG/LEIE exclusions, deactivated NPIs, and critical data-quality issues surfaced as clean booleans                                                               |
| **Taxonomy gaps filled**                      | Missing NUCC primary taxonomy codes inferred and flagged, so specialty classification is complete across every record                                            |
| **Hospital & group affiliations appended**    | CMS Care Compare hospital CCN affiliations, group practice, medical school, graduation year                                                                      |
| **Geographic & demographic context enriched** | Census tract, CBSA metro, HPSA shortage scores, MUA designations, congressional and state legislative districts, IANA timezone, ACS demographics and economics   |
| **Interop resources pre-built**               | FHIR R4.0.1 Practitioner, Organization, Location, and PractitionerRole resources ready to drop into TEFCA, QHIN, HIE, or EHR pipelines                           |
| **Coverage**                                  | Every US state, DC, and territories                                                                                                                              |
| **Refresh cadence**                           | Weekly — our pipeline monitors for new CMS NPPES releases and loads the validated, enriched data within days of publication, automatically                       |

**Consumer tables** (11 in the share):

* **Provider master**: `PROVIDERS`, `PROVIDERS_ADDRESSES`, `PROVIDERS_IDENTIFIERS`, `PROVIDERS_LICENSES`, `PROVIDERS_TAXONOMIES`, `TAXONOMIES`
* **Enrichment outputs**: `PROVIDERS_ENTITIES`, `PROVIDERS_FHIR`, `PROVIDERS_DUPLICATES`
* **Metadata**: `PRODUCT_INFO`, `DATA_DICTIONARY`

## Overview

The NPPES Provider Golden Record provides analytics-ready access to every US healthcare provider with enrichment that isn't available in the raw CMS feed.

**Provider master:**

* **PROVIDERS** – One row per NPI. Identity, lifecycle, secondary-name fields, taxonomy enrichment, OIG exclusion, CMS Care Compare hospital/group affiliations, PECOS Medicare enrollment, duplicate cluster flags, quality signals
* **PROVIDERS\_ADDRESSES** – One row per (NPI, address\_type). Both `Practice_Location` and `Business_Mailing` rows per provider. Practice rows carry USPS-validated address, geocoded lat/long + native Snowflake `geo_point` GEOGRAPHY, census tract/block/place codes, ACS demographics, HPSA and MUA shortage designations, congressional + state-legislative districts, IANA timezone — mailing rows are raw NPPES passthrough
* **PROVIDERS\_IDENTIFIERS** – Other provider identifiers (PECOS, Medicaid, prior NPIs, state license numbers per NPPES `OTHER_PROVIDER_IDENTIFIER_*`). Join `PROVIDER_ID = PROVIDERS.ID`
* **PROVIDERS\_LICENSES** – State medical licenses (state + license number)
* **PROVIDERS\_TAXONOMIES** – Per-NPI taxonomy rows with the raw NPPES `PRIMARY` flag (sparsely populated \~15% of rows — for dense primary-taxonomy signal use `PROVIDERS.INFERRED_PRIMARY_TAXONOMY_CODE`)
* **TAXONOMIES** – NUCC taxonomy reference (\~880 codes with grouping, classification, specialization, definition)

**Enrichment outputs:**

* **PROVIDERS\_ENTITIES** – One row per unique real-world provider with multi-NPI duplicate clusters pre-collapsed and critical rollups preserved (`any_medicare_enrolled`, `any_oig_excluded`) — query this for accurate unique-provider counts
* **PROVIDERS\_FHIR** – FHIR R4.0.1 `Practitioner` (or `Organization`), `PractitionerRole` for Medicare enrollment, and `Location` resources stored as VARIANT, ready to drop into TEFCA, QHIN, EHR, or HIE pipelines
* **PROVIDERS\_DUPLICATES** – Bridge table of suspected duplicate NPI pairs with similarity scores, match type, confidence tier, and recommended surviving NPI — forensic view behind `PROVIDERS_ENTITIES`

### Metadata Tables

Every Dataplex data product includes these standard metadata tables:

| Table             | Purpose                                                                                                |
| ----------------- | ------------------------------------------------------------------------------------------------------ |
| `PRODUCT_INFO`    | Single-row product snapshot — current provider count, address count, enrichment coverage, last refresh |
| `DATA_DICTIONARY` | Column descriptions for every consumer-facing view                                                     |

## Entity Relationship Diagram

![NPPES Provider Golden Record Entity Relationship Diagram](/files/ytC5x4XjFmeTDOYzEqgN)

The primary join key is `npi` for every consumer-facing table. `PROVIDERS.ID` (VARCHAR UUID) is the FK target for `PROVIDERS_IDENTIFIERS.PROVIDER_ID`, `PROVIDERS_LICENSES.PROVIDER_ID`, and `PROVIDERS_TAXONOMIES.PROVIDER_ID`. Address enrichment joins via `NPI`: `PROVIDERS_ADDRESSES` is one row per (NPI, address\_type) so each provider has up to two rows (Practice + Mailing).

## Data Tables

### PROVIDERS — the primary consumer table

One row per NPI. Identity, taxonomy, OIG exclusion, CMS Care Compare, PECOS Medicare enrollment, duplicate flags, and quality scoring. **Join to `PROVIDERS_ADDRESSES` on `npi`** for validated addresses and geo/demographic enrichment. Filter `is_canonical = TRUE` to dedupe a roster. The `id` column (VARCHAR UUID) is the FK target for `PROVIDERS_IDENTIFIERS`, `PROVIDERS_LICENSES`, and `PROVIDERS_TAXONOMIES` — join via `PROVIDER_ID = PROVIDERS.ID`.

**Key Features:**

* One row per NPI (9.5M+ US healthcare providers including active + deactivated)
* Validated identity and lifecycle fields including the `PROVIDER_OTHER_*` secondary-name family (maiden, former, DBA/AKA)
* **PECOS Medicare enrollment** joined per NPI — `medicare_enrolled` is strict TRUE/FALSE
* **OIG/LEIE exclusion** flagged per NPI with exclusion type and date
* **CMS Care Compare** hospital affiliations, group practice, medical school, graduation year
* Duplicate cluster labeling with a stable `is_canonical` filter for safe roster dedup
* Refreshed weekly on the CMS NPPES release cadence

#### Key Columns

{% hint style="info" %}
This table highlights the columns used most often. **`PROVIDERS` has 85+ columns in total** including identity fields (`id`, 10× `PROVIDER_OTHER_*`). Query `DWV.DATA_DICTIONARY WHERE table_name = 'PROVIDERS'` for the complete, authoritative spec with types and descriptions — we publish the metadata to Snowflake so it always matches what's in the view.
{% endhint %}

| Column                                            | Type      | Description                                                                          |
| ------------------------------------------------- | --------- | ------------------------------------------------------------------------------------ |
| npi                                               | NUMBER    | National Provider Identifier — 10-digit CMS ID, primary key                          |
| address\_id                                       | VARCHAR   | FK to `PROVIDERS_ADDRESSES.address_id` — join for address + enrichment               |
| entity\_type\_code                                | NUMBER    | 1 = Individual, 2 = Organization                                                     |
| is\_individual / is\_organization                 | BOOLEAN   | Convenience flags derived from `entity_type_code`                                    |
| provider\_organization\_name                      | VARCHAR   | Legal business name (organizations)                                                  |
| provider\_last\_name / first\_name / middle\_name | VARCHAR   | Name parts (individuals)                                                             |
| provider\_full\_name                              | VARCHAR   | Derived display name                                                                 |
| provider\_credential\_text                        | VARCHAR   | MD, DO, NP, etc.                                                                     |
| provider\_enumeration\_date                       | DATE      | When CMS first issued the NPI                                                        |
| is\_active                                        | BOOLEAN   | Currently active (not deactivated)                                                   |
| npi\_status                                       | VARCHAR   | `active` or `deactivated`                                                            |
| years\_in\_practice                               | NUMBER    | Years since enumeration                                                              |
| primary\_taxonomy\_code                           | VARCHAR   | NUCC taxonomy code (e.g., `207R00000X` = Internal Medicine)                          |
| specialty\_description                            | VARCHAR   | Human-readable specialty from NUCC lookup                                            |
| is\_specialty\_inferred                           | BOOLEAN   | TRUE when taxonomy was inferred to fill a NPPES gap                                  |
| is\_excluded                                      | BOOLEAN   | On the OIG/LEIE exclusion list                                                       |
| exclusion\_type\_description                      | VARCHAR   | OIG exclusion reason                                                                 |
| exclusion\_date                                   | DATE      | Date added to OIG exclusion list                                                     |
| medicare\_enrolled                                | BOOLEAN   | Enrolled in Medicare per PECOS (strict TRUE/FALSE)                                   |
| medicare\_specialty\_description                  | VARCHAR   | PECOS-reported Medicare specialty                                                    |
| medicare\_enrollment\_state                       | VARCHAR   | State of PECOS enrollment                                                            |
| medicare\_group\_practice\_pac\_id                | VARCHAR   | PECOS group practice PAC ID                                                          |
| medical\_school                                   | VARCHAR   | From CMS Care Compare                                                                |
| graduation\_year                                  | NUMBER    | From CMS Care Compare                                                                |
| hospital\_affiliation\_count                      | NUMBER    | Number of CCN affiliations (0–3)                                                     |
| group\_practice\_name                             | VARCHAR   | Primary group practice from CMS Care Compare                                         |
| is\_canonical                                     | BOOLEAN   | TRUE for the canonical NPI in a duplicate cluster — filter for a deduplicated roster |
| is\_duplicate\_suspected                          | BOOLEAN   | Participates in a high-confidence duplicate pair                                     |
| duplicate\_group\_id                              | NUMBER    | Cluster ID across all members (NULL if singleton)                                    |
| duplicate\_cluster\_size                          | NUMBER    | Count of NPIs in this cluster                                                        |
| has\_quality\_flags                               | BOOLEAN   | Any active data quality flag                                                         |
| has\_critical\_quality\_flag                      | BOOLEAN   | Critical-severity flag (e.g., undeliverable address)                                 |
| overall\_quality\_score                           | FLOAT     | 0.0–1.0 composite quality score                                                      |
| created\_at                                       | TIMESTAMP | When the source data was last loaded                                                 |
| updated\_at                                       | TIMESTAMP | When this record was last refreshed                                                  |

### PROVIDERS\_ADDRESSES — validated addresses with full geo and demographic enrichment

One row per (NPI, address\_type) — both `Practice_Location` and `Business_Mailing` rows per provider. Practice rows carry the full enrichment stack (USPS validation, Geocodio census/CBSA/HPSA/MUA, lat/lng/`geo_point GEOGRAPHY`, political districts, timezone, ACS demographics). Mailing rows are raw NPPES fields only with enrichment columns NULL.

**Key Features:**

* One row per (NPI, address\_type) — **\~19M rows total**, \~9.5M practice + \~9.5M mailing
* Raw NPPES fields: `street1`, `street2`, `city`, `state`, `zipcode`, `countrycode`, `telephone_number`, `fax_number`, `address_type`, `id` (VARCHAR UUID), `provider_id` (VARCHAR UUID FK to `PROVIDERS.ID`)
* Practice rows carry USPS validation with `address_validation_status` indicating valid / corrected / undeliverable / not\_found (see [status values](#address_validation_status-values) below)
* Geocoded lat/long coordinates and native `geo_point GEOGRAPHY` on practice rows
* Census tract, block, place FIPS codes
* ACS demographics and economics (population, median age, race/ethnicity, median household income, housing, education, veteran status)
* HRSA HPSA shortage scores (primary care, mental health, dental) and MUA designations
* Congressional and state legislative districts
* IANA timezone with UTC offset and DST observance

{% hint style="info" %}
Filter `WHERE address_type = 'Practice_Location'` for enriched geo analytics. Filter `WHERE address_type = 'Business_Mailing'` for correspondence addresses.
{% endhint %}

#### Key Columns

{% hint style="info" %}
This table highlights the columns used most often. **`PROVIDERS_ADDRESSES` has \~60 columns in total.** Query `DWV.DATA_DICTIONARY WHERE table_name = 'PROVIDERS_ADDRESSES'` for the complete, authoritative spec — we publish the metadata to Snowflake so it always matches what's in the view.
{% endhint %}

| Column                                                                                            | Type        | Description                                                                                                    |
| ------------------------------------------------------------------------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------- |
| id                                                                                                | VARCHAR(36) | Row PK                                                                                                         |
| provider\_id                                                                                      | VARCHAR(36) | FK to `PROVIDERS.ID`                                                                                           |
| npi                                                                                               | NUMBER      | NPI — matches `PROVIDERS.NPI`                                                                                  |
| address\_type                                                                                     | VARCHAR     | `Practice_Location` or `Business_Mailing`                                                                      |
| street1 / street2 / city / state / zipcode                                                        | VARCHAR     | Address as filed with CMS                                                                                      |
| countrycode                                                                                       | VARCHAR     | ISO country code (`US`)                                                                                        |
| telephone\_number / fax\_number                                                                   | VARCHAR     | Contact numbers as filed                                                                                       |
| validated\_address\_line1 / validated\_city / validated\_state / validated\_zip / validated\_zip4 | VARCHAR     | USPS-validated canonical address (practice rows only)                                                          |
| latitude / longitude                                                                              | FLOAT       | Geocoded coordinates (practice rows only)                                                                      |
| geo\_point                                                                                        | GEOGRAPHY   | Native Snowflake spatial type — use with `ST_DISTANCE`, `ST_DWITHIN`, `ST_CONTAINS` (practice rows only)       |
| address\_validation\_status                                                                       | VARCHAR     | Outcome of address verification (practice rows) — see [status values](#address_validation_status-values) below |
| address\_confidence\_score                                                                        | FLOAT       | 0.0–1.0 validation confidence                                                                                  |
| state\_fips / county\_fips                                                                        | VARCHAR     | Census FIPS codes                                                                                              |
| county\_name                                                                                      | VARCHAR     | County display name                                                                                            |
| tract\_code / block\_code / block\_group                                                          | VARCHAR     | Census Bureau tract/block geography                                                                            |
| metro\_area\_name / metro\_area\_code / metro\_area\_type                                         | VARCHAR     | CBSA name, code, metropolitan/micropolitan                                                                     |
| combined\_statistical\_area\_name                                                                 | VARCHAR     | CSA (where applicable)                                                                                         |
| congressional\_district / congressional\_district\_name                                           | VARCHAR     | US House district                                                                                              |
| state\_senate\_district / state\_house\_district                                                  | VARCHAR     | State legislative districts                                                                                    |
| timezone\_name                                                                                    | VARCHAR     | IANA timezone (e.g., `America/New_York`)                                                                       |
| timezone\_utc\_offset                                                                             | NUMBER      | UTC offset in hours                                                                                            |
| total\_population                                                                                 | NUMBER      | ACS total population for the tract                                                                             |
| median\_household\_income                                                                         | NUMBER      | ACS median household income                                                                                    |
| pct\_bachelors\_or\_higher                                                                        | FLOAT       | ACS educational attainment                                                                                     |
| is\_in\_hpsa                                                                                      | BOOLEAN     | County has an active HRSA HPSA designation                                                                     |
| hpsa\_primary\_care\_score / mental\_health\_score / dental\_score                                | NUMBER      | HPSA severity (0–26, higher = more severe shortage)                                                            |
| is\_in\_mua                                                                                       | BOOLEAN     | HRSA Medically Underserved Area designation                                                                    |
| is\_rural                                                                                         | BOOLEAN     | Not metropolitan                                                                                               |
| created\_at / updated\_at                                                                         | TIMESTAMP   | Lineage                                                                                                        |

#### `address_validation_status` values

What each outcome means when you're working with the data:

| Value           | What it means for you                                                                                                                                                                                                             |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `valid`         | The address as filed with CMS is deliverable and standardized — no changes needed. Safe to use `practice_*` or `validated_*` columns interchangeably.                                                                             |
| `corrected`     | The filed address had issues (typos, non-canonical casing, missing ZIP+4). The `validated_*` columns contain the corrected, deliverable version — **use those for mailing and claim submission**, not the `practice_*` originals. |
| `undeliverable` | Address verification determined the location cannot receive mail (invalid apartment, deleted street, etc.). Flag these providers for outreach or directory correction before using them for claims or mailings.                   |
| `not_found`     | The address could not be matched against authoritative address records — could be a new construction, an irregular PO box format, or a data-entry error. Verify before mailing or submitting claims.                              |

### PROVIDERS\_ENTITIES — deduplicated provider roster

One row per unique real-world provider. Multi-NPI duplicate clusters collapsed into a single canonical entity with the full NPI list preserved in `all_npis`. Critical rollups (`any_oig_excluded`, `any_medicare_enrolled`, `any_has_critical_quality_flag`) aggregate across cluster members so compliance signals are never silently deduped away.

**Key Features:**

* Use this for accurate unique-provider counts (avoid inflating from multi-NPI clusters)
* `all_npis` ARRAY preserves every NPI in the cluster for roster remediation
* Any OIG exclusion or Medicare enrollment on any cluster member is surfaced

#### Key Columns

{% hint style="info" %}
This table highlights the columns used most often. **`PROVIDERS_ENTITIES` has 38 columns in total.** Query `DWV.DATA_DICTIONARY WHERE table_name = 'PROVIDERS_ENTITIES'` for the complete spec.
{% endhint %}

| Column                                                                                    | Type                     | Description                                                                            |
| ----------------------------------------------------------------------------------------- | ------------------------ | -------------------------------------------------------------------------------------- |
| entity\_id                                                                                | NUMBER                   | Stable cluster key (`duplicate_group_id` for clusters, `canonical_npi` for singletons) |
| canonical\_npi                                                                            | NUMBER                   | Representative NPI for the entity                                                      |
| all\_npis                                                                                 | ARRAY                    | Every NPI in this entity (length 1 for singletons)                                     |
| npi\_count                                                                                | NUMBER                   | Cluster size                                                                           |
| any\_medicare\_enrolled                                                                   | BOOLEAN                  | TRUE if ANY cluster NPI is Medicare-enrolled per PECOS                                 |
| any\_oig\_excluded                                                                        | BOOLEAN                  | Compliance rollup — TRUE if ANY cluster NPI is on the OIG/LEIE list                    |
| any\_has\_quality\_flags                                                                  | BOOLEAN                  | TRUE if ANY cluster NPI has a data quality flag                                        |
| any\_has\_critical\_quality\_flag                                                         | BOOLEAN                  | TRUE if ANY cluster NPI has a critical quality flag                                    |
| entity\_type\_code                                                                        | NUMBER                   | 1 = Individual, 2 = Organization                                                       |
| provider\_full\_name                                                                      | VARCHAR                  | Derived display name from canonical NPI                                                |
| provider\_organization\_name / last/first/middle\_name                                    | VARCHAR                  | Name parts from canonical NPI                                                          |
| provider\_credential\_text                                                                | VARCHAR                  | Credentials from canonical NPI                                                         |
| primary\_taxonomy\_code / specialty\_description                                          | VARCHAR                  | NUCC taxonomy from canonical NPI                                                       |
| cms\_specialty                                                                            | VARCHAR                  | CMS Care Compare specialty from canonical NPI                                          |
| is\_active                                                                                | BOOLEAN                  | Canonical NPI's active status                                                          |
| npi\_status                                                                               | VARCHAR                  | `active` or `deactivated`                                                              |
| is\_excluded / exclusion\_type\_description / exclusion\_date                             | BOOLEAN / VARCHAR / DATE | Canonical NPI's OIG status                                                             |
| medical\_school / graduation\_year / hospital\_affiliation\_count / group\_practice\_name | —                        | From CMS Care Compare                                                                  |
| overall\_quality\_score                                                                   | FLOAT                    | Composite score from canonical NPI                                                     |
| address\_id                                                                               | VARCHAR                  | FK to `PROVIDERS_ADDRESSES` (canonical NPI's practice address)                         |
| created\_at / updated\_at                                                                 | TIMESTAMP                | Lineage                                                                                |

### PROVIDERS\_FHIR — drop-in FHIR R4.0.1 resources

One row per NPI, 1:1 with `PROVIDERS`. Three FHIR R4.0.1 resources stored as VARIANT: a `Practitioner` or `Organization` (switched on `entity_type_code`), a `PractitionerRole` representing Medicare enrollment (NULL when not PECOS-enrolled), and a `Location` with coordinates. Ready to drop directly into TEFCA, QHIN, EHR, or HIE pipelines — no FHIR transformer required.

**Key Features:**

* US Core-aligned field conventions
* Vendor-scoped extensions for OIG/LEIE exclusion on `Practitioner.extension`
* `meta.lastUpdated` preserved from source
* Conditional `PractitionerRole` (Medicare enrollment only) mirrors real-world semantics

#### Column Reference

All 7 columns:

| Column                   | Type      | Description                                                                   |
| ------------------------ | --------- | ----------------------------------------------------------------------------- |
| npi                      | NUMBER    | National Provider Identifier (1:1 with `PROVIDERS.npi`)                       |
| fhir\_resource\_type     | VARCHAR   | `Practitioner` (individual) or `Organization` (Type 2 NPI)                    |
| fhir\_practitioner       | VARIANT   | FHIR R4.0.1 Practitioner or Organization resource as JSON                     |
| fhir\_practitioner\_role | VARIANT   | FHIR R4.0.1 PractitionerRole for Medicare enrollment (NULL when not enrolled) |
| fhir\_location           | VARIANT   | FHIR R4.0.1 Location resource with coordinates                                |
| created\_at              | TIMESTAMP | When the source data was last loaded                                          |
| updated\_at              | TIMESTAMP | When this record was last refreshed                                           |

### PROVIDERS\_DUPLICATES — bridge table of suspected duplicate pairs

Tiered-confidence duplicate NPI pairs with similarity scores and recommended primary NPI. The forensic view behind `PROVIDERS_ENTITIES` — use this to audit dedup quality or build custom dedup logic.

#### Column Reference

All 10 columns:

| Column               | Type      | Description                                                                                                       |
| -------------------- | --------- | ----------------------------------------------------------------------------------------------------------------- |
| npi\_1               | NUMBER    | First NPI in the pair                                                                                             |
| npi\_2               | NUMBER    | Second NPI (always greater than `npi_1`)                                                                          |
| similarity\_score    | FLOAT     | 0.0–1.0 weighted similarity score across name, address, phone, and taxonomy                                       |
| confidence\_tier     | VARCHAR   | Evidence strength for treating the two NPIs as the same entity — see [tier values](#confidence_tier-values) below |
| match\_type          | VARCHAR   | Pattern of similarity that triggered the match — see [match type values](#match_type-values) below                |
| entity\_class        | VARCHAR   | `individual` or `organization`                                                                                    |
| recommended\_primary | NUMBER    | Suggested survivor NPI (lower-numbered = earlier CMS registration)                                                |
| detection\_date      | DATE      | When the pair was first detected                                                                                  |
| created\_at          | TIMESTAMP | When the source data was last loaded                                                                              |
| updated\_at          | TIMESTAMP | When this record was last refreshed                                                                               |

#### `confidence_tier` values

How much trust to place in a pair when collapsing a roster:

| Value             | What it means for you                                                                                                                                                                                                |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `high_confidence` | Strong evidence the two NPIs are the same real-world provider (exact name + shared address or phone). **Safe to use directly for roster deduplication** — `PROVIDERS_ENTITIES` pre-collapses these clusters for you. |
| `possible_match`  | Partial similarity that warrants a human review step before deduplication. Useful for surfacing dedup candidates in an MDM workflow, but don't merge blindly.                                                        |

#### `match_type` values

The pattern of similarity that triggered the pair:

| Value            | What it means for you                                                                                                                                                                                                      |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exact_match`    | Name, address, and/or phone match character-for-character on both NPIs. Highest-confidence category.                                                                                                                       |
| `probable_match` | Strong fuzzy similarity (e.g., name + shared practice address or phone). Meaningful overlap but with minor variations.                                                                                                     |
| `fuzzy_variant`  | Name spelling variants sharing location or taxonomy signals — typically hyphen/space/punctuation differences, or common name aliases. Lower precision but surfaces legitimate name changes and data-entry inconsistencies. |

### PROVIDERS\_IDENTIFIERS — other provider identifiers

Other provider identifiers per NPI — PECOS enrollment IDs, Medicaid provider IDs, prior NPIs, state license numbers per NPPES `OTHER_PROVIDER_IDENTIFIER_*` spec. One row per (provider, identifier). \~2.76M rows.

**Key Features:**

* \~2.76M identifier rows across every NPI that filed one
* Join `PROVIDER_ID = PROVIDERS.ID`
* Includes ALL NPIs (canonical + non-canonical) so historical claim joins don't silently drop deactivated providers

| Column                    | Type        | Description                                |
| ------------------------- | ----------- | ------------------------------------------ |
| id                        | VARCHAR(36) | Row PK                                     |
| provider\_id              | VARCHAR(36) | FK to `PROVIDERS.ID`                       |
| identifier                | VARCHAR     | The other identifier value                 |
| type\_code                | VARCHAR     | NPPES Other Provider Identifier Type Code  |
| type\_value               | VARCHAR     | Human-readable type label                  |
| state                     | VARCHAR     | State code where the identifier was issued |
| issuer                    | VARCHAR     | Issuing authority                          |
| created\_at / updated\_at | TIMESTAMP   | Lineage                                    |

### PROVIDERS\_LICENSES — state medical licenses

State medical licenses per provider. One row per (provider, license). \~6.96M rows.

**Key Features:**

* \~6.96M license rows
* Join `PROVIDER_ID = PROVIDERS.ID`

| Column                    | Type        | Description                        |
| ------------------------- | ----------- | ---------------------------------- |
| id                        | VARCHAR(36) | Row PK                             |
| provider\_id              | VARCHAR(36) | FK to `PROVIDERS.ID`               |
| licenses\_number          | VARCHAR     | License number as filed            |
| state                     | VARCHAR     | State code that issued the license |
| created\_at / updated\_at | TIMESTAMP   | Lineage                            |

### PROVIDERS\_TAXONOMIES — per-NPI taxonomy rows

Per-NPI taxonomy rows with the raw NPPES `PRIMARY` flag preserved. One row per (NPI, taxonomy) — up to 15 taxonomies per provider. \~9.94M rows. The `HEATLHCARE_PROVIDER_TAXONOMY_CODE` column name reflects the spelling carried over from the source NPPES schema.

{% hint style="info" %}
The raw `PRIMARY` flag is sparsely populated (\~15% of rows) per NPPES spec. For a dense inferred-primary signal, use `PROVIDERS.INFERRED_PRIMARY_TAXONOMY_CODE` on the main `PROVIDERS` table — we use a ROW\_NUMBER fallback to pick the best taxonomy when the raw flag is missing.
{% endhint %}

**Key Features:**

* \~9.94M taxonomy rows
* Join `PROVIDER_ID = PROVIDERS.ID`
* Join `HEATLHCARE_PROVIDER_TAXONOMY_CODE = TAXONOMIES.CODE` to get classification/specialization names

| Column                               | Type        | Description                                 |
| ------------------------------------ | ----------- | ------------------------------------------- |
| id                                   | VARCHAR(36) | Row PK                                      |
| provider\_id                         | VARCHAR(36) | FK to `PROVIDERS.ID`                        |
| taxonomy\_id                         | VARCHAR(36) | FK to `TAXONOMIES.ID`                       |
| heatlhcare\_provider\_taxonomy\_code | VARCHAR     | NUCC taxonomy code                          |
| primary                              | BOOLEAN     | Raw NPPES primary flag — sparsely populated |
| created\_at / updated\_at            | TIMESTAMP   | Lineage                                     |

### TAXONOMIES — NUCC taxonomy reference

NUCC (National Uniform Claim Committee) taxonomy code definitions. \~880 rows, updated quarterly. Join from `PROVIDERS_TAXONOMIES.HEATLHCARE_PROVIDER_TAXONOMY_CODE` or `PROVIDERS.PRIMARY_TAXONOMY_CODE` to `TAXONOMIES.CODE` for classification names.

| Column                    | Type      | Description                                                    |
| ------------------------- | --------- | -------------------------------------------------------------- |
| id                        | VARCHAR   | Row PK                                                         |
| code                      | VARCHAR   | 10-char NUCC taxonomy code                                     |
| grouping                  | VARCHAR   | Top-level grouping (e.g., Allopathic & Osteopathic Physicians) |
| classification            | VARCHAR   | Mid-level classification (e.g., Internal Medicine)             |
| specialization            | VARCHAR   | Specialization (e.g., Cardiovascular Disease)                  |
| definition                | VARCHAR   | NUCC's plain-language definition                               |
| notes                     | VARCHAR   | Additional NUCC notes                                          |
| display\_name             | VARCHAR   | Pre-concatenated human-readable name                           |
| section                   | VARCHAR   | NUCC spec section reference                                    |
| created\_at / updated\_at | TIMESTAMP | Lineage                                                        |

## Data Quality

### Standardization

Every address is USPS-validated and geocoded, with AI-driven normalization as a fallback for addresses USPS can't resolve. Taxonomy codes are filled in via inference when NPPES is missing a primary — `is_specialty_inferred` flags which records were enriched. NPI lifecycle fields (`is_active`, `npi_status`) are computed from deactivation and reactivation dates to handle NPPES's sparse status flagging.

### Data Freshness

Check when data was last refreshed using `PRODUCT_INFO`:

```sql
SELECT product_name, provider_count, address_count,
       enrichment_coverage_pct, last_updated_at, version
FROM DWV.PRODUCT_INFO;
```

## Getting Started

### Platform Schema Reference

This dataset is available on Snowflake Marketplace. Queries use a schema-only reference — the database is set by the share context:

| Platform      | Schema | Example         |
| ------------- | ------ | --------------- |
| **Snowflake** | `DWV`  | `DWV.PROVIDERS` |

### Discover Available Data

Start with `PRODUCT_INFO` for a product-wide snapshot and `DATA_DICTIONARY` for column-level descriptions on every view.

{% tabs %}
{% tab title="Snowflake" %}

```sql
-- Product snapshot: freshness and enrichment coverage
SELECT product_name, provider_count, address_count,
       enrichment_coverage_pct, duplicate_pairs_count,
       active_provider_count, last_updated_at, version
FROM DWV.PRODUCT_INFO;
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Snowflake" %}

```sql
-- Column-level data dictionary for every consumer-facing view
SELECT table_name, column_name, data_type, column_description
FROM DWV.DATA_DICTIONARY
WHERE table_name = 'PROVIDERS'
ORDER BY ordinal_position;
```

{% endtab %}
{% endtabs %}

### Foundational Queries

#### Specialty breakdown with Medicare enrollment rate

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT specialty_description,
       COUNT(*) AS total_providers,
       SUM(CASE WHEN is_individual THEN 1 ELSE 0 END) AS individuals,
       SUM(CASE WHEN is_organization THEN 1 ELSE 0 END) AS organizations,
       SUM(CASE WHEN medicare_enrolled THEN 1 ELSE 0 END) AS medicare_enrolled
FROM DWV.PROVIDERS
WHERE is_active AND is_canonical
GROUP BY specialty_description
ORDER BY total_providers DESC
LIMIT 20;
```

{% endtab %}
{% endtabs %}

### Revenue Cycle: Reduce Claim Denials from Bad NPI Data

See original CMS addresses next to USPS-corrected addresses side by side — the records that would have triggered denials on the raw NPPES feed.

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT p.npi, p.provider_full_name, p.specialty_description,
       a.practice_city AS original_city, a.practice_state AS original_state,
       a.validated_city, a.validated_state, a.validated_zip,
       a.address_validation_status
FROM DWV.PROVIDERS p
JOIN DWV.PROVIDERS_ADDRESSES a ON p.address_id = a.address_id
WHERE a.address_validation_status = 'corrected'
ORDER BY p.npi
LIMIT 25;
```

{% endtab %}
{% endtabs %}

### Spatial and Demographic Analytics

#### Market sizing: provider density by metro with ACS demographics

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT a.metro_area_name,
       COUNT(DISTINCT p.npi) AS providers,
       a.median_household_income,
       a.total_population,
       ROUND(COUNT(DISTINCT p.npi) * 10000.0
             / NULLIF(a.total_population, 0), 2) AS providers_per_10k,
       a.is_in_hpsa
FROM DWV.PROVIDERS p
JOIN DWV.PROVIDERS_ADDRESSES a ON p.address_id = a.address_id
WHERE p.is_active AND p.is_canonical
  AND a.metro_area_name IS NOT NULL
GROUP BY a.metro_area_name, a.median_household_income,
         a.total_population, a.is_in_hpsa
ORDER BY providers DESC
LIMIT 20;
```

{% endtab %}
{% endtabs %}

#### Network adequacy: providers within 10 miles of an anchor NPI (native `geo_point`)

Anchor the search to a real provider's practice location and find every other provider within a radius. Swap `anchor_npi` for any NPI to re-target the search.

{% tabs %}
{% tab title="Snowflake" %}

```sql
-- Anchor: find everyone within 10 miles of a reference provider's practice location.
WITH anchor AS (
    SELECT a.geo_point,
           p.provider_full_name || ' (' || a.validated_city || ', ' || a.validated_state || ')' AS anchor_label
    FROM DWV.PROVIDERS p
    JOIN DWV.PROVIDERS_ADDRESSES a ON p.address_id = a.address_id
    WHERE p.npi = 1003000126   -- anchor_npi: swap for any NPI
)
SELECT p.npi, p.provider_full_name, p.specialty_description,
       a.validated_city, a.validated_state,
       anchor.anchor_label,
       ROUND(ST_DISTANCE(a.geo_point, anchor.geo_point) / 1609.344, 1) AS distance_miles
FROM DWV.PROVIDERS p
JOIN DWV.PROVIDERS_ADDRESSES a ON p.address_id = a.address_id
CROSS JOIN anchor
WHERE p.is_active AND p.is_canonical
  AND p.npi != 1003000126
  AND ST_DWITHIN(a.geo_point, anchor.geo_point, 16093)   -- 16,093 meters = 10 miles
ORDER BY distance_miles
LIMIT 25;
```

{% endtab %}
{% endtabs %}

### TEFCA and FHIR-Ready Interop

#### Drop-in FHIR resources for EHR, HIE, and QHIN pipelines

Three FHIR R4.0.1 resources per NPI — a `Practitioner` or `Organization`, a `PractitionerRole` for Medicare enrollment, and a `Location`. Inspect the VARIANT in Snowsight's JSON viewer to verify conformance.

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT npi,
       fhir_practitioner,
       fhir_practitioner_role,
       fhir_location
FROM DWV.PROVIDERS_FHIR
LIMIT 10;
```

{% endtab %}
{% endtabs %}

#### FHIR bulk export: one resource type per row for NDJSON streaming

Compose a FHIR bulk export bundle — UNION ALL the three resource types into a single NDJSON-shaped stream for QHIN or `$export` pipelines.

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT 'Practitioner'     AS resource_type, fhir_practitioner      AS resource
FROM DWV.PROVIDERS_FHIR WHERE fhir_practitioner IS NOT NULL
UNION ALL
SELECT 'PractitionerRole' AS resource_type, fhir_practitioner_role AS resource
FROM DWV.PROVIDERS_FHIR WHERE fhir_practitioner_role IS NOT NULL
UNION ALL
SELECT 'Location'         AS resource_type, fhir_location          AS resource
FROM DWV.PROVIDERS_FHIR WHERE fhir_location IS NOT NULL
LIMIT 30;
```

{% endtab %}
{% endtabs %}

### Medicare Network Planning: PECOS-Enrolled Providers in HPSA Areas

Cross-reference PECOS Medicare enrollment with HRSA HPSA shortage designations to surface in-network providers where they're most needed.

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT p.npi, p.provider_full_name, p.specialty_description,
       p.medicare_specialty_description, p.medicare_enrollment_state,
       a.validated_city, a.validated_state,
       a.hpsa_primary_care_score
FROM DWV.PROVIDERS p
JOIN DWV.PROVIDERS_ADDRESSES a ON p.address_id = a.address_id
WHERE p.is_active AND p.is_canonical
  AND p.medicare_enrolled
  AND a.is_in_hpsa
ORDER BY a.hpsa_primary_care_score DESC
LIMIT 25;
```

{% endtab %}
{% endtabs %}

### Roster Dedup Without Losing Signal

#### High-confidence duplicate NPI pairs with similarity scores

Tiered-confidence duplicate pairs with match type and recommended primary NPI — safe to use for roster collapse without losing historical claim lineage.

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT d.npi_1, d.npi_2, d.confidence_tier, d.similarity_score,
       d.match_type, d.entity_class,
       COALESCE(p1.provider_last_name, p1.provider_organization_name) AS name_1,
       COALESCE(p2.provider_last_name, p2.provider_organization_name) AS name_2,
       d.recommended_primary
FROM DWV.PROVIDERS_DUPLICATES d
JOIN DWV.PROVIDERS p1 ON d.npi_1 = p1.npi
JOIN DWV.PROVIDERS p2 ON d.npi_2 = p2.npi
WHERE d.confidence_tier = 'high_confidence'
ORDER BY d.similarity_score DESC
LIMIT 25;
```

{% endtab %}
{% endtabs %}

### HIPAA Compliance: OIG-Excluded Providers Still Active in NPPES

OIG/LEIE-flagged NPIs that are still showing as active in NPPES — pull this before billing or claim submission.

{% tabs %}
{% tab title="Snowflake" %}

```sql
SELECT p.npi, p.provider_full_name, p.specialty_description,
       p.exclusion_type_description, p.exclusion_date,
       p.is_active, p.npi_status
FROM DWV.PROVIDERS p
WHERE p.is_excluded
ORDER BY p.exclusion_date DESC
LIMIT 25;
```

{% endtab %}
{% endtabs %}

## Who Uses This Data

### Common Use Cases

* **Revenue cycle: reduce first-pass claim denials** — Replace raw NPPES with validated, USPS-corrected addresses, verified NPI lifecycle status, and PECOS-confirmed Medicare enrollment to cut denials caused by bad provider master data
* **Biennial revalidation and HIPAA compliance** — OIG/LEIE exclusion flags and revalidation-aware NPI status help meet the 2026 CMS biennial revalidation mandate and reduce exposure to HIPAA fines
* **TEFCA, QHIN, and HIE interoperability** — Query pre-built FHIR R4.0.1 `Practitioner`, `Organization`, `PractitionerRole`, and `Location` resources for drop-in TEFCA and bulk-export pipelines without building a FHIR transformer
* **Provider network design for VBC and ACO programs** — Build deduplicated provider networks with `PROVIDERS_ENTITIES`, cross-reference PECOS enrollment, and filter by HPSA shortage and congressional district
* **Network adequacy modeling and market sizing** — Use native `geo_point` GEOGRAPHY with `ST_DWITHIN` / `ST_DISTANCE` plus ACS demographics (median household income, population, education) and CBSA metro boundaries
* **Provider roster master data management (MDM)** — `PROVIDERS_DUPLICATES` and `PROVIDERS_ENTITIES` together provide a golden record of unique real-world providers with the signal (OIG, Medicare, quality) preserved across every multi-NPI cluster
* **Underserved-area provider analytics** — HRSA HPSA primary care, mental health, and dental shortage scores plus MUA/MUP designations layered on every provider for rural health, CMMI, and 340B analysis

### Related Datasets & Research

This dataset pairs well with:

* [**CMS NPPES Provider Dataset**](/data-catalog/cms-nppes-provider-dataset.md) — The raw source registry that this Golden Record is built on. Subscribe to the raw feed if you need the unprocessed relational tables (addresses, licenses, identifiers, taxonomies) for custom research; subscribe to the Golden Record if you want the same provider universe cleaned, deduped, enriched, and FHIR-ready
* [**HRSA Healthcare Resources Dataset**](/data-catalog/hrsa-dataset.md) — Source of the county-level HPSA, MUA, and AHRF workforce signals used here. Join to HRSA for workforce rate analysis by county
* [**CMS Data Feeds Dataset**](/data-catalog/cms-data-feeds-dataset.md) — Complete CMS data catalog including Care Compare and PECOS source feeds
* **TEFCA / QHIN implementations** — Drop-in FHIR R4.0.1 resources compatible with the ASTP TEFCA Common Agreement and the QHIN Directory schema

### Frequently Asked Questions

**How is the NPPES Provider Golden Record different from the raw CMS NPPES Provider Dataset?**\
The raw dataset is the unprocessed NPI registry as CMS publishes it weekly — addresses with typos and outdated ZIPs, no deduplication, no Medicare enrollment, no OIG exclusion flags, no FHIR mapping, no geo enrichment. The Golden Record is that same universe after USPS address validation (around 23% of addresses get corrected), geocoding with census and ACS enrichment, PECOS Medicare enrollment joins, OIG/LEIE exclusion checks, CMS Care Compare hospital and group affiliations, NUCC taxonomy gap-filling, tiered-confidence duplicate detection, and FHIR R4.0.1 resource generation. Most NPPES consumers eventually write their own cleanup pipeline — this dataset replaces that pipeline.

**How often is the data updated?**\
**Weekly, automatically.** Our pipeline monitors CMS for each new NPPES release and loads the updated, validated, and enriched data within days of publication — no manual intervention needed. Coverage spans every active and deactivated NPI in the registry (9M+ providers).

**What FHIR version is supported?**\
FHIR R4.0.1. Every row in `PROVIDERS_FHIR` contains a US Core-aligned `Practitioner` or `Organization` resource, a `PractitionerRole` for Medicare enrollment (NULL when not PECOS-enrolled), and a `Location` resource. The resources use standard `http://hl7.org/fhir/sid/us-npi` identifier systems and can be dropped directly into a QHIN, EHR, or HIE pipeline.

**Can I use the Snowflake GEOGRAPHY column for spatial queries?**\
Yes. `PROVIDERS_ADDRESSES.geo_point` is a native Snowflake GEOGRAPHY type built from the geocoded lat/long. Use it with `ST_DISTANCE`, `ST_DWITHIN`, `ST_CONTAINS`, and other Snowflake spatial functions for proximity, radius, and polygon containment queries without converting from lat/long on every query.

**How do I query "unique providers" without double-counting multi-NPI entities?**\
Use `PROVIDERS_ENTITIES`. It pre-collapses multi-NPI duplicate clusters into one row per real-world entity while preserving `any_oig_excluded` and `any_medicare_enrolled` rollups so compliance signals aren't silently deduped away. Alternatively, filter `PROVIDERS` with `is_canonical = TRUE` for a one-row-per-entity view that keeps the full column set.

**Are inactive or deactivated NPIs included?**\
Yes. NPPES never recycles an NPI and historical claims still reference deactivated NPIs, so the dataset preserves them. Use `is_active = TRUE` and `is_canonical = TRUE` for a clean active-provider roster, or query without those filters for historical analysis.

**What's the pricing and is there a trial?**\
Full access is $800/month on Snowflake Marketplace. A 14-day free trial includes 1,500 NPIs that are consistent across every view — the trial set is pre-materialized for fast evaluation queries. Metadata views (`PRODUCT_INFO`, `DATA_DICTIONARY`) are always fully available.

## Get Started

{% hint style="success" %}
**Ready to access the NPPES Provider Golden Record?**

Questions? [Contact our team](mailto:support@dataplex-consulting.com) for a walkthrough.
{% endhint %}

| Platform       | Action                                                                                       |
| -------------- | -------------------------------------------------------------------------------------------- |
| **Snowflake**  | [Get on Snowflake Marketplace →](https://app.snowflake.com/marketplace/listing/GZT1Z7QRT0IO) |
| **Free Trial** | [Start 14-Day Free Trial →](https://trial.dataplex-consulting.com)                           |

### About Dataplex

Dataplex Consulting & Data Products delivers turnkey, analytics-ready data products that make complex public and commercial data easy to use across modern data platforms. Our data pipelines include automated quality checks and active monitoring to ensure timely, reliable, and well-structured data that is ready for downstream analytics, machine learning, and operational use.

In addition to data products, Dataplex provides data engineering and analytics consulting services to organizations of all sizes. We bring deep, hands-on experience supporting both early-stage companies and large enterprises, helping teams build scalable data platforms, improve data reliability, and become more data-driven.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.dataplex-consulting.com/data-catalog/nppes-validated-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
