CDC Open Data Product
About the Dataset
The CDC Open Data Product is a comprehensive data solution that transforms and delivers over 1,300 high-quality, up-to-date public health datasets from the Centers for Disease Control and Prevention (CDC). This product encompasses more than 500GB of data and growing, with over 27,000 attributes, offering researchers, analysts, and data scientists unparalleled access to a vast array of public health information. See the catalog for the full list of tables.
Dataset Features
Covers topics including infectious diseases, chronic conditions, health behaviors, and healthcare utilization
Continuous monitoring for new CDC datasets and updates
Rigorous quality assurance process ensuring data reliability
Schema evolution support to handle changes in source data structures
Data Quality and Maintenance
The CDC Open Data Product undergoes a rigorous quality assurance process:
Automated Checks: Each dataset batch is subject to automated QA checks before being made available.
Schema Evolution: The system automatically adapts to changes in source data schemas, ensuring data consistency over time.
Data Validation: Checks are performed to ensure data integrity, including row count validation and data type consistency.
Regular Updates: Datasets are updated based on the CDC's update frequency for each dataset.
Freshness Tracking: The
last_refresh_timestamp
in thedwv.datasets
table indicates the most recent update for each dataset.
Business Applications
The CDC Open Data Product can be utilized in various business applications, including:
Public health research and analysis
Healthcare policy development
Epidemiological studies
Health risk assessment and management
Population health management
Healthcare resource allocation
Disease outbreak monitoring and prediction
Example Use Cases
COVID-19 Impact Analysis: Analyze trends in COVID-19 deaths across different age groups and regions.
Tobacco Consumption Trends: Track changes in tobacco consumption patterns over time and across states.
Bacterial Surveillance: Monitor the prevalence of invasive bacterial infections across different demographics.
Cardiovascular Disease Risk Assessment: Analyze risk factors and trends in cardiovascular diseases across populations.
Immunization Coverage Evaluation: Assess vaccination rates and their impact on disease outbreaks.
Data Structure
The CDC Open Data Product is organized into three main tables in the DWV schema:
datasets: Contains metadata about each CDC dataset.
datasets_batches: Provides information about data processing batches.
{table_name}
: Individual dataset tables, one for each CDC dataset withinDWV_
schemas tied to the dataset category.
Common system columns across all datasets include:
id: Unique identifier for each record
dataset_id: Reference to the dataset metadata
batch_id: Reference to the processing batch
source_dataset_id: Original CDC dataset identifier
Entity Relationship Diagram
Sample Queries
List all available datasets:
Get the latest processing batch for a specific dataset:
Query COVID-19 related datasets:
Query the latest batch data for a single dataset:
Support and Contact
For questions, support, or feedback regarding the CDC Open Data Product, please contact our data support team at support@dataplex-consulting.com.
About Dataplex
Dataplex Consulting & Data Products offers top-notch, turnkey data products, making data easily accessible for any business. Our data pipelines feature automatic quality checks and active monitoring, ensuring timely, clean, and high-quality data designed for seamless ingestion.
We also offer data consulting services to companies of all sizes. With 20+ years of experience serving small businesses and Fortune 500 companies, our team has gained a wealth of practical expertise in the field. Our track record shows success in enhancing data management, boosting revenue, and helping companies become more data-driven.
Last updated