This material provides a quick tour of much of the data available from the Human Microbiome Project, but it is not an exhaustive inventory of all data sets and analysis products. Many approximations and generalizations are made for the sake of intelligibility. It is also focused on the subset of data products that are likely to be both tractable and interesting for the average researcher.
The HMP is generating large amounts of genomic and metagenomic sequence data. There are two primary portals for accessing data:
One way of organizing much (though not all) of the metagenomic sequence data generated under the project is to split it by cohort type and data type.
There are two primary cohort types:
There are three primary data types:
The resulting division can be roughly represented by the following table:
|Center"Healthy Cohort"||Demonstration Project "disease cohorts"
NCBI BioProject 46305
|Reference microbial genomes
NCBI BioProject 28331
|~1000 strains||Hundreds of strains|
mWGS metagenomic sequence
NCBI BioProject 43017
|Subset of the 300 subjects, multiple timepoints, 15+ bodysites||5 projects, each with unique, sampling sites, conditions, etc.|
16S metagenomic sequence
NCBI BioProject 48489
|300 subjects, multiple timepoints, 15+ bodysites||14 projects, each with unique, sampling sites, conditions, etc. 4 projects contain both 16S and mWGS components|
There are other data types being generated under the project and many nuances even within this approximate organization. All of the sequence data listed above is openly available for download. To protect subject privacy, data has been filtered to remove contaminating human sequence.
In addition to the generation of metagenomic sequence data (mWGS and/or 16S), information, or metadata, about the human subjects was also collected. To protect subject privacy, those data are available only through NCBI's dbGaP to qualified researchers. "Qualified researchers" are defined as PI-level investigators at legitimate institutions who can describe how they plan to use the data and can follow a series of precautions to safeguard patient privacy. Detailed information on the accessing private data is available at the NCBI dbGaP site.
Only the following clinical metadata are available outside of dbGaP, directly embedded in the sequence file metadata:
No approval is necessary to access these data.
Most of the raw sequence data reside at NCBI's Sequence Read Archive (SRA). The most straightforward way to identify all of the SRA data associated with a particular dataset is to enter through the BioProject pages referred to above. Each project-level BioProject page provides links to all associated SRA experiments (accession prefix: SRX). Alternately, it is also possible to begin in the SRA and search for all experiments that are linked to a given BioProject ID. Both processes can be performed manually through NCBI's website or by using E-utilities.
The DCC hosts value-added sequence data, with datasets representing numerous steps along common analysis paths. This is intended to allow researchers to bin analysis pipelines mid-stream, dedicating time to the areas they find most important.Go back to Data Browser