A map of human genome variation from population-scale sequencing
The 1000 Genomes Project Consortium*
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed todevelop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotypestructure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person isfound to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 1028 per base pair per generation. Weexplore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. Understanding the relationship between genotype and phenotype is one of the central goals in biology and medicine. The reference humangenome sequence1 provides a foundation for the study of human genetics, but systematic investigation of human variation requires full knowledge of DNA sequence variation across the entire spectrum of allele frequencies and types of DNA differences. Substantial progress has already been made. By 2008 the public catalogue of variant sites (dbSNP 129) contained approximately 11 million singlenucleotide polymorphisms (SNPs) and 3 million short insertions and deletions (indels)2–4. Databases of structural variants (for example, dbVAR) indexed the locations of large genomic variants. The International HapMap Project catalogued both allele frequencies and the correlation patterns between nearby variants, a phenomenon known as linkage disequilibrium (LD), across several populations for 3.5million SNPs3,4. These resources have driven disease gene discovery in the first generation of genome-wide association studies (GWAS), wherein genotypes at several hundred thousand variant sites, combined with the knowledge of LD structure, allow the vast majority of common variants (here, those with .5% minor allele frequency (MAF)) to be tested for association4 with disease. Over the past 5 yearsassociation studies have identified more than a thousand genomic regions associated with disease susceptibility and other common traits5. Genomewide collections of both common and rare structural variants have similarly been tested for association with disease6. Despite these successes, much work is still needed to achieve a deep understanding of the genetic contribution to human phenotypes7. Once aregion has been identified as harbouring a risk locus, detailed study of all genetic variants in the locus is required to discover the causal variant(s), to quantify their contribution to disease susceptibility, and to elucidate their roles in functional pathways. Low-frequency and rare variants (here defined as 0.5% to 5% MAF, and below 0.5% MAF, respectively) vastly outnumber common variants...