Bio-session package - Social Science Genetic Association Consortium
Transcription
Bio-session package - Social Science Genetic Association Consortium
66*$&-81(%,2/2*,&$/$1127$7,21 ,QWURGXFWLRQ ,GHQWLI\LQJDQG5HVROYLQJ7DUJHWV Ɣ /LPLWDWLRQVRIFXUUHQW*:$6 9LVVFKHU3HWHU0HWDO)LYH\HDUVRI*:$6GLVFRYHU\7KH$PHULFDQ-RXUQDORI+XPDQ *HQHWLFV Ɣ )XQFWLRQDODQQRWDWLRQ7}QX +DSOR5HJZZZEURDGLQVWLWXWHRUJPDPPDOVKDSORUHJKDSORUHJSKS613LQIR KWWSVQSLQIRQLHKVQLKJRY6131H[XVKWWSZZZVQSQH[XVRUJ5HJXORPH'% KWWSUHJXORPHVWDQIRUGHGX Ɣ H47/7}QX *7(;&RQVRUWLXPKWWSVFRPPRQIXQGQLKJRY*7([)X-LQJ\XDQHWDO8QUDYHOLQJWKH UHJXODWRU\PHFKDQLVPVXQGHUO\LQJWLVVXHGHSHQGHQWJHQHWLFYDULDWLRQRIJHQHH[SUHVVLRQ3/R6 JHQHWLFVH5HJXORPH'%KWWSUHJXORPHVWDQIRUGHGX Ɣ *HQRPHVDQG6HTXHQFLQJ7}QX KWWSJHQRPHVRUJDQG*HQRPHV&RQVRUWLXPSDSHUV Ɣ *HQHEDVHGWHVWV-DLPH KWWSJXPSTLPUHGXDX9(*$6/LX-LPP\=HWDO$YHUVDWLOHJHQHEDVHGWHVWIRU JHQRPHZLGHDVVRFLDWLRQVWXGLHV$PHULFDQMRXUQDORIKXPDQJHQHWLFV 'HVFULELQJ7DUJHWV Ɣ ,GHQWLI\LQJSUHYLRXVDVVRFLDWLRQV-DLPH KXPDQKWWSZZZJHQRPHJRYJZDVWXGLHVPRXVHKWWSZZZLQIRUPDWLFVMD[RUJ]HEUDILVK KWWS]ILQRUJ Ɣ 3DWKZD\DQDO\VLV-DLPH KWWSDWJXPJKKDUYDUGHGXLQULFK/HH3KLO+HWDO,15,&+LQWHUYDOEDVHGHQULFKPHQW DQDO\VLVIRUJHQRPHZLGHDVVRFLDWLRQVWXGLHV%LRLQIRUPDWLFV Ɣ *HQHIXQFWLRQSUHGLFWLRQ/XGH Ɣ *HQHSULRULWL]DWLRQ/XGH Ɣ $QDO\VLVRIFKURPDWLQPDUNV*RVLD KWWSVZZZEURDGLQVWLWXWHRUJPSJHSLJZDV7U\QND*RVLDHWDO&KURPDWLQPDUNVLGHQWLI\ FULWLFDOFHOOW\SHVIRUILQHPDSSLQJFRPSOH[WUDLWYDULDQWV1DWXUHJHQHWLFV (1&2'(OLQNVKWWSZZZQDWXUHFRPHQFRGHWKUHDGVKWWSJHQRPHXFVFHGX(1&2'( KWWSZZZURDGPDSHSLJHQRPLFVRUJ 6XPPDU\'LVFXVVLRQ 7}QX(VNRWHVNR#EURDGLQVWLWXWHRUJ /XGH)UDQNHOXGH#OXGHVLJQQO -DLPH'HUULQJHUMDLPHODQH#JPDLOFRP *RVLD7U\QNDJRVLD#EURDGLQVWLWXWHRUJ REVIEW Five Years of GWAS Discovery Peter M. Visscher,1,2,* Matthew A. Brown,1 Mark I. McCarthy,3,4 and Jian Yang5 The past five years have seen many scientific and biological discoveries made through the experimental design of genome-wide association studies (GWASs). These studies were aimed at detecting variants at genomic loci that are associated with complex traits in the population and, in particular, at detecting associations between common single-nucleotide polymorphisms (SNPs) and common diseases such as heart disease, diabetes, auto-immune diseases, and psychiatric disorders. We start by giving a number of quotes from scientists and journalists about perceived problems with GWASs. We will then briefly give the history of GWASs and focus on the discoveries made through this experimental design, what those discoveries tell us and do not tell us about the genetics and biology of complex traits, and what immediate utility has come out of these studies. Rather than giving an exhaustive review of all reported findings for all diseases and other complex traits, we focus on the results for auto-immune diseases and metabolic diseases. We return to the perceived failure or disappointment about GWASs in the concluding section. Introduction: Have GWASs Been a Failure? In the past five years, genome-wide association studies (GWASs) have led to many scientific discoveries, and yet at the same time, many people have pointed to various problems and perceived failures of this experimental design. Let us begin by considering a number of criticisms that have been made against GWASs. We do not list these quotes to discredit any of the scientists or journalists involved, nor to deliberately cite them out of context. Rather, they serve to confirm that the points we discuss in this review are related to beliefs held by a significant number of scientific commentators and therefore warrant consideration. From an interview with Sir Alec Jeffreys, ESHG Award Lecturer 2010: ‘‘One of the great hopes for GWAS was that, in the same way that huge numbers of Mendelian disorders were pinned down at the DNA level and the gene and mutations involved identified, it would be possible to simply extrapolate from single gene disorders to complex multigenic disorders. That really hasn’t happened. Proponents will argue that it has worked and that all sorts of fascinating genes that predispose to or protect against diabetes or breast cancer, for example, have been identified, but the fact remains that the bulk of the heritability in these conditions cannot be ascribed to loci that have emerged from GWAS, which clearly isn’t going to be the answer to everything.’’ From McCLellan and King, Cell 20101: ‘‘To date, genome-wide association studies (GWAS) have published hundreds of common variants whose allele frequencies are statistically correlated with various illnesses and traits. However, the vast majority of such variants have no established biological relevance to disease or clinical utility for prognosis or treatment.’’ ‘‘An odds ratio of 3.0, or even of 2.0 depending on population allele frequencies, would be robust to such population stratification. However, odds ratios of the magnitude generally detected by GWAS (<1.5) can frequently be explained by cryptic population stratification, regardless of the p value associated with them.’’ ‘‘More generally, it is now clear that common risk variants fail to explain the vast majority of genetic heritability for any human disease, either individually or collectively (Manolio et al., 2009).’’ ‘‘The general failure to confirm common risk variants is not due to a failure to carry out GWAS properly. The problem is underlying biology, not the operationalization of study design. The common disease–common variant model has been the primary focus of human genomics over the last decade. Numerous international collaborative efforts representing hundreds of important human diseases and traits have been carried out with large well-characterized cohorts of cases and controls. If common alleles influenced common diseases, many would have been found by now. The issue is not how to develop still larger studies, or how to parse the data still further, but rather whether the common disease–common variant hypothesis has now been tested and found not to apply to most complex human diseases.’’ From Nicholas Wade in the New York Times, March 20 2011: ‘‘More common diseases, like cancer, are thought to be caused by mutations in several genes, and finding the causes was the principal goal of the $3 billion 1 University of Queensland Diamantina Institute, Princess Alexandra Hospital, Brisbane, Queensland 4102, Australia; 2The Queensland Brain Institute, The University of Queensland, Brisbane, Queensland 4072, Australia; 3Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK; 4 Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital Old Road, Headington Oxford OX3 7LJ, UK; 5Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia *Correspondence: [email protected] DOI 10.1016/j.ajhg.2011.11.029. !2012 by The American Society of Human Genetics. All rights reserved. The American Journal of Human Genetics 90, 7–24, January 13, 2012 7 human genome project. To that end, medical geneticists have invested heavily over the last eight years in an alluring shortcut. But the shortcut was based on a premise that is turning out to be incorrect. Scientists thought the mutations that caused common diseases would themselves be common. So they first identified the common mutations in the human population in a $100 million project called the HapMap. Then they compared patients’ genomes with those of healthy genomes. The comparisons relied on ingenious devices called SNP chips, which scan just a tiny portion of the genome. (SNP, pronounced ‘‘snip,’’ stands for single nucleotide polymorphism.) These projects, called genome-wide association studies, each cost around $10 million or more. The results of this costly international exercise have been disappointing. About 2,000 sites on the human genome have been statistically linked with various diseases, but in many cases the sites are not inside working genes, suggesting there may be some conceptual flaw in the statistics. And in most diseases the culprit DNA was linked to only a small portion of all the cases of the disease. It seemed that natural selection has weeded out any disease-causing mutation before it becomes common.’’ From Tim Crow, Molecular Psychiatry 20112: ‘‘There comes a point at which the genetic skeptic can be pardoned the suggestion that if the genes are so small and so multiple, what they are hardly matters, the dividing line between polygenes and no genes is of little practical consequence. Have we reached this point’’? From a commentary article by Jonathan Latham, on guardian.co.uk, 17 April 2011: ‘‘Among all the genetic findings for common illnesses, such as heart disease, cancer and mental illnesses, only a handful are of genuine significance for human health. Faulty genes rarely cause, or even mildly predispose us, to disease, and as a consequence the science of human genetics is in deep crisis. Since the Collins paper [Manolio et al. 20093] was published nothing has happened to change that conclusion. It now seems that the original twinstudy critics were more right than they imagined. The most likely explanation for why genes for common diseases have not been found is that, with few exceptions, they do not exist.’’ These quotes raise a number of different issues about the methodology, research outcomes, and utility of the research findings. The pertinent points made in these quotes are: (1) GWASs are founded on a flawed assumption that genetics plays an important role in the risk to common diseases; (2) GWASs have been disappointing in not explaining more genetic variation in the population; (3) GWASs have not delivered meaningful, biologically relevant knowledge or results of clinical or any other utility; and (4) GWAS results are spurious. In this review we will briefly give the history of GWASs and then focus on the discoveries made through this experimental design, what those discoveries tell us and do not tell us about the genetics and biology of complex traits, and what immediate utility has come out of these studies. We will focus on the results for auto-immune diseases and metabolic diseases, although there have been important findings for other diseases and complex traits. In the concluding section, we will again consider the perceived failure or disappointment of GWASs. What Are GWASs, and How Did We Get There? Attempts to use linkage analysis to map genomic loci that have an effect on disease or other complex traits have been ubiquitous in the last two decades. Gene mapping by linkage relies on the cosegregation of causal variants with marker alleles within pedigrees. We define and discuss what we mean by ‘‘causal’’ in Box 1. Because the number of recombination events per meiosis is relatively small, tagging a causal variant requires only a few genetic markers per chromosome. The downside of the small number of recombination events is that the mapping resolution, i.e., how close to the causal variant one can get through linked markers, is typically low. Linkage mapping has been extremely successful in mapping genes and gene variants affecting Mendelian traits (e.g., singlegene disorders).4 Mapping loci underlying common diseases and, in particular, identifying causative mutations have had much less success. There are many reasons for the failure of linkage analyses to reliably identify complex-trait loci in human pedigrees. One reason is that the effect sizes (‘‘penetrance’’) of individual causal variants are too small to allow detection via cosegregation within pedigrees. GWASs are based upon the principle of linkage disequilibrium (LD) at the population level. LD is the nonrandom association between alleles at different loci. It is created by evolutionary forces such as mutation, drift, and selection and is broken down by recombination.5 Generally, loci that are physically close together exhibit stronger LD than loci that are farther apart on a chromosome. The larger the (effective) population size, the weaker the LD for a given distance.6 (Linkage analysis exploits the large LD within pedigrees.) The genomic distance at which LD decays determines how many genetic markers are needed to ‘‘tag’’ a haplotype, and the number of such tagging markers is much smaller than the total number of segregating variants in the population. For example, a selection of approximately 500,000 common SNPs in the human genome is sufficient to tag common variation 8 The American Journal of Human Genetics 90, 7–24, January 13, 2012 tion that is obtained from linkage analysis in family studies. What if we do not have any prior information on genomic loci or, alternatively, we deliberately want an unbiased scan of the genome? In a landmark paper, Risch and Merikangas83 showed that performing an association scan involving one million variants in the genome and a sample of unrelated individuals could be more powerful than performing a linkage analysis with a few hundred markers. It took only 10 years before this theoretical design became reality. What was needed was the discovery (accelerated by the sequencing of the human genome) of hundreds of thousands of single-nucleotide variants, the quantification of the correlation (LD) structure of those markers in the human genome, and the ability to accurately genotype hundreds of thousands of markers in an automated and affordable manner. The LD structure was investigated in the HapMap project,7 and the outcome was a list of tag SNPs that captured most of the common genomic variation in a number of human populations. Concurrently, commercial companies produced dense SNP arrays that could genotype many markers in a single assay. The technological advances together with biobanks of either population cohorts or case-control samples facilitated the ability to conduct GWASs. Although GWASs are unbiased with respect to prior biological knowledge (or prior beliefs) and with respect to genome location, they are not unbiased in terms of what is detectable. GWASs rely on LD between genotyped SNPs and ungenotyped causal variants. The strength of statistical association between alleles at two loci in the genome strongly depends on their allele frequencies, such that a rare variant (say, one with a frequency <0.01) will be in low LD (as measured by r2) with a nearby common variant, even if they map to the same recombination interval.84 But the SNPs that are on the SNP chips have been selected to be common (most have a minor allele frequency >0.05). Therefore, GWASs are by design powered to detect association with causal variants that are relatively common in the population. Is it realistic to assume common causal variants for disease segregate in the population? This is discussed in Box 2. Box 1. What Is a Causal Variant? New mutations that contribute to an increase or decrease in risk to disease arise in populations all the time. Some of these mutations can reach an appreciable frequency in the population, for example by random drift or by natural selection. As discussed in the main text, these mutations will be associated with other variants in the genome through LD. Such associations will include those with SNPs that are genotyped on ‘‘SNP chips.’’ Because there are many more segregating variants in the population than those genotyped in GWASs, it is unlikely, but not impossible, that a mutation is genotyped itself, and so its effect usually will be detected through an association with a genotyped variant. This genotyped variant can be robustly associated with disease in multiple samples from the same population, or even across populations, but it is not the mutation that causes variation in risk. The results from GWASs have shown that variants at many genetic loci in the genome are associated with disease, and these also reflect many ancestral mutations with an effect on susceptibility to disease. Therefore, the effect size (in terms of increasing or decreasing the absolute probability of disease) is, on average, small, and individual variants are neither necessary nor sufficient to cause disease. Herein lies the problem of defining ‘‘causal’’: How do we prove that a particular mutation causes the observed effect on variation in the population? Engineering the same mutation in a cell or animal model might give a relevant phenotype, but that is not a proof. The mutation can have a direct effect on gene expression in human tissues or be functional in another way, but that doesn’t prove it has a causal effect on disease risk. Operationally, in this review what we mean by ‘‘causal variant’’ is an (unknown) variant that has a direct or indirect functional effect on disease risk, rather than a variant that is associated with disease risk through LD, even if we don’t have the tools available at present to prove causality beyond reasonable doubt. Hence, it is the variant that causes the observed association signal. in non-African populations, even though the total number of common SNPs exceeds 10 million.7 Geneticists realized some time ago that they could exploit population-based LD to map genes. For example, Bodmer suggested in 1986 that fine-mapping using population association could lead to closer linkage between a causative mutation and a linked marker.82 However, fine-mapping still relied on having an initial genomic loca- (Nearly) Five Years of Discovery Although the first results from a GWAS were reported in 20058 and 2006,9 we take the 2007 Wellcome Trust Case Control Consortium (WTCCC) paper in Nature10 as a starting point. The reason for this is that the WTCCC study was the first large, well-designed GWAS for complex diseases to employ a SNP chip that had good coverage of the genome. There are many ways to summarize the discoveries based on GWASs in the last five years. We have tried to separate the discoveries quantitatively and to focus on the biology. There are now well over 2000 loci that are significantly and robustly associated with one or more complex traits (see GWAS catalog in Web Resources), as shown in Figure 1. The vast majority of the loci identified are new, i.e., before 2007 their association with disease or other complex traits The American Journal of Human Genetics 90, 7–24, January 13, 2012 9 Box 2. Box 2. The CDCV Hypothesis Currently, the allele frequency of variants that contribute to cause common disease is a subject of some debate.85,86 The common disease-common variant (CDCV) hypothesis is sometimes said to be one side of this debate; the other side holds that disease-causing alleles are typically rare. But what is the precise ‘‘hypothesis’’ in the CDCV hypothesis? We tried to find the origin of the CDCV hypothesis. Many researchers cite either Lander87 or Risch and Merikangas.83 We will add Chakravarti88 and Reich and Lander89 as key studies. Lander87 noted from the then-available data that there is a limited diversity in coding regions at genes, in that most variants are very rare, and therefore the effective number of alleles is small. In addition, he provided ‘‘tantalizing examples’’ of common alleles with large effects (for example, such alleles include APOE [MIM 107741], MTHFR [MIM 607093], and ACE [MIM 106180]). Reich and Lander89 presented a theoretical population-genetics model that predicted a relatively simple spectrum of the frequency of disease risk alleles at a particular disease locus. They (re)phrased the CDCV hypothesis as the prediction that the expected allelic identity is high for those disease loci that are responsible for most of the population risk for disease. These studies did not appear to make any prediction about the number of disease loci or, therefore, about the effect size. What the authors stated was that if a disease was common, there was likely to be one disease-causing allele that was much more common than all the other diseasecausing alleles at the same locus.87,89 Risch and Merikangas83 quantified two important points regarding the detection of disease loci: first, that detection by association is more powerful than linkage when the genotype-relative risk is modest or small and the risk-allele frequency is large (say, >10%); and second, that the multiple-testing burden of a genome scan by association does not prevent the detection of genome-wide-significant findings. This paper was essentially about experimental design and statistical power (and hence feasibility), not about the CDCV hypothesis as such. Finally, Chakravarti88 pointed out that if individuals with disease needed to be homozygous for risk variants at multiple loci, then the risk alleles at those loci must be more common than they would be in a model in which homozygosity at any risk locus is sufficient to cause disease. We note that without the assumption of strong epistasis on the scale of liability, there is no need for risk variants to be common. For example, Risch’s multilocus multiplicative model,90 which implies an additive model Continued on the log (risk) scale (it is one of the ‘‘exchangeable’’ models91), does not rely on a particular allelic spectrum of risk-allele frequencies. What all these landmark papers have in common is a remarkable foresight in predicting the GWAS era well before the publication of the full draft of the human genome sequence, the HapMap project, or the availability of commercial genotyping. But what can we conclude about the origin and specifics of the CDCV hypothesis? As implicitly or explicitly stated in these key papers, there is no strong prediction about the exact allele-frequency spectrum of risk variants in the genome, nor a prediction about the effect size at any disease loci and hence about the total number of risk alleles in the genome. The current debate is about the frequency spectrum of disease-causing alleles. Phrasing the debate as an either/or question is not very helpful because examples of both common and rare alleles are already known, but there is still an open question as to whether most genetic variation contributing to complex traits in the population is caused by rare variants or common variants. A more general question regards the spectrum of allele frequencies of disease-causing alleles and the joint distribution between risk-allele frequency and effect size. In the special case of an evolutionarily neutral model and a constant effective population size, most causal variants that are segregating in the population will be rare, but most heritability will be due to common variants.79,92 The reason for this apparent paradox is that the number of segregating variants is proportional to 1/[p(1 ! p), where p is the allele frequency of a risk-increasing allele (so the smaller p, the more variants of that frequency), whereas the heritability contributed at that frequency is proportional to p(1 ! p). The net effect is that the heritability is distributed equally over all frequencies, and cumulatively most heritability is contributed by common variants. was not known. Essentially, these are 2000 new biological leads. The number of loci identified per complex trait varies substantially, from a handful for psychiatric diseases to a hundred or more for inflammatory bowel disease (IBD1 [MIM 266600], including Crohn disease [CD]11 and ulcerative colitis [UC]12) and stature.13 Importantly, the number of discovered variants is strongly correlated with experimental sample size (Figure 2), which predicts that an ever-increasing discovery sample size will increase the number of discovered variants: very roughly, after a minimum sample-size threshold below which no variants are detected is reached, a doubling in sample size leads 10 The American Journal of Human Genetics 90, 7–24, January 13, 2012 Figure 1. GWAS Discoveries over Time Data obtained from the Published GWAS Catalog (see Web Resources). Only the top SNPs representing loci with association p values < 5 3 10!8 are included, and so that multiple counting is avoided, SNPs identified for the same traits with LD r2 > 0.8 estimated from the entire HapMap samples are excluded. to a doubling of the number of associated variants discovered. The proportion of genetic variation explained by significantly associated SNPs is usually low (typically less than 10%) for many complex traits, but for diseases such as CD and multiple sclerosis (MS [MIM 126200]), and for quantitative traits such as height and lipid traits, between 10% and 20% of genetic variance has been accounted for (Table 1). In comparison to the pre-GWAS era, the proportion of genetic variation accounted for by newly discovered variants that are segregating in the population is large. It is clear that for most complex traits that have been investigated by GWAS, multiple identified loci have genome-wide statistical significance, and thus it is likely that there are (many) other loci that have not been identified because of a lack of statistical significance (false negatives). Recently, researchers have developed and applied methods to quantify the proportion of phenotypic variation that is tagged when one considers all SNPs simultaneously.12–14 These methods focus on estimation rather than hypothesis testing and do not suffer from false negatives caused by small effect sizes.15 Whole-genome approaches to estimating genetic variation have shown that approximately one-third to one-half of additive genetic variation in the population is being tagged when all GWAS SNPs are considered simultaneously.12–14 This is a surprisingly large proportion given that evolutionary theory predicts that most variants affecting disease risk ought to be found at a low frequency in the population if they affect fitness,16,17 and such risk variants would not be in sufficient LD with the common SNPs to be detected in GWASs. Autoimmune Diseases We concentrate on seven auto-immune diseases, ankylosing spondylitis (AS [MIM 106300]), rheumatoid arthritis (RA [MIM 180300), systemic lupus erythematosus (SLE Figure 2. Increase in Number of Loci Identified as a Function of Experimental Sample Size (A) Selected quantitative traits. (B) Selected diseases. The coordinates are on the log scale. The complex traits were selected with the criteria that there were at least three GWAS papers published on each in journals with a 2010–2011 journal impact factor >9 (e.g., Nature, Nature Genetics, the American Journal of Human Genetics, and PLoS Genetics) and that at least one paper contained more than ten genome-wide significant loci. These traits are a representative selection among all complex traits that fulfilled these criteria. [MIM 152700]), and type 1 diabetes (T1D [MIM 222100]), MS, CD, and UC. Table 2 summarizes the number of genes that have been identified for these diseases. Across these diseases, 19 loci (mainly related to human leukocyte antigen) were known prior to 2007, and 277 have been discovered from 2007 onward. The total of 277 includes multiple counts of loci that have been implicated across a number of diseases; such loci include BLK (MIM 191305), TNFAIP3 (MIM 191163) and CD40 (MIM 109535). Inflammatory bowel disease (IBD, not to be confused here with identity by descent) is thought to arise from dysregulation of intestinal homeostasis.18 GWASs of IBD (CD and UC) have been highly successful in terms of the number of loci identified (99 nonoverlapping loci in The American Journal of Human Genetics 90, 7–24, January 13, 2012 11 Table 1. Population Variation Explained by GWAS for a Selected Number of Complex Traits Trait or Disease h2 Pedigree Studies h2 GWAS Hitsa h2 All GWAS SNPsb Type 1 diabetes 0.998 0.699 0.312 Type 2 diabetes 0.3–0.6100 0.05-0.1034 Obesity (BMI) 0.4–0.6101,102 0.01-0.0236 0.214 Crohn’s disease 0.6–0.8103 0.111 0.412 Ulcerative colitis 0.5103 0.0512 Multiple sclerosis 0.3–0.8104 105 ,c 0.145 0.2106 Ankylosing spondylitis >0.90 Rheumatoid arthritis 0.6107 Schizophrenia 0.7–0.8108 0.0179 0.3109 Bipolar disorder 0.6–0.7108 0.0279 0.412 Breast cancer 0.3110 0.08111 Von Willebrand factor 0.66–0.75112,113 0.13114 115,116 0.1 13 Height 0.8 Bone mineral density 0.6-0.8117 0.05118 QT interval 0.37–0.60119,120 0.07121 HDL cholesterol 0.5122 0.157 Platelet count 0.8123 0.05–0.158 0.2514 0.513,14 0.214 a Proportion of phenotypic variance or variance in liability explained by genome-wide-significant and validated SNPs. For a number of diseases, other parameters were reported, and these were converted and approximated to the scale of total variation explained. Blank cells indicate that these parameters have not been reported in the literature. b Proportion of phenotypic variance or variance in liability explained when all GWAS SNPs are considered simultaneously. Blank cell indicate that these parameters have not been reported in the literature. c Includes pre-GWAS loci with large effects. total18), and a substantial proportion of familial risk, about 20%, has been accounted for.11,12,18 Twenty-eight risk loci are shared between CD and UC, despite the fact that these diseases display distinct clinical features, and it has been suggested that the two diseases share pathways and are part of a mechanistic continuum.18 There are also strong overlaps between genes involved in CD and UC, AS,19 and psoriasis (MIM 177900), again suggesting shared aetiopathogenic mechanisms in these conditions. Pleiotropic genetic effects are becoming increasing widely identified, including in classical autoimmune diseases.20 For example, a coding variant in the gene PTPN22 (MIM 600716) confers strong risk for T1D and RA as well as protection against CD.18 Metabolic Diseases In terms of metabolic diseases, we focus here specifically on type 2 diabetes (T2D [MIM 125853]); fasting glucose and insulin levels; body-mass index (BMI) and obesity; and fat distribution. A recent review21 already covered these complex traits, but we have updated that review wherever necessary. Table 3 gives an overview of the number of loci identified. More than 20 major GWASs for T2D have been published to date21–24, and there has been a cumulative tally of around 50 genome-wide-significant hits,21,23,24 only three of which were known before the GWAS era. Most of these studies have involved individuals of European descent; the latest published effort is from the DIAGRAM (Diabetes Genetics Replication and Meta-analysis) Consortium and includes more than 47,000 GWAS individuals and 94,000 samples for replication. More recently, equivalent studies have emerged from samples of East Asians,23,25–27 South Asians,22 and Hispanics,28,29 and large studies involving African Americans and other major ethnic groups are underway. Notwithstanding differences in allele frequency and LD patterns, most of the signals found in one ethnic group show some evidence of association in others, indicating that the common-variant signals identified by GWASs are likely to be the result of widely distributed causal alleles that are of relatively high frequency. This is an important observation because it indicates that most of the GWAS-identified associations for T2D reflect high LD with a causal variant that has a small effect size rather than low LD with a causal variant that has a large effect size. The largest common-variant signal identified for T2D remains TCF7L2 (MIM 602228) (detected just prior to the GWAS era30), which has a per-allele odss ratio (OR) of around 1.35. The remaining signals detected by GWAS have allelic ORs in the range between 1.05 and 1.25. Collectively, the most-strongly associated variants at these loci are estimated to explain around 10% of familial aggregation of T2D in European populations. The MAGIC (Meta-Analysis of Glucose- and InsulinRelated Traits Consortium) investigators have been carrying out equivalent analyses focused on the identification of variants influencing variation in glucose and insulin levels in healthy nondiabetic individuals.31–33 Prior to the GWAS era, the only compelling association signal for fasting glucose levels was known at GCK (MIM 138079) (glucokinase),34 but GWAS in European samples (46,000 GWAS and 76,000 replication samples) have expanded that number to 1632. These variants explain around 10% of the inherited variation in fasting glucose levels. Only two signals (near GCKR [MIM 600842] and IGF1 [MIM 147440]) were shown to influence fasting insulin levels in the same analysis. Equivalent analyses for 2h glucose33 (15,000 GWAS samples and up to 30,000 replication samples) identified further signals, including variants near the GIP (MIM 137240) receptor (GIPR [MIM 137241]). Before the GWAS era, the only robust association between DNA sequence variation and either BMI or weight involved low-frequency variants in MC4R (MIM 155541).35 Now, there are more than 30. In the most recent study from the GIANT consortium,36 these analyses extended to almost 250,000 samples, half of them in the stage 1 GWAS, the remainder for replication. The largest signal remains that at FTO (MIM 610966),37 where the 12 The American Journal of Human Genetics 90, 7–24, January 13, 2012 Table 2. Summary of GWAS Findings for Seven Autoimmune Diseasesa Prior to 2007 2007 onward Disease Number of Loci Loci Number of Loci Some or All of the Loci Ankylosing spondylitis 1 HLA-B27 13 IL23R, ERAP1, 2p15, 21q22, CARD9 (MIM 607212), IL12B (MIM 161561), PTGER4 (MIM 601586), IL1R2 (MIM 147811), TNFR1, TBKBP1 (MIM 608476), ANTXR2 (MIM 608041), RUNX3 (MIM 600210), KIF21B (MIM 608322) Rheumatoid arthritis 3 HLA-DRB1, PADI4, CTLA4 30 AFF3 (MIM 601464), BLK, CCL21 (MIM 602737), CD2/CD58 (MIM 186990)/153420], CD28, CD40, FCGR2A (MIM 146790), HLA-DRB1, IL2/IL21 (MIM 147680/605384), IL2RA, IL2RB (MIM 146710), KIF5A/PIP4K2C, PRDM1 (MIM 603423), PRKCQ (MIM 600448), PTPRC (MIM 151460), REL (MIM 164910), STAT4 (MIM 600558), TAGAP, TNFAIP3, TNFRSF14, TRAF1/C5 (MIM 120900/601711), TRAF6 (MIM 602355), IL6ST (MIM 600694), SPRED2 (MIM 609292), RBPJ (MIM 147183), CCR6 (MIM 601835), IRF5 (MIM 607218), PXK (MIM 611450) Systemic lupus erythematosus 3 HLA, PTPN22, IRF5 (MIM 607218) 31 BANK1 (MIM 610292), BLK (MIM 191305), C1q, C2 (MIM 613927), C4A/B (MIM 120820/120810), CRP (MIM 123260), ETS1 (MIM 164720), FcGR2A–FcGR3A (MIM 146790/146740), FcGR3B (MIM 610665), HIC2-UBE2L3 (MIM 607712/603721), IKZF1 (MIM 603023), IL10 (MIM 124092), IRAK1 (MIM 300283), ITGAM–ITGAX (MIM 120980)/151510], JAZF1, KIAA1542/PHRF1, LRRC18-WDFY4, LYN (MIM 165120), NMNAT2 (MIM 608701), PRDM1 (MIM 603423), PTTG1 (MIM 604147), PXK (MIM 611450), RASGRP3 (MIM 609531), SLC15A4, STAT1 (MIM 600555), TNFAIP3, TNFSF4 (MIM 603594), TNIP1 (MIM 607714), TREX1 (MIM 606609), UHRF1BP1, XKR6 Type 1 diabetes 4 HLA, INS (MIM 176730), PTPN22, CTLA4 40 RGS1, IL18RAP (MIM 604509), IFIH1 (MIM 606951), CCR5 (MIM 601373), IL2 (MIM 147680), IL7R, MHC, BACH2 (MIM 605394), TNFAIP3, TAGAP, IL2RA, PRKCQ (MIM 600448), INS (MIM 176730), ERBB3 (MIM 190151), 12q13.3, SH2B3 (MIM 605093), CTSH (MIM 116820), CLEC16A (MIM 611303), PTPN2 (MIM 176887), CD226 (MIM 605397), UBASH3A (MIM 605736), C1QTNF6, IL10 (MIM 124092), 4p15.2, C6orf173, 7p15.2, COBL (MIM 610317), GLIS3 (MIM 610192), C10orf59, CD69 (MIM 107273), 14q24.1, 14q32.2, IL27 (MIM 608273), 16q23.1, ORMDL3 (MIM 610075), 17q21.2, 19q13.32, 20p13, 22q12.2, Xq28 Multiple sclerosis 1 HLA 52 BACH2 (MIM 605394), BATF (MIM 612476), CBLB, CD40, CD58, CD6 (MIM 186720), CD86, CLEC16A (MIM 611303), CLECL1, CYP24A1, CYP27B1, DKKL1 (MIM 605418), EOMES (MIM 604615), EVI5 (MIM 602942), GALC (MIM 606890), HHEX (MIM 604420), IL12A, IL12B, IL22RA2, IL2RA, IL7, IL7R, IRF8, KIF21B (MIM 608322), MALT1, MAPK1 (MIM 176948), MERTK (MIM 604705), MMEL1, MPHOSPH9 (MIM 605501), MPV17L2, MYB (MIM 189990), MYC (MIM 190080), OLIG3 (MIM 609323), PLEK (MIM 173570), PTGER4 (MIM 601586), PVT1 (MIM 165140), RGS1, SCO2 (MIM 604272), SP140 (MIM 608602), STAT3, TAGAP, THEMIS (MIM 613607), TMEM39A, TNFRSF1A, TNFSF14 (MIM 604520), TYK2, VCAM1, ZFP36L1 (MIM 601064), ZMIZ1 (MIM 607159), ZNF767 Crohn’s disease 4 NOD2 (MIM 605956), IBD5 (MIM 606348), DRB1*0103, IL23R 67 SMAD3 (MIM 603109), ERAP2 (MIM 609497), IL10 (MIM 124092), IL2RA, TYK2, FUT2 (MIM 182100), DNMT3A (MIM 602769), DENND1B (MIM 613292), BACH2 (MIM 605394), ATG16L1 (MIM 610767) Ulcerative colitis 3 DRB1*1502, DRB1*0103, IL23R 44 IL1R2 (MIM 147811), IL8RA-IL8RB, IL7R, IL12B, DAP (MIM 600954), PRDM1 (MIM 603423), JAK2 (MIM 147796), IRF5 (MIM 607218), GNA12 (MIM 604394), LSP1 (MIM 153432), ATG16L1 (MIM 610767) Total 19 277 a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant from protein-coding genes. average between-homozygotes difference in weight is around 2.5 kg. The effects at other loci are smaller, and in combination, these variants explain no more than 1%–2% of overall variation in adult BMI (although this percentage rises to almost 20% if the analysis is extended to all GWA variants, not just those that reach genome- wide significance14). As well as these studies of BMI and obesity in population samples, there have been several studies focused on extreme obesity phenotypes.38,39 The genome-wide-significant loci thrown up by these efforts only partially overlap with those emerging from population-based studies, raising the possibility that some of The American Journal of Human Genetics 90, 7–24, January 13, 2012 13 Table 3. Summary of GWAS Findings for Metabolic Traitsa Prior to 2007 2007 onward Disease Number of Loci Loci Number of Loci Some or All of the Loci Type 2 diabetes 3 PPARG, KCNJ11 (MIM 600937), TCF7L2 50 NOTCH2 (MIM 600275), PROX1 (MIM 601546), GCKR, THADA (MIM 611800), BCL11A (MIM 606557), RBMS1 (MIM 602310), IRS1, ADAMTS9, ADCY5 (MIM 600293), IGF2BP2 (MIM 608289), WFS1, ZBED3, CDKAL1, DGKB (MIM 604070), JAZF1, GCK, KLF14, TP53INP1 (MIM 606185), SLC30A8 (MIM 611145), PTPRD (MIM 601598), CDKN2A, CHCHD9, CDC123, HHEX (MIM 604420), DUSP8 (MIM 602038), KCNQ1, CENTD2, MTNR1B, HMGA2 (MIM 600698), TSPAN8 (MIM 600769), HNF1A, ZFAND6 (MIM 610183), PRC1 (MIM 603484), FTO, SRR (MIM 606477), HNF1B (MIM 189907), DUSP9 (MIM 300134), CDCD4A, UBE2E2 (MIM 602163), GRB14 (MIM 601524), ST6GAL1 (MIM 109675), VPS26A (MIM 605506), HMG20A (MIM 605534), AP3S2 (MIM 602416), HNF4A (MIM 600281), SPRY2 (MIM 602466) Body-mass index 1 MC4R 30 NEGR1 (MIM 613173), TNNI3K (MIM 613932), PTBP2 (MIM 608449), TMEM18 (MIM 613220), POMC, FANCL (MIM 608111), LRP1B (MIM 608766), CADM2 (MIM 609938), ETV5 (MIM 601600), GNPDA2 (MIM 613222), SLC39A8 (MIM 608732), HMGCR (MIM 142910), PCSK1, ZNF608, NCR3 (MIM 611550), HMGA1 (MIM 600701), LRRN6C, TUB (MIM 601197), BDNF, MTCH2 (MIM 613221), FAIM3 (MIM 606015), MTIF3, PRKD1 (MIM 605435), MAP2K5 (MIM 602520), FTO, SH2B1, GPRC5B (MIM 605948), KCTD15, GIPR, TMEM160 Glucose or insulin 1 GCK 15 GCKR, G6PC2, IGF1, ADCY5 (MIM 600293), MADD (MIM 603584), ADRA2A, CRY2 (MIM 603732), FADS1 (MIM 606148), GLIS3 (MIM 610192), SLC2A2, PROX1 (MIM 601546), C2CD4B (MIM 610344), DGKB (MIM 604070), GIPR, VPS13C (MIM 608879) Fat distribution 0 20 TBX15 (MIM 604127), LYPLAL1, IRS1, SPRY2 (MIM 602466), GRB14 (MIM 601524), STAB1 (MIM 608560), ADAMTS9, CPEB4 (MIM 610607), VEGFA (MIM 192240), TFAP2B (MIM 601601), LY86 (MIM 605241), RSPO3 (MIM 610574), NFE2L3 (MIM 604135), MSRA (MIM 601250), ITPR2 (MIM 600144), HOXC13 (MIM 142976), NRXN3 (MIM 600567), ZNRF3 (MIM 612062), PIGC (MIM 601730) Total 5 107 a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant from protein-coding genes. the most extreme cases of obesity are driven by highly penetrant, low-frequency variants. Variation at copynumber variants (CNVs) has some impact on BMI. This is true of common CNVs (the NEGR1 association seems likely to be driven by a common CNV40) and also rarer CNVs for which evidence is starting to accumulate (e.g., 16p CNV and effect on morbid obesity and developmental delay41). The adverse metabolic effects of obesity depend not only on the overall level of adiposity but also on the distribution of fat around the body; visceral (abdominal) fat has particularly adverse consequences for overall health. GWASs of fat-distribution phenotypes (including waist circumference, waist:hip ratio, and body-fat percentage studied in close to 200,000 individuals) have revealed almost 20 loci with genome-wide significance40,42–44 and relatively little overlap with those loci influencing overall adiposity. As with BMI, the proportion of variance explained by these loci is small (around 1% after adjustment for BMI, age, and sex). New Biology Arising from GWAS Discoveries Autoimmune Diseases Thus far nearly all genes associated with MS have been involved in autoimmune pathways rather than in neurologic degenerative diseases.45 Indeed, of the two MS-associated genes involved in neurodegeneration, one (KIF21B) is also associated with AS and CD, suggesting that it is actually an autoimmunity gene. The genes involved in MS include genes coding for components of the cytokine pathway (CXCR5 [MIM 601613], IL2RA [MIM 147730], IL7R [MIM 146661], IL7 [MIM 146660], IL12RB1 [MIM 601604], IL22RA2 [MIM 606648], IL12A [MIM 161560], IL12B [MIM 161561], IRF8 [MIM 601565], TNFRSF1A [MIM 191190], TNFRSF14 [MIM 602746], and TNFSF14 [MIM 604520]), costimulatory molecules (CD37 [MIM 151523], CD40, CD58 [MIM 153420], CD80 [MIM 112203], CD86 [MIM 601020], and CLECL1 [MIM 607467]), and signal-transduction molecules of immunological relevance (CBLB [MIM 604491], GPR65 [MIM 604620], MALT1 [MIM 604860], RGS1 [MIM 600323], STAT3 [MIM 102582], TAGAP [MIM 609667], and TYK2 [MIM 176941]). Interestingly, these genes mainly implicate T-helper cells in MS pathogenesis. Genetic findings have had a major impact on AS research and therapeutics. The association of the genes IL23R (MIM 607562)46 and IL12B19 have pointed to the involvement of the IL-23R pathway, and hence IL-17-producing 14 The American Journal of Human Genetics 90, 7–24, January 13, 2012 proinflammatory cell populations, in the aetiopathogenesis of AS. The involvement of this pathway in AS was not considered until the genetic discoveries were reported. The recent demonstration that ERAP1 (MIM 606832) polymorphisms are associated with HLA-B27-positive but not HLA-B27-negative AS has shed important light on research into the mechanism by which HLA-B27 induces AS; this mechanism has remained an enigma since the discovery of the association of HLA-B27 with AS in the early 1970s. ERAP1 is involved in peptide processing before HLA class I molecule presentation; the restriction of the association of ERAP1 variants to HLA-B27-positive disease indicates that HLA-B27 operates to cause AS by a mechanism that involves peptide presentation. Protective variants of ERAP1 have been shown to have lower peptide-processing capacity and thus to reduce the amount of peptide available to HLA-B27.47 Thus HLA-B27 is more likely to cause AS when it is processing more peptides. The finding that PADI4 (MIM 605347) is associated with RA focused research interest on the role of anti-citrullinated peptide antibodies (ACPAs) and disease.48 PADI4 is involved in the citrullination of peptides against which ACPAs develop. The association of PADI4 variants with RA therefore indicated that ACPAs are directly involved in RA pathogenesis, not an indirect manifestation of immune dysregulation in the disease. Subsequently, it was discovered that the association of HLA-DRB1 (MIM 142857) with RA was restricted to ACPA-positive disease and that there was a strong gene-environment interaction, such that cigarette smoking increases the risk of ACPApositive but not ACPA-negative RA.49 Because ACPApositive disease is more severe than ACPA-negative disease and has a greater propensity toward joint-damaging erosion, this provided further evidence supporting publichealth measures against cigarette smoking. The genetic loci identified for IBD through GWASs have highlighted a number of pathways, including antibacterial autophagy and signaling pathways (e.g., IL-10 signaling, T-cell-negative regulators, and pathways involving B cells and innate sensors).18 Some of these pathways were previously not suspected to be important for these diseases. The role of a number of pathways, for example the IL-23R pathway, the autophagy pathway, and innate immunity, have all come from hypothesis-generating genetics research, not from immunology or hypothesis-driven research. Similar advances could be described for many other autoimmune diseases but are beyond the scope of this review. Metabolic Traits Most loci affecting T2D and fasting glucose levels map to regulatory sequences, and in many cases, the ‘‘causal’’ transcript, i.e., the transcript responsible for mediating the effect of the associated variants, is not yet known. At other loci, a combination of coding variants, strong biological candidates, and/or cis expression QTL data has defined the transcript through which the effect is mediated (HNF1A [MIM 142410], GCK, IRS1 [MIM 147545], WFS1 [MIM 606201], PPARG [MIM 601487], CAMK1D [MIM 607957], JAZF1 [MIM 606246], KLF14 [MIM 609393] and others) as a first step to inferring biology.50 Some of these stories are now starting to be fleshed out into biological mechanisms (e.g., KLF1451). There is incomplete overlap with the loci influencing physiological variation in glucose and insulin. Some loci (e.g., MTNR1B [MIM 600804]) have a relatively large effect on both, whereas others (e.g., G6PC2 [MIM 608058]) influence fasting glucose levels but have a minimal effect on T2D risk. Still others (e.g., CDKN2A and CDKN2 B [MIM 600160 and 600431]) impact T2D and have surprisingly modest effects on fasting glucose levels in healthy, nondiabetic individuals32,33,50. Most of these loci appear to have their primary effect on the function of beta cells rather than on insulin resistance, highlighting the importance of the former with respect to normal and abnormal glucose homeostasis.50 Of the subset of loci (including PPARG, KLF14, and ADAMTS9 [MIM 605421]) shown to influence T2D risk through a primary effect on insulin resistance, only FTO seems to act primarily through an effect on obesity.50 Several of the T2D loci overlap genes that are known to harbor rare variants responsible for penetrant, monogenic forms of diabetes (such genes include KCNQ1 [MIM 607542], PPARG, HNF1A, GCK, and WFS1), indicating that multiple causal variants at the same locus segregate in the population at difference frequencies. There is overlap between signals influencing T2D risk and those influencing body weight (CDKAL1 [MIM 611259] and ADCY5 [MIM 600293]) indicating that some of the observed epidemiological associations between these traits are attributable to shared susceptibility variants.52 Whereas many of the fasting-glucose and fasting-insulin signals map near strong biological candidates for relevant traits (such candidate genes include IRS1, IGF1, ADRA2A [MIM 104210], SLC2A2 [MIM 138160], GCK and GCKR) and fit within established models of our understanding of islet biology, this is far from the case with the loci identified for T2D. Efforts to demonstrate that the genes mapping close to T2D risk loci are enriched for particular pathways or processes have met with only limited success; the most robust finding yet has been in relation to cell-cycle regulation (and was consistent with a model in which the regulation of islet mass is a key component of risk50). Either T2D is especially heterogeneous or else key aspects of its pathophysiology are as yet poorly codified in existing databases. As for T2D and fasting glucose, most of the signals for obesity and fat distribution map to regulatory signals, the causal transcript is known at only a minority of the loci. Signals influencing BMI appear to be enriched for genes implicated in neuronal processes, whereas those influencing fat distribution seem to be more closely related to adipose development.36,43 Overlap with signals and genes implicated in more severe forms of disease (morbid obesity, The American Journal of Human Genetics 90, 7–24, January 13, 2012 15 lipodystrophy) is seen at some loci (PCSK1 [MIM 162150], POMC [MIM 176830], BDNF [MIM 113505], MC4R, and SH2B1 [MIM 608937]) but is far from complete (some loci implicated in extreme obesity case-control studies show no association with BMI at the population level36). The strongest signal for overall adiposityis the one mapping to FTO37. FTO is thought to be a DNA methylase,53 but its function is poorly understood. Murine models demonstrate that modulation of Fto expression is associated with changes in body weight,54–56 but no direct evidence linking coding variants in FTO in humans to body-weight variation has been demonstrated. For the time being, FTO remains the strongest candidate, but the role of other genes (e.g., RPGRIP1L [MIM 610937]) in the region cannot be discounted. This example demonstrates the difficulties that remain in relating GWAS signals to downstream biology. Fat distribution is a strongly gender-dimorphic phenotype, and many of the signals associated with fat distribution seem to have a selective effect on this phenotype in women.43 Quantitative Traits In addition to having been performed on the quantitative traits discussed previously (e.g., BMI and fasting-glucose and -insulin levels), GWASs have been done on a number of quantitative risk factors for disease and for traits that are models for the genetic architecture of complex traits. For bone mineral density (BMD), a risk factor for osteoporotic fracture, a total of 34 loci, together explaining ~5% of narrow sense heritability, have been identified (Estrada et al., abstract presented at the American Society for Bone and Mineral Research 2010 Annual Meeting, published in J. Bone. Med. Res. 25 [Suppl S1], p. 1243). Among these genes, there is a major over-representation of genes in the Wnt-signaling pathway, which was first implicated in osteoporosis (MIM 166710) from studies in families with high or low BMD phenotypes. Many other examples exist in osteoporosis and other human diseases in which GWASs have demonstrated that more-prevalent but less-severe genetic variants in genes initially identified from studies of severe familial diseases have proven to be important in the risk of disease in the general population. For human height, a combined discovery and validation cohort of ~180,000 samples identified 180 robustly associated loci, many in meaningful biological pathways and with evidence for multiple segregating variants at the same loci.13 Together these loci explain approximately 12%–14% of additive genetic variation (~10% of phenotypic variation). A meta-analysis of more than 100,000 individuals of European ancestry detected a total of 95 loci significantly associated with plasma concentrations of cholesterol and triglycerides, known risk factors for coronary artery disease,57 and it provided evidence that the GWAS loci were of biological and clinical relevance. A meta-analysis from the HaemGen consortium on platelet count and platelet volume, which are endophenotypes for myocardial infarction (MIM 608446), discovered 68 loci.58 When the genes of a number of these loci were silenced in Drosophila, 11 showed a clear platelet phenotype. These genes are previously unknown regulators of blood cell formation. The identification of so many loci has uncovered new gene functions in megakaryopoiesis and platelet formation. That is, new biology has resulted directly from the identification of SNPs that are associated with variation in platelet phenotypes. Across these quantitative traits, a number of loci discovered through GWASs were known to be a mutational target for those traits because Mendelian forms with extreme phenotypes existed. Taken together, the inference from quantitative traits in terms of the (large) number of loci involved, the allelic frequency spectrum of associated variants, and the nature of the candidate genes suggest that models arising from quantitative traits appropriately reflect the genetic architecture of disease and reinforce the emerging evidence that it is the cumulative effect of many loci that underlies susceptibility to disease. From GWAS to Translation: Clinical Relevance Autoimmune Diseases Many of the MS-associated genes discovered by GWASs represent excellent potential therapeutic targets. Of particular note is the identification of two genes involved in vitamin D metabolism (CYP27B1 [MIM 609506] and CYP24A1 [MIM 126065]). This identification might help to explain the latitudinal variation in MS incidence—i.e., higher MS prevalence at more extreme latitudes is most likely due to higher rates of vitamin D deficiency. Two other identified genes are already targets of MS therapies, highlighting the relevance of the findings to the disease pathogenesis (natalizumab targets VCAM1 [MIM 192225], and daclizumab targets IL2RA). The findings for AS have stimulated the trial of therapies against identified pathways. Anti-IL-17 treatment has been shown in a phase 2 trial to have equivalent efficacy as the current gold-standard treatment, TNF-inhibition, in the treatment of AS. The relevance of the RA-related genetic findings to therapeutic development is highlighted by the fact that some existing therapies already target genes or gene pathways highlighted by the genetic associations with RA; such therapies include those involving TNF inhibitors (e.g., infliximab) and co-stimulation inhibitors (e.g., abatacept). Abatacept is a fusion protein of CTLA-4 and immunoglobulin. It acts by preventing costimulation of T-helper cells by the binding of the T cell’s CD28 protein to the B7 protein on the antigen-presenting cell. CTLA4 (MIM 123890) and CD28 (MIM 186760) polymorphisms are associated with RA. The RA-associated genes include many involved in the NfKB signaling pathway and place this pathway at the center of RA pathogenesis. As in MS, mouse research prior to the genetic discoveries had implicated the IL-23-dependent Th17-lymphocyte pathway in RA pathogenesis. To date there has been very little genetic support for this with regard to human diseases, in contrast to the situation in seronegative 16 The American Journal of Human Genetics 90, 7–24, January 13, 2012 diseases such as AS, psoriasis and IBD, where strong genetic associations exist and treatments targeting the pathway are in clinical use. Metabolic Diseases The main relevance of GWASs lies in the insights into disease biology (see above) and the potential for clinical translation through novel approaches to the diagnosis, prevention, treatment, and monitoring of disease. This will take some time, in particular given that most GWAS discoveries were made in the last few years. The predictive power of disease risk ascertained from genetic data remains poor because for most diseases only a small proportion of additive genetic variation has been accounted for. Although it is possible for T2D to identify individuals who are at the extremes of the genotype risk score distribution and who differ appreciably in T2D risk (they have twice or half the average risk for the upper and lower 1%–2%, respectively), many of these would already be identifiable on the basis of classical risk factors. In fact, when using receiver operating characteristic (ROC) analyses, BMI and age do a far better job of discrimination than the genetic variants so far discovered.59 This may change as low frequency and rare causal alleles are found. Although individual prediction is not yet practical with the variants at hand, it should be possible to identify groups of individuals who are at a substantially greaterthan-average risk for diabetes, and this might be of value, for example, with respect to clinical-trial enrichment. One obvious route to early translation involves the identification of diagnostic biomarkers on the basis of the processes that have been uncovered. These may have predictive impact well beyond the genetic variants that led to their discovery. This was recently demonstrated by a GWAS of C-reactive protein (CRP) levels; that study found that common variants near the HNF1A gene were associated with variation in CRP.60 The authors asked whether rare HNF1A mutations that are causal for the Mendelian MODY (MIM 606391) subtype of diabetes are also associated with differences in CRP levels and whether it would be possible to use CRP levels as a diagnostic marker to help identify individuals who have early-onset diabetes and who are likely to have HNF1A-MODY (and to direct those individuals to sequence-based diagnostics). They were able to show marked differences in CRP levels between HNF1A -MODY and other types of diabetes and demonstrated that diagnoses based on CRP levels has a discriminative accuracy of more than 80% for this diagnostic classification.61,62 Otherwise, GWAS findings have as yet had no impact on therapeutic optimization. Recent studies have identified variants that influence therapeutic response to metformin63 and might herald better understanding of how these drugs work. New Science Facilitated by GWASs Although the GWAS approach was designed for the detection of associations between DNA markers and disease, as a by-product such studies have generated new scientific discoveries. A detailed description and discussion is outside the scope of this review, and we highlight only a few of these advances: the discovery of genes affecting genetic recombination and their correlation with natural selection64–66 and new insight in human population structure and evolution.67–73 Interpretation of GWAS Results GWASs conducted in the last five years were designed and powered to detect associations through LD between genotyped (or imputed) common SNP markers and unknown causal variants. What do the results imply in terms of variance explained in the population, common versus rare variants underlying complex traits, and the nature of complex-trait variation and evolution? It is too early to be able to quantify the joint distribution of risk-allele frequencies and their effect sizes because there are very few causal variants identified by GWAS and because systematic study of rare variants (through exome or whole-genome sequencing) is in an early stage. To understand the allelic spectrum of risk variants and thereby inform optimal design of experiments aiming to detect causal variants, one must differentiate between two explanations for observed associations between genotyped common SNPs and disease: the association can be caused by one or more causal variants that have large effect sizes and are in low LD with the genotyped SNPs, or it can be caused by causal variants that have small effects and are in high LD with the genotyped SNPs. Low LD occurs when the allele frequencies of the unknown causal variants and those at the genotyped SNPs are very different from each other, for example when the allele frequency of causal variants is much lower than that of the SNPs. For a single robustly associated SNP in a homogeneous population, we cannot distinguish between the hypotheses that the association signal is caused by a rare variant of large effect or a common variant with small effect. However, variants at multiple loci and GWASs in other ethnic populations help to narrow the boundaries of the genetic architecture of diseases. At this point in time, we can conclude that (1) Many loci contribute to complex-trait variation (e.g., Figure 2). (2) At a number of identified risk loci, there are multiple alleles associated with disease at a wide range of frequencies. (3) There is evidence for pleiotropy, i.e., that the same variants are associated with multiple traits.66,74,75 (4) A number of variants associated with disease or complex traits in one ethnic population are also associated the same disease or traits in other populations (see above for T2D examples). (5) The hypothesis76 that causal variant(s) that lead to the association between common SNPs and disease are mostly rare (say, have an allele frequency of 1% The American Journal of Human Genetics 90, 7–24, January 13, 2012 17 Box 3. Box 3. Synthetic Associations Dickson and colleagues suggested that the observed association between a common SNP and a complex trait might result when one or more rare variants at the locus is in LD with that SNP.76,93 Because common SNP alleles and rare causal variants cannot be highly correlated because of the properties of LD,84 the hypothesis of ‘‘synthetic’’ associations implies that the effect sizes of the causal variants are much larger than the effect size observed at the common SNP and suggests that (re)sequencing studies might detect such variants. The hypothesis is not about whether GWASs work as an experimental design but what the likely interpretation of GWAS hits is in terms of the allele spectrum of causal risk alleles. Are empirical data consistent with this hypothesis? Several lines of evidence suggest that associations observed with common SNP associations are rarely due to synthetic associations with rare variants. First, because the LD correlation between common and rare variants is so low (typically 0.01–0.02), synthetic associations imply that variation explained by the causal variants at the locus is 50–100 times larger than the variance explained at the genotyped SNP.78 So, if the SNP explains 0.1% of phenotypic variation in the population, the causal variant would explain 5%–10%. But as shown in this review, for many complex traits and diseases tens to hundred of common variants are identified, and so their combined effects would explain too much variation if synthetic associations were the norm. Second, empirical data from (re)sequencing studies and trans-ethnic mapping suggest that both common and rare variants contribute to disease risk.77 At most loci detected by GWASs, there is no evidence (despite extensive genotyping and/or re-sequencing) that the common-variant signal is driven by low-frequency or rarer variants. Where rare risk alleles are uncovered at the same loci, they seem much more likely to be independent signals.94–96 Together these observations point to a highly polygenic model of disease susceptibility with causal variants across the entire range of the allelefrequency spectrum. By ‘‘polygenic,’’ we mean that segregating variants at many genomic loci (tens, hundreds, or even thousands) contribute to genetic variation for susceptibility in the population. The observations imply that, for most common complex diseases, nearly everyone in the population carries some risk alleles and that affected individuals are likely to have a different portfolio of risk alleles.79 They also imply that any single risk allele is neither necessary nor sufficient to cause disease. For the Continued etiology of disease, these observations provide empirical evidence to support a threshold or burden model involving multiple variants and environmental factors, and they appear to be inconsistent with a single cause (e.g., a single mutation). A rarevariant only model of disease, characterized by locus heterogeneity and rare mutations of large effects and proposed by, for example, McClellan and King,1 is not consistent with empirical observations.77,79,97 or lower) is not consistent with theoretical and empirical results.77,78 In particular, there is no widespread evidence for the existence of ‘‘synthetic associations’’ (see Box 3). Numerically, we expect that most causal variants that segregate in the population are rare, consistent with evolutionary theory, but the proportion of genetic variation that these variants cumulatively explain depends on their correlation with fitness.79 (6) A surprisingly large proportion of additive genetic variation is tagged when all SNPs are considered simultaneously.12–14 The Cost of GWASs If we assume that the GWAS results from Figure 1 represent a total of 500,000 SNP chips and that on average a chip costs $500, then this is a total investment of $250 million. If there are a total of ~2,000 loci detected across all traits, then this implies an investment of $125,000 per discovered locus. Is that a good investment? We think so: The total amount of money spent on candidate-gene studies and linkage analyses in the 1990s and 2000s probably exceeds $250M, and they in total have had little to show for it. Also, it is worthwhile to put these amounts in context. $250M is of the order of the cost of a one-two stealth fighter jets and much less than the cost of a single navy submarine. It is a fraction of the ~$9 billion cost of the Large Hadron Collider. It would also pay for about 100 R01 grants. Would those 100 non-funded R01 grants have made breakthrough discoveries in biology and medicine? We simply can’t answer this question, but we can conclude that a tremendous number of genuinely new discoveries have been made in a period of only five years. Concluding Comments In this review we have attempted to summarize the tremendous quality and quantity of discoveries that have been made by GWASs in the last five years. Because of space limitations, we have been able to discuss only a subset of diseases and have not mentioned those made in common cancers, pediatric diseases, and ophthalmological diseases, to name but a few. We now return to the 18 The American Journal of Human Genetics 90, 7–24, January 13, 2012 perceived failure of GWASs as summarized in the introductory section: (1) Is the GWAS approach founded on a flawed assumption that genetics plays an important role in the risk for common diseases? Pedigree studies, including those involving twins, suggest that a substantial proportion of variation in susceptibility for common disease is due to genetic factors. The proportion of total variation explained by genome-wide-significant variants has reached 10%–20% for a number of diseases, and clearly there are additional variants with such small effect sizes that they have not been detected with stringent significance. As reviewed here, many of the detected loci are in biologically meaningful pathways for the diseases investigated. Whole-genome analyses involving GWAS data have estimated that 20%–50% of phenotypic variation is captured when all SNPs are considered simultaneously for a number of complex diseases and traits. These estimates are based on populationwide studies and provide a lower limit of the total proportion of phenotypic variation due to genetic factors. Inference from GWASs is independent of inference drawn from close relatives (pedigree/ family studies), and therefore these studies have provided independent evidence for the role of genetics in common diseases. (2) Have GWASs been disappointing in not explaining more genetic variation in the population? This criticism implies that the aim of GWASs is to explain all genetic variation. This is a misrepresentation of the objective of GWASs. As was the aim of linkage studies in pedigrees for complex diseases prior to the GWAS era, the aim of GWAS is to detect loci that are associated with complex traits. The detection of such loci has led to the discovery of new biological knowledge about disease—knowledge that was absent only five years ago. But even ignoring the aim of GWASs, for a number of complex traits the proportion of genetic variation uncovered by GWASs is actually substantial. For example, for T2D, MS, and CD, approximately 10%, 20%, and 20%, respectively, of genetic variation in the population has been accounted for. Apart from diseases with a known major locus (which is usually the major histocompatibility locus), the baseline of variation explained five years ago was essentially zero. (3) Have GWASs delivered meaningful biologically relevant knowledge or results of clinical or any other utility? As we have highlighted in this review, the answer to this question is a definite ‘‘yes.’’ For example, the discovery of the importance of the autophagy pathway in Crohn disease, the IL-23R pathway in rheumatoid arthritis, and factor H in age-related macular degeneration (MIM 610149)9 have given important biological insight with direct clinical relevance. Hunter and Kraft put it this way back in 2007: ‘‘There have been few, if any, similar bursts of discovery in the history of medical research.’’80 (4) Are GWAS results spurious? The combination of large sample sizes and stringent significance testing has led to a large number of robust and replicable associations between complex traits and genetic variants, many of which are in meaningful biological pathways. A number of variants or different variants at the same loci have been shown to be associated with the same trait in different ethnic populations, and some loci are even replicated across species.81 The combination of multiple variants with small effect sizes has been shown to predict disease status or phenotype in independent samples from the same population. Clearly, these results are not consistent with flawed inferences from GWASs. In conclusion, in a period of less than five years, the GWAS experimental design in human populations has led to new discoveries about genes and pathways involved in common diseases and other complex traits, has provided a wealth of new biological insights, has led to discoveries with direct clinical utility, and has facilitated basic research in human genetics and genomics. For the future, technological advances enabling the sequencing of entire genomes in large samples at affordable prices is likely to generate additional genes, pathways, and biological insights, as well as to identify causal mutations. Acknowledgments We acknowledge funding from the Australian National Health and Medical Research Council (NHMRC grants 389892, 496667, 613672, 613601, and 1011506) and the Australian Research Council (ARC grant DP1093502). P.M.V. and M.A.B. are funded by NHMRC Senior Principal Research Fellowships. We thank two referees for many helpful comments. Web Resources The URLs for data presented herein are as follows: Online Mendelian Inheritance in Man (OMIM), http://www. omim.org GWAS Catalog, http://www.genome.gov/26525384 References 1. McClellan, J., and King, M.C. (2010). Genetic heterogeneity in human disease. Cell 141, 210–217. 2. Crow, T.J. (2011). ‘The missing genes: what happened to the heritability of psychiatric disorders?’. Mol. Psychiatry 16, 362–364. 3. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., et al. (2009). Finding the missing heritability of complex diseases. Nature 461, 747–753. The American Journal of Human Genetics 90, 7–24, January 13, 2012 19 4. Botstein, D., and Risch, N. (2003). Discovering genotypes underlying human phenotypes: Past successes for mendelian disease, future approaches for complex disease. Nat. Genet. Suppl. 33, 228–237. 5. Hartl, D.L., and Clark, A.G. (1997). Principles of population genetics (Sunderland: Sinauer Associates). 6. Hill, W.G., and Robertson, A. (1968). The effects of inbreeding at loci with heterozygote advantage. Genetics 60, 615–628. 7. Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S., Daly, M.J., and Donnelly, P.; International HapMap Consortium. (2005). A haplotype map of the human genome. Nature 437, 1299–1320. 8. Dewan, A., Liu, M., Hartman, S., Zhang, S.S., Liu, D.T., Zhao, C., Tam, P.O., Chan, W.M., Lam, D.S., Snyder, M., et al. (2006). HTRA1 promoter polymorphism in wet age-related macular degeneration. Science 314, 989–992. 9. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M., Mayne, S.T., et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389. 10. Wellcome Trust Case Control Consortium. (2007). Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678. 11. Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., RadfordSmith, G.L., Ahmad, T., Lees, C.W., Balschun, T., Lee, J., Roberts, R., et al. (2010). Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet. 42, 1118–1125. 12. Anderson, C.A., Boucher, G., Lees, C.W., Franke, A., D’Amato, M., Taylor, K.D., Lee, J.C., Goyette, P., Imielinski, M., Latiano, A., et al. (2011). Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47. Nat. Genet. 43, 246–252. 13. Lango Allen, H., Estrada, K., Lettre, G., Berndt, S.I., Weedon, M.N., Rivadeneira, F., Willer, C.J., Jackson, A.U., Vedantam, S., Raychaudhuri, S., et al. (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838. 14. Yang, J., Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Caporaso, N., Cunningham, J.M., de Andrade, M., Feenstra, B., Feingold, E., Hayes, M.G., et al. (2011). Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519–525. 15. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569. 16. Eyre-Walker, A. (2010). Evolution in health and medicine Sackler colloquium: Genetic architecture of complex traits and its implications for fitness and genome-wide association studies. Proc. Natl. Acad. Sci. USA 107 (Suppl 1 ), 1752–1756. 17. Pritchard, J.K. (2001). Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124–137. 18. Khor, B., Gardet, A., and Xavier, R.J. (2011). Genetics and pathogenesis of inflammatory bowel disease. Nature 474, 307–317. 19. Danoy, P., Pryce, K., Hadler, J., Bradbury, L.A., Farrar, C., Pointon, J., Ward, M., Weisman, M., Reveille, J.D., Wordsworth, B.P., et al; Australo-Anglo-American Spondyloarthritis Consortium; Spondyloarthritis Research Consortium of Canada. (2010). Association of variants at 1q32 and STAT3 with ankylosing spondylitis suggests genetic overlap with Crohn’s disease. PLoS Genet. 6, e1001195. 20. Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M., Wallace, C., Abecasis, G.R., Barrett, J.C., Behrens, T., Cho, J., et al; FOCiS Network of Consortia. (2011). Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet. 7, e1002254. 21. McCarthy, M.I. (2010). Genomics, type 2 diabetes, and obesity. N. Engl. J. Med. 363, 2339–2350. 22. Kooner, J.S., Saleheen, D., Sim, X., Sehmi, J., Zhang, W., Frossard, P., Been, L.F., Chia, K.S., Dimas, A.S., Hassanali, N., et al; DIAGRAM; MuTHER. (2011). Genome-wide association study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci. Nat. Genet. 43, 984–989. 23. Yamauchi, T., Hara, K., Maeda, S., Yasuda, K., Takahashi, A., Horikoshi, M., Nakamura, M., Fujita, H., Grarup, N., Cauchi, S., et al. (2010). A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat. Genet. 42, 864–868. 24. Shu, X.O., Long, J., Cai, Q., Qi, L., Xiang, Y.B., Cho, Y.S., Tai, E.S., Li, X., Lin, X., Chow, W.H., et al. (2010). Identification of new genetic risk variants for type 2 diabetes. PLoS Genet. 6, e1001127. 25. Yasuda, K., Miyake, K., Horikawa, Y., Hara, K., Osawa, H., Furuta, H., Hirota, Y., Mori, H., Jonsson, A., Sato, Y., et al. (2008). Variants in KCNQ1 are associated with susceptibility to type 2 diabetes mellitus. Nat. Genet. 40, 1092–1097. 26. Unoki, H., Takahashi, A., Kawaguchi, T., Hara, K., Horikoshi, M., Andersen, G., Ng, D.P., Holmkvist, J., Borch-Johnsen, K., Jørgensen, T., et al. (2008). SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in East Asian and European populations. Nat. Genet. 40, 1098–1102. 27. Tsai, F.J., Yang, C.F., Chen, C.C., Chuang, L.M., Lu, C.H., Chang, C.T., Wang, T.Y., Chen, R.H., Shiu, C.F., Liu, Y.M., et al. (2010). A genome-wide association study identifies susceptibility variants for type 2 diabetes in Han Chinese. PLoS Genet. 6, e1000847. 28. Below, J.E., Gamazon, E.R., Morrison, J.V., Konkashbaev, A., Pluzhnikov, A., McKeigue, P.M., Parra, E.J., Elbein, S.C., Hallman, D.M., Nicolae, D.L., et al. (2011). Genome-wide association and meta-analysis in populations from Starr County, Texas, and Mexico City identify type 2 diabetes susceptibility loci and enrichment for expression quantitative trait loci in top signals. Diabetologia 54, 2047–2055. 29. Parra, E.J., Below, J.E., Krithika, S., Valladares, A., Barta, J.L., Cox, N.J., Hanis, C.L., Wacher, N., Garcia-Mena, J., Hu, P., et al; Diabetes Genetics Replication and Meta-analysis (DIAGRAM) Consortium. (2011). Genome-wide association study of type 2 diabetes in a sample from Mexico City and a meta-analysis of a Mexican-American sample from Starr County, Texas. Diabetologia 54, 2038–2046. 30. Grant, S.F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., Helgason, A., Stefansson, H., Emilsson, V., Helgadottir, A., et al. (2006). Variant of 20 The American Journal of Human Genetics 90, 7–24, January 13, 2012 transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 38, 320–323. 31. Prokopenko, I., Langenberg, C., Florez, J.C., Saxena, R., Soranzo, N., Thorleifsson, G., Loos, R.J., Manning, A.K., Jackson, A.U., Aulchenko, Y., et al. (2009). Variants in MTNR1B influence fasting glucose levels. Nat. Genet. 41, 77–81. 32. Dupuis, J., Langenberg, C., Prokopenko, I., Saxena, R., Soranzo, N., Jackson, A.U., Wheeler, E., Glazer, N.L., Bouatia-Naji, N., Gloyn, A.L., et al; DIAGRAM Consortium; GIANT Consortium; Global BPgen Consortium; Anders Hamsten on behalf of Procardis Consortium; MAGIC investigators. (2010). New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 42, 105–116. 33. Saxena, R., Hivert, M.F., Langenberg, C., Tanaka, T., Pankow, J.S., Vollenweider, P., Lyssenko, V., Bouatia-Naji, N., Dupuis, J., Jackson, A.U., et al; GIANT consortium; MAGIC investigators. (2010). Genetic variation in GIPR influences the glucose and insulin responses to an oral glucose challenge. Nat. Genet. 42, 142–148. 34. Weedon, M.N., Clark, V.J., Qian, Y., Ben-Shlomo, Y., Timpson, N., Ebrahim, S., Lawlor, D.A., Pembrey, M.E., Ring, S., Wilkin, T.J., et al. (2006). A common haplotype of the glucokinase gene alters fasting glucose and birth weight: Association in six studies and population-genetics analyses. Am. J. Hum. Genet. 79, 991–1001. 35. Larsen, L.H., Echwald, S.M., Sørensen, T.I., Andersen, T., Wulff, B.S., and Pedersen, O. (2005). Prevalence of mutations and functional analyses of melanocortin 4 receptor variants identified among 750 men with juvenile-onset obesity. J. Clin. Endocrinol. Metab. 90, 219–224. 36. Speliotes, E.K., Willer, C.J., Berndt, S.I., Monda, K.L., Thorleifsson, G., Jackson, A.U., Allen, H.L., Lindgren, C.M., Luan, J., Mägi, R., et al; MAGIC; Procardis Consortium. (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42, 937–948. 37. Frayling, T.M., Timpson, N.J., Weedon, M.N., Zeggini, E., Freathy, R.M., Lindgren, C.M., Perry, J.R., Elliott, K.S., Lango, H., Rayner, N.W., et al. (2007). A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894. 38. Meyre, D., Delplanque, J., Chèvre, J.C., Lecoeur, C., Lobbens, S., Gallina, S., Durand, E., Vatin, V., Degraeve, F., Proença, C., et al. (2009). Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat. Genet. 41, 157–159. 39. Scherag, A., Dina, C., Hinney, A., Vatin, V., Scherag, S., Vogel, C.I., Müller, T.D., Grallert, H., Wichmann, H.E., Balkau, B., et al. (2010). Two new Loci for body-weight regulation identified in a joint analysis of genome-wide association studies for early-onset extreme obesity in French and german study groups. PLoS Genet. 6, e1000916. 40. Willer, C.J., Speliotes, E.K., Loos, R.J., Li, S., Lindgren, C.M., Heid, I.M., Berndt, S.I., Elliott, A.L., Jackson, A.U., Lamina, C., et al; Wellcome Trust Case Control Consortium; Genetic Investigation of ANthropometric Traits Consortium. (2009). Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat. Genet. 41, 25–34. 41. Walters, R.G., Jacquemont, S., Valsesia, A., de Smith, A.J., Martinet, D., Andersson, J., Falchi, M., Chen, F., Andrieux, J., Lobbens, S., et al. (2010). A new highly penetrant form of obesity due to deletions on chromosome 16p11.2. Nature 463, 671–675. 42. Heard-Costa, N.L., Zillikens, M.C., Monda, K.L., Johansson, A., Harris, T.B., Fu, M., Haritunians, T., Feitosa, M.F., Aspelund, T., Eiriksdottir, G., et al. (2009). NRXN3 is a novel locus for waist circumference: A genome-wide association study from the CHARGE Consortium. PLoS Genet. 5, e1000539. 43. Heid, I.M., Jackson, A.U., Randall, J.C., Winkler, T.W., Qi, L., Steinthorsdottir, V., Thorleifsson, G., Zillikens, M.C., Speliotes, E.K., Mägi, R., et al; MAGIC. (2010). Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nat. Genet. 42, 949–960. 44. Kilpelainen, T.O., Zillikens, M.C., Stancakova, A., Finucane, F.M., Ried, J.S., Langenberg, C., Zhang, W., Beckmann, J.S., Luan, J., Vandenput, L., et al. (2011). Genetic variation near IRS1 associates with reduced adiposity and an impaired metabolic profile. Nat. Genet. 43, 753–760. 45. Sawcer, S., Hellenthal, G., Pirinen, M., Spencer, C.C., Patsopoulos, N.A., Moutsianas, L., Dilthey, A., Su, Z., Freeman, C., Hunt, S.E., et al; International Multiple Sclerosis Genetics Consortium; Wellcome Trust Case Control Consortium 2. (2011). Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476, 214–219. 46. Burton, P.R., Clayton, D.G., Cardon, L.R., Craddock, N., Deloukas, P., Duncanson, A., Kwiatkowski, D.P., McCarthy, M.I., Ouwehand, W.H., Samani, N.J., et al; Wellcome Trust Case Control Consortium; Australo-Anglo-American Spondylitis Consortium (TASC); Biologics in RA Genetics and Genomics Study Syndicate (BRAGGS) Steering Committee; Breast Cancer Susceptibility Collaboration (UK). (2007). Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat. Genet. 39, 1329–1337. 47. Evans, D.M., Spencer, C.C., Pointon, J.J., Su, Z., Harvey, D., Kochan, G., Oppermann, U., Dilthey, A., Pirinen, M., Stone, M.A., et al; Spondyloarthritis Research Consortium of Canada (SPARCC); Australo-Anglo-American Spondyloarthritis Consortium (TASC); Wellcome Trust Case Control Consortium 2 (WTCCC2). (2011). Interaction between ERAP1 and HLA-B27 in ankylosing spondylitis implicates peptide handling in the mechanism for HLA-B27 in disease susceptibility. Nat. Genet. 43, 761–767. 48. Suzuki, A., Yamada, R., Chang, X., Tokuhiro, S., Sawada, T., Suzuki, M., Nagasaki, M., Nakayama-Hamada, M., Kawaida, R., Ono, M., et al. (2003). Functional haplotypes of PADI4, encoding citrullinating enzyme peptidylarginine deiminase 4, are associated with rheumatoid arthritis. Nat. Genet. 34, 395–402. 49. Padyukov, L., Silva, C., Stolt, P., Alfredsson, L., and Klareskog, L. (2004). A gene-environment interaction between smoking and shared epitope genes in HLA-DR provides a high risk of seropositive rheumatoid arthritis. Arthritis Rheum. 50, 3085–3092. 50. Voight, B.F., Scott, L.J., Steinthorsdottir, V., Morris, A.P., Dina, C., Welch, R.P., Zeggini, E., Huth, C., Aulchenko, Y.S., Thorleifsson, G., et al; MAGIC investigators; GIANT Consortium. (2010). Twelve type 2 diabetes susceptibility The American Journal of Human Genetics 90, 7–24, January 13, 2012 21 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. loci identified through large-scale association analysis. Nat. Genet. 42, 579–589. Small, K.S., Hedman, A.K., Grundberg, E., Nica, A.C., Thorleifsson, G., Kong, A., Thorsteindottir, U., Shin, S.Y., Richards, H.B., Soranzo, N., et al; GIANT Consortium; MAGIC Investigators; DIAGRAM Consortium; MuTHER Consortium. (2011). Identification of an imprinted master trans regulator at the KLF14 locus related to multiple metabolic phenotypes. Nat. Genet. 43, 561–564. Freathy, R.M., Mook-Kanamori, D.O., Sovio, U., Prokopenko, I., Timpson, N.J., Berry, D.J., Warrington, N.M., Widen, E., Hottenga, J.J., Kaakinen, M., et al; Genetic Investigation of ANthropometric Traits (GIANT) Consortium; Meta-Analyses of Glucose and Insulin-related traits Consortium; Wellcome Trust Case Control Consortium; Early Growth Genetics (EGG) Consortium. (2010). Variants in ADCY5 and near CCNL1 are associated with fetal growth and birth weight. Nat. Genet. 42, 430–435. Gerken, T., Girard, C.A., Tung, Y.C., Webby, C.J., Saudek, V., Hewitson, K.S., Yeo, G.S., McDonough, M.A., Cunliffe, S., McNeill, L.A., et al. (2007). The obesity-associated FTO gene encodes a 2-oxoglutarate-dependent nucleic acid demethylase. Science 318, 1469–1472. Church, C., Lee, S., Bagg, E.A., McTaggart, J.S., Deacon, R., Gerken, T., Lee, A., Moir, L., Mecinovi!c, J., Quwailid, M.M., et al. (2009). A mouse model for the metabolic effects of the human fat mass and obesity associated FTO gene. PLoS Genet. 5, e1000599. Church, C., Moir, L., McMurray, F., Girard, C., Banks, G.T., Teboul, L., Wells, S., Brüning, J.C., Nolan, P.M., Ashcroft, F.M., and Cox, R.D. (2010). Overexpression of Fto leads to increased food intake and results in obesity. Nat. Genet. 42, 1086–1092. Freathy, R.M., Timpson, N.J., Lawlor, D.A., Pouta, A., BenShlomo, Y., Ruokonen, A., Ebrahim, S., Shields, B., Zeggini, E., Weedon, M.N., et al. (2008). Common variation in the FTO gene alters diabetes-related metabolic traits to the extent expected given its effect on BMI. Diabetes 57, 1419–1426. Teslovich, T.M., Musunuru, K., Smith, A.V., Edmondson, A.C., Stylianou, I.M., Koseki, M., Pirruccello, J.P., Ripatti, S., Chasman, D.I., Willer, C.J., et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713. Gieger, C., Radhakrishnan, A., Cvejic, A., Tang, W., Porcu, E., Pistis, G., Serbanovic-Canic, J., Elling, U., Goodall, A.H., Labrune, Y., et al. (2011). New gene functions in megakaryopoiesis and platelet formation. Nature 480, 201–208. Mihaescu, R., Meigs, J., Sijbrands, E., and Janssens, A.C. (2011). Genetic risk profiling for prediction of type 2 diabetes. PLoS Curr. 3, RRN1208. Elliott, P., Chambers, J.C., Zhang, W., Clarke, R., Hopewell, J.C., Peden, J.F., Erdmann, J., Braund, P., Engert, J.C., Bennett, D., et al. (2009). Genetic Loci associated with C-reactive protein levels and risk of coronary heart disease. JAMA 302, 37–48. Owen, K.R., Thanabalasingham, G., James, T.J., Karpe, F., Farmer, A.J., McCarthy, M.I., and Gloyn, A.L. (2010). Assessment of high-sensitivity C-reactive protein levels as diagnostic discriminator of maturity-onset diabetes of the young due to HNF1A mutations. Diabetes Care 33, 1919–1924. Thanabalasingham, G., Shah, N., Vaxillaire, M., Hansen, T., Tuomi, T., Gasperikova, D., Szopa, M., Tjora, E., James, T.J., Kokko, P., et al. (2011). A large multi-centre European study validates high-sensitivity C-reactive protein (hsCRP) as a clinical biomarker for the diagnosis of diabetes subtypes. Diabetologia 54, 2801–2810. 63. Zhou, K., Bellenguez, C., Spencer, C.C., Bennett, A.J., Coleman, R.L., Tavendale, R., Hawley, S.A., Donnelly, L.A., Schofield, C., Groves, C.J., et al; GoDARTS and UKPDS Diabetes Pharmacogenetics Study Group; Wellcome Trust Case Control Consortium 2; MAGIC investigators. (2011). Common variants near ATM are associated with glycemic response to metformin in type 2 diabetes. Nat. Genet. 43, 117–120. 64. Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdottir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Ingason, A., Gudnadottir, V.G., et al. (2005). A common inversion under selection in Europeans. Nat. Genet. 37, 129–137. 65. Kong, A., Barnard, J., Gudbjartsson, D.F., Thorleifsson, G., Jonsdottir, G., Sigurdardottir, S., Richardsson, B., Jonsdottir, J., Thorgeirsson, T., Frigge, M.L., et al. (2004). Recombination rate and reproductive success in humans. Nat. Genet. 36, 1203–1206. 66. Hinch, A.G., Tandon, A., Patterson, N., Song, Y., Rohland, N., Palmer, C.D., Chen, G.K., Wang, K., Buxbaum, S.G., Akylbekova, E.L., et al. (2011). The landscape of recombination in African Americans. Nature 476, 170–175. 67. Seldin, M.F., Tian, C., Shigeta, R., Scherbarth, H.R., Silva, G., Belmont, J.W., Kittles, R., Gamron, S., Allevi, A., Palatnik, S.A., et al. (2007). Argentine population genetic structure: Large variance in Amerindian contribution. Am. J. Phys. Anthropol. 132, 455–462. 68. Seldin, M.F., Shigeta, R., Villoslada, P., Selmi, C., Tuomilehto, J., Silva, G., Belmont, J.W., Klareskog, L., and Gregersen, P.K. (2006). European population substructure: Clustering of northern and southern populations. PLoS Genet. 2, e143. 69. Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G., and Seldin, M.F. (2006). A genomewide single-nucleotidepolymorphism panel with high ancestry information for African American admixture mapping. Am. J. Hum. Genet. 79, 640–649. 70. McEvoy, B.P., Montgomery, G.W., McRae, A.F., Ripatti, S., Perola, M., Spector, T.D., Cherkas, L., Ahmadi, K.R., Boomsma, D., Willemsen, G., et al. (2009). Geographical structure and differential natural selection among North European populations. Genome Res. 19, 804–814. 71. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V., Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch, M., et al. (2008). Investigation of the fine structure of European populations with applications to disease association studies. Eur. J. Hum. Genet. 16, 1413–1429. 72. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R., et al. (2008). Genes mirror geography within Europe. Nature 456, 98–101. 73. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L., Scarnicci, F., Ruiz-Linares, A., Groop, L., Saetta, A.A., Korkolopoulou, P., et al. (2008). Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 4, e236. 74. Manolio, T.A. (2010). Genomewide association studies and assessment of the risk of disease. N. Engl. J. Med. 363, 166–176. 22 The American Journal of Human Genetics 90, 7–24, January 13, 2012 75. Sivakumaran, S., Agakov, F., Theodoratou, E., Prendergast, J.G., Zgaga, L., Manolio, T., Rudan, I., McKeigue, P., Wilson, J.F., and Campbell, H. (2011). Abundant pleiotropy in human complex diseases and traits. Am. J. Hum. Genet. 89, 607–618. 76. Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H., and Goldstein, D.B. (2010). Rare variants create synthetic genome-wide associations. PLoS Biol. 8, e1000294. 77. Anderson, C.A., Soranzo, N., Zeggini, E., and Barrett, J.C. (2011). Synthetic associations are unlikely to account for many common disease genome-wide association signals. PLoS Biol. 9, e1000580. 78. Wray, N.R., Purcell, S.M., and Visscher, P.M. (2011). Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol. 9, e1000579. 79. Visscher, P.M., Goddard, M.E., Derks, E.M., and Wray, N.R. (2011). Evidence-based psychiatric genetics, AKA the false dichotomy between common and rare variant hypotheses. Molecular Psychiatry, in press. Published online 14 June 2011. 2010.1038/mp.2011.2065. 80. Hunter, D.J., and Kraft, P. (2007). Drinking from the fire hose—Statistical issues in genomewide association studies. N. Engl. J. Med. 357, 436–439. 81. Pryce, J.E., Hayes, B.J., Bolormaa, S., and Goddard, M.E. (2011). Polymorphic regions affecting human height also control stature in cattle. Genetics 187, 981–984. 82. Bodmer, W.F. (1986). Human genetics: The molecular challenge. Cold Spring Harb. Symp. Quant. Biol. 51, 1–13. 83. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science 273, 1516– 1517. 84. Wray, N.R. (2005). Allele frequencies and the r2 measure of linkage disequilibrium: impact on design and interpretation of association studies. Twin Res. Hum. Genet. 8, 87–94. 85. McClellan, J.M., Susser, E., and King, M.C. (2007). Schizophrenia: A common disease caused by multiple rare alleles. Br. J. Psychiatry 190, 194–199. 86. Craddock, N., O’Donovan, M.C., and Owen, M.J. (2007). Phenotypic and genetic complexity of psychosis. Invited commentary on . Schizophrenia: a common disease caused by multiple rare alleles. Br. J. Psychiatry 190, 200–203. 87. Lander, E.S. (1996). The new genomics: Global views of biology. Science 274, 536–539. 88. Chakravarti, A. (1999). Population genetics—Making sense out of sequence. Nat. Genet. 21 (1, Suppl), 56–60. 89. Reich, D.E., and Lander, E.S. (2001). On the allelic spectrum of human disease. Trends Genet. 17, 502–510. 90. Risch, N. (1990). Linkage strategies for genetically complex traits. I. Multilocus models. Am. J. Hum. Genet. 46, 222–228. 91. Slatkin, M. (2008). Exchangeable models of complex inherited diseases. Genetics 179, 2253–2261. 92. Hill, W.G., Goddard, M.E., and Visscher, P.M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4, e1000008. 93. Wang, K., Dickson, S.P., Stolle, C.A., Krantz, I.D., Goldstein, D.B., and Hakonarson, H. (2010). Interpretation of association signals and identification of causal variants from genome-wide association studies. Am. J. Hum. Genet. 86, 730–742. 94. Nejentsev, S., Walker, N., Riches, D., Egholm, M., and Todd, J.A. (2009). Rare variants of IFIH1, a gene implicated in anti- 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. viral responses, protect against type 1 diabetes. Science 324, 387–389. Momozawa, Y., Mni, M., Nakamura, K., Coppieters, W., Almer, S., Amininejad, L., Cleynen, I., Colombel, J.F., de Rijk, P., Dewit, O., et al. (2011). Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory bowel disease. Nat. Genet. 43, 43–47. Rivas, M.A., Beaudoin, M., Gardet, A., Stevens, C., Sharma, Y., Zhang, C.K., Boucher, G., Ripke, S., Ellinghaus, D., Burtt, N., et al; National Institute of Diabetes and Digestive Kidney Diseases Inflammatory Bowel Disease Genetics Consortium (NIDDK IBDGC); United Kingdom Inflammatory Bowel Disease Genetics Consortium; International Inflammatory Bowel Disease Genetics Consortium. (2011). Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 43, 1066–1073. Wang, K., Bucan, M., Grant, S.F., Schellenberg, G., and Hakonarson, H. (2010). Strategies for genetic studies of complex diseases. Cell 142, 351–353, author reply 353–355. Hyttinen, V., Kaprio, J., Kinnunen, L., Koskenvuo, M., and Tuomilehto, J. (2003). Genetic liability of type 1 diabetes and the onset age among 22,650 young Finnish twin pairs: A nationwide follow-up study. Diabetes 52, 1052–1055. Polychronakos, C., and Li, Q. (2011). Understanding type 1 diabetes through genetics: Advances and prospects. Nat. Rev. Genet. 12, 781–792. Poulsen, P., Kyvik, K.O., Vaag, A., and Beck-Nielsen, H. (1999). Heritability of type II (non-insulin-dependent) diabetes mellitus and abnormal glucose tolerance—A population-based twin study. Diabetologia 42, 139–145. Magnusson, P.K., and Rasmussen, F. (2002). Familial resemblance of body mass index and familial risk of high and low body mass index. A study of young men in Sweden. Int. J. Obes. Relat. Metab. Disord. 26, 1225–1231. Schousboe, K., Willemsen, G., Kyvik, K.O., Mortensen, J., Boomsma, D.I., Cornes, B.K., Davis, C.J., Fagnani, C., Hjelmborg, J., Kaprio, J., et al. (2003). Sex differences in heritability of BMI: A comparative study of results from twin studies in eight countries. Twin Res. 6, 409–421. Tysk, C., Lindberg, E., Järnerot, G., and Flodérus-Myrhed, B. (1988). Ulcerative colitis and Crohn’s disease in an unselected population of monozygotic and dizygotic twins. A study of heritability and the influence of smoking. Gut 29, 990–996. Hawkes, C.H., and Macgregor, A.J. (2009). Twin studies and the heritability of MS: A conclusion. Mult. Scler. 15, 661–667. Brown, M.A., Kennedy, L.G., MacGregor, A.J., Darke, C., Duncan, E., Shatford, J.L., Taylor, A., Calin, A., and Wordsworth, P. (1997). Susceptibility to ankylosing spondylitis in twins: The role of genes, HLA, and the environment. Arthritis Rheum. 40, 1823–1828. Brown, M.A. (2011). Progress in the genetics of ankylosing spondylitis. Brief. Funct. Genomics 10, 249–257. MacGregor, A.J., Snieder, H., Rigby, A.S., Koskenvuo, M., Kaprio, J., Aho, K., and Silman, A.J. (2000). Characterizing the quantitative genetic contribution to rheumatoid arthritis using data from twins. Arthritis Rheum. 43, 30–37. Lichtenstein, P., Yip, B.H., Björk, C., Pawitan, Y., Cannon, T.D., Sullivan, P.F., and Hultman, C.M. (2009). Common The American Journal of Human Genetics 90, 7–24, January 13, 2012 23 109. 110. 111. 112. 113. 114. 115. genetic determinants of schizophrenia and bipolar disorder in Swedish families: A population-based study. Lancet 373, 234–239. Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O’Donovan, M.C., Sullivan, P.F., and Sklar, P.; International Schizophrenia Consortium. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752. Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A., Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., and Hemminki, K. (2000). Environmental and heritable factors in the causation of cancer—Analyses of cohorts of twins from Sweden, Denmark, and Finland. N. Engl. J. Med. 343, 78–85. Turnbull, C., Ahmed, S., Morrison, J., Pernet, D., Renwick, A., Maranian, M., Seal, S., Ghoussaini, M., Hines, S., Healey, C.S., et al; Breast Cancer Susceptibility Collaboration (UK). (2010). Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 42, 504–507. Orstavik, K.H., Magnus, P., Reisner, H., Berg, K., Graham, J.B., and Nance, W. (1985). Factor VIII and factor IX in a twin population. Evidence for a major effect of ABO locus on factor VIII level. Am. J. Hum. Genet. 37, 89–101. de Lange, M., Snieder, H., Ariëns, R.A., Spector, T.D., and Grant, P.J. (2001). The genetics of haemostasis: A twin study. Lancet 357, 101–105. Smith, N.L., Chen, M.H., Dehghan, A., Strachan, D.P., Basu, S., Soranzo, N., Hayward, C., Rudan, I., Sabater-Lleal, M., Bis, J.C., et al; Wellcome Trust Case Control Consortium. (2010). Novel associations of multiple genetic loci with plasma levels of factor VII, factor VIII, and von Willebrand factor: The CHARGE (Cohorts for Heart and Aging Research in Genome Epidemiology) Consortium. Circulation 121, 1382–1392. Visscher, P.M., Medland, S.E., Ferreira, M.A., Morley, K.I., Zhu, G., Cornes, B.K., Montgomery, G.W., and Martin, N.G. (2006). Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2, e41. 116. Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I., Cornes, B.K., Davis, C., Dunkel, L., De Lange, M., Harris, J.R., Hjelmborg, J.V., et al. (2003). Heritability of adult body height: A comparative study of twin cohorts in eight countries. Twin Res. 6, 399–408. 117. Peacock, M., Turner, C.H., Econs, M.J., and Foroud, T. (2002). Genetics of osteoporosis. Endocr. Rev. 23, 303–326. 118. Duncan, E.L., Danoy, P., Kemp, J.P., Leo, P.J., McCloskey, E., Nicholson, G.C., Eastell, R., Prince, R.L., Eisman, J.A., Jones, G., et al. (2011). Genome-wide association study using extreme truncate selection identifies novel genes affecting bone mineral density and fracture risk. PLoS Genet. 7, e1001372. 119. Dalageorgou, C., Ge, D., Jamshidi, Y., Nolte, I.M., Riese, H., Savelieva, I., Carter, N.D., Spector, T.D., and Snieder, H. (2008). Heritability of QT interval: how much is explained by genes for resting heart rate? J. Cardiovasc. Electrophysiol. 19, 386–391. 120. Russell, M.W., Law, I., Sholinsky, P., and Fabsitz, R.R. (1998). Heritability of ECG measurements in adult male twins. J. Electrocardiol. Suppl. 30, 64–68. 121. Shah, S.H., and Pitt, G.S. (2009). Genetics of cardiac repolarization. Nat. Genet. 41, 388–389. 122. Hunt, S.C., Hasstedt, S.J., Kuida, H., Stults, B.M., Hopkins, P.N., and Williams, R.R. (1989). Genetic heritability and common environmental components of resting and stressed blood pressures, lipids, and body mass index in Utah pedigrees and twins. Am. J. Epidemiol. 129, 625–638. 123. Evans, D.M., Frazer, I.H., and Martin, N.G. (1999). Genetic and environmental causes of variation in basal levels of blood cells. Twin Research: The Official Journal of the International Society for Twin Studies 2, 250–257. 24 The American Journal of Human Genetics 90, 7–24, January 13, 2012 Unraveling the Regulatory Mechanisms Underlying Tissue-Dependent Genetic Variation of Gene Expression Jingyuan Fu1,2*, Marcel G. M. Wolfs3, Patrick Deelen4, Harm-Jan Westra1, Rudolf S. N. Fehrmann1,5, Gerard J. te Meerman1, Wim A. Buurman6, Sander S. M. Rensen6, Harry J. M. Groen7, Rinse K. Weersma8, Leonard H. van den Berg9, Jan Veldink9, Roel A. Ophoff10, Harold Snieder2, David van Heel11, Ritsert C. Jansen12, Marten H. Hofker3, Cisca Wijmenga1, Lude Franke1* 1 Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands, 2 Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands, 3 Department of Pathology and Medical Biology, Molecular Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands, 4 Hanze University Groningen, Groningen, The Netherlands, 5 Department of Medical Oncology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands, 6 Department of Surgery, University Hospital Maastricht and Nutrition and Toxicology Research Institute (NUTRIM), Maastricht University, Maastricht, The Netherlands, 7 Department of Pulmonology, University Medical Centre Groningen, University of Groningen, Groningen, The Netherlands, 8 Department of Gastroenterology and Hepatology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands, 9 Department of Neurology, Rudolf Magnus Institute of Neuroscience, University Medical Center Utrecht, Utrecht, The Netherlands, 10 Department of Medical Genetics, University Medical Center Utrecht, Utrecht, The Netherlands, 11 Blizard Institute of Cell and Molecular Science, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom, 12 Groningen Bioinformatics Centre, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Haren, The Netherlands Abstract It is known that genetic variants can affect gene expression, but it is not yet completely clear through what mechanisms genetic variation mediate this expression. We therefore compared the cis-effect of single nucleotide polymorphisms (SNPs) on gene expression between blood samples from 1,240 human subjects and four primary non-blood tissues (liver, subcutaneous, and visceral adipose tissue and skeletal muscle) from 85 subjects. We characterized four different mechanisms for 2,072 probes that show tissue-dependent genetic regulation between blood and non-blood tissues: on average 33.2% only showed cis-regulation in non-blood tissues; 14.5% of the eQTL probes were regulated by different, independent SNPs depending on the tissue of investigation. 47.9% showed a different effect size although they were regulated by the same SNPs. Surprisingly, we observed that 4.4% were regulated by the same SNP but with opposite allelic direction. We show here that SNPs that are located in transcriptional regulatory elements are enriched for tissue-dependent regulation, including SNPs at 39 and 59 untranslated regions (P = 1.8461025 and 4.761024, respectively) and SNPs that are synonymous-coding (P = 9.961024). SNPs that are associated with complex traits more often exert a tissue-dependent effect on gene expression (P = 2.6610210). Our study yields new insights into the genetic basis of tissue-dependent expression and suggests that complex trait associated genetic variants have even more complex regulatory effects than previously anticipated. Citation: Fu J, Wolfs MGM, Deelen P, Westra H-J, Fehrmann RSN, et al. (2012) Unraveling the Regulatory Mechanisms Underlying Tissue-Dependent Genetic Variation of Gene Expression. PLoS Genet 8(1): e1002431. doi:10.1371/journal.pgen.1002431 Editor: Greg Gibson, Georgia Institute of Technology, United States of America Received July 19, 2011; Accepted November 8, 2011; Published January 19, 2012 Copyright: ! 2012 Fu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by IOP Genomics grant IGE05012A, the Netherlands Organisation for Scientific Research (NWO) VICI grant 918.66.620 (CW), a European Union FP7 COPACETIC grant 201379 (CW), the Dutch Diabetes Foundation (2006.00.007), the Wellcome Trust (084743 to DvH), the Medical Research Council UK (G1001158 to DvH), Juvenile Diabetes Research Foundation (33-2008-402 to DvH), a NWO VENI grant 863.09.007 (JF), a NWO VENI grant 916.10.135, a Horizon Breakthrough grant 92519031 from the Netherlands Genomics Initiative (LF), a NWO clinical fellowship grant 90.700.281 (RKW), the Netherlands ALS foundation and the Adessium Foundation (LHvdB), the Thierry Latran Foundation (JV), and a Transnational University Limburg (TUL) grant (SSMR). The research leading to these results has received funding from the European Community’s Health Seventh Framework Programme (FP7/2007–2013) under grant agreement nu 259867. This study was financed in part by the SIA-raakPRO subsidy for project BioCOMP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] (JF); [email protected] (LF) Introduction lymphoblastoid cell lines (LCL) [3], [4], liver [5]–[7], blood [8], [9], brain [10], [11], adipose tissues [6], [8], skin [12], [13] and primary fibroblasts [12]. However, considerable heterogeneity of cis-eQTL effects is possible between different tissues: A recent study reported that the proportion of heritability due to gene expression attributable to cis-regulation differs between tissues (37% in blood and 24% in adipose tissue) [14]. By comparing the overlap of significant cis-eQTL at a predefined threshold, estimates on the tissue-dependence of cis-eQTL were between 30% (liver, adipose tissues) and 70–80% (LCLs, fibroblasts, T cells) [8], [9], It has become clear that human genetic variants, such as single nucleotide polymorphisms (SNPs), can in cis affect the expression of nearby genes [1], [2]. Many loci exist that contain genetic variants that affect gene expression (expression quantitative loci, eQTL, usually assessed by investigating single nucleotide polymorphisms (SNPs) and expression probes that are within 250 kb up to 1 Mb apart). These cis-eQTL analyses have been performed in many different human tissues and cell types, including PLoS Genetics | www.plosgenetics.org 1 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL HumanHT12 v3 platform (see Materials and Methods). After normalization, we further removed strong expression differences between these tissues by removing the 50 principal components from this dataset and using the residuals for further analysis (described in [18] and Materials and Methods, Figure S1). We first performed cis-eQTL analysis in each of these datasets separately, by testing the correlation between SNPs and probes that were mapping within 1 Mb distance. At a false-discovery rate (FDR) of 0.05 level, we identified a non-overlapping set of 195,078 probeSNP pairs that were significant in at least one of the tissues under study: 4,700 probe-SNP pairs were significantly associated in liver, 7,161 pairs significantly in SAT, 5,323 pairs significantly in VAT, 1,971 pairs significantly in muscle, and 190,278 pairs significant in blood (Figure S2). Owing to the much larger sample size, 182,569 probe-SNP pairs (93.6%) were solely detected in blood, while only 601 probe-SNP pairs (0.31%) were significant in each of the five different tissues (Figure S3). Although a previous study showed that the heritability of gene expression levels are higher in blood (37%) compared to adipose tissue (24%) [14], we believe that the large difference in the detected probe-SNP pairs between blood and non-blood tissues is due to statistical power issues that result from substantial sample size differences. As we had initially run ciseQTL analyses in each of the tissues separately, we subsequently conducted a weighted Z-score meta-analysis across the four nonblood tissues and detected 23,878 probe-SNP pairs at FDR of 0.05. Out of these, 23.2% (5,550 out of 23,878 probe-SNP pairs) had not been identified in any of the single-tissue analyses (Figure S4). In total, the single-tissue analyses and meta-analysis yielded a non-overlapping set of 200,629 significant probe-SNP pairs, corresponding to 103,968 unique expression altering SNPs (eSNPs) and 11,618 probes (eProbes) that represent 8,561 unique genes (eGenes) (Figure S2). Author Summary Gene expression can be affected by genetic variation, e.g. single nucleotide polymorphisms (SNPs). These are called expression-affecting SNPs or eSNPs. Gene expression levels are known to vary across different tissues in the same individual, despite the fact that genetic variation is the same in these tissues. We explored the different mechanisms by which genetic variants can mediate tissuedependent gene expression. We observed that the genetic variants that associated with complex traits are more likely to affect gene expression in a tissue-dependent manner. Our results suggest that complex traits are even more complex than we had anticipated, and they underline the great importance of using expression data from tissues relevant to the disease being studied in order to further the understanding of the biology underlying the disease association. [15], [16]. However, due to statistical power issues, it is likely that the tissue-dependency of cis-eQTL has been overestimated by studies solely assessing the overlap of cis-eQTL between tissues based on a certain threshold. Realizing this problem, Ding et al. used a refined statistical method to estimate the percentage of overlap by adding a power parameter to the model [12]. They reported that only 30% of cis-eQTL in LCLs were not shared with fibroblast cis-eQTL. Similarly, a recent study by Nica et al. [13] examined the tissue-dependence of cis-eQTL in three human tissues (LCL, skin and fat) in a continuous manner by quantifying the proportion of overlap of cis-eQTL from the enrichment of low P-values. They observed that 29% of cis-eQTL appear to be exclusively tissue-dependent, and also observed that the effect sizes of 10–20% of the cis-eQTL present in multiple tissues differ per tissue type. These observations are in line with a large-scale transcriptomic analysis of 46 human tissues, which found that while only 6.0% of genes were ubiquitously expressed across all the assessed tissues, 3.1% genes were only expressed in a single tissue [17]. To gain a better understanding of this subtle regulation of tissue-dependent regulation and to address the question of how genetic variants mediate tissue-dependent expression, we compared cis-regulation between whole peripheral blood from a large cohort of 1,240 individuals and four smaller primary human tissues (liver, subcutaneous adipose tissue (SAT), visceral adipose tissue (VAT) and skeletal muscle) obtained from a set of 85 subjects. We first applied a robust sampling procedure to estimate accurately how often genes showed different cis-eQTL effects between tissues. We then investigated in what way genes are differently associated with SNPs in different tissues. Finally, we assessed various functional properties for the SNPs involved in tissue-dependent cis-regulation and their association with complex traits. Cis-eQTL Effects Differ per Tissue Type To assess the tissue-dependency of the cis-eQTL, we compared the Spearman correlation of each probe-SNP pair between tissues. However, due to the small sample sizes of the non-blood datasets we had very limited statistical power to determine whether there were cis-eQTL effect differences between non-blood tissues. We therefore confined ourselves to comparisons between the large blood dataset and each of the smaller non-blood tissues. To correct for sample size differences, we employed a resampling procedure, permitting us to derive an empirical distribution of association Zscores (calculated based on the Spearman correlation) of each probe-SNP pair in blood of the same sample size as in non-blood tissues (see Materials and Methods; Figure S5). We observed that 18,456 pairs (9.2% of 200,629 probe-SNP pairs) showed a significantly different Z-score between blood and at least one of the non-blood tissues at P,6.2361028 (corresponding to a conservative Bonferroni-corrected P,0.05), implying a discordant association between blood and non-blood tissues. The remaining 182,173 probe-SNP pairs, which we called ‘‘concordant association’’, had similar association Z-scores between the tissues under study (Figure S2). The ‘‘discordant associations’’ accounted for 15.4% of the eSNPs (15,974 out of 103,968 eSNPs), 28.7% of the eProbes (3,330 out of 11,618 eProbes), and 34.1% of the unique eGenes (2,919 out of 8,561 eGenes) (Table S2 and Figure S2). We further assessed for each probe-SNP pair, whether the discordance was detected between blood and multiple non-blood tissues, or only between blood and one specific non-blood tissue. We observed that 14,388 probe-SNP pairs (78.0% of the 18,456 discordant probe-SNP pairs) only showed a discordant effect between blood and one specific non-blood tissue. Only 125 probeSNP pairs (corresponding to 31 eProbes) showed a discordant Results Cis-eQTL Mapping in Five Primary Tissues For this study, we collected data for four different tissues from a set of 85 unrelated obese Dutch subjects. We successfully collected data on 74 liver samples, 62 muscle samples, 83 subcutaneous adipose tissue (SAT) samples and 77 visceral adipose tissue (VAT) samples (for 48 individuals all four tissues were available). The fifth tissue, blood, was collected from a different group of 1,240 unrelated Dutch individuals (Table S1). The gene expression levels in all five tissues were profiled using the same Illumina PLoS Genetics | www.plosgenetics.org 2 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL four non-blood tissues. In total, we ended up with 13,603 probeSNP pairs (12,549 top eSNPs, that were affecting 11,575 probes pairs) these six analyses. Among them, 2,612 probe-SNP pairs (19.2%) showed a discordant effect among tissues at P = 6.2361028 level (genome-wide test level), accounting for 2,466 (19.7%) unique eSNPs. We found that the top eSNPs with discordant effect had a significantly higher minor allele frequency (MAF) than the concordant top eSNPs (Wilcoxon test P = 8.27610221). The eSNPs at a smaller distance from the eProbe (#250 kb) were more likely to show a discordant effect compared to the eSNPs at larger distance (250 kb–1 Mb distance, OR = 1.62, P = 3.6610222, Figure S10). Although we acknowledge that the top eSNPs do not necessarily reflect the true causal variants, we annotated the functional properties of the top eSNPs to understand the potential roles of the eSNPs (irrespective of whether these reflect concordant or discordant eProbes). We observed that the most of the eSNPs were located in intragenic regions (67.0%) and intronic regions (14.9%), where their function often remains undetermined. Interestingly, eSNPs with discordant effect were (compared to concordant eSNPs) significantly enriched for synonymous-coding SNPs (Fisher’s exact P value 9.961024), and more often mapped in the 39 and 59 untranslated regions (UTRs, Fisher’s exact P values 1.8461025 and 4.761024, respectively) (Figure 1). As shown before, we observed that SNPs, associated with complex traits and diseases, are more likely to be eSNPs [2], [6], [8], [18], [19]. We subsequently analysed 1,954 trait-associated SNPs (at P,561028, retrieved from the GWAS catalog per 16 September 2011) [20] and observed that 907 trait-associated SNPs (46.4%) were eSNPs. Of these, 261 trait-associated eSNPs (28.7%) showed discordant effects on gene expression, which is significantly higher than what we observed for all 103,968 trait- and non-traitassociated eSNPs (15.4% discordant, Fisher’s exact test P = 1.10610233) and also significantly higher than if we compare this to only the 12,549 top eSNPs (19.7% discordant, Fisher’s exact test P = 2.6610210). association in all four comparisons, suggesting similar regulation in the four non-blood tissues but markedly different regulation in blood (Figure S6). As such these results reveal there are considerable differences in the genetically determined regulation of gene expression between liver, SAT, VAT and muscle tissues, even though the RNA from these tissues had been derived from the same individuals at was collected at exactly the same time. To ensure that our sampling procedure was robust, we used the same procedure to assess how often our method incorrectly concluded that a probe-SNP Spearman correlation differed between two independent eQTL datasets in the same peripheral blood tissue: We used the 1,240 blood samples as discovery set and used an independent set of 229 blood samples as validation whose expression was profiled using Illumina H8-v2 chips, [18], [19], see Methods and Materials. In this analysis, our method incorrectly deemed that 0.45% of the probe-SNP pairs showed a significant difference at the previously used P,6.2361028 level (Figure S7). In our comparisons between blood and non-blood tissues we had observed that 9.2% of the probe-SNP pairs showed a discordant effect, which is substantially higher and indicates that the number of discordant associations that we identified when comparing different tissues are not expected by chance (Fisher’s exact test: OR = 20.6 and P,102300). We also assessed whether imputation accuracy differences between datasets might confound some of the results, but did not find evidence this to be the case (see Materials and Methods). Properties of eSNPs For the significant 200,629 probe-SNP pairs, we observed that for 146,480 pairs (73.0%) the eSNPs were located within 250 kb distance of the eProbe while 54,149 probe-SNP pairs (27.0%) mapped between 250 kb and 1 Mb apart. Consistent with a previous study [15], we observed that eSNPs at a larger distance from the probes tend to have smaller effects (Figure S8). However, we realize that due to extensive LD many different SNPs are usually significantly correlated with one single cis-eQTL probe. To address this, we performed step-wise conditional analyses in each tissue type to ascertain whether there were multiple SNPs that independently affected the expression levels of the same probe. We observed this for 26.8% of the eProbes in the large blood dataset (Table S3), (where for 2,794 out 10,443 eProbes we had detected multiple independent eSNPs): We observed that the secondary, tertiary and quaternary eSNPs usually map further away from the probe (Wilcoxon test P = 2.25610266, Figure S9), potentially reflecting some regulatory elements such as enhancers that usually reside further away from genes. In the non-blood tissues, we lacked statistical power to detect many secondary and tertiary effects (Table S3). Interestingly, there was a very high overlap between the discordant eProbes (detected in our comparison across tissues) and the eProbes with multiple independent effects in blood (detected in the aforementioned analysis that solely used blood samples). Out of the 10,443 eProbes in blood, 2,528 eProbes had discordant association and 7,915 eProbes had concordant association. We observed that 47.5% of the discordant eProbes had multiple independent eSNPs present in blood (1,202 out of 2,528); whereas only 20.1% of the concordant eProbes had multiple independent eSNPs (1,592 out of 8,219, Fisher’s exact test P = 3.85610281). This observation suggests that for eProbes: 1) different independent eSNPs can exist and 2) these independent eSNPs can exert an effect in one tissue while they do not exert an effect in another tissue. We subsequently analyzed the most significant eSNP per eProbe per tissue and the top eSNP per eProbe from the meta-analysis of PLoS Genetics | www.plosgenetics.org Four Categories of Tissue-Dependent Cis-Regulation As we have shown above, discordant eProbes are more likely to be influenced by multiple independent eSNPs. However, solely assessing the discordance of a single SNP-probe pair does not provide an extensive landscape of the tissue-dependent genetic determinants of gene expression. To gain further insight into this, we created ‘association profiles’ for the discordant eProbes and compared these across tissues. An association profile refers to the association Z-scores of all tested SNPs within 1 Mb distance of the eProbe under study (see Materials and Methods), and takes into account multiple SNPs and linkage disequilibrium. We created such association profiles for 2,007 discordant eProbes 52 (521 eProbes from liver, 708 eProbes from SAT, 526 eProbes from VAT, and 252 eProbes from muscle, Figure S2). Upon inspection of these association profiles for the discordant eProbes, we identified four main different categories of tissuedependent genetic regulation of gene expression. If the association profiles for one single eProbe did not correlate at all between two tissues, we further checked whether the eProbe was significant in both tissues: If the probe had a significant association in one tissue but not in the other, we deemed this ‘‘specific cis-regulation’’. If instead the eProbe was significant in both tissues, but was associated to different (unlinked) eSNPs in the different tissues, we deemed it ‘‘alternative cis-regulation’’ between tissues. For those association profiles where two tissues showed a correlation, we checked the direction and the effect size of allelic effect on gene expression. If the allelic direction was the same and the effect size was different, 3 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL Figure 1. Functional Properties of eSNPs with tissue-dependent effect and concordant effect. The bar plot shows the frequency of the eSNP per function property. The eSNPs were annotated using the web-based tool of SNP Annotation and Proxy Search (SNAP; http://www. broadinstitute.org/mpg/snap/), based on the HapMap CEU population panel (release 22) and genome build 36.3. The asterisks indicate the significance of Fisher’s exact test by comparing the eSNPs with concordant effect and with discordant effect, as given in the legend. doi:10.1371/journal.pgen.1002431.g001 replicated this specific cis-regulation in liver (Figure 3A). The association Z-score for rs12740374 with SORT1 expression variation in liver was 8.24 (N = 74, P = 1.41610215) but in blood we observed no effect (Z-score = 0.07, N = 1,240, P = 0.8), nor did we observe any associations in SAT, VAT or muscle, and the association profiles for this gene show no correlation between different tissues (all spearman correlation P values.0.39). Thus, in our data, rs12740374 only exerts an effect on SORT1 gene expression in liver, although we did observe that SORT1 was expressed abundantly in all tissues. Alternative regulation. Alternative regulation between tissues refers to a gene that is cis-associated with a SNP in a particular tissue and associated with a different, independent SNP in another tissue. Such an alternative cis-regulation is also a common phenomenon, as we found it applied to on average 14.5% of the we concluded the eProbe belonged to the category ‘‘different effect size’’. If the allelic direction was instead opposite, the probes had tissue-dependent regulation with an ‘‘opposite allelic direction’’ (see Materials and Methods). We discuss each of these four categories in detail below and in Figure 2 and Figure 3. Specific regulation. Specific cis-regulation refers to a gene that is cis-regulated in only one specific tissue. We found this type of regulation is a common phenomenon as it accounted for on average 33.2% of the discordant eProbes (Figure 2). One well-established example is the SORT1 gene at the 1p13 cholesterol locus, to which SNPs map that affect low-density lipoprotein cholesterol (LDL-C) and the risk of myocardial infarction (MI) in humans [21], [22]. Recently, it was shown that the functional variant rs12740374 alters the binding site for C/EBP transcription factors and consequently alters the hepatic expression of the SORT1 gene [23]. Our data PLoS Genetics | www.plosgenetics.org 4 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL Figure 2. cis-regulation of gene expression between tissues. The associated probe-SNP pairs were classified to be concordant or discordant between tissues. The small pie plot shows the proportion of probes that have only concordant association (red part) or at least one discordant association (blue part). The probes with discordant association were under tissue-dependent regulation and we characterized four different mechanisms: specific regulation, alternative regulation, different effect size and opposite effect sizes. Their proportions are shown in the large blue pie plot. The concordant cis-regulation and the four different mechanisms are illustrated by the correlation between SNP genotypes (AA, AG and GG) and gene expression levels in two tissues: brown dots represent the expression of a gene in tissue 1 and purple dots the expression of a gene in tissue 2. doi:10.1371/journal.pgen.1002431.g002 Different effect size. The different effect size refers to a common phenomenon that a gene is associated with the same SNP with alleles that have the same direction of effect but with a different magnitude in different tissues (Figure 2). For eProbe that showed this, we observed a significantly positive correlation between the association profiles of the tissues. We observed it applies to on average 47.9% of the probes that show tissuedependent regulation (Figure S2), in line with a previous report [13]. One example is the O-6-methylguanin-DNAmethyltransferase (MGMT) gene that plays an important role in DNA repair and which suppresses tumor development [24]. We observed a cis-eQTL for MGMT across each of the five tissues. However, the effect size in blood was substantially smaller than that in SAT tissues (Figure 3C). Opposite allelic direction. Surprisingly, we observed that some genes were associated with the same SNPs in different tissues probes with tissue-dependent regulation (Figure 2). One particular example is the trans-membrane gene TMEM176A, also known as hepatocellular carcinoma-associated antigen 112. The expression of TMEM176A was associated with intronic SNP rs714885 in liver (N = 74, P = 5.761026) but with the 19.5 kb upstream SNP rs6464104 in blood (N = 1,240, P = 5.076102132) (Figure 3B). These two SNPs are unlinked variants (r2 = 0.002 and D9 = 0.054 based on the HapMap phase II CEU panel). We observed the same alternative association for different probes of TMEM176A in an independent liver eQTL dataset (profiled using a custom ink-jet microarrays [7] and in the aforementioned independent blood eQTL dataset that was profiled using Illumina HumanRef-8 v2 BeadChips) (Table S4) [18], [19]. This clearly shows that 1) multiple, unrelated variants can sometimes affect exactly the same gene, and 2) these independent variants sometimes only exert an effect on the gene expression in a particular tissue. PLoS Genetics | www.plosgenetics.org 5 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL Figure 3. Case examples for tissue-dependent cis-regulation. (A) The liver-specific regulation of the SORT1 gene. (B) The alternative regulation of the TMEM176A gene in blood and liver. (C) The cis-regulation for the MGMT gene had different effect sizes in blood and SAT. (D) The cisregulation for the DDT gene show opposite allelic direction between blood and liver. For each gene, the left panel shows the cis-eQTL association profile in the corresponding tissue (liver or SAT, in blue) vs the association profile in blood (red). The x-axis is the genome position based on genome build 36.3 (in Mb). The y-axis at the left is the association strength in terms of Z-score. The Z-score in blood has been weighted by the square root of the sample size, corresponding to the compared tissue. The dashed green line indicates the significance level of association at FDR 0.05. We use the absolute Z-scores to show the association in (A–C), but use the Z-scores in (D) for a better illustration of allelic direction. We assigned the association Z-scores in blood a negative value. If the allelic direction in SAT is the same as that in blood, the Z-score in SAT is negative too; otherwise, the Z-score in SAT is positive. The black line shows the recombination rate at this locus based on the HapMap II CEU panel and the scale is indicated on the righthand y-axis. The green line with arrow at the bottom shows the genome position of the gene and the arrow indicates the transcription direction. The right panel shows the correlation of the Z-scores between two tissues. The r-value indicates the correlation coefficient of the Pearson correlation. doi:10.1371/journal.pgen.1002431.g003 PLoS Genetics | www.plosgenetics.org 6 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL effect of the regulatory factors (e.g., stimulating or suppressing the expression) and the size of their effects could lead to the observations of different categories (Figure 4). but with alleles having an opposite effect on the gene expression between tissues. For a probe under this regulation, we then also observed a strong negative correlation between its association profiles across different tissues. This ‘‘opposite allelic direction’’ mechanism accounted for on average 4.4% of the probes under tissue-dependent regulation (Figure 2), which is much less common than the three previous mechanisms. However, this is still much more often than would be expected by chance, as determined by a comparison between two blood datasets in which we found the allelic directions were nearly always identical (Figure S7). One striking opposite allelic direction was observed to D-dopachrome tautomerase (DDT), which showed completely opposite effects between blood and liver (Figure 3D). Consistently, we found this opposite effect in the independent liver [7] and blood dataset,(H8v2), even when different probes were assessed. The minor allele rs5751777-C was associated with higher expression in liver (P = 9.95610222 in the discovery set and P = 2.866102211 in the validation set), but with lower expression in blood (P = 3.986102119 in the discovery set and P = 4.37610224 in the validation set) (Table S5). Strikingly, this opposite allelic direction was also observed when comparing liver with SAT, VAT and muscle, tissues that were all obtained from exactly the same set of individuals (Figure S11). Another notable gene with an opposite allelic direction is ORMDL3. Although its function remains unclear, genetic variants near ORMDL3 are associated with various immune-related diseases, including asthma, type 1 diabetes, Crohn’s diseases, ulcerative colitis and primary biliary cirrhosis [25]–[29]. ORMDL3 had a genome-wide significant cis-eQTL in blood and its association in SAT was showing near-genome-wide significance (Figure S12). All disease-associated SNPs in this locus showed association in cis with the expression level of ORMDL3 (Table S6), including the functional variant rs12936231 that has been implicated to play a causal role in chromatin remodeling [30]. The risk alleles for asthma and preventive alleles for other autoimmune diseases showed consistent up-regulation in blood (and were also reported in LCLs) [25], [30]. However, to our surprise, the effect in SAT was completely reversed, leading to down-regulation. Although we have only provided a few examples here, these observations indicate that conclusions drawn about mechanistic up- or down-regulation from a single tissue cannot necessarily be translated to other tissues, as they may sometimes lead to completely different conclusions depending on the tissues studied. In the supplementary material (Tables S7, S8, S9, S10 and Figures S13, S14, S15, S16), we have summarized the observed tissuedependent regulation for 156 genes that have been reported to be associated with complex traits at P = 561028 (based on the genes, mentioned in the Catalog of Published Genome-wide Association Studies, as of 16/09/2011). Some of these plots also show that the genetic regulation of gene expression is sometimes even more complicated than what we have described here: some genes can have multiple cis-eQTL that were either shared or specific to the tissues, e.g, the association of MTMR3 gene that was associated with lung cancer [31], Nephrophaty [32], and inflammatory bowel disease [33], [34] (Figure S17). The four categories of tissue-dependent cis-regulation we have observed can be explained by two molecular models: 1) the tissuedependent use of the same causal variant, i.e., the same eSNPs tag the same causal variant that is activated differentially by tissuedependent factors; 2) the tissue-dependent causal variants, i.e., the same or different eSNPs tag different causal variants upon the tissues under study. The extent of the linkage disequilibrium (LD) between the causal variants and tag eSNPs, and the direction of PLoS Genetics | www.plosgenetics.org Discussion Gene expression levels are partly determined by genetic variation, and eQTL mapping in different cell types and tissues has identified many cis-eQTL. However, the effect of cis-eQTL is strongly dependent upon the studied tissue. In this study, we compared the genetic architecture of gene expression regulation in blood and four non-blood primary tissues. We detected that the majority (71.3%) of the detected probes under genetic control (eProbes) show a concordant association across tissues. However, the remaining 28.7% of the eProbes show discordant, tissue-dependent regulation. Strikingly, many of those discordantly associated eProbes are affected by multiple, independent eSNPs. We followed up the genes under tissue-dependent regulation and identified four different mechanisms: specific regulation, alternative regulation, different effect size, and opposite allelic direction. We are the first to provide a comprehensive landscape of the different mechanisms of tissue-dependent cis-regulation. Of the four mechanisms identified, the opposite allelic direction mechanism, where alleles can have opposing effects on gene expression between tissues is of particular interest: Although this mechanism is less common than the other three, it has important implications for inferring the transcriptional effects of alleles from other tissue data, especially on the susceptibility risk alleles for complex diseases. The use of different tissues could result in completely the opposite conclusion! This finding highlights the great importance of investigating disease-relevant tissues in order to correctly characterize the functional effects of disease-associated variants. We observed that SNPs at various transcriptional regulatory regions more often than expected exert tissue-dependent regulation, although most of the eSNPs were located at intergenic and gene intronic regions where functions remain undefined. However, we must emphasize that the causal variants remained undefined. Furthermore, because of the LD structure, although the same eSNPs can be associated with the expression of the same gene in different tissues, this does not necessarily mean that the same regulatory variants act in the different tissues. We have proposed two molecular models and suggested that tissuedependent cis-regulations can be explained by the tissue-dependent use of the same causal variants or by the use of different tissue-dependent causal variants. Further fine-mapping and functional analyses are needed to identify the causal variants and to understand how they are used in different tissues due to the limited resolution of cis-eQTL mapping: It is known that the size of regulatory cis-elements generally is only a few base pairs (i.e., the binding sites of transcription factors or microRNAs), whereas the size of linkage disequilibrium blocks is generally in a range of 10– 100 kb [35]. Furthermore, as the molecular models that we have proposed are quite simple, we cannot exclude other molecular mechanisms acting in these processes, e.g., the competition of different regulatory factors and binding sites in different tissues, or the role of tissue-specific methylation [36], [37] and chromatin remodeling [38], etc. It is well known that trait-associated SNPs are more likely to have effects on gene expression but, to our surprise, we found that they are also more likely to exert tissue-dependent effects. This observation adds an extra layer of complexity to complex traits. We acknowledge that our study has some limitations: We compared cis-regulation between peripheral blood and four rather small non-blood tissues. We lacked statistical power to compare 7 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL Figure 4. Molecular models of tissue-dependent cis-regulation. The observed tissue-dependent cis-regulations can be explained by two molecular models: (A) the tissue-dependent use of the same causal variants, or (B) the use of tissue-dependent causal variants. The ovals indicate the two regulatory factors (e.g., transcription factors) that play regulatory roles in different tissues (brown in tissue 1 and purple in tissue 2). These factors can recognize the same or different cis-elements (the yellow region). The genetic variants are shown as SNPs with A/G alleles. The SNPs in red are causal variants and the SNPs in blue are tag SNPs. The red line between them indicates the linkage disequilibrium. The arrows indicate the effect of regulatory factors, here the up arrows represent expression stimulators and the down arrows expression suppressors. The size of the arrows indicates the size of the differences between the expression of A and G alleles, i.e., the cis-eQTL effect size. doi:10.1371/journal.pgen.1002431.g004 the cis-regulations between two non-blood tissues well. Secondly, the identified discordant eQTLs are determined by the limited tissues that we studied Thirdly, although we corrected for substantial expression differences across samples by employing principal component analysis, it is still possible that some of the observed tissue-dependent cis-regulation can be due to the tissue heterogeneity (i.e. different proportions of cell types per tissue). Likewise it is also possible that some of the identified discordant ciseQTL could be due to differences in the base-line expression PLoS Genetics | www.plosgenetics.org between tissues. However, we observed this to be the case for both concordant and discordant cis-eQTL when investigating the original (non-PCA corrected) expression data (see Table S11). Nevertheless our results indicate that natural genetic varation can affect gene expression levels in complex ways. Further analyses using different tissues and specific cell types and using larger sample sizes are required to gain a deeper understanding of the genetic variation of gene expression and to gain better insight into the full complexity of disease. 8 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL BeadChips, we further used program IMPUTE v2 to impute the genotypes of SNPs that presented in Omni1-Quad chips but not directly genotyped on Hap370 and 610-Quad platform [39]. The reference panel for imputation was the CEU population from HapMap release 22. The directly genotyped SNPs were coded as 0, 1 or 2, while the imputed SNP dosage values were called at a 0.95 confidence level, ranging between 0 and 2. In this way, we obtained the genotype of the same set of 1,140,419 SNPs for all five tissues under study. RNA profiling. Anti-sense RNA was synthesized, amplified and purified using the Ambion Illumina TotalPrep Amplification Kit (Ambion, USA) following the manufacturers’ protocol. Complementary RNA was hybridized to Illumina HumanHT-12 arrays and scanned on the Illumina BeadArray Reader. Raw probe intensity data for these samples was extracted using Illumina’s BeadStudio Gene expression module v3.2 (No background correction was applied, nor did we remove proves with low expression). Materials and Methods Genotyping and Expression Profiling on Liver, Muscle, and Adipose Fat Tissues from the Same Population Subjects. From April 2006 to January 2009, 85 morbidly obese Dutch subjects (23 male and 62 female subjects) with a body mass index (BMI) between 35 and 70 were included in the study. They all underwent elective bariatric surgery at the Department of General Surgery, Maastricht University Medical Centre. Patients with acute or chronic inflammatory diseases (e.g., autoimmune diseases), degenerative diseases, reported alcohol consumption (.10 g/day), and/or using anti-inflammatory drugs were excluded. The average age of the subjects was 43.9 with a range of 17 and 67 years. This study was approved by the Medical Ethical Board of Maastricht University Medical Centre, in line with the guidelines of the 1975 Declaration of Helsinki. Informed consent in writing was obtained from each subject personally. The subject information was provided in Table S1. Genotyping. Venous blood samples were obtained after 8 hours fasting on the morning of surgery. DNA was extracted from this blood using the Chemagic Magnetic Separation Module 1 (Chemagen) integrated with a Multiprobe II Pipeting robot (PerkinElmer). All samples were genotyped using Illumina HumanOmni1-Quad BeadChips that contain 1,140,419 SNPs. Genotyping was performed according to standard protocols from Illumina. RNA profiling in four tissues. Wedge biopsies of liver, visceral adipose tissue (VAT, omentum majus), subcutaneous adipose tissue (SAT, abdominal), and muscle (musculus rectus abdominis) were taken during surgery. RNA was isolated using the Qiagen Lipid Tissue Mini Kit (Qiagen, Crawley, West Sussex, UK, 74804). Assessment of RNA quality and concentration was done with an Agilent Bioanalyzer (Agilent Technologies, Santa Clara, USA). Starting with 200 ng of RNA, the Ambion Illumina TotalPrep Amplification Kit was used for anti-sense RNA synthesis, amplification, and purification according to the protocol provided by the manufacturer (Ambion, Austin, USA). 750 ng of complementary RNA was hybridized to Illumina HumanHT12 BeadChips and scanned on the Illumina BeadArray Reader. Raw probe intensity data for these samples was extracted using Illumina’s BeadStudio Gene expression module v3.2 (No background correction was applied, nor did we remove probes with low expression). Genotyping and Expression Profiling in an Independent Blood Dataset of 229 Samples Subjects. To ascertain whether our method for identifying tissue-dependent cis-eQTL was robust, we compared the large peripheral blood with an independent blood eQTL dataset that comprised 229 samples. We have described this cohort in previous studies [9], [18]. In brief, this study comprised 111 English celiac disease patients, 59 Dutch amyotrophic lateral sclerosis patients and 59 Dutch health controls. The peripheral blood (2.5 ml) was collected with the PAXgene system (PreAnalytix GmbH, UK). Genotyping and imputation. The samples were genotyped using the Illumina (Illumina, San Diega, USA) HumanHap300 platform. We further used IMPUTE v2 to impute the genotypes of all HapMap II SNPs. The reference panel for imputation was the CEU population from HapMap release 22. The directly genotyped SNPs were coded as 0, 1 or 2, while the imputed SNP dosage values were called at a 0.95 confidence level, ranging between 0 and 2. RNA profiling. Anti-sense RNA was synthesized amplified and purified using the Ambion Illumina TltalPrep Amplification Kit (Ambion, USA) following the manufacturers’ protocol. Complementary RNA was hybridized to Illumina HumanRef-8 v2 arrays (further referred to as H8v2) and scanned on the Illumina BeadArray Reader. Genotyping and Expression Profiling on Blood Normalization and PCA Correction Subjects. The genetical genomics samples for blood were collected from unrelated Dutch individuals in four studies: 324 healthy individuals were collected in the University Medical Centre Utrecht, 414 amyotrophic lateral sclerosis (ALS) patients were collected in the University Medical Centre Utrecht, 49 ulcerative colitis (UC) patients from a part of the inflammatory bowel disease (IBD) cohort of the University Medical Centre Groningen, and 453 patients with chronic obstructive pulmonary disease (COPD) were collected with the NELSON study. All samples were collected after informed consent and approved by local ethical review boards. Individual sample information is provided in Table S1. Genotyping and imputation. DNA from all samples was hybridized to oligonucleotide arrays from Illumina. 324 healthy individuals and 414 ALS patients were genotyped using the Hap370 platform. The 453 COPD patients and 49 UC patients were genotyped on the 610-Quad platform. Because the subjects with liver, muscle, adipose fat tissues were genotyped using more intensive genotyping platform Illumina HumanOmni1-Quad PLoS Genetics | www.plosgenetics.org The raw expression intensities from five tissues were jointly quantile normalized and log2 transformed. We further applied a principal component analysis (PCA) on expression correlation matrix and observed that genes are differentially expressed among different tissue types (Figure S1). We argue that the dominant principal components (PCs) will primarily capture sample differences in expression that reflect physiological or environmental variation (e.g., tissue type and phenotype difference) as well as systematic experimental variation (e.g. batch and technical effect). In order to target the difference in the genetic variation of expression among tissues, we removed the global variation in expression among tissues by using the residual expression for each probe in each tissue after removing 50 PCs (identical to what we have described before [18]). Our previous analysis on the same dataset showed that the number of significantly detected cis-eQTL probes increased two-fold when 50 PCs were removed from the expression data (see Figure S7 in ref [18]). For the independent blood dataset with 229 subjects, we followed the same quantile normalized and PCA correction. 9 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL Population Stratification and SNP Quality Control secondary eSNPs were present, we repeated the entire procedure to detect tertiary eSNPs by regressing out both the primary and secondary effect (using appropriate multivariate regression analysis). This procedure was repeated until no significant associations were detected any more. We tested population stratification between the two cohorts using the program PLINK (http://pngu.mgh.harvard.edu/, purcell/plink/strat.shtml). This program uses complete linkage agglomerative clustering, based on pair-wise identity-by-state (IBS) distances. The fact that all the individuals from both cohorts were clustered together indicates there was no population stratification. We also checked the allelic frequencies between the two cohorts by treating the 85 individuals with four tissue samples as cases and the 1,240 individuals for blood samples as controls. For the imputed SNPs, we used the genotype with highest probability as the discrete genotype for QC purposes. We removed SNPs that showed significant differences in allele frequency at P,0.01. Then the SNPs were quality controlled for minor allelic frequency .5%, a call rate .95% and an exact Hardy-Weinberg (HWE) P value.0.001. To make certain on the directions of the allelic effect on gene expression (up-regulating or down-regulating), we further removed SNPs with two types of transversion alleles (A/T and G/ C) and confined our analysis to SNPs with transition alleles (A/G or C/T) and other types of transversion alleles (A/C or G/T). This quality control resulted in 710,035 SNPs for further analysis. Sampling Approach to Identify Tissue-Dependent eQTL Comparing blood and non-blood tissues. For each of the 200,629 probe-SNP pairs that was significantly associated at FDR 0.05 level, we further assessed whether the detected Z-scores differed per tissue. We used the Z-scores in blood as a reference because the blood samples were independent from other tissue samples and the sample size was much larger. To correct for the sample size difference, we, out of the 1,240 blood samples, randomly selected without replacement the same number of samples for the comparison with liver (N = 74), SAT (N = 83), VAT (N = 77) and muscle (N = 62). For a certain probe-SNP pair, we re-calculated the association Z-score in blood for the selected sample size. The sampling procedure was repeated 100 times. We subsequently fitted a generalized extreme value distribution (GEVD) for the Z-scores of 1006 sampling procedures in blood. GEVD is a flexible model with three parameters: location (c), scale (b) and shape (a). GEVD can resemble different distributions with different settings of parameters. For example, when a = 0, it resembles the Gumbel types of distributions (Type I); when a.0, it resembles the Frechet types of distributions (Type II); when a,0, it resembles the Weibull types of distributions (Type III). Therefore, fitting the GEVD can permit us to estimate realistic distribution of the Z-scores of this certain probe-SNP pair in blood (Figure S3). We then assessed the deviation of the Z-score of the same probe-SNP pair in the other four tissues from the estimated GEVD in blood and computed P value for the difference of Z-scores between tissues. We did this analysis in R (version 2.10.1) using the package evd: Functions for extreme value distributions (version 2.2–4). This analysis was done for each of the 200,629 probe-SNP pairs and between blood and each of four non-blood tissues. Considering the possible dependence of the eQTL effect among tissues, the significance was controlled at the conserved Bonferroni-corrected 0.05, corresponding to a P value of 6.2361028 (0.05/200,629 probe-SNP pairs/4 tissue comparisons). The probe-SNP pairs with a P#6.2361028 were called ‘‘discordant associations’’, while probeSNP pairs with P.6.2361028 were called ‘‘concordant associations’’. The expression profiling in all five tissues used the same platform. Therefore, the discordant association cannot be explained by the hybridization efficiency. Because all of the tested SNPs were directly genotyped in non-blood tissues but most of them were imputed in blood, we further checked whether the discordance was caused by the imputation. We did not observe that imputation accuracy might confound our results: 69.3% of the discordant eSNPs were imputed in blood whereas 68.0% of the concordant eSNPs were imputed in blood (Fisher’s exact test P value = 0.60). We also assessed whether there was heterogeneity in effect present when comparing the different subgroups of phenotypes. We did not find evidence this to be the case (see Table S6 in ref [18]). Comparing two independent blood datasets. To further validate the tissue-dependent effect we had detected, we compared the cis-eQTL effects between the blood dataset HT12 and H8v2, using the same sampling procedure as described above. Because of the difference of expression platform, we could only make comparisons for those probes that were present in both datasets. We only investigated SNPs that showed similar allele frequencies between the two blood datasets (SNPs with allele frequency P,0.01 were excluded from analysis and as the H8v2 dataset contained 111 celiac disease patients that were nearly all HLA- eQTL Discovery In order to detect cis-eQTLs, analysis was confined to those probe-SNP combinations for which the distance from the probe transcript midpoint to SNP genomic location was #1 Mb. For each probe-SNP pair, we used Spearman correlation to detect association between SNPs and the variations of the gene expression in liver, SAT, VAT, muscle and blood, respectively. We calculated the Spearman correlation coefficient and corresponding P values and subsequently transformed this into a Zscore. To maximize the power of eQTL discovery in non-blood tissues, we further performed meta-analysis for four non-blood tissues that combines the association signals across the four nonblood tissues under study. An overall, joint P value was calculated using a weighted (square root of the dataset sample number) Zmethod. Please see the ref [40] for a comprehensive overview of this method. To correct for multiple testing, we controlled the false-discovery rate (FDR) at 0.05: the distribution of observed p-values was used to calculate the FDR, by comparison with the distribution obtained from permuting expression phenotypes relative to genotypes 100 times. At FDR = 0.05 level, the significance P value threshold was 1.3761025 for significantly associated probeSNP pairs in liver, 2.0761025 for significant association in SAT, 1.5461025 for significant association in VAT, 5.6461026 for significant association in muscle, 4.861024 for significant association in blood and 1.1061024 for significant association in the meta-analysis of four non-blood tissues. For these significant probe-SNP pairs, we termed the corresponding SNP, probe and genes as expression SNP (eSNP), regulated probe (eProbe) and regulated genes (eGenes), respectively. Conditional Regression Analysis to Detect Independent eSNPs Due to the linkage disequilibrium among the tested SNPs, we usually found numerous eSNPs for each eProbe. In order to detect independent eSNPs, we performed conditional regression analysis for the eProbes per tissue type. For each eProbe, we first regressed out the main effect of the top eSNP. We then subjected the residuals to eQTL mapping to detect potential secondary, independent eSNPs. We again controlled the false discovery at 0.05 by running 100, as described before in the method section ‘‘eQTL discovery’’. If PLoS Genetics | www.plosgenetics.org 10 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL repeated this permutation 100 times and determined the empirical threshold rthres = 0.21 at FDR 0.05 level using the model (FDR = n0{r0$rthres}/n1{r$rthres), where r and r0 refer to the Pearson correlation coefficient of real data and permuted data, respectively; n refers to the number of probes where r$rthres and n0 refers to the average number of probes where r0$rthres from 1006 permutations. Based on the correlation of association profiles between tissues, we identified four different categories of tissue-dependent genetic regulation of gene expression. If the association profiles for one single probe did not correlate at all between two tissues (r,0.21), we further checked whether the eProbe was significant in both tissues: if the probe had a significant association in one tissue but not in the other, we deemed this ‘‘specific cis-regulation’’; if instead the eProbe was significant in both tissues, but was associated to different (unlinked) eSNPs in the different tissues, we deemed it ‘‘alternative cis-regulation’’. For those association profiles where two tissues showed a correlation (r$0.21), we checked the direction and the effect size of allelic effect on gene expression: if the allelic direction was the same and the effect size was different, we concluded the eProbe belonged to the category ‘‘different effect size’’; if the allelic direction was instead opposite, the probes had tissue-dependent regulation with an ‘‘opposite allelic direction’’. DQ2.2 or HLA-DQ2.5 positive we also excluded the HLA from this analysis). After filtering we could compare 93,656 probe-SNP pairs. Enrichment for SNP Properties The minor allele frequency (MAF) and function properties of eSNPs were annotated by the web-based tool SNP Annotation and Proxy Search (SNAP) (www.broadinstitute.rog/mpg/snap) [41], using the CEU population panel from HapMap release 22. We performed Fisher’s exact test to compare the enrichment between eSNPs with a tissue-dependent effect on expression across tissues and eSNPs with a static effect. Cis-eQTL Analysis of Trait-Associated SNPs To directly assess the effect of trait-associated SNPs on gene expression, we confined our cis-eQTL analysis to 1,954 SNPs (with alleles A/G) that were associated with complex traits at P,5.061028 in the ‘Catalog of Published Genome-wide Associated Studies’ (per 16 September 2011) [20] and assessed the tissuedependency of eQTL effect across the tissues, following the same analysis and permutation procedures. The cis-eQTL significance threshold P values were set at P = 4.661023 in blood, 2.661024 in liver, 2.561024 in muscle, 1.861024 in VAT and 3.261025 in SAT, and 1.161023 for the meta-analysis of four non-blood tissue. At these levels, a total of 2,990 probe-SNP pairs were significant in at least one eQTL analysis. Differential Expression For the probes with tissue-dependent cis-regulation, we assessed whether they were also differential expressed between the tissues where they showed different cis-regulation. To do so, we relied upon the quantile-normalized expression intensity before any removal of the first 50 principal components. For each discordant eProbe, we used a Wilcoxon Mann-Whitney U test to assess the differential expression between the tissues. We performed the same analysis for a random set of concordant eProbes, equal in size to the set of discordant eProbes. The significance of differential expression was controlled at a Bonferroni-corrected P value 0.05 level. Characterizing the Tissue-Dependent Mechanisms of CisRegulation To characterize the tissue-dependent mechanisms of cisregulation, we reasoned that comparing the association at a single probe-SNP level cannot provide a complete picture of the tissuedependent genetic determinants of gene expression. To gain further insight into the tissue-dependent cis-regulation, we extended analysis for the eProbes with discordant cis-eQTL that were determined by single probe-SNP comparison and compared their whole association profiles across tissues. The association profile refers to the set of the absolute Z-scores of all N number of the tested SNPs within 1 Mb distance from the middle point of probe under study: i.e., {|Z1|, |Z2|, |Z3|, … |Zn|}. Such a profile can represent the combined association signals of the multiple independent eSNPs and their linkage disequilibrium. Most of the eProbes only showed significant association in blood and were not significantly associated in the smaller non-blood tissues. For those eProbes, we had limited statistical power to determine whether the association in non-blood tissues is truly absent or is not detected due to power issues. Therefore, we confined our comparison of association profiles to the eProbes that were significantly associated in non-blood tissues and compared them to those in blood. To assess the similarity of association profiles across tissues, we computed Pearson correlations coefficient (r) of the association profiles between two tissues. Because the SNPs were likely in strong linkage equilibrium, there is strong dependency among the Z-scores within the association profile. To determine the empirical threshold for the significance of the correlation between the association profiles and considering the dependency of the SNPs, we performed permutation analysis by randomly assigning genomes to the individuals per tissue type. We thus obtained the association profiles per probe per tissue for the permuted genotypes. These permuted association profiles retained the same correlation structure among SNPs and the Pearson correlation coefficient between the permuted association profiles (r0) would mainly explain the correlation among SNPs. We PLoS Genetics | www.plosgenetics.org Accession Numbers Expression data for both blood tissue and four non-blood dataset have been deposited in GEO with accession numbers GSE20142 (1,240 peripheral blood samples, hybridized to HT12 arrays) and GSE22070 (subcutaneous adipose, visceral adipose, muscle and liver samples). The expression data of the validation blood eQTL dataset (229 samples) has been deposited in GEO with accession number GSE203332. Supporting Information Figure S1 The effect of removing principal components from expression data. (PDF) Figure S2 Flowchart for the analysis of the tissue-dependent ciseQTL across the five human tissues. (PDF) Figure S3 Overlap of the associated probe-SNP pairs across the tissues. (PDF) Figure S4 Overlap of the associated probe-SNP pairs across the single-tissue analysis and meta-analysis. (PDF) 11 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL Figure S5 Sampling procedure. We assessed the difference of association strength between blood and four other tissues (liver, SAT, VAT and muscle). As an example, for liver, we randomly sampled 74 subjects out of the 1,240 blood subjects (making the same sample size as for the liver tissue dataset) and re-measured the association strength for each significantly associated probeSNP pair, in terms of Z-scores. This sampling procedure was repeated 100 times. The histogram showed the Z-scores distribution of a certain cis-eQTL in 74 blood subjects. We then assessed the deviation of the Z-scores detected in liver (the red arrow) from the distribution of Z-scoress in blood, by fitting the extreme value distribution (EVD) (the red line). The same analysis was performed for comparing blood with SAT, VAT and muscle, by randomly sampling N number of blood subjects (N = 83 for the SAT sample size; 77 for the VAT sample size, and 62 for the muscle sample size, respectively). (PDF) Figure S6 size, corresponding to the compared tissue. The blue dots represent the Z-scores in SAT. The dashed green line indicates the significance level at FDR 0.05. For a better illustration of allelic direction, we assigned the association Z-scores in blood a positive value. If the allelic direction in SAT is the same as that in blood, the Z-scores in SAT are positive too; otherwise, the Z-scores in SAT are negative. (PDF) Figure S13 The association profiles of the selected traitassociated genes that show discordant association between blood and liver. The x-axis is the genome position based on genome build 36.3. The y-axis at the left is the association profiles in terms of the Z-score. The Z-score in blood, represented as the red dots or orange dots. The red dots refer to the Z-scores that have been weighted by the square root of the sample sizes, corresponding to the compared tissue. For the clarity of subtle effect in blood, the weak association in blood was shown as orange dots if the Z-scores have not been weighted by the sample size, i.e., the Z-scores reported in 1,240 subjects. The blue dots represent the Z-scores in liver. The dashed green line indicates the Z-score 3.49, representing the significance level in blood at FDR 0.05. The right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation coefficient of the Pearson correlation. (PDF) The overlap of discordantly associated probe-SNP pairs. (PDF) Figure S7 The comparison of Z-scores between two independent blood datasets. The comparison of cis-eQTL effect was confined to the set of 93,656 probe-SNP pairs that have been tested in two independent blood datasets, e.g., a discovery set of 1,240 subjects profiled on the Illumina HT12 expression platform (HT12) and a validation set of 229 subjects profiled on the Illumina H8v2 expression platform (H8v2). The Z-scores of ciseQTL in the discovery set were the mean of Z-scores from 1006 taking a sample of 229 out of the 1,240 blood subjects. The gray dots indicate the concordantly associated probe-SNP pairs between the two blood samples. The red dots indicate the discordantly associated probe-SNP pairs (the false-positive tissuedependent association). The black line is the diagonal line. (PDF) Figure S14 The association profiles of the selected traitassociated genes that show discordant association between blood and SAT. The x-axis is the genome position based on genome build 36.3. The y-axis at the left is the association profiles in terms of the Z-score. The Z-score in blood, represented as the red dots or orange dots. The red dots refer to the Z-scores that have been weighted by the square root of the sample sizes, corresponding to the compared tissue. For the clarity of subtle effect in blood, the weak association in blood was shown as orange dots if the Z-scores have not been weighted by the sample size, i.e., the Z-scores reported in 1,240 subjects. The blue dots represent the Z-scores in SAT. The dashed green line indicates the Z-score 3.49, representing the significance level in blood at FDR 0.05. The right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation coefficient of the Pearson correlation. (PDF) Figure S8 The probes-SNP distance for associated probe-SNP pairs. The distance was calculated by the base pair position (bp) of SNPs minus the bp position of the middle point of the probes. (PNG) Figure S9 Probe-SNP distance for 2,794 eProbes in blood with multiple independent eSNPs. (PDF) Figure S10 The discordant probe-SNP pairs vs. the probe-SNP distance. The histogram shows the number the probe-SNP pairs with different distance. The numbers on each bar show the total number of probe-SNP pairs and the percentage of pairs with discordant association. The 262 table for Fisher’s exact test is shown. (PDF) Figure S15 The association profiles of the selected traitassociated genes that show discordant association between blood and VAT. The x-axis is the genome position based on genome build 36.3. The y-axis at the left is the association profiles in terms of the Z-score. The Z-score in blood, represented as the red dots or orange dots. The red dots refer to the Z-scores that have been weighted by the square root of the sample sizes, corresponding to the compared tissue. For the clarity of subtle effect in blood, the weak association in blood was shown as orange dots if the Z-scores have not been weighted by the sample size, i.e., the Z-scores reported in 1,240 subjects. The blue dots represent the Z-scores in VAT. The dashed green line indicates the Z-score 3.49, representing the significance level in blood at FDR 0.05. The right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation coefficient of the Pearson correlation. (PDF) Figure S11 The direction of allelic effect of rs5751777 on DDT expression. The correlation between the genotype of rs5751777 and the expression intensity of DDT gene (residual variance after 50 PCs removed) in five tissues. Each dot represents one subject, red for females and blue for males. The X-axis represents the genotypes and the Y-axis represents the expression rank of the probes. (PDF) Figure S12 The opposite association of ORMDL3 gene between blood and SAT. The x-axis is the genome position based on genome build 36.3 (in Mb). The y-axis at the left is the association profiles in terms of Z-scores. The Z-scores in blood, represented as the red dots, has been weighted by the square root of the sample PLoS Genetics | www.plosgenetics.org Figure S16 The association profiles of the selected traitassociated genes that show discordant association between blood and muscle. The x-axis is the genome position based on genome 12 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL build 36.3. The y-axis at the left is the association profiles in terms of the Z-score. The Z-score in blood, represented as the red dots or orange dots. The red dots refer to the Z-scores that have been weighted by the square root of the sample sizes, corresponding to the compared tissue. For the clarity of subtle effect in blood, the weak association in blood was shown as orange dots if the Z-scores have not been weighted by the sample size, i.e., the Z-scores reported in 1,240 subjects. The blue dots represent the Z-scores in muscle. The dashed green line indicates the Z-score 3.49, representing the significance level in blood at FDR 0.05. The right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation coefficient of the Pearson correlation. (PDF) Table S5 Replication of cis-eQTL of DDT in blood and liver Figure S17 genes that show discordant association between blood and VAT. (XLS) that show opposite allelic direction. (DOC) Table S6 Allelic effect of disease-associated SNPs on the expression of ORMLD3. (DOC) Table S7 The tissue-dependent regulation of 45 trait-associated genes that show discordant association between blood and liver. (XLS) Table S8 The tissue-dependent regulation of 50 trait-associated genes that show discordant association between blood and SAT. (XLS) Table S9 The tissue-dependent regulation of 46 trait-associated Association profiles of MTMR3 in blood and liver. The x-axis is the genome position based on genome build 36.3 (in Mb). The y-axis at the left indicates the association Z-score. The Zscores in blood, represented as the red dots, have been weighted by the square root of the sample size, corresponding to the compared tissue. The blue dots represent the Z-scores in SAT. The dashed green line indicates the Z-scores 3.49, representing the significance level in blood at FDR 0.05. The right panel shows the correlation of the absolute association Z-scores between two tissues. The r-value indicates the correlation coefficient of the Pearson correlation. (PDF) Table S10 The tissue-dependent regulation of 19 trait-associated genes that show discordant association between blood and Muscle. (XLS) Table S11 Acknowledgments We thank Robert Hartholt for helping with the DNA isolation and Pieter van der Vlies, Elvira Oosterom, Marcel Bruinenberg, and Bahram Sanjabi for the genotyping and gene expression profiling. We also thank Eric Schadt for providing the allelic directions of eQTL detected in human liver and Jackie Senior for editing the manuscript. Table S1 Characteristics of Samples. (XLS) The number of discordant cis-eQTL between blood and non-blood tissues. (DOC) Table S2 Author Contributions Table S3 The Number of independent eSNPs per probe. Conceived and designed the experiments: JF LF CW MHH. Wrote the paper: JF LF. Collected tissues: WAB SSMR HJMG RKW LHvdB JV DvH. Conducted genotyping: CW RAO. Conducted expression profiling: MGMW CW MHH. Bioinformatics and statistical analyses: JF PD H-JW LF. Bioinformatics support: RCJ. PCA-based normalization: RSNF GJtM LF. Helped to improve the manuscript: HS RCJ CW. (DOC) Table S4 Replication of tissue-alternative cis-eQTL The number of the differentially expressed eProbes. (DOC) of TMEM176A. (DOC) References 12. Ding J, Gudjonsson JE, Liang L, Stuart PE, Li Y, et al. (2010) Gene expression in skin and lymphoblastoid cells: Refined statistical method reveals extensive overlap in cis-eQTL signals. Am J Hum Genet 87: 779–789. 13. Nica AC, Parts L, Glass D, Nisbet J, Barrett A, et al. (2011) The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS Genet 7: e1002003. doi:10.1371/journal.pgen.1002003. 14. Price AL, Helgason A, Thorleifsson G, McCarroll SA, Kong A, et al. (2011) Single-Tissue and Cross-Tissue Heritability of Gene Expression Via Identity-byDescent in Related or Unrelated Individuals. PLoS Genet 7: e1001317. doi:10. 1371/journal.pgen.1001317. 15. Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, et al. (2009) Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325: 1246–1250. 16. Gerrits A, Li Y, Tesson BM, Bystrykh LV, Weersing E, et al. (2009) Expression quantitative trait loci are highly sensitive to cellular differentiation state. PLoS Genet 5: e1000692. doi:10.1371/journal.pgen.1000692. 17. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, et al. (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A 99: 4465–4470. 18. Fehrmann RSN, Jansen RC, Veldink JH, Westra H, Arends D, et al. (2011) Trans-eQTLs reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the HLA. PLoS Genet 7: e1002197. doi:10.1371/journal.pgen.1002197. 19. Dubois PCA, Trynka G, Franke L, Hunt KA, Romanos J, et al. (2010) Multiple common variants for celiac disease influencing immune gene expression. Nat Genet 42: 295–302. 20. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362–9367. 1. Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M (2009) Mapping complex disease traits with global gene expression. Nat Rev Genet 10: 184–194. 2. Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, et al. (2010) Traitassociated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6: e1000888. doi:10.1371/journal.pgen.1000888. 3. Monks SA, Leonardson A, Zhu H, Cundiff P, Pietrusiak P, et al. (2004) Genetic inheritance of gene expression in human cell lines. Am J Hum Genet 75: 1094–1105. 4. Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen K, et al. (2003) Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet 33: 422–425. 5. Bullaughey K, Chavarria CI, Coop G, Gilad Y (2009) Expression quantitative trait loci detected in cell lines are often present in primary tissues. Hum Mol Genet 18: 4296–4303. 6. Zhong H, Beaulaurier J, Lum PY, Molony C, Yang X, et al. (2010) Liver and adipose expression associated SNPs are enriched for association to type 2 diabetes. PLoS Genet 6: e1000932. doi:10.1371/journal.pgen.1000932. 7. Schadt EE, Molony C, Chudin E, Hao K, Yang X, et al. (2008) Mapping the genetic architecture of gene expression in human liver. PLoS Biol 6: e107. doi:10.1371/journal.pbio.0060107. 8. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, et al. (2008) Genetics of gene expression and its effect on disease. Nature 452: 423–428. 9. Heap GA, Trynka G, Jansen RC, Bruinenberg M, Swertz MA, et al. (2009) Complex nature of SNP genotype effects on gene expression in primary human leucocytes. BMC Med Genomics 2: 1. 10. Myers AJ, Gibbs JR, Webster JA, Rohrer K, Zhao A, et al. (2007) A survey of genetic human cortical gene expression. Nat Genet 39: 1494–1499. 11. Richards AL, Jones L, Moskvina V, Kirov G, Gejman PV, et al. (2011) Schizophrenia susceptibility alleles are enriched for alleles that affect gene expression in adult human brain. Mol Psychiatry;doi:10.1038/mp.2011.1. PLoS Genetics | www.plosgenetics.org 13 January 2012 | Volume 8 | Issue 1 | e1002431 Mechanisms Underlying Tissue-Dependent cis-eQTL 31. Hu Z, Wu C, Shi Y, Guo H, Zhao X, et al. (2011) A genome-wide association study identifies two new lung cancer susceptibility loci at 13q12.12 and 22q12.2 in Han Chinese. Nat Genet 43: 792–796. 32. Gharavi AG, Kiryluk K, Choi M, Li Y, Hou P, et al. (2011) Genome-wide association study identifies susceptibility loci for IgA nephropathy. Nat Genet 43: 321–327. 33. Franke A, McGovern DPB, Barrett JC, Wang K, Radford-Smith GL, et al. (2010) Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat Genet 42: 1118–1125. 34. Imielinski M, Baldassano RN, Griffiths A, Russell RK, Annese V, et al. (2009) Common variants at five new loci associated with early-onset inflammatory bowel disease. Nat Genet 41: 1335–1340. 35. International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437: 1299–1320. 36. Liang P, Song F, Ghosh S, Morien E, Qin M, et al. (2011) Genome-wide survey reveals dynamic widespread tissue-specific changes in DNA methylation during development. BMC Genomics 12: 231. 37. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462: 315–322. 38. Eeckhoute J, Lupien M, Meyer CA, Verzi MP, Shivdasani RA, et al. (2009) Celltype selective chromatin remodeling defines the active subset of FOXA1-bound enhancers. Genome Res 19: 372–380. 39. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529. doi:10.1371/journal.pgen.1000529. 40. Whitlock MC (2005) Combining probability from independent tests: the weighted Z-method is superior to Fisher’s approach. J Evol Biol 18: 1368–1373. 41. Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, et al. (2008) SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24: 2938–2939. 21. Schunkert H, König IR, Kathiresan S, Reilly MP, Assimes TL, et al. (2011) Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat Genet 43: 333–338. 22. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, et al. (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466: 707–713. 23. Musunuru K, Strong A, Frank-Kamenetsky M, Lee NE, Ahfeldt T, et al. (2010) From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466: 714–719. 24. Esteller M, Garcia-Foncillas J, Andion E, Goodman SN, Hidalgo OF, et al. (2000) Inactivation of the DNA-repair gene MGMT and the clinical response of gliomas to alkylating agents. N Engl J Med 343: 1350–1354. 25. Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, et al. (2007) Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448: 470–473. 26. Mells GF, Floyd JAB, Morley KI, Cordell HJ, Franklin CS, et al. (2011) Genome-wide association study identifies 12 new susceptibility loci for primary biliary cirrhosis. Nat Genet 43: 329–332. 27. Anderson CA, Boucher G, Lees CW, Franke A, D’Amato M, et al. (2011) Metaanalysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47. Nat Genet 43: 246–252. 28. Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, et al. (2009) Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet 41: 703–707. 29. Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, et al. (2008) Genomewide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nat Genet 40: 955–962. 30. Verlaan DJ, Berlivet S, Hunninghake GM, Madore A, Larivière M, et al. (2009) Allele-specific chromatin remodeling in the ZPBP2/GSDMB/ORMDL3 locus associated with the risk of asthma and autoimmune disease. Am J Hum Genet 85: 377–393. PLoS Genetics | www.plosgenetics.org 14 January 2012 | Volume 8 | Issue 1 | e1002431 REPORT A Versatile Gene-Based Test for Genome-wide Association Studies Jimmy Z. Liu,1,* Allan F. Mcrae,1 Dale R. Nyholt,1 Sarah E. Medland,1 Naomi R. Wray,1 Kevin M. Brown,2 AMFS Investigators,3 Nicholas K. Hayward,1 Grant W. Montgomery,1 Peter M. Visscher,1 Nicholas G. Martin,1 and Stuart Macgregor1,* We have derived a versatile gene-based test for genome-wide association studies (GWAS). Our approach, called VEGAS (versatile genebased association study), is applicable to all GWAS designs, including family-based GWAS, meta-analyses of GWAS on the basis of summary data, and DNA-pooling-based GWAS, where existing approaches based on permutation are not possible, as well as singleton data, where they are. The test incorporates information from a full set of markers (or a defined subset) within a gene and accounts for linkage disequilibrium between markers by using simulations from the multivariate normal distribution. We show that for an association study using singletons, our approach produces results equivalent to those obtained via permutation in a fraction of the computation time. We demonstrate proof-of-principle by using the gene-based test to replicate several genes known to be associated on the basis of results from a family-based GWAS for height in 11,536 individuals and a DNA-pooling-based GWAS for melanoma in ~1300 cases and controls. Our method has the potential to identify novel associated genes; provide a basis for selecting SNPs for replication; and be directly used in network (pathway) approaches that require per-gene association test statistics. We have implemented the approach in both an easy-to-use web interface, which only requires the uploading of markers with their association p-values, and a separate downloadable application. Gene-based tests for association are increasingly being seen as a useful complement to genome-wide association studies (GWAS).1 A gene-based approach considers association between a trait and all markers (usually SNPs) within a gene rather than each marker individually. Depending on the underlying genetic architecture, gene-based approaches can be more powerful than traditional individual-SNP-based GWAS. For example, if a gene contains more than one causative variant, then several SNPs within that gene might show marginal levels of significance that are often indistinguishable from random noise in the initial GWAS results. By combining the effects of all SNPs in a gene into a test-statistic and correcting for linkage disequilibrium (LD), the gene-based test might be able to detect these effects. Gene-based tests are also ideally suited for network (or pathway) approaches to interpreting the findings from GWAS.2–7 These approaches are necessarily gene centric and require a measure of the relative importance of each gene to the phenotype of interest. The gene-based approach also reduces the multiple-testing problem of GWAS by only considering statistical tests for ~20,000 genes per genome as opposed to testing more than half a million SNPs in a typical GWAS. Ideally, a gene-based test statistic can be obtained with permutations, where LD structure and other possible confounding factors, such as gene size, will be accounted for. Computing a gene-based test for basic GWAS designs via permutations is conceptually simple and is currently implemented as the ‘‘set-based test’’ in the PLINK software package8; however, heavy computational requirements have restricted this method from being adopted on a genome-wide scale. Other gene-based tests, such as those based on genetic distances9 or entropy,10 are often also restricted to situations where individual genotype information is available or to specific GWAS designs (usually case-control designs). There are several important situations in which permutations or existing methods cannot be used; these include family-based GWAS, GWAS metaanalyses based on summary data, and DNA-pooling-based GWAS. In contrast, our approach, called VEGAS (versatile gene-based association study), only requires individual marker p values in order to allow computation of a genebased p value, and it can be applied to virtually any association study design. The method tests the evidence for association on a per-gene basis by summarizing either the full set of markers (typically SNPs) in the gene or a subset of the most significant markers (for example, the 10% most significant SNPs). For some genes, an approach considering all the markers might be the most powerful; for others, focusing on just the most associated markers might be apt. The true underlying genetic architecture is seldom known in advance. The default gene-based test in our implementation and in the following examples uses the full set of markers in the gene. Our approach takes account of LD between markers in a gene by using simulation based on the LD structure of a set of reference individuals from a HapMap phase 2 population (CEU [Utah residents with ancestry from northern and western Europe]; CHB and JPT [Han Chinese in Beijing, China and Japanese in Tokyo, Japan]; or YRI [Yoruba in Ibadan, Nigeria]), which 1 Genetics and Population Health Division, Queensland Institute of Medical Research, Brisbane, Queensland 4006, Australia; 2Integrated Cancer Genomics Division, The Translation Genomics Research Institute, Phoenix, Arizona 85028, USA; 3Australian Melanoma Family Study. List of participants and affiliations appear in the Acknowledgements *Correspondence: [email protected] (J.Z.L.), [email protected] (S.M.) DOI 10.1016/j.ajhg.2010.06.009. ª2010 by The American Society of Human Genetics. All rights reserved. The American Journal of Human Genetics 87, 139–145, July 9, 2010 139 provides approximately ~2.1 million autosomal SNPs,11 or a custom set of individuals if genotype information is available. Our method assigns SNPs to each of 17,787 autosomal genes according to positions on the UCSC Genome Browser hg18 assembly. In order to capture regulatory regions and SNPs in LD, we define gene boundaries in this case as 5 50 kb of 50 and 30 UTRs. Then, for a given gene with n SNPs, association p values are first converted to uppertail chi-squared statistics with one degree of freedom (df). The gene-based test statistic is then the sum of all (or a pre-defined subset) of the chi-squared 1 df statistics within that gene. If the SNPs are in perfect linkage equilibrium, the test statistic will have a chi-squared distribution with n degrees of freedom under the null hypothesis. Because this is unlikely to be the case, however, the true null distribution given the LD structure (and hence p values that correlate accordingly) will need to be taken into account. Ideally, one would achieve this by performing a large number of permutations; however, this is very computationally intensive, requires individual genotype information, and assumes that individuals are unrelated. Instead, our Monte Carlo approach makes use of simulations from the multivariate normal distribution and is both much faster and agnostic regarding the GWAS design. For a gene with n SNPs, we simulate an n-element multivariate normally distributed vector with mean 0 and variance S, the n 3 n matrix of pairwise LD (r) values. A vector of n independent, standard, normally distributed random variables is first generated and then multiplied by the Cholesky decomposition matrix of S – that is, the n 3 n lower triangular matrix C, such that CCT ¼ S. The new random vector, Z ¼ ðz1 ,z2 .zn Þ, will have a multivariate normal P distribution, Z $ Nn ð0, Þ. Z is then transformed into a vector of correlated chi-squared 1 df variables, Q ¼ ðq1 ,q2 .qn Þ, qi ¼ z2i . The simulated gene-based test statistic is then the sum of all (or a predefined subset) of the elements of Q and will have the same approximate distribution as our observed gene-based test statistic under the null hypothesis. A large number of multivariate normal vectors are simulated, and the empirical genebased p value is the proportion of simulated test statistics that exceed the observed gene-based test statistic. We have implemented VEGAS in both an easy-to-use web-interface or as a downloadable application for Linux and Unix. The only user inputs required are a text file consisting of two columns: SNP rs-name and association p value, along with specification of the reference population (CEU, CHB and JPT, or YRI). The downloadable version also allows the use of custom individual genotypes if available, as well as specification of gene boundaries. Pairwise LD correlation matrices are calculated in PLINK. The R corpcor package is used to correct for non-positive definite correlation matrices,12 and multivariate normal random vectors are simulated with the mvtnorm package.13 The number of simulations per gene is determined adaptively. In the first stage, 103 simulations will be performed. If the resulting empirical p value is less than 0.1, 104 simulations will be performed. If the empirical p value from 104 simulations is less than 0.001, the program will perform 106 simulations. At each stage, the simulations are mutually exclusive. For computational reasons, if the empirical p value is 0, then no more simulations will be performed. An empirical p value of 0 from 106 simulations can be interpreted as p < 10%6, which exceeds a Bonferroni-corrected threshold of p < 2.8 3 10%6 (z0.05/17,787; this threshold is likely to be conservative given the overlap between genes). The user may select whether to perform the gene-based test on the full set of SNPs within a gene, a specified percentage of the most significant SNPs, or just the single most significant SNP. Because the program depends upon the output from other programs, it is important to take correct GWAS quality-control measures to account for issues such as population stratification or pooling errors before using VEGAS. Using a test with permutations as the ‘‘gold standard,’’ we compared the results from VEGAS to those from the PLINK set-based test8 with permutations (with parameters --set-p1 --set-r21 --maf 0.01) on a GWAS for height in 3,611 unrelated Australian individuals drawn from communitybased twin studies conducted from 1980 to 2004. Several recent genetic studies of other traits,14–16 have used these samples and have described genotype and phenotype data cleaning. In brief, height was corrected for age and sex before being converted to standard z scores. PLINK was used for performing genome-wide association, from which the results were used in our method. For a given set of SNPs, the PLINK set-based test initially performs a standard association test and then uses the average association test statistic across these SNPs as the ‘‘set-based’’ test statistic (VEGAS uses the sum rather than average; the two methods are equivalent in calculations of empirical p values). Then, for the permutation procedure, the phenotypes are randomly shuffled among individuals, and the process is repeated several thousand times, from which an empirical p value is obtained. Because of computational limitations, we only performed the PLINK set-based test on 413 genes on chromosome 22 with 104 permutations each. To see how both tests deal with more significant genes, we performed 106–107 permutations on seven additional genes. These genes were chosen on the basis of having p values < 10%3 when VEGAS was applied across all chromosomes. across all chromosomes. The results from both tests are shown in Figure 1, which compares the corresponding %log10(p value)s from the PLINK set-based test and VEGAS for 420 genes. For the majority of genes, both methods produced very similar results. Correlation between the p values was very high (Pearson r ¼ 0.999), as was that between the rankings (Spearman r ¼ 0.998). Thus, in addition to being agnostic toward GWAS design, a major advantage of our method over permutations is speed. The PLINK set-based test on our computer took ~12 hr to compute the 413 chromosome 22 genes plus 2 days for the seven additional genes. In contrast, our approach 140 The American Journal of Human Genetics 87, 139–145, July 9, 2010 Figure 1. Comparison of the $log10(p value)s from the PLINK Set-Based Test and VEGAS on a GWAS of Height in 3,611 Individuals The PLINK set-based test was performed on 413 genes on chromosome 22 with 104 permutations (circles) and on seven genes on other chromosomes; these were selected on the basis of having the smallest p values from the VEGAS analysis, at 106 to 107 permutations (triangles). The p values from VEGAS were obtained by running 103 to 107 multivariate normal simulations per gene. The straight diagonal line indicates a 1:1 relationship. with 103 to 106 simulations per gene computed the same set of genes in less than thirty minutes. We selected nine nonoverlapping genes of various sizes on chromosome 22 to further investigate the type I error rate of our method compared to those from permutations. The previous height data were permuted 1000 times. VEGAS and the PLINK set-based test were applied to the association results of each permutation for each of the genes. The comparison of the p values for each of the nine genes is shown in Figure S1. Overall, there does not appear to be any major bias involved with VEGAS. Nevertheless, it should be noted that our method will produce spurious results if the incorrect reference population, and hence LD structure, is used. Biases toward smaller p values will occur if the reference population is older than the study population, and larger p values will occur in the opposite situation. When the same 420 genes and 3611 Australian individuals were used, running VEGAS with the HapMap CEU population as the reference produced results comparable to those from permutation (Figure S2A), whereas using the HapMap YRI population produced significant biases toward smaller p values (Figure S2B). Slight biases might also potentially occur for genes with a non-positive definite LD correlation matrix. In our dataset, this was a property of ~80% of genes, inhibiting the direct use of Cholesky decomposition. For these genes, the nearest positive semidefinite matrix is estimated with the R corpcor package.12,17 Matrices that require a large adjustment might explain some of the discrepancy Figure 2. Comparison of the $log10(p value)s from Permutations and VEGAS When Only the Single Best SNP from Each Gene Is Considered Results are based on a GWAS of height in 3611 individuals. Permutations were performed on 413 genes on chromosome 22 with 103 permutations and on seven additional genes with 105–106 permutations. The p values from VEGAS were obtained from 103–106 multivariate normal simulations per gene. The straight diagonal line indicates a 1:1 relationship. between VEGAS and permutations, although as seen in Figure 1, this does not appear to have a major effect. Under some genetic architectures, a more powerful genebased method may be to consider only the most significant SNP in a gene rather than the full set of SNPs and then correct this SNP’s association p value for gene size and other possible confounders. Our approach can readily be applied to this situation. For a gene with n SNPs, recall the simulated vector of n correlated chi-squared 1 df variables, Q ¼ ðq1 ,q2 .qn Þ. For the ‘‘Top-SNP’’ method, we define Qmax as the simulated test statistic of the maximum element of Q. Then, by simulating a large number of Qmax test statistics, the empirical gene-based p value is the proportion of simulated Qmax test statistics that exceed the observed test statistic of the most significant SNP in the gene. Using the same 420 genes as in our previous analysis with the full set of SNPs, we compared the VEGAS TopSNP method and permutations (Figure 2). Note that in this case, we ran our own permutations by using R rather than the PLINK set-based test because the two methods are not equivalent. As with the test considering the full set of SNPs, VEGAS produces results very similar to those from permutations. Correlation between the p values was very high (Pearson r ¼ 0.996), as was that between the rankings (Spearman r ¼ 0.996). Our method of using the full set of SNPs per gene was applied to two situations where permutation tests are not applicable: a family-based GWAS for height, where permutation cannot account for phenotypic correlation between The American Journal of Human Genetics 87, 139–145, July 9, 2010 141 Table 1. VEGAS Results for the 15 Most Significant Genes from a Family-Based GWAS for Height in 11,536 Individuals Chromosome Gene 4 HHIPa 6 GPR126 8 a Number of SNPs Start Position Stop Position Test Statistic p Value 26 145786622 145879331 263.505 10"6 Best SNP SNP p Value rs1812175 1.06 3 10"9 "6 rs6570507 2.16 3 10"7 23 142664748 142809096 169.912 5 3 10 CHCHD7a 4 57286868 57293730 31.82 3.2 3 10"5 rs7833986 2.20 3 10"4 6 HMGA1a 6 34312627 34321986 38.934 8.4 3 10"5 rs1776897 6.71 3 10"6 15 ADAMTSL3a 85 82113841 82499597 344.52 1.34 3 10"4 rs7183263 3.89 3 10"7 4 LCORLa 30 17453940 17632474 222.748 1.38 3 10"4 rs6817306 7.63 3 10"6 20 GDF5a 10 33484562 33489441 81.199 1.78 3 10"4 rs4911494 1.39 3 10"4 "4 a 12 HMGA2 1 MFAP2 17 C17orf78 6 64504506 64646338 147.824 15 17173585 17180668 76.961 3.71 3 10"4 rs11203280 6.03 3 10"4 5 32807097 32823775 27.012 5.31 3 10"4 rs8067120 1.80 3 10"3 HIST1H3Ga 16 26379124 26379591 86.062 5.77 3 10"4 rs10946808 2.48 3 10"5 2 NMUR1 18 232096114 232103426 102.955 6.05 3 10"4 rs1434519 3.29 3 10"5 4 ADH5 26 100211152 100228954 142.218 8.01 3 10"4 rs1042364 2.45 3 10"4 "4 rs3936211 7.35 3 10"4 rs10183113 3.71 3 10"6 8 SPATC1 2 EMX1 3.00 3 10 8 145158594 145174003 58.172 8.30 3 10 13 72998111 73015528 60.278 9.62 3 10"4 rs8756 4.26310"7 34 a These genes have been implicated in previous GWAS of height.22 The signal in HIST1H3G is driven by a variant previously implicated in the neighboring HIST1H1G. family members, and a DNA-pooling GWAS for melanoma (MIM 155600), where individual genotype information is not available. For height, we included an extra 7,935 relatives of those in our original GWAS of 3,611 unrelated individuals. These consisted of parents, offspring, siblings, twins, and other family members, all typed with the same SNP chip as the unrelated individuals used in the first calculation. The results of the family-based association analysis were previously published in Liu, et al.18 Table 1 lists the 15 most significant height-associated genes obtained from VEGAS. One gene, the previously implicated HHIP (MIM 606178; p ¼ 1 3 10"6),19–21 exceeded a Bonferroni corrected threshold of p < 2.8 3 10"6. Overall, nine of the top 15 genes have been previously implicated in published GWAS of height at genome-wide significance.22 It remains to be seen whether any of the remaining genes play a role in height. The gene NMUR1 (MIM 604153; p ¼ 6.05 3 10"4) is a G-protein-coupled receptor and is also involved in neuropeptide signaling, similar to the previously implicated GPR126 (MIM 612243; p ¼ 5 3 10"6). Height might also be mediated by MFAP2 (MIM 156790; p ¼ 3.71 3 10"4) through its role as a glycoprotein component of connective-tissue microfibrils,23 for which normal connective-tissue development is essential for height growth. Mutations in other microfibril components have been linked to Marfan syndrome (MIM 154700), a genetic disorder characterized by skeletal overgrowth.24 These results suggest that despite having a relatively small sample size for a GWAS for height, the gene-based test has the potential to identify novel genes. In a two-stage GWAS, the most significant genes may also be used as a basis for selecting SNPs for replication samples. For melanoma, the gene-based test was performed on the results from a GWAS that used pooled DNA in 1354 melanoma cases and 1291 controls. The sample was originally part of a larger previously published GWAS for melanoma,25 and pooling and association methods are described in that study. This study was performed with the approval of the appropriate ethics committee and with informed consent from all participants. As for height, the results from the gene-based test are consistent with our current understanding of the genetics of melanoma (Table 2). Overall, all of the top 15 genes are in regions known to harbor melanoma-susceptibility genes. Seven genes identified are located on 20q11.22, the region originally implicated by Brown et al.25 and containing the skin pigmentation gene ASIP (MIM 600201); these include MAP1LC3A (MIM 601242; p < 10"6), PIGU (MIM 608528; p ¼ 2 3 10"6), DYNLRB1 (MIM 607167; p ¼ 7 3 10"6), TP53INP2 (p ¼ 4.7 3 10"5), and NCOA6 (MIM 605299; p ¼ 1.38 3 10"4). ASIP itself, however, was nonsignificant (p ¼ 0.116). Given the size of this associated region, it could be the case that a distant enhancer rather than nonsynonymous or proximal regulatory elements is driving the association with ASIP. Similarly, a large number of associated genes are also located on 16q24.3; the most significant of these genes was DEF8 (p ¼ 4 3 10"5). Given that DEF8 lies ~30 kb downstream of the known melanoma-susceptibility gene, MC1R (MIM 155555), it is likely that this signal is driven by variants in and around MC1R, which was only nominally 142 The American Journal of Human Genetics 87, 139–145, July 9, 2010 Table 2. VEGAS Results for the 15 Most Significant Genes from a DNA-Pooling GWAS for Melanoma in 1354 Cases and 1291 Controls Chromosome Gene Number of SNPs Start Position Stop Position Test Statistic p Value 20 MAP1LC3A 59 32598352 32611810 762.618 <10"6 2 3 10 "6 Best SNP SNP p Value rs910873 1.00 3 10"16 rs910873 1.00 3 10"16 20 PIGU 93 32612006 32728750 964.294 15 MYEF2 25 46218920 46257850 50.865 4 3 10"6 rs2470102 4.18 3 10"4 20 DYNLRB1 58 32567864 32592423 548.265 7 3 10"6 rs910873 1.00 3 10"16 20 SNTA1 39 31459423 31495359 242.906 9 3 10"6 rs291695 6.60 3 10"11 16 DEF8 73 88542651 88561968 318.251 4.0 3 10"5 rs1805007 3.33 3 10"16 20 TP53INP2 44 32755808 32764898 312.611 4.7 3 10"5 rs4417778 5.35 3 10"9 "4 rs4911442 2.71 3 10"10 20 NCOA6 81 32766238 32877094 563.953 1.38 3 10 20 CDK5RAP1 55 31410305 31452998 260.851 1.53 3 10"4 rs291695 6.60 3 10"11 5 RXFP3 48 33972247 33974099 138.421 1.95 3 10"4 rs35389 1.31 3 10"8 16 C16orf55 49 88251710 88265176 244.276 3.12 3 10"4 rs258322 1.34 3 10"7 16 MGC16385 59 88563701 88566443 218.033 3.99 3 10"4 rs8049897 9.74 3 10"7 16 DPEP1 58 88207216 88232340 248.214 4.54 3 10"4 rs12918773 4.47 3 10"7 "4 rs258322 1.34 3 10"7 rs4785686 2.76 3 10"7 16 CHMP1A 52 88238344 88251630 248.105 4.60 3 10 16 SPG7 73 88102305 88151675 370.214 4.66 3 10"4 significant (p ¼ 1.30 3 10"3), rather than DEF8 itself. Likewise, the gene RXFP3 (p ¼ 1.95 3 10"4) is adjacent to SLC45A2 (MIM 606202; p ¼ 8.91 3 10"3), a known melanoma-susceptibility gene, and MYEF2 (p ¼ 4 3 10"6) is adjacent to SLC24A5 (MIM 609802; p ¼ 2.34 3 10"3), a gene associated with skin pigmentation. Although VEGAS was able to produce results equivalent to those obtained through permutations at a fraction of the time taken, as well as replicate several known heightand melanoma-associated genes, there are several situations in which use of the gene-based test is limited. The effectiveness of VEGAS, along with other gene-based methods, is determined by the underlying genetic architecture of the gene and phenotype of interest. Although gene-based methods are more powerful than single-marker analysis for identifying significant genes with multiple causal variants, the converse is also true. If a gene contains only one causal variant, then the inclusion of a large number of nonsignificant markers into the gene-based test will dilute this gene’s significance. The correct genetic model to use is seldom known in advance, although our method can be performed on a specified subset of markers or just the single most significant marker rather than all markers in a gene. Similarly, the use of 5 50 kb to define gene boundaries is an arbitrary choice. Large boundaries mean that some markers are included in multiple genes, resulting in a situation similar to our results for melanoma, where it may be difficult to pinpoint the causal gene when multiple adjacent genes are statistically significant. Specifying stringent boundaries, however, may not fully capture regulatory regions or those SNPs in high LD with variants in the gene. Moreover, given that the majority of SNPs so far identified in GWAS are found in nongenic regions,26 these SNPs would not be included in any genecentric analysis at all. For these reasons, gene-based methods should not be seen as a replacement for traditional single-marker association studies but rather should be seen as a complement to GWAS and an essential step for network- and pathway-based approaches. We offer our gene-based test not as a definitive solution to the problem but also as one tool in the complex-trait geneticist’s toolbox for post-GWAS analysis. Supplemental Data Supplemental Data include two figures and Supplemental Acknowledgments and can be found with this article online at http://www.cell.com/AJHG/. Acknowledgments Australian Melanoma Family Study Investigators: Graham J. Mann and Richard F. Kefford (Westmead Institute of Cancer Research, University of Sydney at Westmead Millennium Institute and Melanoma Institute Australia, PO Box 412, Westmead, NSW 2145, Australia); John L. Hopper (Centre for Molecular, Environmental, Genetic, and Analytic Epidemiology, School of Population Health, Level 2, 723 Swanston Street, University of Melbourne, VIC 3052, Australia); Joanne F. Aitken (Viertel Centre for Research in Cancer Control, The Queensland Cancer Council Queensland, PO Box 201, Spring Hill, QLD 4004, Australia); Graham G. Giles (Cancer Epidemiology Centre, The Cancer Council Victoria, Carlton, VIC 3053, Australia); and Bruce K. Armstrong (School of Public Health, A27, University of Sydney, NSW 2006, Australia). J.Z.L. is supported by National Health and Medical Research Council (NHMRC) project grant 496675. S.M., N.K.H., G.W.M., P.M.V., A.F.M., and S.E.M. are supported by the NHMRC Fellowships scheme. N.R.W. and D.R.N. are supported by Australian Research The American Journal of Human Genetics 87, 139–145, July 9, 2010 143 Council Fellowships. K.M.B. is a recipient of a Career Development Award from the Melanoma Research Foundation and is supported by the National Cancer Institute, National Institutes of Health (CA109544, CA083115). We thank Joseph Powell for suggesting the name VEGAS. Additional acknowledgements are provided in the Supplemental Data. 9. 10. Received: April 29, 2010 Revised: June 7, 2010 Accepted: June 11, 2010 Published online: July 1, 2010 11. Web Resources The URLs for data presented herein are as follows: corpcor, http://strimmerlab.org/software/corpcor mvtnorm, http://cran.r-project.org/package¼mvtnorm Online Mendelian Inheritance in Man (OMIM), http://www.ncbi. nlm.nih.gov/Omim PLINK, http://pngu.mgh.harvard.edu/~purcell/plink R, http://www.r-project.org UCSC Genome Browser, http://genome.ucsc.edu VEGAS, http://genepi.qimr.edu.au/general/softwaretools.cgi 12. 13. 14. 15. References 1. Neale, B.M., and Sham, P.C. (2004). The future of association studies: gene-based analysis and replication. Am. J. Hum. Genet. 75, 353–362. 2. Wang, K., Li, M., and Bucan, M. (2007). Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81, 1278–1283. 3. Perry, J.R.B., McCarthy, M.I., Hattersley, A.T., Zeggini, E., Weedon, M.N., Frayling, T.M., and Wellcome Trust Case Control, C.; Wellcome Trust Case Control Consortium. (2009). Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes 58, 1463–1467. 4. Holmans, P., Green, E.K., Pahwa, J.S., Ferreira, M.A.R., Purcell, S.M., Sklar, P., Owen, M.J., O’Donovan, M.C., and Craddock, N.; Wellcome Trust Case-Control Consortium. (2009). Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet. 85, 13–24. 5. Ruano, D., Abecasis, G.R., Glaser, B., Lips, E.S., Cornelisse, L.N., de Jong, A.P., Evans, D.M., Davey Smith, G., Timpson, N.J., Smit, A.B., et al. (2010). Functional gene group analysis reveals a role of synaptic heterotrimeric G proteins in cognitive ability. Am. J. Hum. Genet. 86, 113–125. 6. Baranzini, S.E., Galwey, N.W., Wang, J., Khankhanian, P., Lindberg, R., Pelletier, D., Wu, W., Uitdehaag, B.M.J., Kappos, L., Polman, C.H., et al; GeneMSA Consortium. (2009). Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum. Mol. Genet. 18, 2078–2090. 7. Elbers, C.C., van Eijk, K.R., Franke, L., Mulder, F., van der Schouw, Y.T., Wijmenga, C., and Onland-Moret, N.C. (2009). Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genet. Epidemiol. 33, 419–431. 8. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J., Sklar, P., de Bakker, P.I.W., Daly, M.J., and Sham, P.C. (2007). PLINK: a tool set for 16. 17. 18. 19. 20. 21. 144 The American Journal of Human Genetics 87, 139–145, July 9, 2010 whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575. Buil, A., Martinez-Perez, A., Perera-Lluna, A., Rib, L., Caminal, P., and Soria, J.M. (2009). A new gene-based association test for genome-wide association studies. BMC Proc 3 (Suppl 7), S130. Cui, Y., Kang, G., Sun, K., Qian, M., Romero, R., and Fu, W. (2008). Gene-centric genomewide association study via entropy. Genetics 179, 637–650. Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve, L.L., Gibbs, R.A., Belmont, J.W., Boudreau, A., Hardenbol, P., Leal, S.M., et al; International HapMap Consortium. (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861. Schaefer, J., Opgen-Rhein, R., and Strimmer, K. (2009). Efficient estimation of covariance and (partial) correlation. http://strimmerlab.org/software/corpcor/. Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., and Hothorn, T. (2009). mvtnorm: Multivariate normal and t distributions. http://CRAN.R-project.org/package¼mvtnorm. Medland, S.E., Nyholt, D.R., Painter, J.N., McEvoy, B.P., McRae, A.F., Zhu, G., Gordon, S.D., Ferreira, M.A., Wright, M.J., Henders, A.K., et al. (2009). Common variants in the trichohyalin gene are associated with straight hair in Europeans. Am. J. Hum. Genet. 85, 750–755. Cornes, B.K., Medland, S.E., Ferreira, M.A., Morley, K.I., Duffy, D.L., Heijmans, B.T., Montgomery, G.W., and Martin, N.G. (2005). Sex-limited genome-wide linkage scan for body mass index in an unselected sample of 933 Australian twin families. Twin Res. Hum. Genet. 8, 616–632. Benyamin, B., Perola, M., Cornes, B.K., Madden, P.A.F., Palotie, A., Nyholt, D.R., Montgomery, G.W., Peltonen, L., Martin, N.G., and Visscher, P.M. (2008). Within-family outliers: segregating alleles or environmental effects? A linkage analysis of height from 5815 sibling pairs. Eur. J. Hum. Genet. 16, 516–524. Higham, N.J. (1988). Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl. 103, 103–118. Liu, J.Z., Medland, S.E., Wright, M.J., Henders, A.K., Heath, A.C., Madden, P.A., Duncan, A.D., Montgomery, G.W., Martin, N.G., and McRae, A.F. (2010). Genome-wide association study of height and body mass index in Australian twin families. Twin Res. Hum. Genet. 13, 179–193. Weedon, M.N., Lango, H., Lindgren, C.M., Wallace, C., Evans, D.M., Mangino, M., Freathy, R.M., Perry, J.R.B., Stevens, S., Hall, A.S., et al; Diabetes Genetics Initiative, Wellcome Trust Case Control Consortium, Cambridge GEM Consortium. (2008). Genome-wide association analysis identifies 20 loci that influence adult height. Nat. Genet. 40, 575–583. Gudbjartsson, D.F., Walters, G.B., Thorleifsson, G., Stefansson, H., Halldorsson, B.V., Zusmanovich, P., Sulem, P., Thorlacius, S., Gylfason, A., Steinberg, S., et al. (2008). Many sequence variants affecting diversity of adult human height. Nat. Genet. 40, 609–615. Lettre, G., Jackson, A.U., Gieger, C., Schumacher, F.R., Berndt, S.I., Sanna, S., Eyheramendy, S., Voight, B.F., Butler, J.L., Guiducci, C., et al; Diabetes Genetics Initiative, FUSION, KORA, Prostate, Lung Colorectal and Ovarian Cancer Screening Trial, Nurses’ Health Study, SardiNIA. (2008). Identification of ten loci associated with height highlights new biological pathways in human growth. Nat. Genet. 40, 584–591. 22. Hindorff, L., Junkins, H., Mehta, J., and Manolio, T. (2009). A catalog of published genome-wide association studies. http://www.genome.gov/gwastudies/ (Accessed: April 26 2010). 23. Faraco, J., Bashir, M., Rosenbloom, J., and Francke, U. (1995). Characterization of the human gene for microfibril-associated glycoprotein (MFAP2), assignment to chromosome 1p36.1p35, and linkage to D1S170. Genomics 25, 630–637. 24. Judge, D.P., and Dietz, H.C. (2005). Marfan’s syndrome. Lancet 366, 1965–1976. 25. Brown, K.M., Macgregor, S., Montgomery, G.W., Craig, D.W., Zhao, Z.Z., Iyadurai, K., Henders, A.K., Homer, N., Campbell, M.J., Stark, M., et al. (2008). Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat. Genet. 40, 838–840. 26. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367. The American Journal of Human Genetics 87, 139–145, July 9, 2010 145 Bioinformatics Advance Access published April 17, 2012 INRICH: Interval-based Enrichment Analysis for Genome Wide Association Studies Phil H. Lee,1,2,3 Colm O’Dushlaine,3 Brett Thomas,1 Shaun M. Purcell1,2,3,4∗ Analytical Translational Genetics Unit, Center for Human Genetic Research, Massachusetts General Hospital, MA 02114; 2 Department of Psychiatry, Harvard Medical School, Boston, MA 02115; 3 Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142; 4 Mount Sinai School of Medicine, New York, NY 10029, USA 1 Associate Editor: Dr. Jeffrey Barrett 1 INTRODUCTION Multilocus approaches (often known as pathway or gene-set enrichment analysis methods) can be used to ask whether sets of single nucleotide polymorphisms (SNPs), often defined by groups of functionally-related genes, are in aggregate more highly associated with a phenotype than expected by chance. Consideration of the biological relationships amongst the “top hits” in a genomewide association study (GWAS) can provide orthogonal evidence, over and above the functionally-agnostic analysis of the number, statistical significance and/or variance explained of those hits. For example, that a GWAS has three independent SNPs with pvalues around at 1×10−6 is in itself unremarkable. However, if the associated regions independently map to three of a small set of functionally-related genes, this will be very unlikely to occur by chance: consequently, we would likely wish to put more weight on these associations. As well as providing additional statistical ∗ to whom correspondence should be addressed evidence to sub-threshold association results, another use of geneset analysis can be called in silico fine-mapping, or prioritizing specific genes in loci that contain multiple genes with equivalent association evidence. For example, of ten associated genes within a block of strong linkage disequilibrium (LD), we may find that only one shows above-chance relatedness to genes that appear in other, statistically-independent association intervals. All other things being equal, one would presumably consider that gene as more likely to be causally-related compared to the other nine. Furthermore, the identity of the particular enriched gene-sets may offer insights into disease mechanism and biology, although this will be contingent on the gene-sets’ accuracy, comprehensiveness and relevance to the phenotype’s underlying biology. Over the past few years, several gene-set methods for GWAS have been developed (Wang et al., 2007; Holmans et al., 2009). Still, there clearly exist challenges and limitations to be addressed (Hong et al., 2009). Desirable properties of a gene-set test include that it is i) robust, and so able to calculate experiment-wide significance, with adjustment for common biases due to gene size, LD within and between genes, etc), ii) flexible, with application to (summary) data from different sources, such as GWAS, from imputed data, copy number variant (CNV) studies, targeted sequencing, from tables in manuscripts, etc, and iii) computationally manageable, allowing genome-wide analysis in a reasonable time on a single machine. Here we describe the gene-set enrichment analysis tool INRICH (INterval enRICHment analysis) that aims to satisfy the above properties. INRICH takes a set of independent, nominally-associated genomic intervals and then tests for the enrichment of predefined gene-sets. An “interval” will typically correspond to a genomic region of SNP association defined by LD from a genome-wide scan, although intervals could also represent, e.g. deletion or duplication events observed in cases, regions identified as homozygous-bydescent, etc. 2 METHODS We describe the method implemented in INRICH, focussing on the case of SNP association from GWAS data. Specifically, analysis follows the following three steps: © The Author (2012). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1 Downloaded from http://bioinformatics.oxfordjournals.org/ at Univ of Colorado Libraries on May 3, 2012 ABSTRACT Summary: Here we present INRICH, a pathway-based genome-wide association analysis tool that tests for enriched association signals of predefined gene-sets across independent genomic intervals. INRICH has wide applicability, fast running time and, most importantly, robustness to potential genomic biases and confounding factors. Such factors, including varying gene size and SNP density, linkage disequilibrium within and between genes and overlapping genes with similar annotations, are often not accounted for by existing gene-set enrichment methods. By using a genomic permutation procedure, we generate experiment-wide empirical significance values, corrected for the total number of sets tested, implicitly taking overlap of sets into account. By simulation we confirm a properly controlled type I error rate and reasonable power of INRICH under diverse parameter settings. As a proof of principle, we describe the application of INRICH on the NHGRI GWAS catalog. Availability: A standalone C++ program, user manual, and datasets can be freely downloaded from: http://atgu.mgh.harvard.edu/inrich/. Contact: [email protected] Supplementary information: available from the journal web-site. 3 DATA ANALYSIS AND SUMMARY INRICH takes disease-associated genomic intervals as input – for example, all GWAS SNPs (and the other, local SNPs in LD) that are associated with a phenotype at p<1×10−4 . Either PLINK (Purcell et al., 2007) LDclumping or tag SNP selection commands (or similar tools) can be used to define such independent regions of association, which ensures that multiple, adjacent SNPs that potentially tag the same causal variant are analyzed as one independent association unit. Due to space limitation, we provide a detailed instruction manual on the data generation and testing procedure at our website (http://atgu.mgh.harvard.edu/inrich/). We first conducted a simulation study to assess the Type I error rates of INRICH using two GWAS datasets: HapMap III (CEU+TSI; n=200), and schizophrenia case/control study (n=1,468) (Lieberman et al., 2005). Tested parameter settings include different enrichment statistics (i.e., “interval” or “target” mode), LD-clumping r2 measures (r 2 = 0.2), as well as significant p value thresholds to define associated regions (1×10−3 and 5×10−3 ). Under each setting, we repeated the following procedures 200 times, and calculated the average type I error rate; i) Generate random phenotype labels for subjects; ii) Apply standard χ2 association analysis on individual SNPs; and iii) Run INRICH on the association results using the KEGG gene-sets (Kanehisa et al., 2010). We also conducted the same simulation study using two commonly used gene-set enrichment approaches: GenGen (Wang et al., 2007) (i.e., GSEA tool specifically designed for GWAS) and the hypergeometric test. Compared to these methods, the average type I error rates of INRICH did not exceed the nominal 5% level. In contrast, under some conditions, the hypergeometric test yielded a type I error rate as high as 100%. We also considered the power under conditions where the hypergeometric test is valid, and confirmed that INRICH gives a comparably good power to the hypergeometric test (S3.xlsx for details). Phenotype-permutation-based gene-set enrichment methods (such as GenGen) provide statistically rigorous tests, but are computationally very demanding (particularly if based on imputed datasets, or complex family-based association tests, etc.). In contrast, other gene-set enrichment methods based on summary data alone (such as the hypergeometric test) are not computationally intensive, but can be very anti-conservative, as our simulations show, due to unwarranted assumptions of independence. We argue that INRICH is well-placed between these two poles, providing an efficient yet robust middle-ground. As a proof of concept to demonstrate the performance of INRICH under the alternative hypothesis, we applied INRICH to the summary association data from the NHGRI (National Human Genome Research Institute) GWAS catalog (Hindorff et al., 2009). First, we downloaded a list of 4,689 SNPs that are associated with 411 complex diseases/traits at a p value <1×10−5 (download date: 2011-Mar-04). This analysis focused on 236 diseases/traits that have at least five associated SNPs. For each phenotype, LD-independent intervals were generated around the associated SNPs using PLINK, and enrichment test was conducted using 3,182 Gene Ontology (GO) terms (gene-set size between 5 and 200 genes) (The Gene Ontology Consortium, 2000) and 106 replicates in the first round of permutation and 104 in the second. We excluded all genes and intervals mapping to the broad MHC region (chr6:2535Mb): in practice because this region contains so many genes, it is unlikely to improve the power of gene-set enrichment analysis in most cases. After multiple testing correction, 47 disorders were predicted with at least one significantly enriched GO term at α=5%. Many of the associations were consistent with known pathology of examined complex diseases/traits. For example, Type II diabetes-associated intervals were most significantly enriched for genes involved in glucose homeostasis (corrected p=0.001) and Crohn’s disease-associated intervals enriched for regulation of activated T cell proliferation (corrected p=0.003). In summary, we have implemented a new gene-set enrichment method in the INRICH package, based on a constrained reshuffling of associated intervals, to test whether more genes from particular sets are contained in those intervals than expected by chance. Importantly, we preserve the properties of the original data whilst reshuffling, in terms of the number, SNP density and gene-density. We have shown appropriate type I error rates, even when correcting for hundreds of partially-overlapping gene-sets. Preliminary application to the NHGRI GWAS catalogue indicates good power to detect true signals. INRICH was recently applied to a large GWAS of bipolar disorder, implicating calcium ion channel genes as enriched (Psychiatric GWAS Consortium Bipolar Disorder Working Group, 2011). Practically, INRICH is fast, applicable without individual genotype data, and freely available either as a command-line tool or with a GUI. 2.2 Overlapping Interval/Gene Merging It is not uncommon for functionally-related genes to show physical clustering, and therefore yield an inflated false positive rates for such gene sets if dependent signals are assumed to be independent (Hong et al., 2009; Holmans et al., 2009). To avoid this potential bias due to multi-counting physically clustered genes belonging to the same set, we merge overlapping genes belonging to the same gene-set. We also merge overlapping testing intervals to ensure that testing units are statistically independent from each other. 2.3 Set-based Enrichment Tests The primary enrichment statistic E for each gene-set is the number of intervals that overlap at least one “target” gene (i.e., gene in the tested set), which we refer to as the interval mode. An alternative test instead counts the number of target genes that overlap at least one interval, which is useful for analyzing structural variation data (e.g., CNVs) that typically span large genomic regions and therefore are likely to disrupt multiple, non-overlapping genes. We call this test setting as the target mode. We use a permutation approach, described below, to calculate empirical significance p values for each gene-set. Suppose that input data I includes k intervals, I = {i1 , ..., ik }, and target gene-set T includes m genes, T = {t1 , ..., tm }. 1. Null interval set R is generated by randomly assigning intervals to genomic locations with the constraints that each null interval ri ∈ R approximately matches to the original interval Ii ∈ I (i = 1, ..., k) in terms of the number of SNPs and overlapping genes; we also ensure approximately similar SNP density per kilobase. Figure S1 illustrates the three matching criteria. 2. Corresponding to the selected testing mode as described above, the null enrichment statistic E is calculated as the number of overlapping intervals (or genes) between target gene-set T and randomly matched null set R. 3. STEP 1) and 2) are repeated N times to generate a distribution of the enrichment statistics for target gene-set T under the null hypothesis. 4. The empirical p value for T is the proportion of N replicates where the enrichment statistic is as large as that of original interval set I. 5. Multiple testing correction is achieved via a second, nested round of permutation to assess the null distribution of the minimum empirical p value across all tested gene-sets. This permutation procedure therefore respects the relationship between gene size and the probability of chance overlap, namely that large genes are more likely to be hit by chance. As previously reported, large genes are not representative of all genes in terms of function (Raychaudhuri et al., 2010). INRICH also presents global enrichment statistics Gp that test for an excess of enriched genes at nominal gene-set p=0.001, 0.01, 0.05. This test is based on the number of unique genes within an association interval that are in at least one nominally-enriched gene-set. The empirical significance of Gp is evaluated within the same permutation procedure described above. 2 Downloaded from http://bioinformatics.oxfordjournals.org/ at Univ of Colorado Libraries on May 3, 2012 2.1 Interval Data Generation ACKNOWLEDGEMENT The authors thank Dr. Peter Holmans for insightful comments. REFERENCES 3 Downloaded from http://bioinformatics.oxfordjournals.org/ at Univ of Colorado Libraries on May 3, 2012 Hindorff, L., Sethupathy, P., Junkins, H., Ramos, E., Mehta, J., Collins, F., and Manolio, T. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA, 106(23), 9362–9367. Holmans, P., Green, E., Pahwa, J., Ferreira, M., Purcell, S., Sklar, P., Consortium, W. T. C.-C., Owen, M., O’Donovan, M., and N., N. C. (2009). Gene ontology analysis of gwa study data sets provides insights into the biology of bipolar disorder. Am J Hum Genet, 85(1), 13–24. Hong, M. G., Pawitan, Y., Magnusson, P. K., and Prince, J. A. (2009). Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum Genet, 126(2), 289–301. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., and Hirakawa, M. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res, 38, D355–D360. Lieberman, J., Stroup, T., McEvoy, J., Swartz, M., Rosenheck, R., Perkins, D., Keefe, R., Davis, S., Davis, C., Lebowitz, B., Severe, J., Hsiao, J., and Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) Investigators (2005). Effectiveness of antipsychotic drugs in patients with chronic schizophrenia. N Engl J Med, 353, 1209–1223. Psychiatric GWAS Consortium Bipolar Disorder Working Group (2011). Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near odz4. Nature Genet, 43, 977983. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller, J., Sklar, P., de Bakker, P., Daly, M., and Sham, P. (2007). Plink: a tool set for wholegenome association and population-based linkage analyses. Am J Hum Genet, 81(3), 559–575. Raychaudhuri, S., Korn, J., McCarroll, S., Altshuler, D., Sklar, P., Purcell, S., Daly, M., and Consortium., I. S. (2010). Accurately assessing the risk of schizophrenia conferred by rare copy-number variation affecting genes with brain function. PLoS Genet, 6, e1001097. The Gene Ontology Consortium (2000). Gene ontology: tool for the unification of biology. Nat. Genet, 25(1), 25–29. Wang, K., Li, M., and Bucan, M. (2007). Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet, 81, 1278–1283. A N A LY S I S Chromatin marks identify critical cell types for fine mapping complex trait variants © 2012 Nature America, Inc. All rights reserved. Gosia Trynka1–4,8, Cynthia Sandor1–4,8, Buhm Han1–4, Han Xu5, Barbara E Stranger1,4,7, X Shirley Liu5 & Soumya Raychaudhuri1–4,6 If trait-associated variants alter regulatory regions, then they should fall within chromatin marks in relevant cell types. However, it is unclear which of the many marks are most useful in defining cell types associated with disease and fine mapping variants. We hypothesized that informative marks are phenotypically cell type specific; that is, SNPs associated with the same trait likely overlap marks in the same cell type. We examined 15 chromatin marks and found that those highlighting active gene regulation were phenotypically cell type specific. Trimethylation of histone H3 at lysine 4 (H3K4me3) was the most phenotypically cell type specific (P < 1 × 10−6), driven by colocalization of variants and marks rather than gene proximity (P < 0.001). H3K4me3 peaks overlapped with 37 SNPs for plasma low-density lipoprotein concentration in the liver (P < 7 × 10−5), 31 SNPs for rheumatoid arthritis within CD4+ regulatory T cells (P = 1 × 10−4), 67 SNPs for type 2 diabetes in pancreatic islet cells (P = 0.003) and the liver (P = 0.003), and 14 SNPs for neuropsychiatric disease in neuronal tissues (P = 0.007). We show how cell type–specific H3K4me3 peaks can inform the fine mapping of associated SNPs to identify causal variation. Recent work showing that common phenotypically associated SNPs are enriched for expression quantitative trait loci (eQTLs)1–6 suggests that they might act by altering gene regulatory regions. One example is a common non-coding variant associated with plasma low-density lipoprotein (LDL) concentration. This variant modifies a CEBPB transcription factor–binding site in an enhancer and, in doing so, alters the expression of SORT1, a gene that affects plasma 1Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA. 2Division of Rheumatology, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA. 3Partners Center for Personalized Genetic Medicine, Boston, Massachusetts, USA. 4Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. 5Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, Massachusetts, USA. 6Faculty of Medical and Human Sciences, University of Manchester, Manchester, UK. 7Present addresses: Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, USA and Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois, USA. 8These authors contributed equally to this work. Correspondence should be addressed to S.R. ([email protected]). Received 6 July; accepted 28 November; published online 23 December 2012; doi:10.1038/ng.2504 NATURE GENETICS ADVANCE ONLINE PUBLICATION LDL concentration7. Another similar example is an intergenic risk allele for systemic lupus erythematosus (SLE) that decreases TNFAIP3 transcription by modifying the nuclear factor (NF)-Kb–binding site within a promoter8. Whereas many eQTLs and regulatory variants act universally, the ones most relevant to disease might have tissue specific activity6. The cell type specificity of regulatory elements is one of the major limitations in pursuing functional studies to investigate the regulatory potential of common alleles9–13. One approach to identify regulatory elements influenced by common variants involves assaying epigenetic chromatin marks 14–16. For example, H3K4me3 and monomethylation at H3K4 (H3K4me1) highlight active promoters and enhancers. But, a practical challenge of this approach is that dozens of chromatin marks might potentially be assayed17, and it is prohibitive to conduct studies on all of them in large numbers of different tissues or in samples collected from many individuals. However, because chromatin marks colocalize18, the status of a small subset of the most informative marks might be characterized, allowing for more focused assays in tissue libraries and populations to link variants to regulatory mechanisms. Additionally, it is challenging for a given phenotype to know which cell type(s) are most useful to assay chromatin marks in order to fine map risk alleles. If the critical cell types were known, then it might be possible to identify the biologically important cell type–specific eQTLs. Here, we hypothesize that a proportion of alleles for a given phenotype influence gene regulation by altering regulatory elements that control expression within the cell types most relevant to the phenotype. If this is the case, then variants associated with the same phenotype should overlap marks preferentially occurring within the same cell type. Therefore, to identify the most informative chromatin marks, we quantify the degree to which their activity in specific cell types near phenotypically associated variants tracks with phenotype. We then show how those chromatin marks that are most phenotypically cell type specific can identify causal cell types, asserting that cell type–specific marks might be used to fine map and identify the plausible causal variant at a particular locus. RESULTS Summary of statistical methods We first sought to define a score that corresponds to the possibility that a phenotypically associated SNP or a variant in tight linkage disequilibrium (LD) with it can alter cell type–specific gene regulation, as highlighted by a specific chromatin mark. We define chromatin marks as precise positions in the genome where there is a significant 1 A N A LY S I S a b LD (r 2, 1000 Genomes Project) Observed association (–log10 P) Figure 1 Overview of the statistical approach. (a) For phenotypically associated variants, 10 other variants in tight LD are found. For each P < 5 10–8 Scorea = ha/da 8 associated SNP associated with a phenotype from genetic ha 6 variant Cell type a studies (lead SNP, blue diamond; top), we 4 marks 2 define a locus by identifying SNPs in tight LD 2 Scoreb = hb/db da (r > 0.8, dashed red line; bottom) using data hb Cell type b from the 1000 Genomes Project (blue dots; marks bottom). (b) Each locus is scored on the height db 1 and distance of the nearest peak to a variant in Scoren = hn/dn LD. For a selected chromatin mark, we define peaks (red) in n cell types across the genome. hn Cell type n 0 marks For each SNP in the locus (blue diamond and d Genomic position (kb) n Genomic position (kb) light-blue circles), we compute a score equal to the height of the closest peak (vertical purple Phenotypes Phenotypes line) divided by the distance to the summit in 1 2 m 1 2 m each of the n cell types (horizontal purple line). In each locus within each cell type, we note the value of the SNP with the highest score: this a h/d measure reflects the overlap between a locus b and a cell type–specific regulatory element. c a (c) Across many phenotypes, we assess whether d b marks overlap alleles in specific cell types. • c • Here, the measure of cell type specificity of d • • each risk locus is represented by the intensity n • of red color. A phenotypically cell type–specific • n 0 Cell type specificity (h/d) 1 mark should consistently give signal in one or a small number of cell types for a given phenotype (yellow outline). We quantify the phenotypic cell type specificity of each mark. (d) Permutations are performed to assess the significance of phenotypic cell type specificity. To compute the significance of the phenotypic cell type specificity for a chromatin mark, we permutate SNPs from different loci across phenotypes; this preserves tissue-specific signals without altering the correlation and prevalence of tissue-specific signals. Locus 1 Locus 2 Locus 3 Cell types Locus 1 Locus 2 Locus 3 Locus 1 Locus 2 Locus 3 Locus 1 Locus 2 Locus 3 Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Cell types © 2012 Nature America, Inc. All rights reserved. Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 d c excess of reads from chromatin immunoprecipitation and sequencing (ChIP-seq) data over control sequencing data. We assume that variants close to or directly under tall chromatin mark peaks in specific cell types might be involved in cell type–specific gene regulation; on the other hand, variants that are far from chromatin mark peaks are much less likely to have a direct role in gene regulation. First, for each phenotypically associated SNP, we identified each SNP or insertion and/or deletion (indel) in tight LD (r2 > 0.8 in 1000 Genomes Project data19; Fig. 1a). Next, for each cell type, we assigned each variant in LD a score proportional to the height of the nearest chromatin mark peak (referred to as h; Online Methods) divided by the physical distance to the summit (h/d in Fig. 1b; referred to as s; Online Methods). If the physical distance to the nearest peak is more than 2.5 kb, then the score is set to 0 to obviate any confounding distal effects. Thus, a variant in LD directly under a strong peak will receive a very high score. For each cell type, we assigned the phenotypically associated SNP the maximum score achieved by any of its variants in LD. To quantify the specificity of signals across cell types (as opposed to the absolute magnitude), we normalized the h/d scores so that the Euclidean metric across cell types was one (normalized h/d scores (sn; Online Methods). Thus, a SNP within a chromatin mark that is active in only one cell type will have a high score of 1 in that cell type and 0 in others. In contrast, a SNP close to chromatin marks that are not cell type specific will have similarly modest scores across cell types. Then, we wanted to quantify the phenotypic cell type specificity of the overlap between SNPs and chromatin marks. To do this, we identified sets of SNPs associated with different phenotypes and then assessed the phenotypic cell type specificity of different marks (Fig. 1c). For informative marks, one or few cell types should consistently score highly across many of the SNPs for a given phenotype. For an uninformative chromatin mark, the cell types with the greatest scores vary from SNP to SNP within the same phenotype. 2 Therefore, for informative marks, there should be minimal deviation of scores within a phenotype across multiple cell types. To quantify the phenotypic cell type specificity of a chromatin mark, we defined a metric representing the variation of signal seen within a cell type within a specific phenotype (referred to as d; Online Methods). We evaluated the statistical significance of this metric with permutations with which we randomly reassigned SNPs to phenotypes (Fig. 1d). This permutation strategy restricts analysis to only phenotypically associated SNPs and, in doing so, avoids biases that might result from known differences between phenotypically associated SNPs and non–phenotypically associated SNPs in local LD structure, gene density and epigenetic activity. We note that this approach accurately estimates type I error (Supplementary Fig. 1a). Active gene regulation is phenotypically cell type specific To test the phenotypic cell type specificity of individual marks, we identified a set of SNPs associated with any one of many complex traits20. We selected only SNPs associated in European populations to facilitate LD calculations. To ensure adequate power, we selected only those traits that had at least 15 reported associations in European populations. Then, we pruned SNPs by LD so that they were all independent (r2 < 0.1 and >100 kb away from other associated SNPs in the genome; Online Methods). This resulted in a set of 510 independent SNPs associated with 31 complex traits. After defining the genomic locations and heights of peaks for 15 chromatin marks assayed in 14 Encyclopedia of DNA Elements (ENCODE) cell types15 (Supplementary Table 1), we observed statistically significant phenotypic cell type specificity for 4 marks (P < 0.0033 = 0.05/15; Fig. 2). The most strongly associated chromatin marks were H3K4me3 and acetylation of histone H3 at lysine 9 (H3K9ac) (P < 1 × 10−6), which are known to highlight active gene promoters16,21. In fact, all four most significant modifications are known to occur at regions of the ADVANCE ONLINE PUBLICATION NATURE GENETICS A N A LY S I S ENCODE Project NIH Epigenomics Project 10–5 10–4 10–3 using only the reported lead SNPs and not examining SNPs in LD resulted in considerably less significant results (Supplementary Fig. 1b). We note that some of the variation in phenotypic cell type specificity could be related to the variable number of assayed cell types for different chromatin marks; power to detect phenotypic cell type specificity correlates with the number of assayed cell types (Supplementary Fig. 4). 10–2 H3K27me3 H3K9me3 CTCF-binding site H3K9me1 H4K20me1 H3K4me2 H3K36me3 H3K4me1 H2A.Z H3K27ac DNase I HS H3K79me2 H3K9ac Pol2b-binding site 1 ENCODE Project 14 cell types NIH Epigenomics Project 38 cell types 10–1 H3K4me3 © 2012 Nature America, Inc. All rights reserved. Score (observed versus random) Phenotypic cell type specificity (P value) 10–6 Figure 2 Evaluating the significance of phenotypic cell type specificity for different marks. We used two data sets of marks assayed in different cell types: the ENCODE Project and NIH Epigenomics Project. For each mark, we performed up to 1 million permutations of SNPs and phenotypes to calculate the null distribution of phenotypic cell type specificity for comparison to observed phenotypic cell type specificity. Below, we show the observed phenotypic cell type specificity (green lines) against the null distribution (black and gray density plots). Above, we plot the corresponding P values. The red dashed line indicates the significance threshold after correcting for the testing of multiple independent hypotheses. genome involved in active gene transcription; DNase I hypersensitivity sites (DHSs; P < 1 × 10−3) and dimethylation of histone H3 at lysine 79 (H3K79me2; P < 1 × 10−5) identify active promoter, enhancer or transcribed regions. Because some chromatin marks colocalize (Supplementary Fig. 2), we performed conditional analyses to assess whether chromatin marks contributed to phenotypic cell type specificity independently (Supplementary Fig. 3). We observed that the highly significant associations of H3K4me3, DHSs and H3K9ac were generally not independent. In contrast, we found that chromatin marks that did not correspond to active gene regulation were not phenotypically cell type specific. In particular, H3K9me1, H3K9me3, CTCF-binding sites and trimethylation at histone H3 lysine 27 (H3K27me3), highlighting transcriptionally repressed heterochromatic insulator and polycomb-repressed regions, respectively, showed no evidence of being phenotypically cell type specific (P > 0.40). To assess the reproducibility of these results, we conducted a similar analysis of data from the US National Institutes of Health (NIH) Epigenomics Project, consisting of assays for 6 different chromatin marks in 38 different cell types22 (Supplementary Table 2). We again observed that the most informative mark was H3K4me3 (P < 1 × 10−6), along with H3K4me1 (Fig. 2). H3K9ac was more nominally significant (P = 0.03), perhaps owing to the fewer cell types assayed in this experiment. The concordance of the results from these two data sets was reassuring when considering that the data from the ENCODE Project were obtained on cell lines, whereas most of the NIH Epigenomics Project data were obtained using primary cell types. Our approach benefits from taking advantage of 1000 Genomes Project data to identify variants in LD (Fig. 1a). Repeating our analysis NATURE GENETICS ADVANCE ONLINE PUBLICATION Variants colocalize with cell type–specific H3K4me3 peaks Because chromatin marks tend to concentrate in and around genes, we considered the possibility that the observed overlap between H3K4me3 peaks and variants might be an artifact of proximity to gene transcript sequences with phenotypically cell type specific expression. To assess the role of the specific peak locations versus proximity to specifically expressed genes, we repeated our analyses after randomly shifting the specific location of peaks locally (o 10 kb, s.d. of 2.5 kb) within phenotypically associated loci. While these small shifts would maintain the proximity of peaks to genes, they would disrupt the specific colocalization of variants and H3K4me3 peaks. Indeed, in 1,000 such experiments, we found that shifting peak locations lowered the significance of phenotypic cell type specificity (median P = 0.03), and we did not observe any instance where the phenotypic cell type specificity was more significant than it was in the actual data (Supplementary Fig. 5). This result strongly suggests that the specific colocalization of variants in LD with phenotypically associated SNPs and H3K4me3 peaks rather than proximity to gene structures is driving the phenotypic cell type specificity signal (P < 0.001 by permutation). Enhancers and promoters underlie phenotypic cell type specificity To understand whether the phenotypic cell type specificity that we observed was driven by the activity of promoters or enhancers, we divided chromatin peaks into those falling within proximal promoter regions (including the transcriptional start site (TSS) o 2 kb) and those falling outside of promoter regions and repeated our analysis. Whereas phenotypic cell type specificity was seen both within and outside of the immediate promoter regions, H3K4me3, H3K79me2 and DHSs were more significantly phenotypically cell type specific outside of promoter regions than within (Supplementary Fig. 6). We note that, although H3K4me3 marks are not generally thought of as being enriched in enhancers, there was evidence that they can be enriched in strong and disease-associated enhancers9,23,24. Alternatively, H3K4me3 enrichment outside of promoter sites might also represent unannotated sites. We further assessed the phenotypic cell type specificity of previously published functional annotations on the basis of hidden Markov model states capturing information on nine separate chromatin marks 9. We observed that hidden states 4 and 5, corresponding to active proximal enhancers and active distal enhancers, respectively, were most significantly phenotypically cell type specific (Supplementary Fig. 7). State 4 is highly enriched for H3K4me3 peaks, the mark that we observed to be the most phenotypically cell type specific. Identification of key cell types for four phenotypes We identified the cell types within which common variants likely influence gene regulation using published SNPs for 4 distinct phenotypes (Fig. 3 and Supplementary Table 3) and H3K4me3 data from the Epigenomics Project for a panel of 34 cell-types22. We selected these phenotypes because there is a reasonable sense of what the critical cell types might be and because a sufficient number of associated SNPs had been identified. For each phenotype, we assigned a cell type specificity score to each of its associated variants (Fig. 1a,b and 3 Online Methods) and compared to scores from equal-sized sets of matched SNP sets sampled from 45,950 LD-pruned SNPs 3. Because phenotypically associated SNPs have more epigenetic activity than other SNPs, we were careful to match sampled SNPs so that they had similar total numbers of H3K4me3 peaks across all 34 cell types as associated SNPs. Results were generally consistent in a more stringent analysis when we sampled instead from only phenotypically associated SNPs from the National Human Genome Research Institute (NHGRI) genome-wide association study (GWAS) catalog20 (Supplementary Fig. 8). In addition to these phenotypes, we present separately the results for four additional phenotypes, B-cell–specific cis eQTL associations, SLE, type 1 diabetes (T1D) and body mass index (BMI) (Supplementary Fig. 9); in all of those instances, except BMI, we were able to identify highly significant cell types. Application to plasma LDL concentration implicates liver As a positive control, we tested 37 SNPs associated with LDL concentration25 for overlap with H3K4me3 marks in different tissues. These variants should implicate regulatory activity within the liver, according to previous work7,26,27. In aggregate, we observed that the 37 SNPs implicated a total of 1,501 H3K4me3 peaks in 34 different cell types. The most significant cell type was adult liver tissue (P = 7.2 × 10−5; Fig. 3a). We observed overlap with liver-specific peaks using other phenotypically cell type–specific marks, including H3K9ac (P = 0.003) and H3K4me1 (P = 0.002). In contrast, we observed little association with liver for the H3K27me3 or H3K9me3 marks (Fig. 2 and Supplementary Table 4). Examining the relative proximity and specificity of the SNPs within 10,000 sets of matched SNP sets used to calculate statistical significance, we identified the 95th-percentile threshold at a score of 0.58 (Fig. 4a). Of the 37 SNPs associated with LDL concentration, 7 (19%) were near to a highly liver-specific chromatin mark at this threshold. These seven SNPs are generally in tight LD with a variant that is very close to cell type– specific H3K4me3 peaks (median of 132 bp away; see Supplementary Table 3 for details on the specific SNPs). Application to rheumatoid arthritis implicates CD4+ Treg cells For rheumatoid arthritis and other autoimmune diseases, the critical immune cell types are often not clearly defined in the literature and 4 LDL (37 loci) a Adult liver Rheumatoid arthritis (31 loci) b Treg primary cells c Neuropsychiatric disorders (14 loci) CD34+ primary cells Mobilized CD34+ primary cells CD3+ primary cells CD19+ primary cells CD8+ memory primary cells CD8+ naive primary cells CD34+ cultured cells CD4+ naive primary cells CD4+ memory primary cells Treg primary cells Mesenchymal stem cells (bone marrow) Cingulate gyrus Anterior caudate Substantia nigra Inferior temporal lobe Mid-frontal lobe Hippocampus middle Pancreatic islets Chondrocytes (mesenchymal stem cells) Adipose nuclei Adult kidney Mesenchymal stem cells (adipose) Muscle satellite cultured cells Skeletal muscle Adipocyte (mesenchymal stem cells) Adult liver Mucosa, colon Duodenum smooth muscle Stomach smooth muscle Mucosa, stomach Rectal smooth muscle Mucosa, rectum Mucosa, duodenum Smooth muscle, colon Anterior caudate d T2D (67 loci) Hematopoietic Brain Muscluloskeletal, endocrine & others Figure 3 SNPs for four complex traits overlap H3K4me3 marks in specific cell types. (a–d) We considered four phenotypes: LDL cholesterol plasma concentration (a), rheumatoid arthritis (b), neuropsychiatric disorders (schizophrenia and bipolar disease) (c) and T2D (d). For each phenotype, we calculated the cell type–specific overlap with H3K4me3 histone modification peaks in 34 tissues (listed on the left). The histograms on the right show the significance of the overlap for each tissue with variants from each of the phenotypes, estimated by sampling sets of SNPs matched so that the total number of peaks overlapping SNPs in LD was the same as in the test set. Adjacent to each histogram, we present correlation coefficients between two tissues based on scores computed from randomly sampled sets of independent loci. Colored boxes in d show independent P values for pancreatic islets and liver computed by removing the SNPs scoring highly in one tissue but not the other. Gastrointestinal © 2012 Nature America, Inc. All rights reserved. A N A LY S I S 0 Pancreatic islets Adult liver 1 Correlation 1 10–1 10–2 10–3 10–4 Enrichement for H3K4me3 peaks (P value) 10–5 can be controversial28–30. When we tested the 31 SNPs associated with rheumatoid arthritis31, we observed that they implicated 1,328 H3K4me3 peaks in 34 tissues, with the most significant association to CD4+ T cells and, in particular, CD4+ regulatory T (Treg) cells (P = 1.3 × 10−4; Fig. 3b). The phenotypically similar CD4+ memory T cells were also highly significantly associated (P = 7.0 × 10−4)32. Of the 31 SNPs associated with rheumatoid arthritis, we found that 6 (19.3%) were close to chromatin marks that were highly specific to CD4+ Treg cells, with relative specificity of 0.53 or greater (permuted 95th-percentile threshold; Fig. 4b). These 6 SNPs are generally in tight LD with a variant that is very close to cell type–specific H3K4me3 peaks (median of 37 bp away; see Supplementary Table 3 for details on the specific SNPs). In instances where dense genotyping has been applied to localize the association signal, we speculate that cell type–specific overlap might become more apparent. Indeed, for the 31 loci associated with rheumatoid arthritis, we examined recent results from a fine-mapping study using the dense genotyping platform the Immunochip33. Indeed, when repeating the analysis with the newly defined index SNPs from each locus using dense genotyping data, we found that the significance of the enrichment for CD4 + Treg cells increased (5.1 × 10−5; Supplementary Fig. 10) and that the median specificity score for each locus increased from 0.13 to 0.16. Application to psychiatric disorders implicates neuronal tissues The 14 independent SNPs from neuropsychiatric disorders 34,35 mapped within 874 H3K4me3 peaks. Despite the limited power of this analysis, we were encouraged to see that these SNP associations implicated multiple neuronal tissues, including the anterior caudate nucleus (P = 0.0076) and the mid-frontal lobe of the brain (P = 0.044) (Fig. 3c); we also observed a likely spurious association with colonic smooth muscle (P = 0.026). The role of the frontal lobe in neuropsychiatric disease in particular has long been appreciated36–38. Although none ADVANCE ONLINE PUBLICATION NATURE GENETICS A N A LY S I S Adult liver P = 7.2 × 10–5 0.8 0.6 1.0 1.0 Treg cells P = 1.3 × 10–4 0.8 95% 0.6 0.4 d Anterior caudate P = 7.6 × 10–3 0.8 0.6 95% 0.4 0.4 95% 0.2 0.2 0 0.2 0 10,000 sets of matched SNPs (37 loci) LDL (37 loci) 1.0 95% Rheumatoid arthritis (31 loci) 0.4 P = 6.1 × 10–3 0 10,000 sets Neuropsychiatric of matched disorders (14 loci) SNPs (14 loci) 50 30 10 0 0.2 0.4 0.6 0.8 1.0 Pancreatic islets 10 95% 10,000 sets of matched © 2012 Nature America, Inc. All rights reserved. 0.6 0.2 0 10,000 sets of matched SNPs (31 loci) T2D (67 loci) P = 7.9 × 10–3 0.8 Adult liver H3Kme3 cell type specificity score c b 1.0 H3Kme3 cell type specificity score a Figure 4 Cell type specificity for four sets of SNPs. (a–d) The distribution of cell type– 30 SNPs specificity scores (h/d; Fig. 1b) is shown for SNPs associated with LDL cholesterol (67 loci) concentration, rheumatoid arthritis, neuropsychiatric disorders and T2D within liver (a), 50 CD4+ Treg cells (b), anterior caudate nucleus (c) and jointly in pancreatic islets (x axis) and liver (y axis) (d). Blue points represent cell type specificity scores. Red circles indicate overlapping points, representing SNPs with very similar scores. We compare these scores to specificity scores in the same tissue of 10,000 sampled sets of matched SNPs from HapMap (yellow density plots). We plot the median specificity for both the distribution of observed SNPs and the sampled sets of matched SNPs (solid lines). Also, we present the 95th-percentile threshold for the sampled sets of matched SNPs (dashed line), which we use as a specificity cutoff. For each phenotype, about one-fourth of variants overlap cell type–specific H3K4me3 peaks. of these results reached a conservative level of significance after correcting for multiple-hypothesis testing, we are hopeful that additional SNP discoveries will help clarify this result further. Of the 14 SNPs associated with neuropsychiatric disorders, 3 (21%) had a tissuespecific chromatin mark within the anterior caudate, with a relative specificity of 0.28 or greater (permuted 95th percentile; Fig. 4c). Application to T2D implicates pancreatic islets and liver In certain instances, it might be plausible that multiple tissues could be implicated in a disease. When we examined 67 SNPs for type 2 diabetes (T2D)39–50, implicating a total of 2,776 H3K4me3 peaks within 34 different cell types, we observed the most significant enrichment in pancreatic islets (P = 0.0061) and the liver (P = 0.0079) (Fig. 3d). In particular, of the 67 SNPs associated with type 2 diabetes, 14 (20.1%) were either highly specific for chromatin marks within the liver (at a 0.57 permuted 95th-percentile threshold) or pancreatic islets (at a 0.65 permuted 95th-percentile threshold); these SNPs are in tight LD with a marker that has a median distance of 46 bp from the summit of a cell type–specific peak. When we tested the pancreatic islet and liver tissues together, we found that the combination of liver and pancreatic islets was even more significant than the tissues individually (P = 2.0 × 10−4; Online Methods) and was more significant than all other possible tissue pairs. We found that the SNPs driving the overlap in the two tissues were distinct (Fig. 4d). When we removed the SNPs most specific for pancreatic islet marks (score > 0.3), we observed that enrichment in liver was even more apparent (P = 0.0032); similarly, when we removed the SNPs most specific for overlap with liver marks (score > 0.3), we observed that the enrichment in pancreatic islets was also more apparent (P = 0.0026). Both islet cells and the liver have long been known to have a key role in mediating glucose synthesis, insulin secretion and diabetes51. Fine mapping with cell type–specific H3K4me3 peaks One of the major challenges in understanding complex trait associations is to identify the causal variants and the mechanisms through which they affect genes. Associated variants can be fine mapped to NATURE GENETICS ADVANCE ONLINE PUBLICATION variants in tight LD within cell type–specific chromatin marks in the appropriate cell type. Here, we present examples where cell type– specific H3K4me3 peaks can potentially be used to localize associated variants to causal variants. First, we considered the rs629301 SNP that is associated with plasma LDL concentration in the region including the SORT1 gene (Fig. 5a). A liver-specific H3K4me3 peak, not seen as prominently in other cell types, overlapped with this SNP and three other variants in tight LD with it. This H3K4me3 peak is located far from the TSS region and corresponds to a hepatocyte enhancer region7. The closest SNP to the summit of the peak (87 bp away) is the rs12740374 functional variant. This variant is known to alter a CEBPB-binding site within the enhancer region controlling SORT1. Another example is provided by the locus for T2D represented by the rs704184 reported SNP association. rs10814915, tightly in LD with the reported GWAS SNP (r2 = 0.93), scored highly for pancreatic islets but showed no tissue specificity for the liver (Fig. 5b). This SNP located only 84 bp away from the summit of a highly pancreatic islet–specific peak. rs10814915 is predicted to be present within a sequence bound by the glucocorticoid receptor (GR)52, which is known to have a role in pancreatic islets and glucose regulation. The SNP resides within an intron of the GLIS3 gene, which is involved in the development of pancreatic islets. Finally, we examined the locus for rheumatoid arthritis defined by a reported association with the rs13119723 SNP in the intron of a gene with unknown function, KIAA1109. This SNP is in LD with other variants spanning over 500 kb within this locus, rendering fine-mapping efforts particularly challenging. We identified a SNP, rs13140464, in tight LD with rs13119723 (r2 = 0.9) (Fig. 5c), which maps only 116 bp from the summit of the H3K4me3 peak, which is highly specific to CD4+ Treg cells with a score of 0.63. This SNP is located between the IL2 and IL21 genes, 122 kb downstream of IL2 and 34 kb upstream of IL21, and is 280 kb away from the published SNP. It is tempting to speculate that rs13140464 might act by altering a highly cell type–specific regulatory sequence controlling IL2 expression, which has a key role in CD4+ Treg maturation53. 5 A N A LY S I S a KIAA1324 109.8 CELSR2 (Mb) 109.9 c Chr. 9 (p24.2) 3.9 4.0 4.1 4.2 4.3 (Mb) Chr. 4 (q27) 123.1 123.2 123.3 123.4 123.5 (Mb) MYBPHL SORT1 PSMA5 KIAA1109 GLIS3 ADAD1 IL21 IL2 1 1 1 0.6 0.6 0.6 0.6 0.2 0.2 0.2 30 15 0 25 13 0 100 50 0 30 15 0 150 75 0 100 50 0 80 40 0 120 60 0 40 20 0 50 25 0 60 30 0 Adult liver r2 1 Treg primary cells PSRC1 Pancreatic Skeletal islets muscle SARS 40 20 109.81 © 2012 Nature America, Inc. All rights reserved. b Chr. 1 (p13.3) 109.7 109.82 109.83 0.2 25 13 0 109.81 109.82 109.83 0 25 13 0 25 13 0 25 13 0 123.1 123.2 123.3 123.4 123.5 123.49 123.50 123.51 Figure 5 Selected phenotypically associated loci with high cell type specificity. We present three examples of loci with cell type–specific overlap with H3K4me3 peaks. Top, genomic coordinates and genes near the associated SNP. Middle, lead SNP (blue diamond) and other nearby SNPs from the 1000 Genomes Project (red dots correspond to those with high r 2, blue dots correspond to those with low r 2). We also show the SNP that is closest to the cell type–specific peak (red diamond). Bottom, H3K4me3 sequence tag counts for selected cell types. Colored horizontal lines in the tissue panels correspond to peak calls. Dashed vertical lines mark the summits of phenotypically cell type–specific peaks. (a–c) Shown are the SORT1 locus for LDL (a), the GLIS3 locus for T2D (b) and the IL2-IL21 locus for rheumatoid arthritis (c). DISCUSSION In this study, we demonstrated that chromatin marks highlighting active regulatory regions, such as H3K4me3, H3K9ac and DHSs, overlap phenotypically associated variants; furthermore, this overlap is phenotypically cell type specific. These results strongly support the hypothesis that many complex disease and trait alleles might act by influencing gene regulation in a cell type–specific manner. In addition, we quantified the degree to which different marks are cell type specific in their overlap with phenotypically associated SNPs. These cell type–specific marks might not only be used to connect phenotypes to specific cell types, but they might also be useful in mapping phenotype-associated SNPs to potential regulatory variants. In particular, we consistently observed that H3K4me3 marks could be used to effectively identify specific cell types that are enriched among specific phenotypes. We note that this statistical approach can be applied to assess the significance of other chromatin marks or other cell type–specific gene annotations as they become available. In the phenotypes that we examined, we found that about onefourth of associated variants could be connected to a highly cell type– specific mark within a critical cell type (Fig. 5). In instances where we do not observe a SNP in tight LD within a highly cell type–specific H3K4me3 peak, it is possible that a regulatory region that is not cell type specific might be altered. Alternatively, in some instances the reported SNP association will need to be further refined with dense genotyping, or undiscovered variants in tight LD will need to be ascertained through sequencing, before the effect of a cell type–specific peak can be identified. Finally, for many phenotypes, multiple cell types could be involved, in which case this approach might have limited efficacy. We demonstrated one example of this type of scenario in T2D, where we detected effects both in liver and pancreatic islet cell types. We acknowledge that our approach is potentially sensitive to the diversity and number of cell types assayed. For instance, a limited application to a set of hematopoietic cell types might not be 6 particularly informative if a set of purely neurological phenotypes is assayed. We note that our approach depends critically on technical factors—for instance, the quality of antibody reagents, experimental protocols or other technical factors that might introduce noise into specific chromatin mark assays could mitigate true signals. Our approach may perform better on the chromatin marks with higher quality assays. Once variants and cell types are identified, they will likely be excellent candidates for cell type–specific functional investigations, including allelic imbalance assays to define cis-eQTL activity54, cell type–specific DHS quantitative trait locus (dsQTL) analyses 55 and identification of active transcription factor–binding sites. These cell type–specific investigations in appropriately chosen cell types might ultimately help to lead investigators from common disease variation to causal variants and molecular mechanisms. URLs. All software is available online at http://www.broadinstitute. org/mpg/epigwas/. ENCODE, http://genome.ucsc.edu/ENCODE/ downloads.html; NIH Roadmap Epigenomics Mapping Consortium, http://www.roadmapepigenomics.org/; NHGRI GWAS catalog, http:// www.genome.gov/gwastudies/. METHODS Methods and any associated references are available in the online version of the paper. Note: Supplementary information is available in the online version of the paper. ACKNOWLEDGMENTS We thank M. Daly, M. Kellis, D. Diogo, X. Hu, Y. Okada, R. Plenge, S. Ripke, G. Srivastava, E. Stahl and S. Sunyaev for critical feedback and discussion. G.T. is supported by the Rubicon grant from The Netherlands Organization for Scientific Research (NWO). B.E.S. and S.R. are supported by the Harvard University Milton Fund, and Brigham and Women’s Hospital. S.R. is also supported by funds from the US NIH (K08AR055688 and U01HG0070033) and the Arthritis Foundation. X.S.L. is also supported by funds from the US NIH (R01 HG004069). We thank the ADVANCE ONLINE PUBLICATION NATURE GENETICS A N A LY S I S ENCODE Project, supported by the NHGRI, and the NIH Roadmap Epigenomics Mapping Consortium for making data available. AUTHOR CONTRIBUTIONS S.R. led the study. G.T., C.S., S.R., B.H. and H.X. performed the analysis. G.T., C.S., S.R., B.E.S. and X.S.L. wrote the manuscript. All authors reviewed the final manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests. © 2012 Nature America, Inc. All rights reserved. Published online at http://www.nature.com/doifinder/10.1038/ng.2504. Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. 1. Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010). 2. Fraser, H.B. & Xie, X. Common polymorphic transcript variation in human disease. Genome Res. 19, 567–575 (2009). 3. Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010). 4. Nica, A.C. et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 6, e1000895 (2010). 5. Fehrmann, R.S. et al. eQTLs reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the HLA. PLoS Genet. 7, e1002197 (2011). 6. Fairfax, B.P. et al. Genetics of gene expression in primary immune cells identifies cell type–specific master regulators and roles of HLA alleles. Nat. Genet. 44, 502–510 (2012). 7. Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714–719 (2010). 8. Adrianto, I. et al. Association of a functional variant downstream of TNFAIP3 with systemic lupus erythematosus. Nat. Genet. 43, 253–258 (2011). 9. Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011). 10. Creyghton, M.P. et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl. Acad. Sci. USA 107, 21931–21936 (2010). 11. Waki, H. et al. Global mapping of cell type–specific open chromatin by FAIRE-seq reveals the regulatory role of the NFI family in adipocyte differentiation. PLoS Genet. 7, e1002311 (2011). 12. Atchison, M.L. Enhancers: mechanisms of action and cell specificity. Annu. Rev. Cell Biol. 4, 127–153 (1988). 13. Song, L. et al. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 21, 1757–1767 (2011). 14. Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008). 15. Encode Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011). 16. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007). 17. Kouzarides, T. Chromatin modifications and their function. Cell 128, 693–705 (2007). 18. Wang, Z. et al. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat. Genet. 40, 897–903 (2008). 19. Thousand Genomes Project. A map of human genome variation from populationscale sequencing. Nature 467, 1061–1073 (2010). 20. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009). 21. Bernstein, B.E. et al. Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 120, 169–181 (2005). 22. Bernstein, B.E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010). 23. Pekowska, A. et al. H3K4 tri-methylation provides an epigenetic signature of active enhancers. EMBO J. 30, 4198–4210 (2011). 24. Jia, L. et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS Genet. 5, e1000597 (2009). 25. Teslovich, T.M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010). NATURE GENETICS ADVANCE ONLINE PUBLICATION 26. Smith, L.C., Pownall, H.J. & Gotto, A.M. Jr. The plasma lipoproteins: structure and metabolism. Annu. Rev. Biochem. 47, 751–757 (1978). 27. Hobbs, H.H., Brown, M.S. & Goldstein, J.L. Molecular genetics of the LDL receptor gene in familial hypercholesterolemia. Hum. Mutat. 1, 445–466 (1992). 28. Firestein, G.S. Evolving concepts of rheumatoid arthritis. Nature 423, 356–361 (2003). 29. Lee, D.M. et al. Mast cells: a cellular link between autoantibodies and inflammatory arthritis. Science 297, 1689–1692 (2002). 30. Boilard, E. et al. Platelets amplify inflammation in arthritis via collagen-dependent microparticle production. Science 327, 580–583 (2010). 31. Stahl, E.A. et al. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514 (2010). 32. Akbar, A.N., Vukmanovic-Stejic, M., Taams, L.S. & Macallan, D.C. The dynamic co-evolution of memory and regulatory CD4+ T cells in the periphery. Nat. Rev. Immunol. 7, 231–237 (2007). 33. Eyre, S. et al. High-density genetic mapping identifies new susceptibility loci for rheumatoid arthritis. Nat. Genet. 44, 1336–1340 (2012). 34. Psychiatric GWAS Consortium Bipolar Disorder Working Group. Large-scale genomewide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat. Genet. 43, 977–983 (2011). 35. Schizophrenia Genome-Wide Association Study (GWAS) Consortium. Genome-wide association study identifies five new schizophrenia loci. Nat. Genet. 43, 969–976 (2011). 36. Goldman-Rakic, P.S. & Selemon, L.D. Functional and anatomical aspects of prefrontal pathology in schizophrenia. Schizophr. Bull. 23, 437–458 (1997). 37. Goldstein, J.M. et al. Cortical abnormalities in schizophrenia identified by structural magnetic resonance imaging. Arch. Gen. Psychiatry 56, 537–547 (1999). 38. Strakowski, S.M., Delbello, M.P. & Adler, C.M. The functional neuroanatomy of bipolar disorder: a review of neuroimaging findings. Mol. Psychiatry 10, 105–116 (2005). 39. Morris, A.P. et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981–990 (2012). 40. Cho, Y.S. et al. Meta-analysis of genome-wide association studies identifies eight new loci for type 2 diabetes in east Asians. Nat. Genet. 44, 67–72 (2012). 41. Dupuis, J. et al. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 42, 105–116 (2010). 42. Kong, A. et al. Parental origin of sequence variants associated with complex diseases. Nature 462, 868–874 (2009). 43. Kooner, J.S. et al. Genome-wide association study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci. Nat. Genet. 43, 984–989 (2011). 44. Perry, J.R. et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 8, e1002741 (2012). 45. Qi, L. et al. Genetic variants at 2q24 are associated with susceptibility to type 2 diabetes. Hum. Mol. Genet. 19, 2706–2715 (2010). 46. Saxena, R. et al. Large-scale gene-centric meta-analysis across 39 studies identifies type 2 diabetes loci. Am. J. Hum. Genet. 90, 410–425 (2012). 47. Shu, X.O. et al. Identification of new genetic risk variants for type 2 diabetes. PLoS Genet. 6, pii: e1001127 (2010). 48. Tsai, F.J. et al. A genome-wide association study identifies susceptibility variants for type 2 diabetes in Han Chinese. PLoS Genet. 6, e1000847 (2010). 49. Voight, B.F. et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat. Genet. 42, 579–589 (2010). 50. Yamauchi, T. et al. A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat. Genet. 42, 864–868 (2010). 51. Seino, S., Shibasaki, T. & Minami, K. Dynamics of insulin secretion and the clinical implications for obesity and diabetes. J. Clin. Invest. 121, 2118–2125 (2011). 52. Ward, L.D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 40, D930–D934 (2012). 53. Setoguchi, R., Hori, S., Takahashi, T. & Sakaguchi, S. Homeostatic maintenance of natural Foxp3+ CD25+ CD4+ regulatory T cells by interleukin (IL)-2 and induction of autoimmune disease by IL-2 neutralization. J. Exp. Med. 201, 723–735 (2005). 54. McCarroll, S.A. et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn`s disease. Nat. Genet. 40, 1107–1112 (2008). 55. Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012). 7 © 2012 Nature America, Inc. All rights reserved. ONLINE METHODS Chromatin mark data. We obtained two publicly available data sets for chromatin mark assays on different sets of tissues. We use the term chromatin mark broadly to include histone modifications and DHSs, as well as common epigenetic features, such as CTCF-binding sites. First, we used data from the ENCODE Project, which included sequence reads from ChIP-seq assays and controls in up to 14 different cell types from a diverse set of 15 chromatin marks: CTCF-binding sites, the variant H2A histone (H2A.Z), H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me1, H3K9me3, H4K20me1, Pol2b-binding sites and DHSs15 (Supplementary Tables 1 and 2). We separately obtained hidden chromatin state annotations for 8 of the 14 cell types defined by clustering chromatin marks9. Second, we used data from the NIH Roadmap Epigenomics Mapping Consortium that assayed only six chromatin marks on a large number of cell types22. This data set included sequence reads from ChIP-seq assays and controls for 6 histone modifications—H3K27me3, H3K4me3, H3K36me3, H3K9ac, H3K4me1 and H3K9me3—assayed in 38 adult and fetal tissues (Supplementary Table 2). For both of these data sets, we downloaded data comprising hg19-mapped sequence reads. In instances where there were multiple replicates of a given ChIP-seq assay for the same tissue, we aggregated sequence reads for the individual assays. We also obtained mapped reads from control data comprising sequenced genomic DNA. We ran MACS (v1.4) software to identify significant peaks (P < 1 × 10−5), specific locations within the genome with enrichment of tag sequences, setting the bandwidth parameter to 300 bp56. For each chromatin mark, we located its summit, which represents the position with the highest pileup of sequence tags. Processing chromatin mark data. Once we identified peaks, we used MACS to determine the fold enrichment of tags compared to controls, using the equation M f mean Mlocal (1) where Lpeak and Llocal are parameters for a Poisson distribution determined by fitting the local sequence tag distributions in the peak region from ChIPseq data and control data, respectively. We considered f as the height of peak instead of the raw number tags, as this approach leverages control data to account for local biases in the genome (due to sequencing bias, mapping bias, chromatin structure and genome copy-number variations) and improves the robustness and specificity of the estimation. We then corrected for global variation in multiple experiments for the same chromatin mark in different cell types, using the equation hi , j, norm fi , j ª ¹ max «£ fi , j º » £ fi , j icell type j ¬ (2) j where fi,j corresponds to fold enrichment for the peak j in the cell type i before normalization, and hi,j,norm is the fold enrichment after normalization (or the height of the peak). Phenotypes and associated SNPs. To estimate the phenotypic cell type specificity of each chromatin mark, we identified a comprehensive set of independent SNPs associated with unique phenotypes. We used data from a catalog summarizing results from recent GWAS20 (downloaded January 2012). We selected only the phenotype-associated SNPs with highly statistically significant associations (P < 5 × 10−8). To ensure the applicability of the 1000 Genomes Project resource, we used only those SNPs associated in populations of European descent. To limit the analysis to phenotypes that have an adequate number of SNP associations, we selected only phenotypes with at least 15 such SNP associations. To ensure the independence of the associated SNPs, we removed SNPs with r2 > 0.1 and those that were <100 kb from a more strongly associated NATURE GENETICS variant in the genome. To preserve a priori specific phenotypes for independent testing, we removed SNPs associated with rheumatoid arthritis, BMI and LDL plasma cholesterol concentration as well as height. For variants associated with multiple phenotypes, we selected a single phenotype association and discarded others; we selected the SNP associated with the phenotype with the fewest SNPs. Our final data set consisted of 510 risk variants associated with 31 diseases or traits. To test our approach, we also separately identified in the literature 37 SNPs associated with LDL plasma concentration25, 31 SNPs associated with rheumatoid arthritis risk31, 67 SNPs associated with T2D risk39–50 and 14 risk loci for neuropsychiatric disorders34,35. Evaluating marks for their phenotypically cell–type specific overlap with variants. Step 1. Identifying variants in LD with associated SNPs. We recognized that the observed phenotype association of a given variant might be the consequence of other variants tightly linked to the associated variant (Fig. 1a). We therefore comprehensively ascertained variants from the 1000 Genomes Project to identify all variants (SNPs and indels) in LD19 (r2 > 0.8) on the basis of haplotypes reconstructed with Beagle from the subset of 379 individuals of European descent. Step 2. Scoring regulatory activity near a risk SNP. Next, we examined chromatin marks in the different cell types located near associated SNPs (Fig. 1b). We assumed that the closer an associated SNP (or variant in LD) was to a tall peak, the greater the chance that it might influence a regulatory element highlighted by that peak. We scored each associated SNP k within each cell type by identifying a SNP k` (or indel) in tight LD that was closest to a chromatin mark peak j in tissue i. We then assigned a score sj,k equal to the height of peak j in the tissue i, hi,j,norm (referred to as h in the main text) divided by the distance d between the SNP k` and the summit of the peak j. If there was no peak within 2.5 kb of each SNP in LD with SNP k, then si,k was set to zero. Step 3. Normalization to obtain a cell type specificity score. For each associated SNP k and chromatin mark, we obtained a vector of scores for multiple cell types i. To compare the cell type specificity score across risk variants and phenotypes, we applied Euclidean normalization in the following equation: sni , k si , k (3) £ si2,k i This ensured that sni,k emphasized cell type specificity instead of the magnitude of the signal. For associated risk variants not near any peak, where si,k is zero for all i, we replaced values with the average of values of other associated SNPs with at least one nonzero si,k value over all cell types. Step 4. Estimating the phenotypic cell type specificity of a chromatin mark. If a chromatin mark is informative for phenotypic cell type specificity, then the deviance of chromatin mark overlap for associated SNPs (sni,k) should be minimal for a given phenotype and tissue. If a chromatin mark is not informative, then the deviance of chromatin mark overlap for associated SNPs will be high for a phenotype and tissue. Therefore, we defined a deviance-based metric of phenotypic cell type specificity for a mark, which was the aggregate sum of the squared differences between sni,k values and mean values for fixed phenotypes p and cell types i, d § 2¶ ¨ £ meani , p sn sni , k · ¨ · icell types pphenotypes © k p ¸ £ £ (4) where meani,p(sn) is the mean of the normalized cell specificity scores in the cell type i for SNPs associated with phenotype p. If a mark is informative, then sn scores are dependent on the phenotype and cell type, and this sum of squares should be relatively small. Step 5. Evaluating the statistical significance of phenotypic cell type specificity. To evaluate the statistical significance of phenotypic cell type specificity for particular marks, we conducted up to 1 million permutations reassigning SNPs to phenotypes randomly. This ensures that the properties of associated SNPs in the analysis are maintained, only disrupting their phenotypic associations. doi:10.1038/ng.2504 We recalculated d after each permutation. To compute P values, we calculated the proportion of d scores from permutations (these correspond to the null hypothesis) that were greater than the observed d score. Using overlap with chromatin marks to identify the critical cell type(s) for a specific phenotype. After identifying SNPs associated with a selected phenotype, we compute a cell type specificity score ci,p for a phenotype p by summing the normalized sni,k scores for a cell type i and associated SNPs k in the following equation: ci , p £ sni,k kp (5) Using overlap with marks to identify pairs of critical cell types for a specific phenotype. To test possible pairs of n cell types for association, we constructed (n – 1) × n/2 artificial ChIP-seq profiles for each tissue pair. Each artificial profile consisted of all of the peaks defined in both tissues, where the peak heights were reduced to half of their original heights. We then tested for association with cell type pairs in the same way as for single cell types, except that we replaced individual cell type scores with scores for cell type pairs. 56. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008). © 2012 Nature America, Inc. All rights reserved. To evaluate the statistical significance of cell type specificity scores ci,p, we defined matched sets of SNPs not associated with phenotype p and used them to calculate cell type specificity scores. Statistical significance was calculated as the proportion of SNP sets with cell type specificity scores exceeding the observed scores for actual phenotypic SNPs. To define the matched SNP sets, we required that the sampled SNPs had the same total number of chromatin mark peaks in the region in LD across all cell types as associated SNPs. This ensures that randomly selected SNPs have similar nearby regulatory activity. For the primary analysis, we drew random SNPs from 45,950 independent HapMap SNPs that were clustered to ensure minimal independence3. In a secondary analysis, we drew SNPs from phenotypically associated SNPs from the NIH GWAS catalog20. doi:10.1038/ng.2504 NATURE GENETICS