Heritability and genomics of gene expression in peripheral blood
Transcription
Heritability and genomics of gene expression in peripheral blood
Heritability and genomics of gene expression in peripheral blood Paper: http://www.nature.com/ng/journal/v46/n5/full/ng.2951.html Related news: http://news.ncsu.edu/releases/wrightnatgen/ Presenter: Pak-Kan, WONG (13/05/2014) 1 Contents • Background • Method Summary • Results • Heritability in peripheral blood transcriptome • eQTL analysis • Biomedical relevance • Discussions 2 Background 3 Expression Quantitative Trait Loci (eQTL) • QTL: Stretches of DNA containing or linked to the genes that underlie a quantitative trait • eQTL: QTL that regulate expression levels of mRNAs or proteins cis-eQTL trans-eQTL Master trans-eQTL Image credit: http://www.biostat.jhsph.edu/GenomeCAFE/ExpressionistSeminarSlides/eQTL_review_s.ppt 4 Peripheral Venous Blood • Blood vessels which are outside human heart make peripheral blood system. • Peripheral vessels • Venous blood is deoxygenated blood which travels from the peripheral vessels, through the venous system into the right atrium. http://en.wikipedia.org/wiki/Peripheral_blood http://www.circulationfoundation.org.uk/help-advice/vascular-health/the-circulatory-system/ 5 Classical Twin Design (CTD) • Allow the study of varying family environments (across pairs) and widely differing genetic makeup: • “Identical" or monozygotic (MZ) twins • Share nearly 100% of their genes, which means that most differences between the twins (such as height, susceptibility to boredom, intelligence, depression, etc.) is due to experiences that one twin has but not the other twin. • "Fraternal" or dizygotic (DZ) twins • Share only about 50% of their genes. • Thus powerful tests of the effects of genes can be made. Twins share many aspects of their environment (e.g., uterine environment, parenting style, education, wealth, culture, community) by virtue of being born in the same time and place. • The presence of a given genetic trait in only one member of a pair of identical twins (called discordance) provides a powerful window into environmental effects. Ref.: http://en.wikipedia.org/wiki/Twin_study http://ibg.colorado.edu/cdrom2012/keller/Assumptions/Keller_Coventry_CTD_Indeterminacy_2005.pdf 6 Classical Twin Design Mathematical Model • Monozygotic (MZ) twins: sharing all of their alleles • Dizygotic (DZ) twins: sharing on average 50% of their polymorphic alleles • Assumption: Equal environments for identical and fraternal twins • Assessing the variance of a phenotype in a large group and attempts to estimate how much of this is due to 1. 2. 3. Factors A, D Genetic effects (heritability) Shared environment – events that happen to both twins, affecting C them in the same way Unshared, or unique, environment – events that occur to one twin E but not the others, or events that affect either twin in a different way 7 ACE Model A=h2: additive genetics C=c2: common environment E=e2: unique environment A+C+E=1 • MZ: share 100% of their genes, share all of the environment • Correlation between identical twins provides an estimate of rmz = A+C • DZ: share on average 50% of genes, share all the environment • Correlation between fraternal twins is a direct estimate of rdz = 0.5A+C Expectation E = 1-rmz A = 2(rmz-rdz) C = rmz-A Ref.: http://onlinelibrary.wiley.com/doi/10.1002/0470013192.bsa002/pdf 8 From Netherlands Twin Registry (NTR) Introduction 9 Quantifying Human Transcriptomic Heritability • Although genes with genome-wide significant eQTLs are by definition ‘heritable’ additional polygenic variation may be widespread and fail to reach statistically significance by standard genotype-expression association. • Genes with substantial polygenic variation may also be subject to unique selection pressures not apparent from the analysis of local eQTLs. 10 Association Analysis of Genetical Genomics Data 1. Sample size > 1000 for few studies but we require > 3000 2. Not replicate even using the same HapMap LCLs under standardized procedures • Especially for trans-eQTLs (due to tissue type, ancestry, winner’s curse, …) 3. Gene expression for commonly used LCL is sensitive to EBV copy number and growth rates Franke, Lude, and Ritsert C. Jansen. "eQTL analysis in humans." Cardiovascular Genomics. Humana Press, 2009. 311-328. 11 Proposed Method • • • • Classical twin design (MZ vs. DZ) 2752 individual twins Cohort study Peripheral venous blood samples 12 Goals 1. To describe and evaluate the heritability of all transcripts measured in peripheral blood 2. To identify a comprehensive list of local and distant eQTLs and evaluate their characteristics and replicability 3. To assess the biomedical relevance of the identified eQTLs 13 Data Collection and Pre-Processing • Subjects and biological sampling • Using harmonized protocols • Two longitudinal cohort studies (2-year follow-up) Examined replication in eQTL analyses • Netherlands Twin Registry (NTR): 2752 (out of 3516) samples • Netherlands Study of Depression and Anxiety (NESDA): 1895 (out of 2783) samples • 227 controls • Steady-state transcription in peripheral blood for 43638 probe sets from 18392 genes • Gene expression assays • Remove sex-mismatched samples and additional samples of poor quality • Removal of 19 samples with the lowest D values resulted in the largest number of significant transcripts (q<0.10) • Genome-wide SNP assays • Among 714 monozygotic twin pairs, the intrapair agreement for 686895 autosomal SNPs was 0.9985 • 8.3 million SNPs are used. 14 Demography of 2752 subjects from 1444 twin pairs for twin-based heritability analyses 15 Twin-Based Transcript Heritability • Maximize the logarithm of the profile restricted maximum likelihood (REML) function 1 2 • 𝑝𝑙𝑅𝐸𝑀𝐿 𝜎𝑒2 , 𝜌𝑎 , 𝜌𝑐 = − log 𝑉 ⋅ 1, 𝑋 ′ 𝑉 −1 1, 𝑋 1 𝑟 ′ 𝑉 −1 𝑟 2 2𝜎𝑒 − 𝑛−𝑝−1 log 2 − 𝑛−𝑝−1 log 2 𝜎𝑒2 − 2𝜋 ,where • 𝜌𝑎 = 𝜎𝑎2 𝜎𝑒2 • 𝜌𝑐 = 𝜎𝑐2 𝜎𝑒2 𝐴: the correlation matrix of zygosity. 𝐶: the correlation matrix of twins. 𝑦: the expression values. 𝑥: the covariates. • 𝑟 = 𝑦 − 1, 𝑋 1, 𝑋 ′ 𝑉 −1 1, 𝑋 • 𝑉 = 𝜌𝑎 𝐴 + 𝜌𝑐 𝐶 + 𝐼 • 𝑝 is the rank of 𝑋 −1 1, 𝑋 ′ 𝑉 −1 𝑦 Twin-based heritability 𝑎 2 = 𝜎𝑎2 /(𝜎𝑎2 + 𝜎𝑐2 + 𝜎𝑒2 ) Shared environmental effects 𝑐 2 = 𝜎𝑐2 /(𝜎𝑎2 + 𝜎𝑐2 + 𝜎𝑒2 ) 16 Results Twin-Based Heritability in the Peripheral Blood Transcriptome 17 Investigating on Expression Covariates • To identify a minimal set of covariates • Increase power for expression heritability calculation and improve the eQTL mapping • The covariates can be roughly divided into 1. 2. 3. Covariates related to technical variation Clinical covariates that are subject specific Covariates related to blood counts, which if not properly accounted for might produce spurious “eQTL” relationships. 18 19 Manhattan plot of heritability P values for the transcript with the highest h2 estimate 𝑞 = 0.05 18392 genes h2 for all genes: 0.101 ± 0.142 h2 for expressed genes: 0.138 ±0.153 Max h2 = 0.905 20 K-means clustering of 777 (4.2%) genes with q<0.05 for h2 estimates Mean within-cluster expression correlation r ranged from 0.46 to 0.006 21 3 5 2 7 1 Tissue relevance 6 8 9 4 22 Heritability was strongly associated with expression mean and variance. Values in bold correspond to P<0.0022, for Bonferroni significance at α=0.05 for 23 tests in each of uncorrected and corrected analyses. And numerous KEGG and GO pathways ... 23 Disease Relevance ? NHGRI GWAS catalog identifying the nearest gene (GWAS genes) for each of 3628 significantly disease-associated SNPs (P≤5x10-8) for a total of 2343 GWAS genes. elevated 24 Hypothesis “Disease-causing genes are highly heritable.” • Given that GWAS genes were designated only on the basis of proximity to NHGRI-listed SNPs, these results may reflect an even stronger true tendency of disease-causing genes to be highly heritable. • These results are complementary to observations that diseaseassociated SNPs show eQTL enrichment. • OMIM database shows similar heritability enrichment, even though NHGRI GWAS and OMIM only partly overlap (of genes in either list, 10% are in both). • The OMIM genes with significant heritability (q<0.05) are also quite diverse, further supporting the potential relevance of peripheral blood to other tissues and developmental processes. • Evolutionary associations are consistent with the observation that heritability is necessary for responsiveness to selection. • Enrichment of disease-associated heritability may reflect other underlying sources of commonality but still point to transcription as an important intermediary in disease risk. 25 Results Local Genetic Contributions and Bias in Heritability Estimation 26 Local Genetic Contributions and Bias in h2 Estimation • In published studies, estimates have been complicated by bias and variability in h2 estimation. 27 Definitive Assess the True Extent of Transcriptomic Heritability • Model true h2 as following a gamma distribution with sampling variation determined by the ACE model 7.9% Similar mean h2 Less variation 100 0.3 For twin-based h2 estimates (n = 2752; 8818 expressed genes shown), subtracting the effects of sampling variation produces an estimated true distribution (blue curve). Resimulating from the fitted true assumed distribution closely approximates the observed h2 estimates (black curve). 28 Discrepancy between NTR and MuTHER • Expressed genes in both skin and LCLs with h2>0.5 • MuTHER report estimated >700 • NTR estimated ~100 • Effect of age? • NTR mean age was ~20 years younger • But age is not a covariate • Effect of sample size? 0.3 • Sample size of MuTHER is much smaller. • Apply gamma fit and artificially adding sampling error to the true distribution / inflating the sampling variation • Fit the NTR estimated h2 distribution again 31 How many samples do we need? Small sample size Effect of Sample Size 1.0 correlation is not attainable… 32 Results eQTL Analyses of Peripheral Blood 33 Genotypes as Predictors of Transcription • Two types of genes • Local: Within 1MB upstream of the TSS and 1MB down stream of TES • Distant: Otherwise • Genes with at least one local eQTL (q<0.01) had significantly higher expression levels and heritability (P<1x10-200 for both) 34 Number of Unique Genes with Evidence of Local Association With increasing sample size, it seems that most expressed genes (>10000) show evidence of local eQTL influence in peripheral blood. For NTR, the number of genes with significant eQTLs (q<0.01) was 11384. After employing final quality control steps, 9640 significant genes. Little difference among the transformations 35 Overlap of local eQTL findings with two other large blood studies, at q<0.01 Peripheral blood eQTL meta-analysis of Westra et al. NTR NESDA Local eQTL replication Annotated Genes True Discovered Rate: 59.6% and 59.7%, 36 Results Characteristics of Distant eQTLs 37 Number of unique genes with evidence (q<0.01) for distant association Roughly linear in log-log scale 38 Overlap of distant eQTL findings (q<0.001) with previous studies (within 1 Mb of gene) Peripheral blood eQTL meta-analysis of Westra et al. NTR NESDA Distant eQTL replication 39 Properties of Distant eQTLs Examine using Ensembl Variant Effect Predictor v2.8 Lowest rate of overlap with regulatory features or replication in NESDA 40 eQTL Hotspots (SNPs influencing numerous transcripts) • 304 distant eQTL SNPs 203 regional clusters • 160 clusters: 1 SNP • 43 clusters: 2kb to 2Mb of DNA (median 89kb) • Potential hotspots: 11 clusters associated with ≥ 6 genes • The proportion of associated transcripts using NESDA data to avoid selection bias. influenced by the 304 SNPs Estimated proportion • eQTL hotspots and significant distant eQTLs influence relatively few genes. Lower than the reported in MuTHER study 41 Putative eQTL Hotspot • A distant eQTL hotspot on chr19 was associated with the expression of 12 distant genes and 1 local gene (MYO1F) • MYO1F expression is independent of the expression of the other distant genes, given the expression of the transcription factor SOX13 42 Biomedical Relevance • NHGRI GWAS catalog + filtering P<1x10-8 • 3415 SNPs, 498 traits and 4167 SNP trait pairs from 927 report Trait or Disease SNP found Height High-density Crohn’s lipoprotein disease cholesterol Type 2 diabetes Ulcerative colitis 248 92 98 81 155 • Of the 3118 genes in OMIM, 74.4% were part of a SNP-gene local eQTL pair (q<0.05). • … 43 Conclusion 1. Assessed gene expression profiles in 2,752 twins • Classic twin design to quantify expression heritability and eQTLs in peripheral blood 2. Group ~777 highly heritable genes into 9 clusters 3. Suggest that the previous heritability examined in a replication set is have been upwardly biased 4. Provide a new resource toward understanding the genetic control of transcription 44 Comments • • • • • • New resource for support the newly identified SNPs Computational pipeline for a board range of twin-based experiments Sample variation in small sample size Why are and how do they correlated? Functions of each gene in the cluster, multiple layer of control? New things to explore? 45 Data • Nature Paper + Supplementary Notes • http://www.nature.com/ng/journal/v46/n5/fig_tab/ng.2951_ft.html • Expression data and genotypes (Affymetrix 6.0 and U219) • http://www.ncbi.nlm.nih.gov/gap/?term=phs000486 • Summary results in the seeQTL browser (GWAS results p<5e-8) • http://gbrowse.csbio.unc.edu/cgi-bin/gb2/gbrowse/seeqtl/ 46 Related Links • Netherlands Twin Register (NTR in Dutch) • http://www.tweelingenregister.org/en/ • FastFacts about NTR • http://fastfacts.nl/en/content/netherlands-twin-register • The Multiple Tissue Human Expression Resource (MuTHER) • http://www.muther.ac.uk/ 47 Correlation Matrix In GW heritability analysis using DZ twins, reestimated by PLINK with mean 0.501 and standard deviation 0.038 • 𝑎𝑖𝑗 = 𝑐𝑜𝑟 𝛾𝑖 , 𝛾𝑗 1 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑀𝑍 0.5 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝐷𝑍 = 0 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑢𝑛𝑟𝑒𝑙𝑎𝑡𝑒𝑑 • 𝑐𝑖𝑗 = 𝑐𝑜𝑟 𝛿𝑖 , 𝛿𝑗 1 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑡𝑤𝑖𝑛𝑠 = 0 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑢𝑛𝑟𝑒𝑙𝑎𝑡𝑒𝑑 • 𝑦~𝒩 𝜇1 + 𝑋𝛽, Σ • Where Σ = 𝜎𝑎2 𝐴 + 𝜎𝑐2 𝐶 + 𝜎𝑒2 𝐼 • Re-express Σ = 𝜎𝑒2 𝑉, where V = 𝜎𝑎2 𝐴 𝜎𝑒2 + 𝜎𝑐2 𝐶 𝜎𝑒2 + 𝐼 = 𝜌𝑎 𝐴 + 𝜌𝑐 𝐶 + 𝐼 48 On the Profile Function for TwinBased Heritability • Considers the loss in degrees of freedoms associated with the fixed effect estimates. • Less biased compared to their corresponding maximum likelihood estimates and control type I error better. • The profile function has only three parameters regardless of the number of fixed effects and computationally more efficient than maximizing over the full REML function • Develop an algorithm on R for twin-based heritability analysis 49