Statistical challenges in (RNA-Seq)
Transcription
Statistical challenges in (RNA-Seq)
Statistics and biology Statistical challenges in (RNA-Seq) data analysis Julie Aubert UMR 518 AgroParisTech-INRA Mathématiques et Informatique Appliquées Statistics = study of the collection, analysis, interpretation, presentation and organization of data (source : Wikipedia) Ecole de bioinformatique, Station biologique de Roscoff, 2014 Oct. 7 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 1 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 2 / 81 Roscoff, 2014 Oct. 7 4 / 81 Outline Biological question Experimental design Sequencing experiment Low-level analysis Higher-level analysis Exploratory Data Analysis, image analysis, base calling, read mapping, metadata integration 1 Experimental design 2 Normalisation and Differential analysis 3 Conclusions Exploratory Data Analysis, normalization and expression quantification, differential analysis, metadata integration Biological validation and interprétation *Adapted from S. Dudoit, Berkeley J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 3 / 81 J. Aubert Stat. challenges (RNA-Seq) Experimental design Experimental design Another definition A statistical model : what for ? Aim of an experiment : answer to a biological question. Results of an experiment : (numerous, numerical) measurements. Statistics consists of trying to understand data and to obtain more understandable data. Savage (1977) Model : mathematical formula that relates the experimental conditions and the observed measurements (response). Key of a good data analysis : having good data to analyze. Requisite : a clearly defined research objective. (Statistical) modelling : translating a biological question into a mathematical model (”= PIPELINE !) The statistician who supposes that his main contribution to the planning of an experiment will involve statistical theory, finds repeatedly that he makes his most valuable contribution simply by persuading the investigator to explain why he wishes to do the experiment. Gertrude M Cox J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 5 / 81 Statistical model : mathematical formula involving the experimental conditions, the biological response, the parameters that describe the influence of the conditions on the (mean, theoretical) response, and a description of the (technical, biological) variability. J. Aubert Experimental design 2 3 4 Definition A good design is a list of experiments to conduct in order to answer to the asked question which maximize collected information and minimize the number of experiments (or the experiments cost) with respect to constraints. Considering the population under study, identifying appropriate sampling or experimental units, defining relevant variables, and determining how those variables will be measured. Basic principles - Fisher (1935) Describe the data analysis strategy (technical and biological) replications Replication (independent obs.) ”= Repeated measurements Anticipate eventual complications during the collection step and propose a way to handle them source : Northern Prairie Wildlife Research Center, Statistics for Wildlifers : How much and what kind ? Stat. challenges (RNA-Seq) 6 / 81 Experimental Design Formulate a broadly stated research problem in terms of explicit, addressable questions. J. Aubert Roscoff, 2014 Oct. 7 Experimental design Steps of experiment designing 1 Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 7 / 81 Randomization : randomize as much as is practical, to protect against unanticipated biases Blocking : dividing the observations into homogeneous groups. Isolating variation attributable to a nuisance variable (e.g. lane) J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 8 / 81 Experimental design Experimental design How to Design a good RNA-Seq experiment in an interdisciplinary context ? Rule 2 : Well define the biological question Some basic rules Rule 1 Share a minimal common language Rule 2 Well define the biological question Rule 3 Anticipate difficulties with a well designed experiment Make good choices : Replicates vs Sequencing depth From Alon, 2009 Choose scientific problems on feasibility and interest Order your objectives (primary and secondary) Ask yourself if RNA-seq is better than microarray regarding the biological question J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 9 / 81 J. Aubert Experimental design Roscoff, 2014 Oct. 7 10 / 81 Experimental design Objectives - RNA-Seq Rule 3 : Anticipate difficulties with a well designed experiment Identify differentially expressed (DE) genes ? Detect and estimate isoforms ? Construct a de Novo transcriptome ? J. Aubert Stat. challenges (RNA-Seq) Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 11 / 81 1 Prepare a checklist with all the needed elements to be collected, 2 Collect data and determine all factors of variation, 3 Choose bioinformatics and statistical models, 4 Draw conclusions on results. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 12 / 81 Experimental design Experimental design Be aware of different types of bias RNA-seq experiment analysis : from A to Z Keep in mind the influence of effects on results : lane Æ run Æ RNA library preparation Æ biological (Marioni, 2008), (Bullard, 2010) J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 Adapted from Mutz, 2013 13 / 81 J. Aubert Experimental design Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 14 / 81 Experimental design Make good choices Biological and technical replicates How many reads ? 100M to detect 90% of the transcripts of 81% of human genes (Toung, 2011) 20M reads of 75bp can detect transcripts of medium and low abundance in chicken (Wand, 2011) 10M to cover by at least 10 reads 90% of all (human and zebrafish) genes (Hart, 2013) . . . Biological replicate : sampling of individuals from a population in order to make inferences about that population Technical replicate adresses the measurement error of the assay. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 15 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 16 / 81 Experimental design Experimental design Why increasing the number of biological replicates ? Replication : quality rather than quantity Technical variability => inconsistent detection of exons at low levels of coverage (<5reads per nucleotide) (McIntyre et al. 2011) Doing technical replication may be important in studies where low abundant mRNAs are the focus. To generalize to the population level To estimate to a higher degree of accuracy variation in individual transcript (Hart, 2013) To improve detection of DE transcripts and control of false positive rate : TRUE with at least 3 (Sonenson 2013, Robles 2012) J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 17 / 81 J. Aubert Stat. challenges (RNA-Seq) Experimental design Roscoff, 2014 Oct. 7 18 / 81 Experimental design More biological replicates or increasing sequencing depth ? Experimental Design : Scotty It depends ! (Haas, 2012), (Liu, 2014) It’s a balance : cost, precision ≈∆ nb bio. replicates, sequencing depth. DE transcript detection : (+) biological replicates An example output from the Scotty application. Construction and annotation of transcriptome : (+) depth and (+) sampling conditions Transcriptomic variants search : (+) biological replicates and (+) depth A solution : multiplexing. Tag or bar coded with specific sequences added during library construction and that allow multiple samples to be included in the same sequencing reaction (lane) Busby M A et al. Bioinformatics 2013;29:656-657 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] Decision tools available : Scotty (Busby, 2013), RNAseqPower (Hart, 2013) J. Aubert Stat. challenges (RNA-Seq) This figure shows the user which of the tested experimental configurations do (white) and do not (shaded) conform to the user-defined constraints. Scotty then indicates the optimal configuration based on cost (filled triangle) and power (filled circle). Roscoff, 2014 Oct. 7 19 / 81 Busby M A etJ.al.Aubert Bioinformatics 2013 Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 20 / 81 Experimental design Normalisation and Differential analysis Experimental Design : key points RNA-sequencing The scientific question of interest drives the experimental choices Collect informations before planning All skills are needed to discussions right from project construction Sequencing and other technical biases potentially increase the required sample size and sequencing depth Optimum compromise between replication number and sequencing depth depends on the question Biological replicates are important in most RNA-seq experiments Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 21 / 81 Reverse transcription mRNAs from a sample __ _ __ _ __ … __ __ _ _ __ … __ __ _ Fragmented mRNAs PCR amplification & sequencing cDNAs mapping counting Gene 1 Gene 2 … Gene m 25 320 … 23 And do not forget : budget also includes cost of biological data acquisition, sequencing data backup, bioinformatics and statistical analysis. __ _ __ ___ … _____ # of reads mapped to Wherever possible apply the three Fisher’s principles of randomization, replication and local control (blocking) J. Aubert Random fragmentation _____ A vector of counts from Gene 19 ATTGCC... from Gene 23 … GCTAAC... … from Gene 56 AGCCTC... mapped reads A list of reads Adapted from Li et al. (2011) J. Aubert Normalisation and Differential analysis Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 22 / 81 Normalisation and Differential analysis Isoform detection and quantification Differential Analysis Identification of differentially expressed genes (DE) A gene is declared differentially expressed (DE) between two conditions if the observed difference is statisticially significant, ie more than only du to natural random variation. Statistical tools are necessary to take this decision. From E. Bernard J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 The main steps are : experimental design, normalisation and differential analysis, multiple testing. 23 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 24 / 81 Normalisation and Differential analysis Normalisation and Differential analysis Fold Change approach and ideal cut-off values Fold Change approach and ideal cut-off values FCi = 1 2 3 4 Gene Gene1 Gene2 Gene3 Gene4 CondA1 5.00 800.00 700.00 500.00 CondA2 7.00 1000.00 1100.00 1300.00 xi . yi . CondB1 2.00 350.00 350.00 550.00 CondB2 2.00 250.00 250.00 50.00 FC 3.00 3.00 3.00 3.00 pvalue 0.06 0.03 0.10 0.33 FC does not take the variance of the samples into account. Problematic since variability in gene expression is partially gene-specific. J. Aubert Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 25 / 81 J. Aubert Normalization Stat. challenges (RNA-Seq) Normalisation and Differential analysis Normalization Sources of variability Definition Normalization is a process designed to identify and correct technical biases removing the least possible biological signal. This step is technology and platform-dependant. Within-sample Gene length Roscoff, 2014 Oct. 7 Normalization Between-sample Depth (total number of sequenced and mapped reads) Sampling bias in library construction ? Presence of majority fragments Between-lane normalization Normalisation enabling comparisons of fragments (genes) from different samples. Stat. challenges (RNA-Seq) 26 / 81 Sequence composition (GC content) Within-lane normalization Normalisation enabling comparisons of fragments (genes) from a same sample. No need in a differential analysis context. J. Aubert Roscoff, 2014 Oct. 7 27 / 81 Sequence composition du to PCR-amplification step in library preparation‘(Pickrell et al. 2010, Risso et al. 2011) J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 28 / 81 Normalisation and Differential analysis Normalization Normalisation and Differential analysis Comparison of normalization methods Normalization Normalisation methods Global methods : normalised counts are raw counts divided by a scaling factor calculated for each sample At lot of different normalization methods... Some are part of models for DE, others are ’stand-alone’ Distribution adjustment They do not rely on similar hypotheses Assumption (TC, UQ, Median) : read counts are prop. to expression level and sequencing depth Total number of reads : TC (Marioni et al. 2008), Quantile : FQ (Robinson and Smyth 2008), Upper Quartile : UQ (Bullard et al. 2010), Median But all of them claim to remove technical bias associated with RNA-seq data Which one is the best ? Method taking length into account Reads Per KiloBase Per Million Mapped : RPKM (Mortazavi et al. 2008) The Effective Library Size concept Trimmed Means of M-values TMM (Robinson et Oschlack 2010, edgeR) DESeq (Anders et Huber 2010, DESeq) J. Aubert Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 29 / 81 J. Aubert Normalization Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 30 / 81 Normalization 4 real datasets and one simulated dataset Comparison procedures RNA-seq or miRNA-seq, DE, at least 2 conditions, at least 2 bio. rep., no tech. rep. Distribution and properties of normalized datasets Boxplots, variability between biological replicates Comparison of DE genes Differential analysis : DESeq v1.6.1 (Anders and Huber 2010), default param. Number of common DE genes, similarity between list of genes (dendrogram - binary distance and Ward linkage) Power and control of the Type-I error rate simulated data non equivalent library sizes presence of majority genes J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 31 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 32 / 81 Normalisation and Differential analysis Normalization Normalisation and Differential analysis So the Winner is ... ? Normalization Interpretation In most cases The methods yield similar results RawCount Often fewer differential expressed genes (A. fumigatus : no DE However ... Differences appear based on data characteristics TC, RPKM Sensitive to the presence of majority genes Less effective stabilization of distributions Ineffective (similar to RawCount) gene) FQ Can increase between group variance Is based on an very (too) strong assumption (similar distributions) Median High variability of housekeeping genes TC, RPKM, FQ, Med, UQ Adjustment of distributions, implies a similarity between RNA repertoires expressed J. Aubert Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 33 / 81 Normalization Stat. challenges (RNA-Seq) Normalisation and Differential analysis Conclusions Roscoff, 2014 Oct. 7 34 / 81 Normalization Normalization : key points Hypothesis : the majority of genes is invariant between two samples. Differences between methods when presence of majority sequences, very different library depths. TMM and DESeq : performant and robust methods in a DE analysis context on the gene scale. Normalisation is necessary and not trivial. J. Aubert J. Aubert Stat. challenges (RNA-Seq) RNA-seq data are affected by technical biaises (total number of mapped reads per lane, gene length, composition bias) ∆ A normalization is needed and has a great impact on the DE genes (Bullard et al 2010), (Dillies et al 2012) Detection of differential expression in RNA-seq data is inherently biased (more power to detect DE of longer genes) Do not normalise by gene length in a context of differential analysis. Roscoff, 2014 Oct. 7 35 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 36 / 81 Normalisation and Differential analysis Differential analysis Normalisation and Differential analysis Differential analysis Differential analysis Differential analysis gene-by-gene- with replicates Aim : To detect differentially expressed genes between two conditions For each gene i Is there a significant difference in expression between condition A and B ? Discrete quantitative data Few replicates Statistical model (definition and parameter estimation) - Generalized linear framework Overdispersion problem Test : Equality of relative abundance of gene i in condition A and B vs non-equality Challenge : method which takes into account overdispersion and few number of replicates Proposed methods : edgeR, DESeq for the most used and known Anders et al. 2013, Nature Protocols The Poisson Model Let be Yijk the count for replicate j in condition k from gene i An abundant littérature Comparison of methods : Pachter et al. (2011), Kvam et Liu (2012), Soneson et Delorenzi (2013), Rapaport el al. (2013) J. Aubert Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 37 / 81 Yijk follows a Poisson distribution (µijk ). Property : Var (Yijk ) = Mean(Yijk ) = µijk J. Aubert Differential analysis Stat. challenges (RNA-Seq) Normalisation and Differential analysis Mean-Variance Relationship Roscoff, 2014 Oct. 7 38 / 81 Differential analysis Overdispersion in RNA-seq data Counts from biological replicates tend to have variance exceeding the mean (= overdispersion relative to the Poisson distribution). Poisson describes only technical variation. What causes this overdispersion ? Correlated gene counts Clustering of subjects Within-group heterogeneity Within-group variation in transcription levels Different types of noise present... In case of overdispersion, ø of the type I error rate (prob. to declare incorrectly a gene DE). From D. Robinson and D. McCarthy J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 39 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 40 / 81 Normalisation and Differential analysis Differential analysis Normalisation and Differential analysis Differential analysis Alternative : Negative Binomial Models Tag ID A1 A2 A3 A4 B1 B2 B3 A supplementary dispersion parameter „ to model the variance ENSG00000124208 Poisson vs Negative Binomial Models 478 619 628 744 483 716 240 ENSG00000182463 27 20 27 26 48 55 24 ENSG00000125835 132 200 200 228 560 408 103 ENSG00000125834 42 60 72 86 131 99 30 ENSG00000197818 21 29 35 31 52 44 20 0 0 0 0 ENSG00000215443 4 4 4 0 9 7 4 ENSG00000222008 30 0 23 29 19 0 0 0 46 63 58 54 53 17 ENSG00000125831 ENSG00000101444 ENSG00000101333 0 2 2256 2793 3456 … 71 3362 Model assumptions Mj = library size ij = relative abundance of feature i 2702 2976 1320 …… Poisson describes technical variation: Yij ~ Pois( Mj * ij ) mean(Yij)= variance(Yij) = Mj * ij Negative binomial models biological variability using the dispersion parameter ϕ: Yij ~ NB( ij=Mj * ij , ϕi ) Same mean, variance is quadratic in the mean: variance( Yij ) = ij (1+ ij ϕi ) Critical parameter to estimate: dispersion J. Aubert Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 41 / 81 J. Aubert Differential analysis Stat. challenges (RNA-Seq) Normalisation and Differential analysis „ estimation : crucial ! Parametric Model Some proposed solutions Normalized Count Data Differential analysis Reference Anders et Huber (2010) Robinson et Smyth (2009) Di et al. (2011) P+ NB Quasi-L TSPM BaySeq, DESeq, ShrinkSeq EBSeq, edgeR, NBPseq, ShrinkSeq DESeq, edgeR, NBSeq, TSPM NBPseq : „ and – estimated by LM based on all the genes. Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 Bayesian Bayseq, EBSeq, ShrinkSeq Differential Analysis Classical Test DESeq : data-driven relationship of variance and mean estimated using parametric or local regression for robust fit across genes zéro inf. NB NOISeq, NOISeqBIO, SAMseq Estimation Variance µ(1 + „µ µ) µ(1 + „µ) µ(1 + „µ–≠1 ) Non Parametric Model semiparametric BaySeq, DESeq, EBSeq, edgeR, NBPSeq, ShrinkSeq, TSPM edgeR : borrow information across genes for stable estimates of „ 3 ways to estimate „ (common, trend, moderated) J. Aubert 42 / 81 A lot of statistical methods...still developped Many genes, very few biological samples - difficult to estimate „ on a gene-by-gene basis Method DESeq edgeR NBPseq Roscoff, 2014 Oct. 7 Wilcoxon's statistic Gaussian Kernel Empirical distribution « noise » SAMseq NOISeqBIO NOISeq Permutation SAMseq Bayesian Thresholding NOISeqBIO NOISeq Differential Analysis between two conditions 43 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 44 / 81 Normalisation and Differential analysis Differential analysis Normalisation and Differential analysis Comparaison of differential analysis methods Comparaison of differential analysis methods Soneson et Delorenzi (2013) Soneson et Delorenzi (2013) Small number of replicates (2-3) or low expression æ be careful ! ! Evaluation of 11 methods on both simulated and real data. Obs 1 : The number of replicates matters ! (Differently for different methods) Obs 2 : Results are more accurate and less variable between methods if DE genes are regulated in both directions. Obs 3 : Outlier counts affect different methods in different ways Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 Large number of replicates (10 or so) or very high expression æ method choice does not matter much. Removing genes with outlier counts or using non-parametric methods reduce the sensitivity to outliers Allow tagwise dispersion values Normalization methods have problems when all DE genes are regulated in one direction. Iterative approaches like TCC improve performance The dispersion estimation method matters ! J. Aubert Differential analysis 45 / 81 J. Aubert Differential analysis Stat. challenges (RNA-Seq) Normalisation and Differential analysis DESeq2 Love and Huber (2013) Evalution on methods using SEQC benchmark dataset and ENCODE data. Differences with DESeq. Dispersion shrinkage Fold Change shrinkage (for PCA and Gene Set Enrichissment Analysis) Significant differences between methods. Array-based methods adapted perform comparably to specific methods. Detection of outliers Increasing the number of replicates samples significantly improves detection power over increased sequencing depth. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 46 / 81 Differential analysis Comparaison of differential analysis methods Rapaport et al. (2013) Roscoff, 2014 Oct. 7 Improves power Only one command line 47 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 48 / 81 Soneson and Delorenzi BMC Bioinformatics 2013, 14:91 Normalisation and Differential analysis Differential analysis http://www.biomedcentral.com/1471-2105/14/91 LETTERS Type I error rate at p_nom < 0.05, B00 A Type I error rate at p_nom < 0.05, P00 B 0.20 0.20 Challenge: edgeR can be sensitive to outliers 0.15 0.15 10 Sample 20 30 2000 40 50 60 Page 8 of 18 0.00 0 10 Sample 20 30 40 Sample 50 60 Figure 2. Counts from some miRNAs found to be very significant by edgeR do not seem to follow negative binomial 0 0 Type Ithe error rate from at p_nom < 0.05, inBthe Type I error at p_nom < 0.05, Type I 10th error rate rate at p_nom < 0.05, S0 P0 distributions. Each panel shows counts one miRNA Witten data.28 These miRNAs are the 7th, 0 and 11th most significant features detected by edgeR. The heights of vertical bars show the scaled counts from the samples. 29 bars, coloured blue, are 0.20 The first 29 bars, coloured red, are samples from the one class, and the other 0.20 0.20 from the other. The black broken line is also drawn to separate the two classes. In each panel, we see that one count has much larger values than all the other counts. A CB no outliers 0.00 DESeq.10 edgeR.10 vst.10 voom.10 TSPM.10 NBPSeq.10 0 DESeq.5 edgeR.5 vst.5 voom.5 TSPM.5 NBPSeq.5 60 Type I error rate at p_nom < 0.05, R00 D 0.20 presence of outliers Type I error rate Type I error rate DESeq.10 voom.10 vst.10 TSPM.10 edgeR.10 NBPSeq.10 DESeq.5 voom.5 vst.5 TSPM.5 edgeR.5 NBPSeq.5 DESeq.2 voom.2 vst.2 TSPM.2 edgeR.2 NBPSeq.2 DESeq.10 vst.10 voom.10 TSPM.10 edgeR.10 NBPSeq.10 DESeq.5 vst.5 voom.5 TSPM.5 edgeR.5 NBPSeq.5 Current policies (robustness) DESeq.2 vst.2 voom.2 TSPM.2 edgeR.2 NBPSeq.2 control was worse when all genes were regulated in the same direction. The high false discovery rate seen for ShrinkSeq can possibly be reduced by setting a non-zero value for the fold change threshold defining the null model. Also the variability of the baySeq performance was considerably reduced when there were both up- and downregulated genes among the DE ones. For the largest sample size (10 samples per group), ShrinkSeq, NBPSeq, EBSeq, edgeR and TSPM often found too many false positives. The remaining methods were essentially able to control the false discovery rate at the desired level under these conditions. A possible explanation for the high false discovery rates of NBPSeq is that the S00 (panel C) and R00 (panel D). Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but had overall a small effect. Including outliers with abnormally high counts (panels C and D) had a detrimental effect on the ability to control the type I error for edgeR and NBPSeq, while DESeq became slightly more conservative. edgeR ? one option : moderate dispersion less towards trend level. We put the FDR threshold at 0.05, and calculated control was worse when all genes were regulated in the DESeq ? take thesame fitdirection. (trended) or the feature-specific The high false discovery rate seen for the true false discoverythe rate asmaximum the fraction of theof genes called significant at this level that were indeed false dis- ShrinkSeq can possibly be reduced by setting a non-zero dispersion coveries. Since NOISeq does not return a statistic that is value for the fold change threshold defining the null recommended to use as an adjusted p-value or FDR esti- model. Also the variability of the baySeq performance Very robust, but many genes pay a penalty, less powerful. mate, it was excluded from this evaluation. For baySeq, was considerably reduced when there were both up- and We subsequently filtered reads that had low mapping quality, mapped sex chromosomes or mitochondrial DNA and were not correctly paired, which yielded 9.4 6 3.3 million reads. On average, 86% of the filtered reads mapped to known exons in Ensembl version 54(ref. 17) and 15% of read pairs spanned more than one exon. Evaluation of sequence and mapping quality measures was preformed to ensure that the data quality is acceptable for analysis (Supplementary Fig. 2, also see methods). We quantified reads for known exons, transcripts and whole genes. Read counts for each individual were scaled to a theoretical yield of 10 million reads and corrected for peak insert size across corresponding libraries. Each quantification was filtered to exclude those with missing data for . 10% of the individuals. For exons, this resulted in data for 90,064 exons for 10,777 genes. Of these, 95% had on average more than 10 reads, 38% more than 50 reads and 20% had a mean quantification of $ 100 reads (Supplementary Fig. 3). For transcript quantification, new methods needed to be developed to map reads into specific isoforms18,19. We developed a methodology, called the FluxCapacitor, to quantify abundances of annotated alternatively spliced transcripts (see Methods). Using this method, we obtain relative quantities for 15,967 transcripts from 11,674 genes. For each individual, we compared whole-gene read counts to array intensities generated with Illumina HG-6 version 2 microarrays. Correlations coefficients between RNA-Seq and array quantities and among RNA-Seq samples were high and consistent with previous studies20 (Supplementary Figs 4 and 5). Finally, we explored whether the correlation structure of abundance among exons could facilitate the development of a framework that will allow the imputation of abundance values for exons that are not screened, given a set of reference RNA-Seq samples. This is the same principle as using the correlation structure (Linkage Disequilibrium) of genetic variants to impute variants from a reference to any population sample of interest21. For each of the 10,777 genes, we assessed the pairwise correlation of all exons and on average, any two pairs of exons within a gene were moderately correlated (mean Pearson’s correlation R2 5 0.378 6 0.261) (Supplementary Fig. 6). This correlation increased with increase in total number of reads present in each exon. It is worth noting that the average correlation coefficient between SNPs within the same recombination hotspot interval in HapMap3 is R2 5 0.326 6 0.174, indicating that the correlation structure within genes is stronger and probably more accessible by imputation methodologies than SNPs; however, this needs to be assessed in a tissue-specific context. Association of gene expression measured by RNA-Seq with genetic variation was evaluated in cis with the use of 1.2 million HapMap3 4004 2538 4962 7921 6115 5156 2527 1115 3175 7951 7631 3437 NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140 0.0 1.9 178.1 0.0 0.5 0.0 0.0 0.0 0.0 0.0 2.0 0.6 235.5 6.8 60.2 1.0 0.0 0.0 2.5 1.3 3.5 0.6 429.5 1.0 35.9 0.0 0.4 0.0 0.0 4.7 1.0 5.1 78.9 2.9 0.0 0.0 0.8 0.0 0.0 0.4 0.0 1.3 0.0 1.9 0.0 0.5 46.1 0.0 100.1 1.3 13.8 1.3 30.7 0.0 7.1 0.0 0.0 1.0 0.0 1.3 23.7 111.0 228.8 77.0 129.5 10.0 45.3 27.4 26.3 19.1 2.0 15.2 1074.8 19.5 13.2 10.0 29.6 0.0 1.3 5.5 3.0 6.3 181.0 7.8 7.6 0.0 5.5 3.0 3.1 2.5 1.0 12.1 35.9 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.9 0.4 1.0 0.0 0.5 29.6 0.0 24.4 5.5 24.6 31.1 167.0 4.9 21.2 4.5 8.3 10.1 8.1 0.4 logFC logCPM LR PValue FDR -10.413038 4.186203 30.07924 4.147469e-08 0.0002239513 -5.942865 4.963086 29.60406 5.299369e-08 0.0002239513 -6.387829 5.576979 26.06085 3.308237e-07 0.0009320406 -5.808379 3.183079 22.51927 2.080466e-06 0.0043960241 5.746084 3.921353 21.37010 3.786299e-06 0.0064003595 -4.573655 2.512035 20.13483 7.217026e-06 0.0101663841 -2.154480 6.128702 18.44343 1.750229e-05 0.0211327628 -4.575934 6.873996 18.14127 2.051076e-05 0.0211672325 Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211 Switzerland. Wellcome Trust Sanger Institute, Cambridge CB10 1HH, UK. -3.843458 4.473754 17.71318 2.568407e-05 0.0211672325 Center for Genomic Regulation, University Pompeu Fabra, Barcelona, Catalonia, 08003 Spain. 773 -4.786326 2.416892 17.66324 2.636730e-05 0.0211672325 ©2010 Macmillan Publishers Limited. All rights reserved 4.311717 2.683367 17.57990 2.754846e-05 0.0211672325 -3.014484 4.821100 17.05690 3.627624e-05 0.0255505626 J. Aubert 1 2 3 Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 50 / 81 Differential analysis Robinson’s simulations edgeR and edgeR robust are a bit liberal (5% FDR might mean 6% or 7% to control the false discovery rate at the desired level pare Figures 4A and 4B). The main difference between under these conditions. A possible explanation for the Goal (Robinson and : achieve middle ground false discovery rates of between NBPSeq is that protection the the two settings was seen co.) for ShrinkSeq, whose FDR ahigh against outliers while maintaining high power with observation weights Roscoff, 2014 Oct. 7 CPMs (counts per million) 4004 2538 4962 7921 6115 5156 2527 1115 3175 7951 7631 3437 DESeq2 is very powerful in the absence of outliers, but policy to filter outliers results in loss of power downregulated genes among the DE ones. For the largest Stat. challenges (RNA-Seq) in one lane of an Illumina GAII analyzer and yielded 16.9 6 5.9 notypes using custom and commercially available microarrays1–5. Second generation sequencing technologies are now providing unprecedented access to the fine structure of the transcriptome6–14. We have sequenced the mRNA fraction of the transcriptome in 60 extended HapMap individuals of European descent and have combined these data with genetic variants from the HapMap3 project15. We have quantified exon abundance based on read depth and have also developed methods to quantify whole transcript abundance. We have found that approximately 10 million reads of sequencing can provide access to the same dynamic range as arrays with better quantification of alternative and highly abundant transcripts. Correlation with SNPs (small nucleotide polymorphisms) leads to a larger discovery of eQTLs (expression quantitative trait loci) than with arrays. We also detect a substantial number of variants that influence the structure of mature transcripts indicating variants responsible for alternative splicing. Finally, measures of allelespecific expression allowed the identification of rare eQTLs and allelic differences in transcript structure. This analysis shows that high throughput sequencing technologies reveal new properties of genetic effects on the transcriptome and allow the exploration of genetic effects in cellular processes. Genetic variation in gene expression is an important determinant of human phenotypic variation; a number of studies have elucidated genome-wide patterns of heritability and population differentiation and are beginning to unravel the role of gene expression in the aetiology of disease1–5. Interrogation of the transcriptome in these studies has been greatly facilitated by the use of microarrays, which quantify transcript abundance by hybridization. However, microarrays possess several limitations and recent advances in transcriptome sequencing in second generation sequencing platforms have now provided singlenucleotide resolution of gene expression providing access to rare transcripts, more accurate quantification of abundant transcripts (above the signal saturation point of arrays), novel gene structure, alternative splicing and allele-specific expression6–14. Although RNA-Seq studies have addressed issues of transcript complexity, they have not yet addressed how genetic studies can benefit from this increased resolution to reveal novel effects of sequence variants on the transcriptome. To understand the quantitative differences in gene expression within a human population as determined from second generation sequencing, we sequenced the mRNA fraction of the transcriptome of lymphoblastoid cell lines (LCLs) from 60 CEU (HapMap individuals of European descent) individuals (from CEPH—Centre d’Etude du Polymorphisme Humain) using 37-base pairs (bp) paired-end Illumina sequencing. Each individual’s transcriptome was sequenced DESeq’s policy on outliers has a global effect, resulting in (sometimes drastic) drop in power EBSeq and ShrinkSeq, we imposed the desired threshold DESeq2 ? calculate Cook’s distance and filter outliers sample size (10 samplesgenes per group),with ShrinkSeq, NBPSeq, on the Bayesian FDR [28]. As above, when only 10% of the genes were DE, the EBSeq, edgeR and TSPM often found too many false Can inadvertently filter interesting genes The remaining methods were essentially able direction of their regulation had little effect on the false positives. J. Aubert Nature, 2010 Gene expression is an important phenotype that informs about Robust edgeR suffers a tiny bit in power with no outliers, but has good capacity to dampen their effect if present Allows dispersions to be driven more by the data discovery rate (simulation studies B1250 and B625 0 625 , com- DESeq.10 voom.10 vst.10 TSPM.10 edgeR.10 NBPSeq.10 DESeq.5 voom.5 vst.5 TSPM.5 edgeR.5 NBPSeq.5 DESeq.2 voom.2 vst.2 TSPM.2 edgeR.2 NBPSeq.2 Type I error rate DESeq.10 vst.10 DESeq.10 voom.10 edgeR.10 TSPM.10 vst.10 edgeR.10 voom.10 NBPSeq.10 TSPM.10 NBPSeq.10 DESeq.5 vst.5 DESeq.5 voom.5 edgeR.5 TSPM.5 vst.5 edgeR.5 voom.5 NBPSeq.5 TSPM.5 NBPSeq.5 DESeq.2 vst.2 DESeq.2 voom.2 edgeR.2 TSPM.2 vst.2 edgeR.2 voom.2 NBPSeq.2 TSPM.2 NBPSeq.2 Type I error rate DESeq.10 edgeR.10 vst.10 voom.10 TSPM.10 NBPSeq.10 DESeq.5 edgeR.5 vst.5 voom.5 TSPM.5 NBPSeq.5 DESeq.2 edgeR.2 vst.2 voom.2 TSPM.2 NBPSeq.2 Type I error rate Type I error rate 0.15 0.15 0.15 0.15 When the assumed model does not hold, our statistic is able to select significant features much more efficiently than parametric methods. Also, in contrast to parametric methods, our method gives a reliable estimate of the FDR. On several real data sets, our method is able to find features that are expressed consistently higher in one class, and these are more likely to 0.10 be biologically meaningful. 0.10 0.10 0.10 Moreover, the use of current parametric methods is limited in the outcome types that they can handle. Except for PoissonSeq,20 to our knowledge, existing methods can only be used for data with two-class outcomes. PoissonSeq can also be used for data with quantitative outcomes and multiple0.05 0.05 0.05 class0.05 outcomes, but not survival outcomes. Because of the complexity of parametric methods, it is often difficult to extend them to other types of outcomes. In contrast, our nonparametric method can be used for all the types of outcomes mentioned above. Further, the resampling strategy that we developed (Section 2.2) eliminates the difference between sequencing depths of experiments, making 0.00 0.00 0.00 0.00 it easy to generalize our method to other possible types of outcomes. The rest of this article is organized as follows. In Section 2, we propose a nonparametric statistic for data with a two-class outcome and the associated resampling strategy, as well as a permutation plug-in method to estimate the false discovery rate FDR. In Section 3, we study the performance of our nonparametric method on simulated data sets, and compare it with three available methods, Soneson and 2013 Type PoissonSeq I error rates.and Type I error rates, for the six Delorenzi, methods providing nominal p-values, in simulation studies B00 (panel A), P00 (panel B), edgeR, PoissonSeq and DESeq. In Section 4, we apply our method as Figure well as 3edgeR, I error rate at p_nom < 0.05, S0 Type I error rate at p_nom < 0.05, R00 0 S00 (panel D). as Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but had C on three real Type D C) and DESeq RNA-Seq data sets, and compare the list of features that Rare called 0 (panel overall small(RNA-Seq) effect. Including counts (panels C and D) had7a detrimental differentially expressed by different methods. In SectionStat. 5, we extend our anonparametric statisticoutliers with abnormally high J.0.20Aubert challenges Roscoff, 2014 Oct. 49 effect / 81on the ability to control the type I 0.20 to other types of outcomes, and show their performance on simulated data Section 6 contains error forsets. edgeR and NBPSeq, while DESeq became slightly more conservative. the discussion. 0.15 2 0.15 A nonparametric method for two-class data level. We put the FDR threshold Normalisation and Differential analysis Differential analysisat 0.05, and calculated 2.1 Wilcoxon statistic the true false discovery rate as the fraction of the genes For Feature j, suppose that we have counts N1j, . . . , Nnj from either Class 1 or Class 2. Suppose Class 0.10 0.10 called significant at this level that were indeed false disk contains nk samples, k ¼ 1, 2 and n1 + n2 ¼ n. Let Ck ¼ {i : Sample i is from Class k}, k ¼ 1, 2. If the coveries. Since NOISeq does not return a statistic that is recommended to use as an adjusted p-value or FDR esti0.05 mate, it0.05was excluded from this evaluation. For baySeq, EBSeq and ShrinkSeq, we imposed the desired threshold on the 0.00 Bayesian FDR [28]. 0.00 As above, when only 10% of the genes were DE, the direction of their regulation had little effect on the false discovery rate (simulation studies B1250 and B625 0 625 , compare Figures 4A and 4B). The main difference between 0 Figure 3 Type I error rates. Type I error rates, for the six methods providing nominal p-values, in simulation studies B (panel A), P00 (panel the two settings was seen for ShrinkSeq, whose FDRB), 0 Stephen B. Montgomery1,2, Micha Sammeth3, Maria Gutierrez-Arcelus1, Radoslaw P. Lach2, Catherine Ingle2, James Nisbett2, Roderic Guigo3 & Emmanouil T. Dermitzakis1,2 16 Results driven by outliers DESeq.2 edgeR.2 vst.2 voom.2 TSPM.2 NBPSeq.2 50 DESeq.10 edgeR.10 vst.10 voom.10 TSPM.10 NBPSeq.10 40 DESeq.5 edgeR.5 vst.5 voom.5 TSPM.5 NBPSeq.5 30 Transcriptome genetics using second generation sequencing in a Caucasian population (meanexpression 6 s.d.) million reads that were then mapped to the NCBI36 and environmental on cellular state. Many studies Random split of dataset: n1=5; n2=5 genetic Very littleeffects true differential assembly of the human genome (Supplementary Fig. 1) using MAQ . have previously identified genetic variants for gene expression phe- 0.10 0.05 DESeq.2 edgeR.2 vst.2 voom.2 TSPM.2 NBPSeq.2 20 Why is robustness needed? Li and Tibshirani, 2011 0 0 10 0.05 6000 Scaled counts 2000 Scaled counts 500 1000 6000 0 2000 Soneson and Delorenzi BMC Bioinformatics 2013, 14:91 http://www.biomedcentral.com/1471-2105/14/91 0 miR−375, No. 11 by edgeR 0.10 10000 miR−133b, No. 10 by edgeR 10000 miR−206, No. 7 by edgeR Type I error rate Statistical Methods in Medical Research 0(0) Type I error rate 4 Scaled counts Differential analysis Vol 464 | 1 April 2010 | doi:10.1038/nature08903 Outliers False positive rate Normalisation and Differential analysis Page 8 of 18 51 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 52 / 81 Normalisation and Differential analysis Multiple testing Normalisation and Differential analysis Multiple Testing Multiple testing The Family Wise Error Rate (FWER) Definition False positive (FP) (type I error : –) : A not DE gene which is declared DE. For all ’genes’, we test H0 (gene i is not DE) vs H1 (the gene is DE) using a statistical test (calcul of a score) Pb : Let assume all the G genes are not DE. Each test is realized at – level Ex : G = 10000 genes and – = 0.05 æ E (FP) = 500 genes. FWER = P(FP > 0) Probability of having at least one false positive, of declaring DE at least one non DE gene. The Bonferroni procedure Either each test is realized at – = –ú /G level or use of adjusted pvalue pBonfi = min(1, pi ú G) and FWER Æ –ú . For G = 2000, Æ –ú = 0.05, – = 2.510≠5 . Easy but conservative and not powerful. When the number of tests increases, the FWER æ 1 with constant FP. J. Aubert Stat. challenges (RNA-Seq) Normalisation and Differential analysis Roscoff, 2014 Oct. 7 53 / 81 J. Aubert Multiple testing Stat. challenges (RNA-Seq) Normalisation and Differential analysis The False Discovery Rate Roscoff, 2014 Oct. 7 54 / 81 Multiple testing False Discovery Rate (FDR) Benjamini et Hochberg (95) FDR = E (FP/P) =1 si P > 0 sinon Idea : Do not control the error rate ut the proportion of error ∆ less conservative than control of the FWER. Prop FDR Æ FWER Source : M. Guedj, Pharnext The procedure of Benjamini-Hochberg (95) is one of the procedure which controls the False Discovery Rate FDR=E (FP/P) siP > 0. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 55 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 56 / 81 Normalisation and Differential analysis Multiple testing Normalisation and Differential analysis FDR Multiple testing Multiple testing : key points Important to control for multiple tests Principle : The number of declared positive elements P is given by the greater i p(i) Æ i–ú /G. FDR or FWER depends on the cost associated to FN and FP Controlling the FWER : Prop Having a great confidendence on the DE elements (strong control). Accepting to not detect some elements (lack of power … a few DE elements) In case of independant tests, FDR Æ (G0 /G)–ú Æ –ú Prop FDR Benjamini-Hochberg : fi0 = J. Aubert G0 G =1 Stat. challenges (RNA-Seq) Normalisation and Differential analysis Controlling the FDR : Accepting a proportion of FP among DE elements. Very interesting in exploratory study. Roscoff, 2014 Oct. 7 57 / 81 J. Aubert Multiple testing Stat. challenges (RNA-Seq) Normalisation and Differential analysis Interpretation - Statistical and practical significance Roscoff, 2014 Oct. 7 58 / 81 Multiple testing Differential Analysis : key points Methods dedicated to microarrays are not applicable to RNA-seq Small number of replicates (2-3) or low expression æ be careful ! ! Practical significance (importance) and statistical significance (detectability) have little to do with each other. An effect can be important, but undetectable (statistically insignificant) because the data are few, irrelevant, or of poor quality. An effect can be statistically significant (detectable) even if it is small and unimportant, if the data are many and of high quality. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 59 / 81 Large number of replicates (10 or so) or very high expression æ method choice does not matter much. Removing genes with outlier counts or using non-parametric methods reduce the sensitivity to outliers Don’t forget to correct for multiple testing ! Adapt the method to your data (nb of rep.) Specific methods developped for few replicates. The need for ’sophisticated’ methods decreases when the number of replicates increases. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 60 / 81 Normalisation and Differential analysis Multiple testing Conclusions Other questions General conclusions Gene-Set Enrichissment Analysis These tests assume that genes have the same chance to be declared DE. But sometimes over-detection of longer and mare expressed genes GOSeq (Young et al. 2011) Pratical conclusions Need to collaborate between biologists, bioinformaticians et statisticians and in a ideal world since the project construction Adaptation of methods and tools to the asked question (no pipeline) Check all the steps of the data analysis (quality, normalization, differential analysis . . . ) Filter or not Rau et al. 2013 ; Huber et al. Statistics not only useful for differential analysis with RNA-seq J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 61 / 81 J. Aubert Conclusions Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 62 / 81 Conclusions Aknowledgements StatOmique (in particular M.-A. Dillies, A. Rau, C. Hennequet-Antier) PEPI IBIS (INRA) Pôle planification expérimentale et RNA-Seq David Robinson, Charlotte Soneson From Kumamaru, 2012 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 63 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 64 / 81 Conclusions Conclusions References References Anders, S and Huber, W. (2010) Differential expression analysis for sequence count data, Genome Biology,11 :R106. Anders, S, McCarthy, DJ, Chen, Y, Okoniewski, M, Smyth GK, Huber, W and Robinson, MD (2013) Count-based differential expression analysis of RNA sequencing data using R and Bioconductor, Nature Protocols, doi :10.1038. Kvam V, Liu P (2012) A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data Bullard JH, Purdom E, Hansen KD, Dudoit S. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments, BMC Bioinformatics, 11 :94 Di Y, Schaef, DW, Cumbie JS, Chang JH (2011) The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq, Statistical Applications in Genetics and Molecular Biology, 10(1), Article 24. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 Dillies M-A et al. on behalf of The French StatOmique Consortium (2012) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis, Briefing in Bioinformatics. 65 / 81 Li J, Jiang H, Wong WH, (2010) Modeling non-uniformity in short read rates in RNA-Seq data, Genome Biology, 11 :R50 Li J, Witten DM, Johnstone IM, Tisbhirani R (2011) Normalization, testing, and false discovery rate estimation for RNA-sequencing data, Biostatistics, 1-16 Marioni J.C., Mason C.E. et al. (2008) RNA-seq : An assessment of technical reproducibility and comparison with gene expression arrays, Genome Research, 18 : 1509-1517. J. Aubert Conclusions References Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 5(7), 621-628 Pachter L (2011) Models for transcript quantification from RNA-seq Rapaport et al. (2013) Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data, Genome Biology,14 :R95 Robles et al (2012) Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing, preprint Robinson MD, Oshlack A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11 :R25 Robinson MD and Smyth, GK. (2008) Small-sample estimation of negative binomial dispersion, with applications to SAGE data J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 67 / 81 Biostatistics, 9, 2 ; 321-332 Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 66 / 81 Conclusions References Robinson MD, McCarthy DJ, Smyth, GK. (2009) edgeR : a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics Soneson, C, Delorenzi, M. (2013) A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics,14 :91 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. (2011) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology, 28(5) : 511 ?515. Young MD, Wakefiled MJ, Smyth GK., Oshlack A. (2011) Gene ontology analysis for RNA-seq :accounting for selection bias , Genome Biology J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 68 / 81 Conclusions Conclusions Notations RPKM normalization Reads Per Kilobase per Million mapped reads Adjust for lane sequencing depth (library size) and gene length Motivation greater lane sequencing depth and gene length => greater counts whatever the expression level Assumption read counts are proportional to expression level, gene length and sequencing depth (same RNAs in equal proportion) Method divide gene read count by total number of reads (in million) and gene length (in kilobase) xij : number of reads for gene i in sample j Nj : number of reads in sample j (library size of sample j) n : number of samples in the experiment ŝj : normalization factor associated with sample j xij ú 103 ú 106 Nj ú Li Li : length of gene i (1) Allows to compare expression levels between genes of the same sample Unbiased estimation of number of reads but affect the variance. (Oshlack et al. 2009) J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 69 / 81 J. Aubert Stat. challenges (RNA-Seq) Conclusions Roscoff, 2014 Oct. 7 70 / 81 Conclusions The Effective Library Size concept Methods based on the Effective Library Size Concept Trimmed Mean of M-values Robinson et al. 2010 (edgeR) Motivation Different biological conditions express different RNA repertoires, leading to different total amounts of RNA Filter on transcripts with nul counts, on the resp. 30% and 5% more extreme k Mi = log2( YYikÕ/N Õ ) and A values ik /Nk Hyp : We may not estimate the total ARN production in one condition but we may estimate a global expression change between two conditions from non extreme Mi distribution. Calculation of a scaling factor between two conditions and normalization to avoid Assumption A majority of transcripts is not differentially expressed dependance on a reference sample Anders and Huber 2010 - Package DESeq Aim Minimizing effect of (very) majority sequences sˆj = mediani ( kij ) (fivm=1 kiv )1/m kij : number of reads in sample j assigned to gene i denominator : pseudo-reference sample created from geometric mean across samples J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 71 / 81 J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 72 / 81 Conclusions Conclusions Length bias (Oshlack 2009, Bullard et al. 2010) Negative Binomial Models At same expression level, a long transcript will have more reads than a shorter transcript. Number of reads ”= expression level µijk = ⁄ij mjk where mj k : size factor (library size) Test : H0i : ⁄iA = ⁄iB vs H1i : ⁄iA ”= ⁄iB µ = E (X ) = cNL = Var (X ) X mesured number of reads in a library mapping a specific transcript, Poisson r.v. c proportionnality constant N total number of transcripts L gene length Test : X1 ≠ X2 t= (cN1 L + cN2 L) Power of test depends on a parameter prop. to (L). Identical result after normalization by gene length (but out of the Poisson framework). J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 edgeR Adjust observed counts up or down depending on whether library sizes are below or above the geometric mean => Creates approximately identically distributed pseudodata Estimation of „i by conditional ML conditioning on the TC for gene i Empirical Bayes procedure to shrink dispersions toward a consensus value An exact test analogous to Fisher’s exact test but adapted to overdispersed data (Robinson and Smyth 2008) DESeq Test similar to Fisher’s exact test (calculation has changed) 73 / 81 J. Aubert Conclusions Yijk ≥ NB(µijk , ‡ijk ), where µijk is the mean, and ‡ijk is the variance The mean µijk is the product of a condition-dependent per-gene value ⁄ij and a size factor (library size) mjk : Three sets of parameters need to be estimated : 1 2 µijk = ⁄ij mjk 3 4 Variance decomposition : The variance ‡ijk is the sum of a shot noise term and a raw variance term : ‡ijk = µijk + –i µ2 where –i the dispersion value. Per-gene dispersion –i or pooled – is a smooth function of the mean : –i = fj (⁄ij ) J. Aubert Stat. challenges (RNA-Seq) 74 / 81 DESeq Bioconductor package Assumptions : 2 Roscoff, 2014 Oct. 7 Conclusions Negative Binomial Models - DESeq 1 Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 75 / 81 Size factors mj k (normalization factors) (see normalization part) For each experimental condition j, n expression strength parameters ⁄ij estimated by average of counts from the replicates for each condition, transformed to the common scale : 1 ÿ yijk ⁄ˆij = rj k mˆj k 3 The smooth functions fj for each condition j to model dependence of –i on the expected mean ⁄ij : local or gamma GLM estimation (fit=’local’ or fit=’parametric’) J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 76 / 81 Conclusions Conclusions Practical considerations First commands Input Data = raw counts normalization offsets are included in the model Version matters : edgeR 2.6.7 et DESeq 1.6.1 (Bioconductor 2.9) edgeR TMM normalization Common dispersion must be estimated before tagwise dispersions GLM functionality (for experiments with multiple factors) now available Installation des packages : source("http ://www.bioconductor.org/biocLite.R") biocLite(c("DESeq", "edgeR")) Chargement des packages : library(DESeq) library(edgeR) DESeq Two possibilities to obtain a smooth functions fj (·) Conservative estimates : genes are assigned the maximum of the fitted and empirical values of –i (sharingMode = "maximum") Local fit regression (as described in paper) is no longer the default Each column = independent biological replicate J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 77 / 81 J. Aubert Conclusions Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 78 / 81 Roscoff, 2014 Oct. 7 80 / 81 Conclusions edgeR main commands DESeq main commands generate raw counts from NB, create list object y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4) rownames(y) <- paste("Gene",1 :nrow(y),sep=".") group <- factor(c(1,1,2,2)) cds <- newCountDataSet(y, group) cds <- estimateSizeFactors(cds) sizeFactors(cds) cds <- estimateDispersions(cds) res <- nbinomTest( cds, "1", "2" ) perform DA with edgeR y <- DGEList(counts=y,group=group) y <- calcNormFactors(y,method="TMM") y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) result <- exactTest(y,dispersion="tagwise") Observe some results - DGE with FDR BH topTags(result) summary(decideTestsDGE(result),p.value=0.05) J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 79 / 81 J. Aubert Stat. challenges (RNA-Seq) Conclusions Quelques références pour débuter http://www.r-project.org/ : manuel, FAQ, RJournal, etc... http://www.bioconductor.org/help/publications/ cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf G. Millot, (2009), Comprendre et réaliser les tests statistiques à l’aide de R, Editions De Boeck, 704 p. J. Aubert Stat. challenges (RNA-Seq) Roscoff, 2014 Oct. 7 81 / 81