Statistical challenges in (RNA-Seq)

Transcription

Statistical challenges in (RNA-Seq)
Statistics and biology
Statistical challenges in (RNA-Seq) data analysis
Julie Aubert
UMR 518 AgroParisTech-INRA Mathématiques et Informatique
Appliquées
Statistics = study of the collection, analysis, interpretation, presentation
and organization of data (source : Wikipedia)
Ecole de bioinformatique, Station biologique de Roscoff, 2014 Oct. 7
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
1 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
2 / 81
Roscoff, 2014 Oct. 7
4 / 81
Outline
Biological question
Experimental design
Sequencing experiment
Low-level analysis
Higher-level analysis
Exploratory Data Analysis,
image analysis, base calling,
read mapping, metadata
integration
1
Experimental design
2
Normalisation and Differential analysis
3
Conclusions
Exploratory Data Analysis,
normalization and expression
quantification, differential
analysis, metadata integration
Biological validation and
interprétation
*Adapted from S. Dudoit, Berkeley
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
3 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Experimental design
Experimental design
Another definition
A statistical model : what for ?
Aim of an experiment : answer to a biological question.
Results of an experiment : (numerous, numerical) measurements.
Statistics consists of trying to understand data and to obtain more
understandable data. Savage (1977)
Model : mathematical formula that relates the experimental
conditions and the observed measurements (response).
Key of a good data analysis : having good data to analyze.
Requisite : a clearly defined research objective.
(Statistical) modelling : translating a biological question into a
mathematical model (”= PIPELINE !)
The statistician who supposes that his main contribution to the planning
of an experiment will involve statistical theory, finds repeatedly that he
makes his most valuable contribution simply by persuading the investigator
to explain why he wishes to do the experiment. Gertrude M Cox
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
5 / 81
Statistical model : mathematical formula involving
the experimental conditions,
the biological response,
the parameters that describe the influence of the
conditions on the (mean, theoretical) response,
and a description of the (technical, biological) variability.
J. Aubert
Experimental design
2
3
4
Definition
A good design is a list of experiments to conduct in order to answer to the
asked question which maximize collected information and minimize the
number of experiments (or the experiments cost) with respect to
constraints.
Considering the population under study, identifying appropriate
sampling or experimental units, defining relevant variables, and
determining how those variables will be measured.
Basic principles - Fisher (1935)
Describe the data analysis strategy
(technical and biological) replications
Replication (independent obs.) ”= Repeated measurements
Anticipate eventual complications during the collection step and
propose a way to handle them
source : Northern Prairie Wildlife Research Center, Statistics for Wildlifers : How
much and what kind ?
Stat. challenges (RNA-Seq)
6 / 81
Experimental Design
Formulate a broadly stated research problem in terms of explicit,
addressable questions.
J. Aubert
Roscoff, 2014 Oct. 7
Experimental design
Steps of experiment designing
1
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
7 / 81
Randomization : randomize as much as is practical, to protect against
unanticipated biases
Blocking : dividing the observations into homogeneous groups.
Isolating variation attributable to a nuisance variable (e.g. lane)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
8 / 81
Experimental design
Experimental design
How to Design a good RNA-Seq experiment in an
interdisciplinary context ?
Rule 2 : Well define the biological question
Some basic rules
Rule 1 Share a minimal common language
Rule 2 Well define the biological question
Rule 3 Anticipate difficulties with a well designed experiment
Make good choices : Replicates vs Sequencing depth
From Alon, 2009
Choose scientific problems on feasibility and interest
Order your objectives (primary and secondary)
Ask yourself if RNA-seq is better than microarray regarding the
biological question
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
9 / 81
J. Aubert
Experimental design
Roscoff, 2014 Oct. 7
10 / 81
Experimental design
Objectives - RNA-Seq
Rule 3 : Anticipate difficulties with a well designed
experiment
Identify differentially expressed (DE) genes ?
Detect and estimate isoforms ?
Construct a de Novo transcriptome ?
J. Aubert
Stat. challenges (RNA-Seq)
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
11 / 81
1
Prepare a checklist with all the needed elements to be collected,
2
Collect data and determine all factors of variation,
3
Choose bioinformatics and statistical models,
4
Draw conclusions on results.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
12 / 81
Experimental design
Experimental design
Be aware of different types of bias
RNA-seq experiment analysis : from A to Z
Keep in mind the influence of effects on results :
lane Æ run Æ RNA library preparation Æ biological
(Marioni, 2008), (Bullard, 2010)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
Adapted from Mutz, 2013
13 / 81
J. Aubert
Experimental design
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
14 / 81
Experimental design
Make good choices
Biological and technical replicates
How many reads ?
100M to detect 90% of the transcripts of 81% of human genes
(Toung, 2011)
20M reads of 75bp can detect transcripts of medium and low
abundance in chicken (Wand, 2011)
10M to cover by at least 10 reads 90% of all (human and zebrafish)
genes (Hart, 2013) . . .
Biological replicate : sampling of individuals from a population in order to
make inferences about that population
Technical replicate adresses the measurement error of the assay.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
15 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
16 / 81
Experimental design
Experimental design
Why increasing the number of biological replicates ?
Replication : quality rather than quantity
Technical variability => inconsistent detection of exons at low levels of
coverage (<5reads per nucleotide) (McIntyre et al. 2011)
Doing technical replication may be important in studies where low
abundant mRNAs are the focus.
To generalize to the population level
To estimate to a higher degree of accuracy variation in individual
transcript (Hart, 2013)
To improve detection of DE transcripts and control of false positive
rate : TRUE with at least 3 (Sonenson 2013, Robles 2012)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
17 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Experimental design
Roscoff, 2014 Oct. 7
18 / 81
Experimental design
More biological replicates or increasing sequencing depth ?
Experimental Design : Scotty
It depends ! (Haas, 2012), (Liu, 2014)
It’s a balance : cost, precision ≈∆ nb bio. replicates, sequencing depth.
DE transcript detection : (+) biological replicates
An example output from the Scotty application.
Construction and annotation of transcriptome : (+) depth and (+)
sampling conditions
Transcriptomic variants search : (+) biological replicates and (+)
depth
A solution : multiplexing.
Tag or bar coded with specific sequences added during library construction
and that allow multiple samples to be included in the same sequencing
reaction (lane)
Busby M A et al. Bioinformatics 2013;29:656-657
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: [email protected]
Decision tools available : Scotty (Busby, 2013),
RNAseqPower (Hart, 2013)
J. Aubert
Stat. challenges (RNA-Seq)
This figure shows the user which of the tested experimental configurations do (white) and do not (shaded) conform to the
user-defined constraints. Scotty then indicates the optimal configuration based on cost (filled triangle) and power (filled circle).
Roscoff, 2014 Oct. 7
19 / 81
Busby M A etJ.al.Aubert
Bioinformatics 2013
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
20 / 81
Experimental design
Normalisation and Differential analysis
Experimental Design : key points
RNA-sequencing
The scientific question of interest drives the experimental choices
Collect informations before planning
All skills are needed to discussions right from project construction
Sequencing and other technical biases potentially increase the required
sample size and sequencing depth
Optimum compromise between replication number and sequencing depth
depends on the question
Biological replicates are important in most RNA-seq experiments
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
21 / 81
Reverse
transcription
mRNAs from a sample
__ _ __
_ __
…
__ __ _
_ __
…
__ __ _
Fragmented mRNAs
PCR
amplification &
sequencing
cDNAs
mapping
counting
Gene 1 Gene 2 … Gene m
25
320 … 23
And do not forget : budget also includes cost of biological data acquisition,
sequencing data backup, bioinformatics and statistical analysis.
__ _ __
___
…
_____
# of reads mapped to
Wherever possible apply the three Fisher’s principles of
randomization, replication and local control (blocking)
J. Aubert
Random
fragmentation
_____
A vector of counts
from Gene 19
ATTGCC...
from Gene 23
…
GCTAAC...
…
from Gene 56
AGCCTC...
mapped reads
A list of reads
Adapted from Li et al. (2011)
J. Aubert
Normalisation and Differential analysis
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
22 / 81
Normalisation and Differential analysis
Isoform detection and quantification
Differential Analysis
Identification of differentially expressed genes (DE)
A gene is declared differentially expressed (DE) between two conditions if
the observed difference is statisticially significant, ie more than only du to
natural random variation.
Statistical tools are necessary to take this decision.
From E. Bernard
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
The main steps are : experimental design, normalisation and
differential analysis, multiple testing.
23 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
24 / 81
Normalisation and Differential analysis
Normalisation and Differential analysis
Fold Change approach and ideal cut-off values
Fold Change approach and ideal cut-off values
FCi =
1
2
3
4
Gene
Gene1
Gene2
Gene3
Gene4
CondA1
5.00
800.00
700.00
500.00
CondA2
7.00
1000.00
1100.00
1300.00
xi .
yi .
CondB1
2.00
350.00
350.00
550.00
CondB2
2.00
250.00
250.00
50.00
FC
3.00
3.00
3.00
3.00
pvalue
0.06
0.03
0.10
0.33
FC does not take the variance of the samples into account.
Problematic since variability in gene expression is partially gene-specific.
J. Aubert
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
25 / 81
J. Aubert
Normalization
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Normalization
Sources of variability
Definition
Normalization is a process designed to identify and correct technical
biases removing the least possible biological signal. This step is
technology and platform-dependant.
Within-sample
Gene length
Roscoff, 2014 Oct. 7
Normalization
Between-sample
Depth (total number of sequenced and mapped reads)
Sampling bias in library construction ?
Presence of majority fragments
Between-lane normalization
Normalisation enabling comparisons of fragments (genes) from different
samples.
Stat. challenges (RNA-Seq)
26 / 81
Sequence composition (GC content)
Within-lane normalization
Normalisation enabling comparisons of fragments (genes) from a same
sample.
No need in a differential analysis context.
J. Aubert
Roscoff, 2014 Oct. 7
27 / 81
Sequence composition du to PCR-amplification step in library
preparation‘(Pickrell et al. 2010, Risso et al. 2011)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
28 / 81
Normalisation and Differential analysis
Normalization
Normalisation and Differential analysis
Comparison of normalization methods
Normalization
Normalisation methods
Global methods : normalised counts are raw counts divided by a scaling
factor calculated for each sample
At lot of different normalization methods...
Some are part of models for DE, others are ’stand-alone’
Distribution adjustment
They do not rely on similar hypotheses
Assumption (TC, UQ, Median) : read counts are prop. to expression level
and sequencing depth
Total number of reads : TC (Marioni et al. 2008), Quantile : FQ (Robinson
and Smyth 2008), Upper Quartile : UQ (Bullard et al. 2010), Median
But all of them claim to remove technical bias associated with
RNA-seq data
Which one is the best ?
Method taking length into account
Reads Per KiloBase Per Million Mapped : RPKM (Mortazavi et al. 2008)
The Effective Library Size concept
Trimmed Means of M-values TMM (Robinson et Oschlack 2010, edgeR)
DESeq (Anders et Huber 2010, DESeq)
J. Aubert
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
29 / 81
J. Aubert
Normalization
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
30 / 81
Normalization
4 real datasets and one simulated dataset
Comparison procedures
RNA-seq or miRNA-seq, DE, at least 2 conditions, at least 2 bio. rep., no
tech. rep.
Distribution and properties of normalized datasets
Boxplots, variability between biological replicates
Comparison of DE genes
Differential analysis : DESeq v1.6.1 (Anders and Huber 2010), default
param.
Number of common DE genes, similarity between list of genes (dendrogram
- binary distance and Ward linkage)
Power and control of the Type-I error rate
simulated data
non equivalent library sizes
presence of majority genes
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
31 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
32 / 81
Normalisation and Differential analysis
Normalization
Normalisation and Differential analysis
So the Winner is ... ?
Normalization
Interpretation
In most cases
The methods yield similar results
RawCount Often fewer differential expressed genes (A. fumigatus : no DE
However ...
Differences appear based on data characteristics
TC, RPKM
Sensitive to the presence of majority genes
Less effective stabilization of distributions
Ineffective (similar to RawCount)
gene)
FQ
Can increase between group variance
Is based on an very (too) strong assumption (similar distributions)
Median High variability of housekeeping genes
TC, RPKM, FQ, Med, UQ Adjustment of distributions, implies a similarity
between RNA repertoires expressed
J. Aubert
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
33 / 81
Normalization
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Conclusions
Roscoff, 2014 Oct. 7
34 / 81
Normalization
Normalization : key points
Hypothesis : the majority of genes is invariant between two samples.
Differences between methods when presence of majority sequences,
very different library depths.
TMM and DESeq : performant and robust methods in a DE analysis
context on the gene scale.
Normalisation is necessary and not trivial.
J. Aubert
J. Aubert
Stat. challenges (RNA-Seq)
RNA-seq data are affected by technical biaises (total number of mapped
reads per lane, gene length, composition bias)
∆ A normalization is needed and has a great impact on the DE genes
(Bullard et al 2010), (Dillies et al 2012)
Detection of differential expression in RNA-seq data is inherently biased
(more power to detect DE of longer genes)
Do not normalise by gene length in a context of differential analysis.
Roscoff, 2014 Oct. 7
35 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
36 / 81
Normalisation and Differential analysis
Differential analysis
Normalisation and Differential analysis
Differential analysis
Differential analysis
Differential analysis gene-by-gene- with replicates
Aim : To detect differentially expressed genes between two conditions
For each gene i
Is there a significant difference in expression between condition A and B ?
Discrete quantitative data
Few replicates
Statistical model (definition and parameter estimation) - Generalized
linear framework
Overdispersion problem
Test : Equality of relative abundance of gene i in condition A and B
vs non-equality
Challenge : method which takes into account overdispersion and few
number of replicates
Proposed methods : edgeR, DESeq for the most used and known
Anders et al. 2013, Nature Protocols
The Poisson Model
Let be Yijk the count for replicate j in condition k from gene i
An abundant littérature
Comparison of methods : Pachter et al. (2011), Kvam et Liu (2012),
Soneson et Delorenzi (2013), Rapaport el al. (2013)
J. Aubert
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
37 / 81
Yijk follows a Poisson distribution (µijk ).
Property : Var (Yijk ) = Mean(Yijk ) = µijk
J. Aubert
Differential analysis
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Mean-Variance Relationship
Roscoff, 2014 Oct. 7
38 / 81
Differential analysis
Overdispersion in RNA-seq data
Counts from biological replicates tend to have variance exceeding the
mean (= overdispersion relative to the Poisson distribution). Poisson
describes only technical variation.
What causes this overdispersion ?
Correlated gene counts
Clustering of subjects
Within-group heterogeneity
Within-group variation in transcription levels
Different types of noise present...
In case of overdispersion, ø of the type I error rate (prob. to declare
incorrectly a gene DE).
From D. Robinson and D. McCarthy
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
39 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
40 / 81
Normalisation and Differential analysis
Differential analysis
Normalisation and Differential analysis
Differential analysis
Alternative : Negative Binomial Models
Tag ID A1 A2 A3 A4 B1 B2 B3
A supplementary dispersion parameter „ to model the variance
ENSG00000124208
Poisson vs Negative Binomial Models
478
619
628
744
483
716
240
ENSG00000182463
27
20
27
26
48
55
24
ENSG00000125835
132
200
200
228
560
408
103
ENSG00000125834
42
60
72
86
131
99
30
ENSG00000197818
21
29
35
31
52
44
20
0
0
0
0
ENSG00000215443
4
4
4
0
9
7
4
ENSG00000222008
30
0
23
29
19
0
0
0
46
63
58
54
53
17
ENSG00000125831
ENSG00000101444
ENSG00000101333
0
2
2256 2793 3456
…
71
3362
Model assumptions
Mj = library size
ij = relative abundance of
feature i
2702 2976 1320
……
  Poisson describes technical variation:
Yij ~ Pois( Mj *
ij
)
mean(Yij)= variance(Yij) = Mj *
ij
  Negative binomial models biological variability using the
dispersion parameter ϕ:
Yij ~ NB(
ij=Mj
*
ij
, ϕi )
  Same mean, variance is quadratic in the mean:
variance( Yij ) =
ij
(1+
ij
ϕi )
Critical parameter to estimate: dispersion
J. Aubert
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
41 / 81
J. Aubert
Differential analysis
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
„ estimation : crucial !
Parametric Model
Some proposed solutions
Normalized
Count Data
Differential analysis
Reference
Anders et Huber (2010)
Robinson et Smyth (2009)
Di et al. (2011)
P+
NB
Quasi-L
TSPM
BaySeq, DESeq,
ShrinkSeq
EBSeq, edgeR,
NBPseq, ShrinkSeq
DESeq, edgeR,
NBSeq, TSPM
NBPseq : „ and – estimated by LM based on all the genes.
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
Bayesian
Bayseq, EBSeq,
ShrinkSeq
Differential Analysis
Classical
Test
DESeq : data-driven relationship of variance and mean estimated
using parametric or local regression for robust fit across genes
zéro inf. NB
NOISeq, NOISeqBIO, SAMseq
Estimation
Variance
µ(1 + „µ µ)
µ(1 + „µ)
µ(1 + „µ–≠1 )
Non Parametric Model
semiparametric
BaySeq, DESeq, EBSeq, edgeR,
NBPSeq, ShrinkSeq, TSPM
edgeR : borrow information across genes for stable estimates of „
3 ways to estimate „ (common, trend, moderated)
J. Aubert
42 / 81
A lot of statistical methods...still developped
Many genes, very few biological samples - difficult to estimate „ on a
gene-by-gene basis
Method
DESeq
edgeR
NBPseq
Roscoff, 2014 Oct. 7
Wilcoxon's
statistic
Gaussian
Kernel
Empirical
distribution
« noise »
SAMseq
NOISeqBIO
NOISeq
Permutation
SAMseq
Bayesian Thresholding
NOISeqBIO
NOISeq
Differential Analysis between two conditions
43 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
44 / 81
Normalisation and Differential analysis
Differential analysis
Normalisation and Differential analysis
Comparaison of differential analysis methods
Comparaison of differential analysis methods
Soneson et Delorenzi (2013)
Soneson et Delorenzi (2013)
Small number of replicates (2-3) or low expression æ be careful ! !
Evaluation of 11 methods on both simulated and real data.
Obs 1 : The number of replicates matters ! (Differently for different
methods)
Obs 2 : Results are more accurate and less variable between methods
if DE genes are regulated in both directions.
Obs 3 : Outlier counts affect different methods in different ways
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
Large number of replicates (10 or so) or very high expression æ
method choice does not matter much.
Removing genes with outlier counts or using non-parametric methods
reduce the sensitivity to outliers
Allow tagwise dispersion values
Normalization methods have problems when all DE genes are
regulated in one direction. Iterative approaches like TCC improve
performance
The dispersion estimation method matters !
J. Aubert
Differential analysis
45 / 81
J. Aubert
Differential analysis
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
DESeq2 Love and Huber (2013)
Evalution on methods using SEQC benchmark dataset and ENCODE data.
Differences with DESeq.
Dispersion shrinkage
Fold Change shrinkage (for PCA and Gene Set Enrichissment
Analysis)
Significant differences between methods.
Array-based methods adapted perform comparably to specific
methods.
Detection of outliers
Increasing the number of replicates samples significantly improves
detection power over increased sequencing depth.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
46 / 81
Differential analysis
Comparaison of differential analysis methods
Rapaport et al. (2013)
Roscoff, 2014 Oct. 7
Improves power
Only one command line
47 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
48 / 81
Soneson and Delorenzi BMC Bioinformatics 2013, 14:91
Normalisation and Differential analysis
Differential analysis
http://www.biomedcentral.com/1471-2105/14/91
LETTERS
Type I error rate at p_nom < 0.05, B00
A
Type I error rate at p_nom < 0.05, P00
B
0.20
0.20
Challenge: edgeR can be sensitive to outliers
0.15
0.15
10
Sample
20
30
2000
40
50
60
Page 8 of 18
0.00
0
10
Sample
20
30
40
Sample
50
60
Figure 2. Counts from some miRNAs found to be very significant by edgeR do not seem to follow negative binomial
0
0
Type Ithe
error
rate from
at p_nom
< 0.05, inBthe
Type
I error
at p_nom
< 0.05,
Type
I 10th
error
rate rate
at p_nom
< 0.05,
S0 P0
distributions. Each panel shows
counts
one miRNA
Witten data.28 These miRNAs are the
7th,
0
and 11th most significant features detected by edgeR. The heights of vertical bars show the scaled counts from the
samples.
29 bars, coloured blue, are
0.20 The first 29 bars, coloured red, are samples from the one class, and the other
0.20 0.20
from the other. The black broken line is also drawn to separate the two classes. In each panel, we see that one count
has much larger values than all the other counts.
A
CB
no outliers
0.00
DESeq.10
edgeR.10
vst.10
voom.10
TSPM.10
NBPSeq.10
0
DESeq.5
edgeR.5
vst.5
voom.5
TSPM.5
NBPSeq.5
60
Type I error rate at p_nom < 0.05, R00
D
0.20
presence of outliers
Type I error rate
Type I error rate
DESeq.10
voom.10
vst.10
TSPM.10
edgeR.10
NBPSeq.10
DESeq.5
voom.5
vst.5
TSPM.5
edgeR.5
NBPSeq.5
DESeq.2
voom.2
vst.2
TSPM.2
edgeR.2
NBPSeq.2
DESeq.10
vst.10
voom.10
TSPM.10
edgeR.10
NBPSeq.10
DESeq.5
vst.5
voom.5
TSPM.5
edgeR.5
NBPSeq.5
Current policies (robustness)
DESeq.2
vst.2
voom.2
TSPM.2
edgeR.2
NBPSeq.2
control was worse when all genes were regulated in the
same direction. The high false discovery rate seen for
ShrinkSeq can possibly be reduced by setting a non-zero
value for the fold change threshold defining the null
model. Also the variability of the baySeq performance
was considerably reduced when there were both up- and
downregulated genes among the DE ones. For the largest
sample size (10 samples per group), ShrinkSeq, NBPSeq,
EBSeq, edgeR and TSPM often found too many false
positives. The remaining methods were essentially able
to control the false discovery rate at the desired level
under these conditions. A possible explanation for the
high false discovery rates of NBPSeq is that the
S00 (panel C) and R00 (panel D). Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but had
overall a small effect. Including outliers with abnormally high counts (panels C and D) had a detrimental effect on the ability to control the type I
error for edgeR and NBPSeq, while DESeq became slightly more conservative.
edgeR ? one option : moderate dispersion less towards trend
level. We put the FDR threshold at 0.05, and calculated control was worse when all genes were regulated in the
DESeq
? take
thesame
fitdirection.
(trended)
or the
feature-specific
The high false
discovery
rate seen for
the true false
discoverythe
rate asmaximum
the fraction of theof
genes
called significant at this level that were indeed false dis- ShrinkSeq can possibly be reduced by setting a non-zero
dispersion
coveries. Since NOISeq does not return a statistic that is value for the fold change threshold defining the null
recommended to use as an adjusted p-value or FDR esti- model. Also the variability of the baySeq performance
Very
robust, but many genes pay a penalty, less powerful.
mate, it was excluded from this evaluation. For baySeq, was considerably reduced when there were both up- and
We subsequently filtered reads that had low mapping quality, mapped
sex chromosomes or mitochondrial DNA and were not correctly
paired, which yielded 9.4 6 3.3 million reads. On average, 86% of
the filtered reads mapped to known exons in Ensembl version
54(ref. 17) and 15% of read pairs spanned more than one exon.
Evaluation of sequence and mapping quality measures was preformed
to ensure that the data quality is acceptable for analysis (Supplementary Fig. 2, also see methods).
We quantified reads for known exons, transcripts and whole genes.
Read counts for each individual were scaled to a theoretical yield of 10
million reads and corrected for peak insert size across corresponding
libraries. Each quantification was filtered to exclude those with missing data for . 10% of the individuals. For exons, this resulted in data
for 90,064 exons for 10,777 genes. Of these, 95% had on average more
than 10 reads, 38% more than 50 reads and 20% had a mean quantification of $ 100 reads (Supplementary Fig. 3). For transcript quantification, new methods needed to be developed to map reads
into specific isoforms18,19. We developed a methodology, called the
FluxCapacitor, to quantify abundances of annotated alternatively
spliced transcripts (see Methods). Using this method, we obtain relative quantities for 15,967 transcripts from 11,674 genes. For each
individual, we compared whole-gene read counts to array intensities
generated with Illumina HG-6 version 2 microarrays. Correlations
coefficients between RNA-Seq and array quantities and among
RNA-Seq samples were high and consistent with previous studies20
(Supplementary Figs 4 and 5). Finally, we explored whether the correlation structure of abundance among exons could facilitate the
development of a framework that will allow the imputation of abundance values for exons that are not screened, given a set of reference
RNA-Seq samples. This is the same principle as using the correlation
structure (Linkage Disequilibrium) of genetic variants to impute
variants from a reference to any population sample of interest21. For
each of the 10,777 genes, we assessed the pairwise correlation of all exons
and on average, any two pairs of exons within a gene were moderately
correlated (mean Pearson’s correlation R2 5 0.378 6 0.261) (Supplementary Fig. 6). This correlation increased with increase in total
number of reads present in each exon. It is worth noting that the average
correlation coefficient between SNPs within the same recombination
hotspot interval in HapMap3 is R2 5 0.326 6 0.174, indicating that the
correlation structure within genes is stronger and probably more accessible by imputation methodologies than SNPs; however, this needs to be
assessed in a tissue-specific context.
Association of gene expression measured by RNA-Seq with genetic
variation was evaluated in cis with the use of 1.2 million HapMap3
4004
2538
4962
7921
6115
5156
2527
1115
3175
7951
7631
3437
NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140
0.0
1.9
178.1
0.0
0.5
0.0
0.0
0.0
0.0
0.0
2.0
0.6
235.5
6.8
60.2
1.0
0.0
0.0
2.5
1.3
3.5
0.6
429.5
1.0
35.9
0.0
0.4
0.0
0.0
4.7
1.0
5.1
78.9
2.9
0.0
0.0
0.8
0.0
0.0
0.4
0.0
1.3
0.0
1.9
0.0
0.5
46.1
0.0
100.1
1.3
13.8
1.3
30.7
0.0
7.1
0.0
0.0
1.0
0.0
1.3
23.7
111.0
228.8
77.0
129.5
10.0
45.3
27.4
26.3
19.1
2.0
15.2 1074.8
19.5
13.2
10.0
29.6
0.0
1.3
5.5
3.0
6.3
181.0
7.8
7.6
0.0
5.5
3.0
3.1
2.5
1.0
12.1
35.9
0.0
1.0
1.0
0.0
1.0
0.0
0.0
0.0
1.9
0.4
1.0
0.0
0.5
29.6
0.0
24.4
5.5
24.6
31.1
167.0
4.9
21.2
4.5
8.3
10.1
8.1
0.4
logFC
logCPM
LR
PValue
FDR
-10.413038 4.186203 30.07924 4.147469e-08 0.0002239513
-5.942865 4.963086 29.60406 5.299369e-08 0.0002239513
-6.387829 5.576979 26.06085 3.308237e-07 0.0009320406
-5.808379 3.183079 22.51927 2.080466e-06 0.0043960241
5.746084 3.921353 21.37010 3.786299e-06 0.0064003595
-4.573655 2.512035 20.13483 7.217026e-06 0.0101663841
-2.154480 6.128702 18.44343 1.750229e-05 0.0211327628
-4.575934 6.873996 18.14127 2.051076e-05 0.0211672325
Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211 Switzerland. Wellcome Trust Sanger Institute, Cambridge CB10 1HH, UK.
-3.843458 4.473754 17.71318 2.568407e-05
0.0211672325
Center for Genomic Regulation, University Pompeu Fabra, Barcelona, Catalonia, 08003 Spain.
773
-4.786326 2.416892 17.66324 2.636730e-05 0.0211672325
©2010 Macmillan Publishers Limited. All rights reserved
4.311717 2.683367 17.57990 2.754846e-05 0.0211672325
-3.014484 4.821100 17.05690 3.627624e-05 0.0255505626
J. Aubert
1
2
3
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
50 / 81
Differential analysis
Robinson’s simulations
edgeR and edgeR robust are a bit liberal (5% FDR might mean 6% or
7%
to control the false discovery rate at the desired level
pare Figures 4A and 4B). The main difference between under these conditions. A possible explanation for the
Goal (Robinson
and
: achieve
middle
ground
false discovery
rates of between
NBPSeq is that protection
the
the two settings was
seen co.)
for ShrinkSeq,
whose FDR ahigh
against outliers while maintaining high power with observation weights
Roscoff, 2014 Oct. 7
CPMs
(counts
per
million)
4004
2538
4962
7921
6115
5156
2527
1115
3175
7951
7631
3437
DESeq2 is very powerful in the absence of outliers, but policy to filter
outliers results in loss of power
downregulated genes among the DE ones. For the largest
Stat. challenges (RNA-Seq)
in one lane of an Illumina GAII analyzer and yielded 16.9 6 5.9
notypes using custom and commercially available microarrays1–5.
Second generation sequencing technologies are now providing
unprecedented access to the fine structure of the transcriptome6–14.
We have sequenced the mRNA fraction of the transcriptome in 60
extended HapMap individuals of European descent and have combined these data with genetic variants from the HapMap3 project15.
We have quantified exon abundance based on read depth and have
also developed methods to quantify whole transcript abundance.
We have found that approximately 10 million reads of sequencing
can provide access to the same dynamic range as arrays with better
quantification of alternative and highly abundant transcripts.
Correlation with SNPs (small nucleotide polymorphisms) leads to
a larger discovery of eQTLs (expression quantitative trait loci) than
with arrays. We also detect a substantial number of variants that
influence the structure of mature transcripts indicating variants
responsible for alternative splicing. Finally, measures of allelespecific expression allowed the identification of rare eQTLs and
allelic differences in transcript structure. This analysis shows that
high throughput sequencing technologies reveal new properties of
genetic effects on the transcriptome and allow the exploration of
genetic effects in cellular processes.
Genetic variation in gene expression is an important determinant of
human phenotypic variation; a number of studies have elucidated
genome-wide patterns of heritability and population differentiation
and are beginning to unravel the role of gene expression in the aetiology
of disease1–5. Interrogation of the transcriptome in these studies has
been greatly facilitated by the use of microarrays, which quantify transcript abundance by hybridization. However, microarrays possess
several limitations and recent advances in transcriptome sequencing
in second generation sequencing platforms have now provided singlenucleotide resolution of gene expression providing access to rare transcripts, more accurate quantification of abundant transcripts (above
the signal saturation point of arrays), novel gene structure, alternative
splicing and allele-specific expression6–14. Although RNA-Seq studies
have addressed issues of transcript complexity, they have not yet
addressed how genetic studies can benefit from this increased resolution to reveal novel effects of sequence variants on the transcriptome.
To understand the quantitative differences in gene expression
within a human population as determined from second generation
sequencing, we sequenced the mRNA fraction of the transcriptome of
lymphoblastoid cell lines (LCLs) from 60 CEU (HapMap individuals
of European descent) individuals (from CEPH—Centre d’Etude du
Polymorphisme Humain) using 37-base pairs (bp) paired-end
Illumina sequencing. Each individual’s transcriptome was sequenced
DESeq’s policy on outliers has a global effect, resulting in (sometimes
drastic) drop in power
EBSeq and ShrinkSeq, we imposed the desired threshold
DESeq2
? calculate
Cook’s distance
and
filter
outliers
sample
size (10
samplesgenes
per group),with
ShrinkSeq,
NBPSeq,
on the Bayesian
FDR [28].
As above, when only 10% of the genes were DE, the EBSeq, edgeR and TSPM often found too many false
Can
inadvertently
filter
interesting
genes
The remaining methods were essentially able
direction
of their regulation had
little effect
on the false positives.
J. Aubert
Nature, 2010
Gene expression is an important phenotype that informs about
Robust edgeR suffers a tiny bit in power with no outliers, but has
good capacity to dampen their effect if present
Allows dispersions to be driven more by the data
discovery rate (simulation studies B1250
and B625
0
625 , com-
DESeq.10
voom.10
vst.10
TSPM.10
edgeR.10
NBPSeq.10
DESeq.5
voom.5
vst.5
TSPM.5
edgeR.5
NBPSeq.5
DESeq.2
voom.2
vst.2
TSPM.2
edgeR.2
NBPSeq.2
Type I error rate
DESeq.10
vst.10
DESeq.10
voom.10
edgeR.10
TSPM.10
vst.10
edgeR.10
voom.10
NBPSeq.10
TSPM.10
NBPSeq.10
DESeq.5
vst.5
DESeq.5
voom.5
edgeR.5
TSPM.5
vst.5
edgeR.5
voom.5
NBPSeq.5
TSPM.5
NBPSeq.5
DESeq.2
vst.2
DESeq.2
voom.2
edgeR.2
TSPM.2
vst.2
edgeR.2
voom.2
NBPSeq.2
TSPM.2
NBPSeq.2
Type I error rate
DESeq.10
edgeR.10
vst.10
voom.10
TSPM.10
NBPSeq.10
DESeq.5
edgeR.5
vst.5
voom.5
TSPM.5
NBPSeq.5
DESeq.2
edgeR.2
vst.2
voom.2
TSPM.2
NBPSeq.2
Type I error rate
Type I error rate
0.15 0.15
0.15
0.15
When
the assumed model does not hold, our statistic is able to select significant
features much more
efficiently than parametric methods. Also, in contrast to parametric methods, our method gives a
reliable estimate of the FDR. On several real data sets, our method is able to find features that are
expressed
consistently higher in one class, and these are more likely to 0.10
be biologically
meaningful.
0.10
0.10
0.10
Moreover, the use of current parametric methods is limited in the outcome types that they can
handle. Except for PoissonSeq,20 to our knowledge, existing methods can only be used for data with
two-class outcomes. PoissonSeq can also be used for data with quantitative outcomes and multiple0.05 0.05
0.05
class0.05
outcomes, but not survival outcomes. Because of the complexity of
parametric methods, it is
often difficult to extend them to other types of outcomes. In contrast, our nonparametric method
can be used for all the types of outcomes mentioned above. Further, the resampling strategy that we
developed
(Section 2.2) eliminates the difference between sequencing depths
of experiments, making
0.00 0.00
0.00
0.00
it easy to generalize our method to other possible types of outcomes.
The rest of this article is organized as follows. In Section 2, we propose a nonparametric statistic
for data with a two-class outcome and the associated resampling strategy, as well as a permutation
plug-in method to estimate the false discovery rate FDR. In Section 3, we study the performance of
our nonparametric method on simulated data sets, and compare it with three available methods,
Soneson
and
2013
Type PoissonSeq
I error rates.and
Type I error
rates, for the
six Delorenzi,
methods providing
nominal p-values, in simulation studies B00 (panel A), P00 (panel B),
edgeR, PoissonSeq and DESeq. In Section 4, we apply our method as Figure
well as 3edgeR,
I error rate at p_nom < 0.05, S0
Type I error rate at p_nom < 0.05, R00
0
S00 (panel
D). as
Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but had
C on three real Type
D C) and
DESeq
RNA-Seq data sets, and compare
the list of
features
that Rare
called
0 (panel
overall
small(RNA-Seq)
effect. Including
counts (panels
C and
D) had7a detrimental
differentially
expressed by different methods. In SectionStat.
5, we extend
our anonparametric
statisticoutliers with abnormally high
J.0.20Aubert
challenges
Roscoff,
2014
Oct.
49 effect
/ 81on the ability to control the type I
0.20
to other types of outcomes, and show their performance on simulated
data
Section
6 contains
error
forsets.
edgeR
and NBPSeq,
while DESeq became slightly more conservative.
the discussion.
0.15
2 0.15
A nonparametric method for two-class data
level. We put
the FDR threshold
Normalisation
and Differential analysis
Differential
analysisat 0.05, and calculated
2.1 Wilcoxon
statistic
the true false discovery rate as the fraction of the genes
For Feature j, suppose that we have counts N1j, . . . , Nnj from either Class 1 or Class 2. Suppose Class
0.10
0.10
called
significant
at
this
level
that were indeed false disk contains nk samples, k ¼ 1, 2 and n1 + n2 ¼ n. Let Ck ¼ {i : Sample i is from Class k}, k ¼ 1, 2. If the
coveries. Since NOISeq does not return a statistic that is
recommended to use as an adjusted p-value or FDR esti0.05
mate, it0.05was excluded from this evaluation. For baySeq,
EBSeq and ShrinkSeq, we imposed the desired threshold
on the 0.00
Bayesian FDR [28].
0.00
As above, when only 10% of the genes were DE, the
direction of their regulation had little effect on the false
discovery rate (simulation studies B1250
and B625
0
625 , compare Figures 4A and 4B). The main difference between
0
Figure 3 Type I error rates. Type I error rates, for the six methods providing
nominal
p-values,
in
simulation
studies
B
(panel
A), P00 (panel
the two settings was seen for ShrinkSeq,
whose
FDRB),
0
Stephen B. Montgomery1,2, Micha Sammeth3, Maria Gutierrez-Arcelus1, Radoslaw P. Lach2, Catherine Ingle2,
James Nisbett2, Roderic Guigo3 & Emmanouil T. Dermitzakis1,2
16
Results driven by outliers
DESeq.2
edgeR.2
vst.2
voom.2
TSPM.2
NBPSeq.2
50
DESeq.10
edgeR.10
vst.10
voom.10
TSPM.10
NBPSeq.10
40
DESeq.5
edgeR.5
vst.5
voom.5
TSPM.5
NBPSeq.5
30
Transcriptome genetics using second generation
sequencing in a Caucasian population
(meanexpression
6 s.d.) million reads that were then mapped to the NCBI36
and environmental
on cellular
state. Many studies
Random split of dataset: n1=5; n2=5 genetic
Very
littleeffects
true
differential
assembly of the human genome (Supplementary Fig. 1) using MAQ .
have previously identified genetic variants for gene expression phe-
0.10
0.05
DESeq.2
edgeR.2
vst.2
voom.2
TSPM.2
NBPSeq.2
20
Why is robustness
needed?
Li and Tibshirani, 2011
0
0
10
0.05
6000
Scaled counts
2000
Scaled counts
500 1000
6000
0
2000
Soneson and Delorenzi BMC Bioinformatics 2013, 14:91
http://www.biomedcentral.com/1471-2105/14/91
0
miR−375, No. 11 by edgeR
0.10
10000
miR−133b, No. 10 by edgeR
10000
miR−206, No. 7 by edgeR
Type I error rate
Statistical Methods in Medical Research 0(0)
Type I error rate
4
Scaled counts
Differential analysis
Vol 464 | 1 April 2010 | doi:10.1038/nature08903
Outliers
False positive rate
Normalisation and Differential analysis
Page 8 of 18
51 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
52 / 81
Normalisation and Differential analysis
Multiple testing
Normalisation and Differential analysis
Multiple Testing
Multiple testing
The Family Wise Error Rate (FWER)
Definition
False positive (FP) (type I error : –) : A not DE gene which is declared
DE.
For all ’genes’, we test H0 (gene i is not DE) vs H1 (the gene is DE) using
a statistical test (calcul of a score)
Pb :
Let assume all the G genes are not DE. Each test is realized at – level
Ex : G = 10000 genes and – = 0.05 æ E (FP) = 500 genes.
FWER = P(FP > 0)
Probability of having at least one false positive, of declaring DE at least
one non DE gene.
The Bonferroni procedure
Either each test is realized at – = –ú /G level or use of adjusted pvalue
pBonfi = min(1, pi ú G)
and FWER Æ –ú .
For G = 2000, Æ –ú = 0.05, – = 2.510≠5 .
Easy but conservative and not powerful.
When the number of tests increases, the FWER æ 1 with constant FP.
J. Aubert
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Roscoff, 2014 Oct. 7
53 / 81
J. Aubert
Multiple testing
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
The False Discovery Rate
Roscoff, 2014 Oct. 7
54 / 81
Multiple testing
False Discovery Rate (FDR) Benjamini et Hochberg (95)
FDR =
E (FP/P)
=1
si P > 0
sinon
Idea : Do not control the error rate ut the proportion of error ∆ less
conservative than control of the FWER.
Prop
FDR Æ FWER
Source : M. Guedj, Pharnext
The procedure of Benjamini-Hochberg (95) is one of the procedure which
controls the False Discovery Rate FDR=E (FP/P) siP > 0.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
55 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
56 / 81
Normalisation and Differential analysis
Multiple testing
Normalisation and Differential analysis
FDR
Multiple testing
Multiple testing : key points
Important to control for multiple tests
Principle : The number of declared positive elements P is given by the
greater i p(i) Æ i–ú /G.
FDR or FWER depends on the cost associated to FN and FP
Controlling the FWER :
Prop
Having a great confidendence on the DE elements (strong control).
Accepting to not detect some elements (lack of power … a few DE
elements)
In case of independant tests, FDR Æ (G0 /G)–ú Æ –ú
Prop
FDR Benjamini-Hochberg : fi0 =
J. Aubert
G0
G =1
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Controlling the FDR :
Accepting a proportion of FP among DE elements. Very interesting in
exploratory study.
Roscoff, 2014 Oct. 7
57 / 81
J. Aubert
Multiple testing
Stat. challenges (RNA-Seq)
Normalisation and Differential analysis
Interpretation - Statistical and practical significance
Roscoff, 2014 Oct. 7
58 / 81
Multiple testing
Differential Analysis : key points
Methods dedicated to microarrays are not applicable to RNA-seq
Small number of replicates (2-3) or low expression æ be careful ! !
Practical significance (importance) and statistical significance
(detectability) have little to do with each other.
An effect can be important, but undetectable (statistically
insignificant) because the data are few, irrelevant, or of poor quality.
An effect can be statistically significant (detectable) even if it is small
and unimportant, if the data are many and of high quality.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
59 / 81
Large number of replicates (10 or so) or very high expression æ
method choice does not matter much.
Removing genes with outlier counts or using non-parametric methods
reduce the sensitivity to outliers
Don’t forget to correct for multiple testing !
Adapt the method to your data (nb of rep.)
Specific methods developped for few replicates.
The need for ’sophisticated’ methods decreases when the number of
replicates increases.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
60 / 81
Normalisation and Differential analysis
Multiple testing
Conclusions
Other questions
General conclusions
Gene-Set Enrichissment Analysis
These tests assume that genes have the same chance to be declared DE.
But sometimes over-detection of longer and mare expressed genes
GOSeq (Young et al. 2011)
Pratical conclusions
Need to collaborate between biologists, bioinformaticians et
statisticians
and in a ideal world since the project construction
Adaptation of methods and tools to the asked question (no pipeline)
Check all the steps of the data analysis (quality, normalization,
differential analysis . . . )
Filter or not
Rau et al. 2013 ; Huber et al.
Statistics not only useful for differential analysis with RNA-seq
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
61 / 81
J. Aubert
Conclusions
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
62 / 81
Conclusions
Aknowledgements
StatOmique (in particular M.-A. Dillies, A. Rau, C. Hennequet-Antier)
PEPI IBIS (INRA) Pôle planification expérimentale et RNA-Seq
David Robinson, Charlotte Soneson
From Kumamaru, 2012
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
63 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
64 / 81
Conclusions
Conclusions
References
References
Anders, S and Huber, W. (2010) Differential expression analysis
for sequence count data, Genome Biology,11 :R106.
Anders, S, McCarthy, DJ, Chen, Y, Okoniewski, M, Smyth GK,
Huber, W and Robinson, MD (2013) Count-based differential
expression analysis of RNA sequencing data using R and
Bioconductor, Nature Protocols, doi :10.1038.
Kvam V, Liu P (2012) A comparison of statistical methods for
detecting differentially expressed genes from RNA-seq data
Bullard JH, Purdom E, Hansen KD, Dudoit S. (2010) Evaluation of
statistical methods for normalization and differential expression
in mRNA-seq experiments, BMC Bioinformatics, 11 :94
Di Y, Schaef, DW, Cumbie JS, Chang JH (2011) The NBP
Negative Binomial Model for Assessing Differential Gene
Expression from RNA-Seq, Statistical Applications in Genetics and
Molecular Biology, 10(1), Article 24.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
Dillies M-A et al. on behalf of The French StatOmique Consortium
(2012) A comprehensive evaluation of normalization methods
for Illumina high-throughput RNA sequencing data analysis,
Briefing in Bioinformatics.
65 / 81
Li J, Jiang H, Wong WH, (2010) Modeling non-uniformity in short
read rates in RNA-Seq data, Genome Biology, 11 :R50
Li J, Witten DM, Johnstone IM, Tisbhirani R (2011) Normalization,
testing, and false discovery rate estimation for RNA-sequencing
data, Biostatistics, 1-16
Marioni J.C., Mason C.E. et al. (2008) RNA-seq : An assessment
of technical reproducibility and comparison with gene
expression arrays, Genome Research, 18 : 1509-1517.
J. Aubert
Conclusions
References
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. (2008)
Mapping and quantifying mammalian transcriptomes by
RNA-seq. Nature Methods, 5(7), 621-628
Pachter L (2011) Models for transcript quantification from
RNA-seq
Rapaport et al. (2013) Comprehensive evaluation of differential
gene expression analysis methods for RNA-seq data, Genome
Biology,14 :R95
Robles et al (2012) Efficient experimental design and analysis
strategies for the detection of differential expression using
RNA-Sequencing, preprint
Robinson MD, Oshlack A. (2010) A scaling normalization method
for differential expression analysis of RNA-seq data. Genome
Biology, 11 :R25
Robinson MD and Smyth, GK. (2008) Small-sample estimation of
negative binomial dispersion, with applications to SAGE data
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
67 / 81
Biostatistics,
9, 2 ; 321-332
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
66 / 81
Conclusions
References
Robinson MD, McCarthy DJ, Smyth, GK. (2009) edgeR : a
Bioconductor package for differential expression analysis of
digital gene expression data, Bioinformatics
Soneson, C, Delorenzi, M. (2013) A comparison of methods for
differential expression analysis of RNA-seq data. BMC
Bioinformatics,14 :91
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren
MJ, Salzberg SL, Wold BJ, Pachter L. (2011) Transcript assembly
and quantification by RNA-Seq reveals unannotated transcripts
and isoform switching during cell differentiation, Nature
Biotechnology, 28(5) : 511 ?515.
Young MD, Wakefiled MJ, Smyth GK., Oshlack A. (2011) Gene
ontology analysis for RNA-seq :accounting for selection bias ,
Genome Biology
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
68 / 81
Conclusions
Conclusions
Notations
RPKM normalization
Reads Per Kilobase per Million mapped reads
Adjust for lane sequencing depth (library size) and gene length
Motivation greater lane sequencing depth and gene length =>
greater counts whatever the expression level
Assumption read counts are proportional to expression level, gene
length and sequencing depth (same RNAs in equal proportion)
Method divide gene read count by total number of reads (in million)
and gene length (in kilobase)
xij : number of reads for gene i in sample j
Nj : number of reads in sample j (library size of sample j)
n : number of samples in the experiment
ŝj : normalization factor associated with sample j
xij
ú 103 ú 106
Nj ú Li
Li : length of gene i
(1)
Allows to compare expression levels between genes of the same sample
Unbiased estimation of number of reads but affect the variance.
(Oshlack et al. 2009)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
69 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Conclusions
Roscoff, 2014 Oct. 7
70 / 81
Conclusions
The Effective Library Size concept
Methods based on the Effective Library Size Concept
Trimmed Mean of M-values Robinson et al. 2010 (edgeR)
Motivation
Different biological conditions express different RNA repertoires, leading to
different total amounts of RNA
Filter on transcripts with nul counts, on the resp. 30% and 5% more extreme
k
Mi = log2( YYikÕ/N
Õ ) and A values
ik /Nk
Hyp : We may not estimate the total ARN production in one condition but we may
estimate a global expression change between two conditions from non extreme Mi
distribution.
Calculation of a scaling factor between two conditions and normalization to avoid
Assumption
A majority of transcripts is not differentially expressed
dependance on a reference sample
Anders and Huber 2010 - Package DESeq
Aim
Minimizing effect of (very) majority sequences
sˆj = mediani (
kij
)
(fivm=1 kiv )1/m
kij : number of reads in sample j assigned to gene i
denominator : pseudo-reference sample created from geometric mean across samples
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
71 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
72 / 81
Conclusions
Conclusions
Length bias (Oshlack 2009, Bullard et al. 2010)
Negative Binomial Models
At same expression level, a long transcript will have more reads than a
shorter transcript. Number of reads ”= expression level
µijk = ⁄ij mjk where mj k : size factor (library size)
Test : H0i : ⁄iA = ⁄iB vs H1i : ⁄iA ”= ⁄iB
µ = E (X ) = cNL = Var (X )
X mesured number of reads in a library mapping a specific transcript,
Poisson r.v.
c proportionnality constant
N total number of transcripts
L gene length
Test :
X1 ≠ X2
t=
(cN1 L + cN2 L)

Power of test depends on a parameter prop. to (L).
Identical result after normalization by gene length (but out of the Poisson
framework).
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
edgeR
Adjust observed counts up or down depending on whether library sizes
are below or above the geometric mean => Creates approximately
identically distributed pseudodata
Estimation of „i by conditional ML conditioning on the TC for gene i
Empirical Bayes procedure to shrink dispersions toward a consensus
value
An exact test analogous to Fisher’s exact test but adapted to
overdispersed data (Robinson and Smyth 2008)
DESeq
Test similar to Fisher’s exact test (calculation has changed)
73 / 81
J. Aubert
Conclusions
Yijk ≥ NB(µijk , ‡ijk ), where µijk is the mean, and ‡ijk is the variance
The mean µijk is the product of a condition-dependent per-gene value
⁄ij and a size factor (library size) mjk :
Three sets of parameters need to be estimated :
1
2
µijk = ⁄ij mjk
3
4
Variance decomposition : The variance ‡ijk is the sum of a shot noise
term and a raw variance term : ‡ijk = µijk + –i µ2 where –i the
dispersion value.
Per-gene dispersion –i or pooled – is a smooth function of the mean :
–i = fj (⁄ij )
J. Aubert
Stat. challenges (RNA-Seq)
74 / 81
DESeq Bioconductor package
Assumptions :
2
Roscoff, 2014 Oct. 7
Conclusions
Negative Binomial Models - DESeq
1
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
75 / 81
Size factors mj k (normalization factors) (see normalization part)
For each experimental condition j, n expression strength parameters
⁄ij estimated by average of counts from the replicates for each
condition, transformed to the common scale :
1 ÿ yijk
⁄ˆij =
rj k mˆj k
3
The smooth functions fj for each condition j to model dependence of
–i on the expected mean ⁄ij : local or gamma GLM estimation
(fit=’local’ or fit=’parametric’)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
76 / 81
Conclusions
Conclusions
Practical considerations
First commands
Input Data = raw counts
normalization offsets are included in the model
Version matters : edgeR 2.6.7 et DESeq 1.6.1 (Bioconductor 2.9)
edgeR
TMM normalization Common dispersion must be estimated before tagwise
dispersions GLM functionality (for experiments with multiple factors) now
available
Installation des packages :
source("http ://www.bioconductor.org/biocLite.R")
biocLite(c("DESeq", "edgeR"))
Chargement des packages :
library(DESeq)
library(edgeR)
DESeq
Two possibilities to obtain a smooth functions fj (·)
Conservative estimates : genes are assigned the maximum of the
fitted and empirical values of –i (sharingMode = "maximum")
Local fit regression (as described in paper) is no longer the default
Each column = independent biological replicate
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
77 / 81
J. Aubert
Conclusions
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
78 / 81
Roscoff, 2014 Oct. 7
80 / 81
Conclusions
edgeR main commands
DESeq main commands
generate raw counts from NB, create list object
y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4)
rownames(y) <- paste("Gene",1 :nrow(y),sep=".")
group <- factor(c(1,1,2,2))
cds <- newCountDataSet(y, group)
cds <- estimateSizeFactors(cds)
sizeFactors(cds)
cds <- estimateDispersions(cds)
res <- nbinomTest( cds, "1", "2" )
perform DA with edgeR
y <- DGEList(counts=y,group=group)
y <- calcNormFactors(y,method="TMM")
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)
result <- exactTest(y,dispersion="tagwise")
Observe some results - DGE with FDR BH
topTags(result)
summary(decideTestsDGE(result),p.value=0.05)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
79 / 81
J. Aubert
Stat. challenges (RNA-Seq)
Conclusions
Quelques références pour débuter
http://www.r-project.org/ : manuel, FAQ, RJournal, etc...
http://www.bioconductor.org/help/publications/
cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf
G. Millot, (2009), Comprendre et réaliser les tests statistiques à l’aide
de R, Editions De Boeck, 704 p.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
81 / 81