Slides - Alan Moses

Transcription

Slides - Alan Moses
Bioinformatics, statistics and
multiple testing
Alan Moses
ML4bio
With slides
from Quaid
Morris
Outline for Today
• Bioinformatics
– GO and other annotations
– The annoying thing about bioinformatics
• Review of hypothesis testing
– Parametric vs. non
non-parametric
parametric tests
– Exact tests
– Multivariate hypothesis
yp
testing
g
• Multiple hypothesis testing
– Bonferoni, FDR
– Application to gene set enrichment analysis
ConDens kinase substrate prediction
• Andy Lai was a MSc student in my lab who
developed a cool new way to predict kinase
substrates based on amino acid sequence
alignments.
• He predicted new lists of substrates for some
kinases, and wanted to show that the predictions
were good, without doing any experiments.
• Gene Set Enrichment Analysis is the answer
List of predicted Cbk1
targets in yeast
BOI1
SEC3
MPT5
SSD1
DSF2
FIR1
YNL058C
KIN1
YGR117C
KIN2
IRC8
YJL016W
ACE2
RGA2
List of predicted Cbk1
targets in drosophila
CG8617
Oatp30B
CG9467
ec
pan
Where Do Gene Lists Come From?
• Molecular profiling e.g. mRNA, protein
– Identification Æ Gene list
– Quantification Æ Gene list + values
– Ranking,
Ranking Clustering (biostatistics)
• Interactions: Protein interactions, microRNA
g
transcription
p
factor binding
g sites
targets,
(ChIP)
• Genetic screen e.g. of knock out library
• Association
A
i ti studies
t di (G
(Genome-wide)
id )
– Single nucleotide polymorphisms (SNPs)
– Copy number variants (CNVs)
Quaid Morris
What is the Gene Ontology (GO)?
What is the Gene Ontology (GO)?
www.geneontology.org
• Set
Set of biological phrases (terms) which are of biological phrases (terms) which are
applied to genes:
– protein kinase
protein kinase
– apoptosis
– membrane
• Dictionary: term definitions
• Ontology: A formal system for describing knowledge
Jane Lomax @ EBI
GO Structure
GO Structure
• Terms are related within a hierarchy
– is‐a
– part‐of
• Describes multiple levels of detail of
levels of detail of gene function
• Terms can have more Terms can have more
than one parent or child
What GO Covers?
What GO Covers?
• GO terms divided into three aspects:
– cellular component
– molecular function
– biological process (important pathway source)
glucose-6-phosphate
l
6 h
h t iisomerase
activity
Cell division
Terms
• Where do GO terms come from?
– GO terms are added by editors at EBI and gene GO
dd d b di
EBI d
annotation database groups
– Terms added by request
T
dd d b
t
– Experts help with major development
– 32029 terms, >99% with definitions.
32029
99% i h d fi i i
•
•
•
•
19639 biological_process
2859 cellular component
2859 cellular_component
9531 molecular_function
As of July 15, 2010
As of July 15, 2010
Annotations
• Genes
Genes are linked, or associated, with GO are linked or associated with GO
terms by trained curators at genome databases
– Known as ‘gene associations’ or GO annotations
– Multiple annotations per gene Multiple annotations per gene
• Some GO annotations created automatically ( ith t h
(without human review)
i )
Annotation Sources
• Manual annotation
– Curated by scientists
by scientists
• High quality
• Small number (time‐consuming to create)
– Reviewed computational analysis
• Electronic annotation
– Annotation derived without human validation
• Computational predictions (accuracy varies)
• Lower ‘quality’ than manual codes
‘ l ’ h
l d
• Key point: be aware of annotation origin Evidence Types
Evidence Types
•
•
Experimental Evidence Codes
•
EXP: Inferred from Experiment
•
IDA: Inferred from Direct Assayy
•
IPI: Inferred from Physical Interaction
•
IMP: Inferred from Mutant Phenotype
•
IGI: Inferred from Genetic Interaction
•
IEP: Inferred from Expression Pattern
•
•
Computational Analysis Evidence Codes
•
ISS: Inferred from Sequence or Structural
Similarity
•
ISO: Inferred from Sequence Orthology
•
ISA: Inferred from Sequence Alignment
•
ISM: Inferred from Sequence Model
•
IGC: Inferred from Genomic Context
•
RCA: inferred from Reviewed Computational
Analysis
Author Statement Evidence
Codes
•
TAS: Traceable Author
Statement
•
NAS: Non-traceable
Author Statement
Curator Statement Evidence
Codes
•
IC: Inferred by
Curator
•
ND: No biological
Data available
• IEA: Inferred from electronic annotation
See http://www.geneontology.org
Wide & Variable Species Coverage
Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.
Accessing GO: QuickGO
http://www.ebi.ac.uk/ego/
See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
Biomart 0.7
Quaid Morris
Ensembl BioMart
Ensembl BioMart
• Convenient access to gene list annotation
Select genome
Select filters
Select attributes
to download
Quaid Morris
Sources of Gene Attributes
Sources of Gene Attributes
• Ensembl BioMart (eukaryotes)
– http://www.ensembl.org
• Entrez Gene (general)
(g
)
– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene
• Model organism databases
g
– E.g. SGD: http://www.yeastgenome.org/
• Also available through R
Quaid Morris
Why is it all such a mess?
•
•
•
Naming of molecules was done by whoever found it first.
o
Proteins and Genes do not always have consistent names
names.
o
More important genes that were studied by many groups have many names. Competing
research groups may purposefully omit the name(s) used by other groups
Database identifiers (IDs) are unique
unique, stable names or numbers
that help track database records, but…
o
Each database will typically use its own internal IDs and naming conventions
o
The more important a gene/protein is,
is the more databases will have information for it,
it so it
will have many IDs
o
Databases are frequently updated, so we always have to keep track of the database
version that was used
Records for: Gene, DNA, RNA, Protein
o
Important to recognize the correct record type
o
Different data sources pertain to different data types (e.g., Pfam only has proteins)
o
The relationship between Genes, DNA, RNA and Proteins is not 1 to 1
Common Identifiers
Gene
Ensembl ENSG00000139618
Entrez Gene 675
U i
Unigene
H 34012
Hs.34012
RNA transcript
GenBank BC026160.1
BC026160 1
RefSeq NM_000059
Ensembl ENST00000380152
Protein
Ensembl ENSP00000369497
RefSeq NP_000050.2
U iP t BRCA2_HUMAN
UniProt
BRCA2 HUMAN or
A1YBP1_HUMAN
IPI IPI00412408.1
EMBL AF309413
PDB 1MIU
Species-specific
HUGO HGNC BRCA2
MGI MGI:109337
RGD 2219
ZFIN ZDB-GENE-060510-3
FlyBase CG9097
WormBase WBGene00002299 or ZK1067.1
SG S000002187
SGD
S
or YDL029W
Annotations
InterPro IPR015252
OMIM 600185
Pfam PF09104
Gene Ontology GO:0000724
SNPs rs28897757
E
Experimental
i
t l Platform
Pl tf
Affymetrix 208368_3p_s_at
Agilent A_23_P99452
Red = Recommended
CodeLink GE60169
Illumina GI_4502450-S
Quaid Morris
ID Mapping Services
ID Mapping Services
• Synergizer
–
http://llama.med.harvard.edu/synergiz
er/translate/
• Ensembl BioMart
–
http://www.ensembl.org
• PICR (proteins only)
–
http://www.ebi.ac.uk/Tools/picr/
• R
R language l
annotation databases
–
http://www.bioconductor.org
Quaid Morris
ID Mapping Challenges
ID Mapping Challenges
• Avoid errors: map IDs correctly
Gene name ambiguity – not a good ID
not a good ID
• Gene name ambiguity – e.g. FLJ92943, LFS1, TRP53, p53
– Better to use the standard gene symbol: TP53
g
y
• Excel error‐introduction
– OCT4 is changed to October‐4
g
• Problems reaching 100% coverage
– E.g. due to version issues
– Use multiple sources to increase coverage
Zeeberg BR et al. Mistaken identifiers: gene
name errors can be introduced inadvertently
y
when using Excel in bioinformatics BMC
Bioinformatics. 2004 Jun 23;5:80
Quaid Morris
Summary so far
• GO ((and other functional annotations)) are a g
great
way to tell us about the functions of a list of gene
• In order to use these, we need to compare our
gene list
li t tto what’s
h t’ iin th
the GO database…
d t b
– Genes and their products and attributes have many
identifiers (IDs)
( )
– Bioinformatics often means converting or mapping
IDs from one type to another
– ID mapping services are available
– Use standard, commonly used IDs to reduce ID
mapping challenges
Outline for Today
• Bioinformatics
– GO and other annotations
– The annoying thing about bioinformatics
• Review of hypothesis testing
– Parametric vs. non
non-parametric
parametric tests
– Exact tests
– Multivariate hypothesis
yp
testing
g
• Multiple hypothesis testing
– Bonferoni, FDR
– Application to gene set enrichment analysis
What is a P-value?
P value?
• A) The probability that the null hypothesis
is true
• B) Probability of a test statistic under the
null distribution
• C) P
Probability
b bilit off an incorrect
i
t rejection
j ti off
the null hypothesis
• D) Some subset of the above
Modified from
Quaid Morris
What is a P-value?
P value?
• A) The probability that the null hypothesis
is true
• B) Probability of a test statistic under the
null distribution
• C) P
Probability
b bilit off an incorrect
i
t rejection
j ti off
the null hypothesis
• D) Some subset of the above
N
None
off th
these!!
Modified from
Quaid Morris
What is a P-value?
P value?
• Probability of observing something as
extreme or more under the null hypothesis
What is this thing?
• Usually it’s a “test statistic” but it can be
any summary of the data…
• Always a sum or integral over the “tail” or
“tails” of a distribution.
Hypothesis testing
• Random variables:
– H: H0 (null hypothesis) or H1 (alternative hypothesis)
– Data: X1, X2, … XN (independent and identically distributed –
IID)
– t is a test statistic, t = f(X)
– t* observed value of test statistic
• Parameters:
– α: significance level
– Reject H0 if P-value < α
• P-value is:
– Pr[ t is “as or more extreme” than t* | H0 is true ]
26
Modified from
Quaid Morris
P-value
P
value versus false rejections
• P
P-value
value is:
– Pr[ t is “as or more extreme” than t* | H0 is true ]
• False rejection probability:
– Pr[ H0 is true | H0 is rejected ]
– aka “False discovery rate”
27
Modified from
Quaid Morris
P-value
P
value facts
• Note that: Pr[P
Pr[P-value
value < p | H0 is true] = p
• So under the null distribution
distribution, P-value is a
random variable that is uniformly distributed
between 0 and 1.
•
Given different tests with P-values p1, p2, …, pN you can
combine them into a single P-value. “Fisher’s method”
•
Fisher figured out that test statistic X2 = -2 Σi ln[ pi ] is chi-square
with 2N degrees of freedom if p’s are uniform {0,1}
Sometimes called “meta analysis”
y
because you
y can combine the
results of many analyses this way
•
28
Modified from
Quaid Morris
samples
E g 2-sample
E.g.,
2 sample tests
genes
• W
We make
k severall observations
b
ti
under
d ttwo
situations, and we want to find out whether
there is a statistical difference.
• Which genes have differential expression in
the different tumor types?
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter
J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, Botstein D. Repeated observation of breast tumor
subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8418-23.
2 sample test
6
Gene Expression levels for a single gene
5
9
6
4
Distribution
of gene expression levels
1
5
6
3
7
Normal Breast-like
0
1
1
0
2
1
1
-1
-2
0
0
1
Basal subtype
0
Probability density
5
0.8
0.6
0.4
02
0.2
0
-5
0
5
Gene expression
p
Æ
10
Question: How likely is it that the difference
between the two samples is due to chance?
30
Modified from
Quaid Morris
2 sample t-test
Normal Breast-like: N1=10
Mean: m1 = 5.6
Std:
s1 = 1.6
Basal subtype: N2=13
Mean: m1 = 0.3
03
Std:
s1 = 1.0
T statistic =
T-statistic
31
m1 − m2
s12 s22
+
N1 N 2
Distribution
of gene expression levels
1
Probability density
Summarize the data with the
so called “tt-statistic
so-called
statistic”
0.8
0.6
0.4
02
0.2
0
-5
0
5
Gene expression
p
Æ
10
H0: Black and red scores are drawn from
a distribution with the same mean
H1: The
Th two
t
means are nott equall
Modified from
Quaid Morris
2 sample t-test
T-distribution
0
T-statistic
T statistic =
T-statistic
m1 − m2
s12 s22
+
N1 N 2
Distribution of this statistic is
known under the null hypothesis
32
Distribution
of gene expression levels
1
Probability density
Probability d
density
P-value
P
value = shaded area * 2
0.8
0.6
0.4
02
0.2
0
-5
0
5
Gene expression
p
Æ
10
H0: Black and red scores are drawn from
a distribution with the same mean
H1: The
Th two
t
means are nott equall
Modified from
Quaid Morris
Examples of inappropriate distributions for T-tests
T-test assumes data are (approximately) normally distributed
T-test detects differences between means, not necessarily between distributions
Gene expression Æ
Values are positive and have
increasing density near zero, e.g.
sequence counts
Probability density
Bimodal “two-bumped”
distributions.
Probabillity density
Probab
bility density
Distributions with outliers,
or “heavy-tailed”
distributions
Gene expression Æ
0
Gene expression Æ
Solutions: “non-parametric two-sample tests”
1) Robust test for difference of medians (WMW)
2) Di
Direct test off difference
diff
off distributions
di ib i
(K
(K-S)
S)
33
Quaid Morris
Enrichment analysis with two-sample, not paired
Wilco on Rank S
Wilcoxon
Sum
m
1) Rank gene scores, calculate RB,
sum of ranks of black values
ranks
2.1
5.6
-1.1
-2.5
-0.5
N2 red
values
3.2
1.7
6.5
4.5
0.1
6.5
56
5.6
4.5
3.2
2.1
1.7
0.1
-1.1
-2.5
25
-0.5
P
Probability
density
aka Mann-Whitney U test or simply “WMW”
1
2
3 RB = 21
4
5
6
7
Gene Expression Æ
8
9
H0: Probability that a red ranks are
10
greater than black ranks is 0.5
H1: red ranks are greater than black ranks
N1 black
values
34
Z
Quaid Morris
Wilcoxon-Mann-Whitney (WMW) test
aka Mann-Whitney U-test, Wilcoxon rank-sum test
mean rank
RB = 21
N1 + N 2 + 1
RB − N1
2
Z=
= -1.4
σU
3) Calculate P
P-value:
value:
Pro
obability densityy
P-value = shaded area * 2
Normal distribution
-1.4
35
P
Probability
density
2) Calculate Z-score:
Gene ExpressionÆ
H0: Probability that a random sample from
distribution of red score is > than one from
black is 0.5
H1: Otherwise
0
Z
Z
Quaid Morris
WMW test details
• Described method is only applicable for
large N1 and N2 and when there are no
tied scores
• WMW test is robust to (a few) outliers
σ u = N1 N 2 ( N1 + N 2 + 1) / 12
36
Quaid Morris
Empirical (cumulative)
distribution
1.0
0.5
0
Gene ExpressionÆ
Prrobability d
density
Cum
mulative p
probability
Kolmogorov-Smirnov (K-S) test for
diff
difference
off distributions
di ib i
0
Gene ExpressionÆ
1) Calculate cumulative
distributions of red and black
37
Quaid Morris
Empirical (cumulative)
distribution
1.0
0.5
0
Gene ExpressionÆ
Prrobability d
density
Cum
mulative p
probability
Kolmogorov-Smirnov
Kolmogorov
Smirnov (K-S)
(K S) test
0
Gene ExpressionÆ
1) Calculate cumulative
distributions of red and black
38
Quaid Morris
Empirical (cumulative)
distribution
1.0
0.5
0
Gene Expression Æ
Prrobability d
density
Cum
mulative p
probability
Kolmogorov-Smirnov
Kolmogorov
Smirnov (K-S)
(K S) test
0
Gene Expression Æ
1) Calculate cumulative
distributions of red and black
39
Quaid Morris
Empirical (cumulative)
distribution
1.0
0.5
0
Distance = 0.4
Gene Expression Æ
Prrobability d
density
Cum
mulative p
probability
Kolmogorov-Smirnov
Kolmogorov
Smirnov (K-S)
(K S) test
0
Gene Expression Æ
Test statistic: Maximum vertical difference
between the two cumulative distributions
Distribution of test statistic is known
regardless
dl
off the
th underlying
d l i di
distributions
t ib ti
40
Modified from
Quaid Morris
WMW and K
K-S
S test caveats
• Neither tests is as sensitive as the T-test,, i.e. theyy
require more data points to detect the same amount of
difference, so use the T-test whenever it is valid.
• K-S
K S test
t t and
d WMW can give
i you different
diff
t answers: K-S
KS
detects difference of distributions, WMW detects whether
samples from one tend to be higher than those from the
other (or vice versa)
• Technical issue: Tied scores and/or small # of
observations can be a problem for some
implementations of the WMW or KS-test
41
Quaid Morris
Central limit theorem
• If you have a moderately large sample, you
can do statistical tests that don’t
don t depend on
assumptions about the distribution of the
data
Probability den
nsity
Æ E.g., black data mean is almost certainly greater
than red mean, but there are a lot of tied ‘0’ values that
might mess up K
K-S
S and WMW tests
tests.
Central Limit Theorem: Distribution of your
the estimate of means is Gaussian. (Assuming
your sample is big enough, i.i.d., and that the
variance is finite)
0
Gene Expression Æ
Under the null hypothesis, average red = average black and is N(μ,σ2),
where μ is the mean and σ2 is the variance.
What is the distribution of my data?
• Because of the central limit theorem and
permutation tests, you don’t usually have to
worry about it
• A good way to check is using a “qq
qq-plot
plot”.
– This compares the “theoretical quantiles” of a
particular distribution to the quantiles in your data.
– If they don’t disagree too badly, you can usually
be safe assuming your data are consistent with
that distribution
• With large genomics data sets, you will have
enough power to reject the hypothesis that
your data “truly”
truly come from any distribution
Permutation tests
• Often, the null distribution of the test
statistic is unclear or not analytical.
• In these cases, you can generate an
empirical
i i l di
distribution
t ib ti b
by sampling
li ffrom th
the
null distribution and then evaluating your
test statistic against
g
this distribution.
• In many genomic applications it is often
possible to get a sample from the null
distribution by randomizing (i
(i.e.
e permuting)
the association between genes and
corresponding
p
g data.
44
Quaid Morris
When permuting, you
have to think deep
thoughts about what
your null hypothesis
really is.
Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey
Ideker Nature Biotechnology 31, 38–45 (2013) A gene ontology inferred from molecular networks
Exact tests
• Sometimes
Sometimes, the probability of an
observation as extreme or more can be
calculated directly under the H0
• In this case there is no “test statistic”
• E.g.,
E
“binomial
“bi
i l ttest”,
t” “Fi
“Fisher’s
h ’ E
Exactt T
Test”
t”
Use for
Gene Set Enrichment Analysis
and “hypergeometric test”
• These tests are feasible now because
computers calculate these probabilities
E g Binomial test
E.g.,
• You did a poll were you get “yes”
yes or “no”
no
answers each time, and you have some
prior belief about the frequency of “yes”
yes or
“no” under the null hypothesis. E.g., if
people don’t
don t care,
care then p should be 50%
P-value = Pr(73 or more “yes” | 102 total, p=50%)
, where
X=102
P-value =
Σ
⎛102 ⎞
⎜⎜
⎟⎟
X=73 ⎝ X ⎠
(0.5)X (1 – 0.5)102-X
⎛n⎞
n!
⎜⎜ ⎟⎟ =
⎝ k ⎠ (n − k )!k!
E g Fisher’s
E.g.,
Fisher s Exact test
Obse
erved
positive negative
• You developed a prediction method where you
got a 2 x 2 table as the result
predicted
p
ed c ed
positive negative
14
178
7
31
I won’t bother you with the
formula, but the probability of
the “configuration” of the 2 x 2
table can be calculated exactly
Æ doesn’t make any assumption about
positives and negatives
g
the distribution of p
P-value = Pr(a “configuration” as extreme or more | no association)
To calculate this
this, you need to sum up a lot of possible tables
According to R, in this case P-value = 0.05666
The hypergeometric test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
H0: List is a random sample from population
H1: More black genes than expected
Background population:
500 bl
black
k genes,
4500 red genes
49
Quaid Morris
The hypergeometric function
Probability a random sample of k genes contains q black genes when the
background population contains m black genes out of n total genes:
# ways to
choose q out
of m genes
# ways to choose
q-k out of n-m
genes
=
# ways to
choose k out
of n genes
⎛m ⎞⎛ n − m ⎞
⎜ ⎟⎜
⎟
⎝ q ⎠⎝ q − k ⎠
⎛n⎞
⎜ ⎟
⎝k ⎠
⎛n ⎞
n!
is called “n choose k” for details see
=
⎜ ⎟50
⎝k ⎠ (n − k)!k! http://www.khanacademy.org/video/combinations
Quaid Morris
The hypergeometric test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
Null distribution
P-value
⎛500⎞⎛ 4500⎞
⎛500⎞⎛ 4500⎞
⎟
⎜
⎟⎜
⎟
⎜
⎟⎜
⎝ 4 ⎠⎝ 1 ⎠ + ⎝ 5 ⎠⎝ 0 ⎠ = 4.6 x 10-4
⎛5000⎞
⎛5000⎞
⎜
⎟
⎜
⎟
⎝ 5 ⎠
⎝ 5 ⎠
Background population:
500 bl
black
k genes,
4500 red genes
51
Quaid Morris
Important details
• One wayy to test for under-enrichment of “black”,, test for
over-enrichment of “red”
• Same as a “One-tailed Fisher’s Exact Test”
• Need to choose “background population” appropriately,
e.g., if only portion of the total gene complement is
queried (or available for annotation), only use that
population as background.
• To test for enrichment of more than one independent
t
types
off annotation
t ti (red
( d vs black
bl k and
d circle
i l vs square),
)
we need to apply the hypergeometric test separately for
yp ***multivariate hypothesis
yp
testing***
g
each type.
52
Quaid Morris
Multivariate hypothesis tests
P-value is the “probability of observing something as extreme or more under the null hypothesis”
• Basic problem is the “or
or more”
more
Æ We would have to do the
sum in all dimensions.
Instead there are two major strategies for multivariate hypothesis testing:
yp
with a single
g
1. Likelihood ratio test – summarizes the multivariate hypothesis
test statistic, and then do the sum in a single dimension
2. Test each dimension independently – very conservative because it ignores the
potential correlation between dimensions.
Æ When we want to know which dimensions are causing the rejection of the null
hypothesis, we typically use #2
Gene set enrichment analysis
• Which (if any) annotations are enriched in
our gene list?
• Test each annotation independently using
the hypergeometric test
• Need to correct P-values because there
are so many annotations tested…
Outline for Today
• Bioinformatics
– GO and other annotations
– The annoying thing about bioinformatics
• Review of hypothesis testing
– Parametric vs. non
non-parametric
parametric tests
– Exact tests
– Multivariate hypothesis
yp
testing
g
• Multiple hypothesis testing
– Bonferoni, FDR
– Application to gene set enrichment analysis
Multiple test correction:
Bonferroni and False Discovery
R t
Rate
56
Quaid Morris
Mark Gerstein P
P-value
value paradox
– His lab publishes about 30 research
papers/year. E.g., published 33 papers in
2011 (>300 in the last 10 years)
– At P-value=0.05,, how manyy significant
g
results/year are expected from his lab under
the null hypothesis?
How to win the P-value
P value lottery, part 1
Random draws
… 7,834
,
draws later …
Expect a random draw
with observed
enrichment
i h
t once every
1 / P-value draws
Background population:
500 bl
black
k genes,
5000 red genes
58
Quaid Morris
How to win the P-value lottery, part 2
Keep the gene list the same, evaluate different annotations
Observed draw
RRP6
MRD1
RRP7
RRP43
RRP42
59
Different annotations
RRP6
MRD1
RRP7
RRP43
RRP42
Quaid Morris
ORA tests need correction
From the Gene Ontology website:
Current ontology statistics: 25206 terms
• 14825 biological process
• 2101 cellular component
• 8280 molecular function
Æ Buying 1 or 2 or even 10 lottery tickets, you still have a small
chance of winning. However, if you by 25,000 tickets, your chances of
winning start to improve.
60
Quaid Morris
Simple P-value
P value correction: Bonferroni
If M = # of annotations tested:
Corrected P-value = M x original
g
P-value
Corrected P-value is g
greater than or equal
q
to the p
probability
y that
one or more of the observed enrichments could be due to
random draws. The jargon for this correction is “controlling for
the Family-Wise Error Rate (FWER)”
Quaid Morris
Bonferroni correction caveats
• Bonferroni correction is very stringent and
can “wash away” real enrichments.
• Often users are willing to accept a less
stringent condition, the “false discovery
rate” (FDR),
rate
(FDR) which leads to a gentler
correction when there are real
enrichments.
enrichments
62
Quaid Morris
False discovery rate (FDR)
• FDR is the expected proportion of the
observed enrichments due to random
chance.
• Compare to Bonferroni correction which is a bound
on the probability that any one of the observed
enrichments could be due to random chance.
• Typically
yp
y FDR corrections are calculated using
g the
Benjamini-Hochberg procedure.
• FDR threshold is often called the “q-value”
Quaid Morris
Controlling FDR using the
B j i i H hb
Benjamini-Hochberg
procedure
d
I
• Say you want to bound the FDR at α,
α you
need to calculate the corresponding Pvalue threshold t
• First, calculate the P-values for all the
tests and then sort them so that p1 is the
tests,
smallest (i.e. most significant) P-value,
and pm is the least
least.
64
Benjamini, Y. & Hochberg, Y. (1995) J. R. Stat. Soc. B 85, 289–300
Quaid Morris
Controlling FDR using the
B j i i H hb
Benjamini-Hochberg
procedure
d
II
• t = pr where r is the max value for which:
FDR threshold
pr ≤ rα / m
rank
# of tests
Cavaet: Assumes independent or positively
correlated tests.
65
Quaid Morris
Reducing multiple test correction stringency
• Can control the stringency by reducing the
number of tests: e.g. use GO slim or
restrict testing to the appropriate GO
annotations.
66
Quaid Morris
Reducing multiple test correction stringency
• The correction to the P-value
P value threshold 〈
depends on the # of tests that you do, so,
no matter what
what, the more tests you do
do, the
more sensitive the test needs to be
• Can control the stringency by reducing the
number of tests: e.g. use GO slim; restrict
testing to the appropriate GO annotations;
or select only larger GO categories.
Quaid Morris
Summary
• Multiple test correction
– Bonferroni: stringent, controls probability of
at least one false positive
– FDR: more forgiving, controls expected
proportion of false positives -- typically use
B j i i H hb
Benjamini-Hochberg