Slides - Alan Moses
Transcription
Slides - Alan Moses
Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid Morris Outline for Today • Bioinformatics – GO and other annotations – The annoying thing about bioinformatics • Review of hypothesis testing – Parametric vs. non non-parametric parametric tests – Exact tests – Multivariate hypothesis yp testing g • Multiple hypothesis testing – Bonferoni, FDR – Application to gene set enrichment analysis ConDens kinase substrate prediction • Andy Lai was a MSc student in my lab who developed a cool new way to predict kinase substrates based on amino acid sequence alignments. • He predicted new lists of substrates for some kinases, and wanted to show that the predictions were good, without doing any experiments. • Gene Set Enrichment Analysis is the answer List of predicted Cbk1 targets in yeast BOI1 SEC3 MPT5 SSD1 DSF2 FIR1 YNL058C KIN1 YGR117C KIN2 IRC8 YJL016W ACE2 RGA2 List of predicted Cbk1 targets in drosophila CG8617 Oatp30B CG9467 ec pan Where Do Gene Lists Come From? • Molecular profiling e.g. mRNA, protein – Identification Æ Gene list – Quantification Æ Gene list + values – Ranking, Ranking Clustering (biostatistics) • Interactions: Protein interactions, microRNA g transcription p factor binding g sites targets, (ChIP) • Genetic screen e.g. of knock out library • Association A i ti studies t di (G (Genome-wide) id ) – Single nucleotide polymorphisms (SNPs) – Copy number variants (CNVs) Quaid Morris What is the Gene Ontology (GO)? What is the Gene Ontology (GO)? www.geneontology.org • Set Set of biological phrases (terms) which are of biological phrases (terms) which are applied to genes: – protein kinase protein kinase – apoptosis – membrane • Dictionary: term definitions • Ontology: A formal system for describing knowledge Jane Lomax @ EBI GO Structure GO Structure • Terms are related within a hierarchy – is‐a – part‐of • Describes multiple levels of detail of levels of detail of gene function • Terms can have more Terms can have more than one parent or child What GO Covers? What GO Covers? • GO terms divided into three aspects: – cellular component – molecular function – biological process (important pathway source) glucose-6-phosphate l 6 h h t iisomerase activity Cell division Terms • Where do GO terms come from? – GO terms are added by editors at EBI and gene GO dd d b di EBI d annotation database groups – Terms added by request T dd d b t – Experts help with major development – 32029 terms, >99% with definitions. 32029 99% i h d fi i i • • • • 19639 biological_process 2859 cellular component 2859 cellular_component 9531 molecular_function As of July 15, 2010 As of July 15, 2010 Annotations • Genes Genes are linked, or associated, with GO are linked or associated with GO terms by trained curators at genome databases – Known as ‘gene associations’ or GO annotations – Multiple annotations per gene Multiple annotations per gene • Some GO annotations created automatically ( ith t h (without human review) i ) Annotation Sources • Manual annotation – Curated by scientists by scientists • High quality • Small number (time‐consuming to create) – Reviewed computational analysis • Electronic annotation – Annotation derived without human validation • Computational predictions (accuracy varies) • Lower ‘quality’ than manual codes ‘ l ’ h l d • Key point: be aware of annotation origin Evidence Types Evidence Types • • Experimental Evidence Codes • EXP: Inferred from Experiment • IDA: Inferred from Direct Assayy • IPI: Inferred from Physical Interaction • IMP: Inferred from Mutant Phenotype • IGI: Inferred from Genetic Interaction • IEP: Inferred from Expression Pattern • • Computational Analysis Evidence Codes • ISS: Inferred from Sequence or Structural Similarity • ISO: Inferred from Sequence Orthology • ISA: Inferred from Sequence Alignment • ISM: Inferred from Sequence Model • IGC: Inferred from Genomic Context • RCA: inferred from Reviewed Computational Analysis Author Statement Evidence Codes • TAS: Traceable Author Statement • NAS: Non-traceable Author Statement Curator Statement Evidence Codes • IC: Inferred by Curator • ND: No biological Data available • IEA: Inferred from electronic annotation See http://www.geneontology.org Wide & Variable Species Coverage Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304. Accessing GO: QuickGO http://www.ebi.ac.uk/ego/ See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi Biomart 0.7 Quaid Morris Ensembl BioMart Ensembl BioMart • Convenient access to gene list annotation Select genome Select filters Select attributes to download Quaid Morris Sources of Gene Attributes Sources of Gene Attributes • Ensembl BioMart (eukaryotes) – http://www.ensembl.org • Entrez Gene (general) (g ) – http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene • Model organism databases g – E.g. SGD: http://www.yeastgenome.org/ • Also available through R Quaid Morris Why is it all such a mess? • • • Naming of molecules was done by whoever found it first. o Proteins and Genes do not always have consistent names names. o More important genes that were studied by many groups have many names. Competing research groups may purposefully omit the name(s) used by other groups Database identifiers (IDs) are unique unique, stable names or numbers that help track database records, but… o Each database will typically use its own internal IDs and naming conventions o The more important a gene/protein is, is the more databases will have information for it, it so it will have many IDs o Databases are frequently updated, so we always have to keep track of the database version that was used Records for: Gene, DNA, RNA, Protein o Important to recognize the correct record type o Different data sources pertain to different data types (e.g., Pfam only has proteins) o The relationship between Genes, DNA, RNA and Proteins is not 1 to 1 Common Identifiers Gene Ensembl ENSG00000139618 Entrez Gene 675 U i Unigene H 34012 Hs.34012 RNA transcript GenBank BC026160.1 BC026160 1 RefSeq NM_000059 Ensembl ENST00000380152 Protein Ensembl ENSP00000369497 RefSeq NP_000050.2 U iP t BRCA2_HUMAN UniProt BRCA2 HUMAN or A1YBP1_HUMAN IPI IPI00412408.1 EMBL AF309413 PDB 1MIU Species-specific HUGO HGNC BRCA2 MGI MGI:109337 RGD 2219 ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1 SG S000002187 SGD S or YDL029W Annotations InterPro IPR015252 OMIM 600185 Pfam PF09104 Gene Ontology GO:0000724 SNPs rs28897757 E Experimental i t l Platform Pl tf Affymetrix 208368_3p_s_at Agilent A_23_P99452 Red = Recommended CodeLink GE60169 Illumina GI_4502450-S Quaid Morris ID Mapping Services ID Mapping Services • Synergizer – http://llama.med.harvard.edu/synergiz er/translate/ • Ensembl BioMart – http://www.ensembl.org • PICR (proteins only) – http://www.ebi.ac.uk/Tools/picr/ • R R language l annotation databases – http://www.bioconductor.org Quaid Morris ID Mapping Challenges ID Mapping Challenges • Avoid errors: map IDs correctly Gene name ambiguity – not a good ID not a good ID • Gene name ambiguity – e.g. FLJ92943, LFS1, TRP53, p53 – Better to use the standard gene symbol: TP53 g y • Excel error‐introduction – OCT4 is changed to October‐4 g • Problems reaching 100% coverage – E.g. due to version issues – Use multiple sources to increase coverage Zeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently y when using Excel in bioinformatics BMC Bioinformatics. 2004 Jun 23;5:80 Quaid Morris Summary so far • GO ((and other functional annotations)) are a g great way to tell us about the functions of a list of gene • In order to use these, we need to compare our gene list li t tto what’s h t’ iin th the GO database… d t b – Genes and their products and attributes have many identifiers (IDs) ( ) – Bioinformatics often means converting or mapping IDs from one type to another – ID mapping services are available – Use standard, commonly used IDs to reduce ID mapping challenges Outline for Today • Bioinformatics – GO and other annotations – The annoying thing about bioinformatics • Review of hypothesis testing – Parametric vs. non non-parametric parametric tests – Exact tests – Multivariate hypothesis yp testing g • Multiple hypothesis testing – Bonferoni, FDR – Application to gene set enrichment analysis What is a P-value? P value? • A) The probability that the null hypothesis is true • B) Probability of a test statistic under the null distribution • C) P Probability b bilit off an incorrect i t rejection j ti off the null hypothesis • D) Some subset of the above Modified from Quaid Morris What is a P-value? P value? • A) The probability that the null hypothesis is true • B) Probability of a test statistic under the null distribution • C) P Probability b bilit off an incorrect i t rejection j ti off the null hypothesis • D) Some subset of the above N None off th these!! Modified from Quaid Morris What is a P-value? P value? • Probability of observing something as extreme or more under the null hypothesis What is this thing? • Usually it’s a “test statistic” but it can be any summary of the data… • Always a sum or integral over the “tail” or “tails” of a distribution. Hypothesis testing • Random variables: – H: H0 (null hypothesis) or H1 (alternative hypothesis) – Data: X1, X2, … XN (independent and identically distributed – IID) – t is a test statistic, t = f(X) – t* observed value of test statistic • Parameters: – α: significance level – Reject H0 if P-value < α • P-value is: – Pr[ t is “as or more extreme” than t* | H0 is true ] 26 Modified from Quaid Morris P-value P value versus false rejections • P P-value value is: – Pr[ t is “as or more extreme” than t* | H0 is true ] • False rejection probability: – Pr[ H0 is true | H0 is rejected ] – aka “False discovery rate” 27 Modified from Quaid Morris P-value P value facts • Note that: Pr[P Pr[P-value value < p | H0 is true] = p • So under the null distribution distribution, P-value is a random variable that is uniformly distributed between 0 and 1. • Given different tests with P-values p1, p2, …, pN you can combine them into a single P-value. “Fisher’s method” • Fisher figured out that test statistic X2 = -2 Σi ln[ pi ] is chi-square with 2N degrees of freedom if p’s are uniform {0,1} Sometimes called “meta analysis” y because you y can combine the results of many analyses this way • 28 Modified from Quaid Morris samples E g 2-sample E.g., 2 sample tests genes • W We make k severall observations b ti under d ttwo situations, and we want to find out whether there is a statistical difference. • Which genes have differential expression in the different tumor types? Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, Botstein D. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8418-23. 2 sample test 6 Gene Expression levels for a single gene 5 9 6 4 Distribution of gene expression levels 1 5 6 3 7 Normal Breast-like 0 1 1 0 2 1 1 -1 -2 0 0 1 Basal subtype 0 Probability density 5 0.8 0.6 0.4 02 0.2 0 -5 0 5 Gene expression p Æ 10 Question: How likely is it that the difference between the two samples is due to chance? 30 Modified from Quaid Morris 2 sample t-test Normal Breast-like: N1=10 Mean: m1 = 5.6 Std: s1 = 1.6 Basal subtype: N2=13 Mean: m1 = 0.3 03 Std: s1 = 1.0 T statistic = T-statistic 31 m1 − m2 s12 s22 + N1 N 2 Distribution of gene expression levels 1 Probability density Summarize the data with the so called “tt-statistic so-called statistic” 0.8 0.6 0.4 02 0.2 0 -5 0 5 Gene expression p Æ 10 H0: Black and red scores are drawn from a distribution with the same mean H1: The Th two t means are nott equall Modified from Quaid Morris 2 sample t-test T-distribution 0 T-statistic T statistic = T-statistic m1 − m2 s12 s22 + N1 N 2 Distribution of this statistic is known under the null hypothesis 32 Distribution of gene expression levels 1 Probability density Probability d density P-value P value = shaded area * 2 0.8 0.6 0.4 02 0.2 0 -5 0 5 Gene expression p Æ 10 H0: Black and red scores are drawn from a distribution with the same mean H1: The Th two t means are nott equall Modified from Quaid Morris Examples of inappropriate distributions for T-tests T-test assumes data are (approximately) normally distributed T-test detects differences between means, not necessarily between distributions Gene expression Æ Values are positive and have increasing density near zero, e.g. sequence counts Probability density Bimodal “two-bumped” distributions. Probabillity density Probab bility density Distributions with outliers, or “heavy-tailed” distributions Gene expression Æ 0 Gene expression Æ Solutions: “non-parametric two-sample tests” 1) Robust test for difference of medians (WMW) 2) Di Direct test off difference diff off distributions di ib i (K (K-S) S) 33 Quaid Morris Enrichment analysis with two-sample, not paired Wilco on Rank S Wilcoxon Sum m 1) Rank gene scores, calculate RB, sum of ranks of black values ranks 2.1 5.6 -1.1 -2.5 -0.5 N2 red values 3.2 1.7 6.5 4.5 0.1 6.5 56 5.6 4.5 3.2 2.1 1.7 0.1 -1.1 -2.5 25 -0.5 P Probability density aka Mann-Whitney U test or simply “WMW” 1 2 3 RB = 21 4 5 6 7 Gene Expression Æ 8 9 H0: Probability that a red ranks are 10 greater than black ranks is 0.5 H1: red ranks are greater than black ranks N1 black values 34 Z Quaid Morris Wilcoxon-Mann-Whitney (WMW) test aka Mann-Whitney U-test, Wilcoxon rank-sum test mean rank RB = 21 N1 + N 2 + 1 RB − N1 2 Z= = -1.4 σU 3) Calculate P P-value: value: Pro obability densityy P-value = shaded area * 2 Normal distribution -1.4 35 P Probability density 2) Calculate Z-score: Gene ExpressionÆ H0: Probability that a random sample from distribution of red score is > than one from black is 0.5 H1: Otherwise 0 Z Z Quaid Morris WMW test details • Described method is only applicable for large N1 and N2 and when there are no tied scores • WMW test is robust to (a few) outliers σ u = N1 N 2 ( N1 + N 2 + 1) / 12 36 Quaid Morris Empirical (cumulative) distribution 1.0 0.5 0 Gene ExpressionÆ Prrobability d density Cum mulative p probability Kolmogorov-Smirnov (K-S) test for diff difference off distributions di ib i 0 Gene ExpressionÆ 1) Calculate cumulative distributions of red and black 37 Quaid Morris Empirical (cumulative) distribution 1.0 0.5 0 Gene ExpressionÆ Prrobability d density Cum mulative p probability Kolmogorov-Smirnov Kolmogorov Smirnov (K-S) (K S) test 0 Gene ExpressionÆ 1) Calculate cumulative distributions of red and black 38 Quaid Morris Empirical (cumulative) distribution 1.0 0.5 0 Gene Expression Æ Prrobability d density Cum mulative p probability Kolmogorov-Smirnov Kolmogorov Smirnov (K-S) (K S) test 0 Gene Expression Æ 1) Calculate cumulative distributions of red and black 39 Quaid Morris Empirical (cumulative) distribution 1.0 0.5 0 Distance = 0.4 Gene Expression Æ Prrobability d density Cum mulative p probability Kolmogorov-Smirnov Kolmogorov Smirnov (K-S) (K S) test 0 Gene Expression Æ Test statistic: Maximum vertical difference between the two cumulative distributions Distribution of test statistic is known regardless dl off the th underlying d l i di distributions t ib ti 40 Modified from Quaid Morris WMW and K K-S S test caveats • Neither tests is as sensitive as the T-test,, i.e. theyy require more data points to detect the same amount of difference, so use the T-test whenever it is valid. • K-S K S test t t and d WMW can give i you different diff t answers: K-S KS detects difference of distributions, WMW detects whether samples from one tend to be higher than those from the other (or vice versa) • Technical issue: Tied scores and/or small # of observations can be a problem for some implementations of the WMW or KS-test 41 Quaid Morris Central limit theorem • If you have a moderately large sample, you can do statistical tests that don’t don t depend on assumptions about the distribution of the data Probability den nsity Æ E.g., black data mean is almost certainly greater than red mean, but there are a lot of tied ‘0’ values that might mess up K K-S S and WMW tests tests. Central Limit Theorem: Distribution of your the estimate of means is Gaussian. (Assuming your sample is big enough, i.i.d., and that the variance is finite) 0 Gene Expression Æ Under the null hypothesis, average red = average black and is N(μ,σ2), where μ is the mean and σ2 is the variance. What is the distribution of my data? • Because of the central limit theorem and permutation tests, you don’t usually have to worry about it • A good way to check is using a “qq qq-plot plot”. – This compares the “theoretical quantiles” of a particular distribution to the quantiles in your data. – If they don’t disagree too badly, you can usually be safe assuming your data are consistent with that distribution • With large genomics data sets, you will have enough power to reject the hypothesis that your data “truly” truly come from any distribution Permutation tests • Often, the null distribution of the test statistic is unclear or not analytical. • In these cases, you can generate an empirical i i l di distribution t ib ti b by sampling li ffrom th the null distribution and then evaluating your test statistic against g this distribution. • In many genomic applications it is often possible to get a sample from the null distribution by randomizing (i (i.e. e permuting) the association between genes and corresponding p g data. 44 Quaid Morris When permuting, you have to think deep thoughts about what your null hypothesis really is. Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker Nature Biotechnology 31, 38–45 (2013) A gene ontology inferred from molecular networks Exact tests • Sometimes Sometimes, the probability of an observation as extreme or more can be calculated directly under the H0 • In this case there is no “test statistic” • E.g., E “binomial “bi i l ttest”, t” “Fi “Fisher’s h ’ E Exactt T Test” t” Use for Gene Set Enrichment Analysis and “hypergeometric test” • These tests are feasible now because computers calculate these probabilities E g Binomial test E.g., • You did a poll were you get “yes” yes or “no” no answers each time, and you have some prior belief about the frequency of “yes” yes or “no” under the null hypothesis. E.g., if people don’t don t care, care then p should be 50% P-value = Pr(73 or more “yes” | 102 total, p=50%) , where X=102 P-value = Σ ⎛102 ⎞ ⎜⎜ ⎟⎟ X=73 ⎝ X ⎠ (0.5)X (1 – 0.5)102-X ⎛n⎞ n! ⎜⎜ ⎟⎟ = ⎝ k ⎠ (n − k )!k! E g Fisher’s E.g., Fisher s Exact test Obse erved positive negative • You developed a prediction method where you got a 2 x 2 table as the result predicted p ed c ed positive negative 14 178 7 31 I won’t bother you with the formula, but the probability of the “configuration” of the 2 x 2 table can be calculated exactly Æ doesn’t make any assumption about positives and negatives g the distribution of p P-value = Pr(a “configuration” as extreme or more | no association) To calculate this this, you need to sum up a lot of possible tables According to R, in this case P-value = 0.05666 The hypergeometric test Gene list RRP6 MRD1 RRP7 RRP43 RRP42 H0: List is a random sample from population H1: More black genes than expected Background population: 500 bl black k genes, 4500 red genes 49 Quaid Morris The hypergeometric function Probability a random sample of k genes contains q black genes when the background population contains m black genes out of n total genes: # ways to choose q out of m genes # ways to choose q-k out of n-m genes = # ways to choose k out of n genes ⎛m ⎞⎛ n − m ⎞ ⎜ ⎟⎜ ⎟ ⎝ q ⎠⎝ q − k ⎠ ⎛n⎞ ⎜ ⎟ ⎝k ⎠ ⎛n ⎞ n! is called “n choose k” for details see = ⎜ ⎟50 ⎝k ⎠ (n − k)!k! http://www.khanacademy.org/video/combinations Quaid Morris The hypergeometric test Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Null distribution P-value ⎛500⎞⎛ 4500⎞ ⎛500⎞⎛ 4500⎞ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎝ 4 ⎠⎝ 1 ⎠ + ⎝ 5 ⎠⎝ 0 ⎠ = 4.6 x 10-4 ⎛5000⎞ ⎛5000⎞ ⎜ ⎟ ⎜ ⎟ ⎝ 5 ⎠ ⎝ 5 ⎠ Background population: 500 bl black k genes, 4500 red genes 51 Quaid Morris Important details • One wayy to test for under-enrichment of “black”,, test for over-enrichment of “red” • Same as a “One-tailed Fisher’s Exact Test” • Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. • To test for enrichment of more than one independent t types off annotation t ti (red ( d vs black bl k and d circle i l vs square), ) we need to apply the hypergeometric test separately for yp ***multivariate hypothesis yp testing*** g each type. 52 Quaid Morris Multivariate hypothesis tests P-value is the “probability of observing something as extreme or more under the null hypothesis” • Basic problem is the “or or more” more Æ We would have to do the sum in all dimensions. Instead there are two major strategies for multivariate hypothesis testing: yp with a single g 1. Likelihood ratio test – summarizes the multivariate hypothesis test statistic, and then do the sum in a single dimension 2. Test each dimension independently – very conservative because it ignores the potential correlation between dimensions. Æ When we want to know which dimensions are causing the rejection of the null hypothesis, we typically use #2 Gene set enrichment analysis • Which (if any) annotations are enriched in our gene list? • Test each annotation independently using the hypergeometric test • Need to correct P-values because there are so many annotations tested… Outline for Today • Bioinformatics – GO and other annotations – The annoying thing about bioinformatics • Review of hypothesis testing – Parametric vs. non non-parametric parametric tests – Exact tests – Multivariate hypothesis yp testing g • Multiple hypothesis testing – Bonferoni, FDR – Application to gene set enrichment analysis Multiple test correction: Bonferroni and False Discovery R t Rate 56 Quaid Morris Mark Gerstein P P-value value paradox – His lab publishes about 30 research papers/year. E.g., published 33 papers in 2011 (>300 in the last 10 years) – At P-value=0.05,, how manyy significant g results/year are expected from his lab under the null hypothesis? How to win the P-value P value lottery, part 1 Random draws … 7,834 , draws later … Expect a random draw with observed enrichment i h t once every 1 / P-value draws Background population: 500 bl black k genes, 5000 red genes 58 Quaid Morris How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw RRP6 MRD1 RRP7 RRP43 RRP42 59 Different annotations RRP6 MRD1 RRP7 RRP43 RRP42 Quaid Morris ORA tests need correction From the Gene Ontology website: Current ontology statistics: 25206 terms • 14825 biological process • 2101 cellular component • 8280 molecular function Æ Buying 1 or 2 or even 10 lottery tickets, you still have a small chance of winning. However, if you by 25,000 tickets, your chances of winning start to improve. 60 Quaid Morris Simple P-value P value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original g P-value Corrected P-value is g greater than or equal q to the p probability y that one or more of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)” Quaid Morris Bonferroni correction caveats • Bonferroni correction is very stringent and can “wash away” real enrichments. • Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR), rate (FDR) which leads to a gentler correction when there are real enrichments. enrichments 62 Quaid Morris False discovery rate (FDR) • FDR is the expected proportion of the observed enrichments due to random chance. • Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. • Typically yp y FDR corrections are calculated using g the Benjamini-Hochberg procedure. • FDR threshold is often called the “q-value” Quaid Morris Controlling FDR using the B j i i H hb Benjamini-Hochberg procedure d I • Say you want to bound the FDR at α, α you need to calculate the corresponding Pvalue threshold t • First, calculate the P-values for all the tests and then sort them so that p1 is the tests, smallest (i.e. most significant) P-value, and pm is the least least. 64 Benjamini, Y. & Hochberg, Y. (1995) J. R. Stat. Soc. B 85, 289–300 Quaid Morris Controlling FDR using the B j i i H hb Benjamini-Hochberg procedure d II • t = pr where r is the max value for which: FDR threshold pr ≤ rα / m rank # of tests Cavaet: Assumes independent or positively correlated tests. 65 Quaid Morris Reducing multiple test correction stringency • Can control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GO annotations. 66 Quaid Morris Reducing multiple test correction stringency • The correction to the P-value P value threshold 〈 depends on the # of tests that you do, so, no matter what what, the more tests you do do, the more sensitive the test needs to be • Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or select only larger GO categories. Quaid Morris Summary • Multiple test correction – Bonferroni: stringent, controls probability of at least one false positive – FDR: more forgiving, controls expected proportion of false positives -- typically use B j i i H hb Benjamini-Hochberg