Underpowered Studies - Open Science Framework
Transcription
Underpowered Studies - Open Science Framework
Learning to Improve the Impact & Integrity of your Science: An Introduction to the Open Science Framework https://osf.io Lorraine Chuen http://www.ooocanada.ca/ @OOOpen_Canada http://facebook.com/OOOCanada OOOCanada Co-Founder Studio-Y Fellow: MaRS Discovery District Tag-Team Talk A: Science’s Reproducibility Crisis David Groppe, PhD Research Associate: Psychology, Univ. of Toronto Adjunct Scientist: Neurosurgery, Feinstein Institute for Medical Research Tag-Team Talk B: Practical Guide to using the Open Science Framework Liz Page-Gould, PhD https://osf.io Associate Professor: Psychology, Univ. of Toronto Canada Research Chair in Social Psychophysiology OSF Ambassador Motivate Change How to Implement Change To Download Our Slides: https://osf.io/avdhk/ CC license: Tag-Team Talk A: Science’s Reproducibility Crisis David Groppe, PhD Research Associate: Psychology, Univ. of Toronto Adjunct Scientist: Neurosurgery, Feinstein Institute for Medical Research Science=Awesome http://science.amorphia-apparel.com/design/robot/ For Example… Hubble Space Telescope Newtonian Mechanics Insulin (Kudos Toronto!) Mistakes can Happen… Johannes Andreas Grib Fibiger Mistakes can Happen… Johannes Andreas Grib Fibiger 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes Mistakes can Happen… Johannes Andreas Grib Fibiger Mistakes should be rare (e.g., p<0.05) 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes Mistakes can Happen… Johannes Andreas Grib Fibiger Mistakes should be rare (e.g., p<0.05) Mistakes should be corrected by subsequent research 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes Increasing Evidence that Mistakes are Common and Difficult to Correct Image from: http://www.economist.com/news/briefing/21588057scientists-think-science-self-correcting-alarming-degree-it-not-trouble 3 Large Scale Replication Studies 1. Open Science Collaboration (2015) Estimating the reproducibility of psychological science. • • • Systematic attempt to replicate 100 psychology experiments 39% replication success 83% of replicated effects were smaller than original effects 2. Begley & Ellis (2012) Raise standards for preclinical research. • • Amgen’s attempt to replicate 53 “landmark” haematology & cancer studies 11% replication success 3. Prinz, Schlange, & Asadullah (2011) Believe it or not: how much can we rely on published data on potential drug targets? • • Bayer HealthCare’s attempt to replicate 67 studies in cancer, women’s health, & heart disease 20-25% replication success 3 Large Scale Replication Studies 1. Open Science Collaboration (2015) Estimating the reproducibility of psychological science. • • • Systematic attempt to replicate 100 psychology experiments 39% replication success 83% of replicated effects were smaller than original effects 2. Begley & Ellis (2012) Raise standards for preclinical research. • • Amgen’s attempt to replicate 53 “landmark” haematology & cancer studies 11% replication success 3. Prinz, Schlange, & Asadullah (2011) Believe it or not: how much can we rely on published data on potential drug targets? • • Bayer HealthCare’s attempt to replicate 67 studies in cancer, women’s health, & heart disease 20-25% replication success 3 Large Scale Replication Studies 1. Open Science Collaboration (2015) Estimating the reproducibility of psychological science. • • • Systematic attempt to replicate 100 psychology experiments 39% replication success 83% of replicated effects were smaller than original effects 2. Begley & Ellis (2012) Raise standards for preclinical research. • • Amgen’s attempt to replicate 53 “landmark” haematology & cancer studies 11% replication success 3. Prinz, Schlange, & Asadullah (2011) Believe it or not: how much can we rely on published data on potential drug targets? • • Bayer HealthCare’s attempt to replicate 67 studies in cancer, women’s health, & heart disease 20-25% replication success Mistakes can Happen… Johannes Andreas Grib Fibiger Mistakes should be rare (e.g., p<0.05) Mistakes should be corrected by subsequent research 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes Mistakes can Happen… Johannes Andreas Grib Fibiger ? Mistakes should be rare (e.g., p<0.05) Mistakes should be corrected by subsequent research 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes 2 fundamental basis. More troubling, some of the research has triggered a series of clinical studies — suggesting that many patients had human micro-environment in preclinical research, reviewers and editors should demand greater thoroughness. Reproducibility & Impact REPRODUCIBILITY OF RESEARCH FINDINGS Preclinical research generates many secondary publications, even when results cannot be reproduced. Journal impact factor Number of articles Mean number of citations of non-reproduced articles* Mean number of citations of reproduced articles >20 21 248 (range 3–800) 231 (range 82–519) 5–19 32 169 (range 6–1,909) 13 (range 3–24) Results from ten-year retrospective analysis of experiments performed prospectively. The term ‘non-reproduced’ was assigned on the basis of findings not being sufficiently robust to drive a drug-development programme. *Source of citations: Google Scholar, May 2011. Non-Reproducible Reproducible © 2012 Macmillan Publishers Limited. All rights reserved Over a 10 year period, non-reproducible studies tended to have greater impact and in some cases spawned subfields and clinical trials. Begley & Ellis (2011) Negative Results Rarely Published olding a heavy clipboard, for mple, took interview candidates e seriously and deemed social blems to be more pressing than . people primed with words ing to cleanliness judged dirty uch conceptual replications are ul for psychology, which often s with abstract concepts. “The al way of thinking would be [a conceptual replication] is n stronger than an exact replion. It gives better evidence for generalizability of the effect,” Eliot Smith, a psychologist at ana University in Blooming- ut to other psychologists, relie on conceptual replication is blematic. “You can’t replicate ncept,” says Chambers. “It’s so ective. It’s anybody’s guess as ow similar something needs to o count as a conceptual replica” The practice also produces a ical double-standard”, he says. ACCENTUATE THE POSITIVE A literature analysis across disciplines reveals a tendency to publish only ‘positive’ studies — those that support the tested hypothesis. Psychiatry and psychology are the worst offenders. PHYSICAL Space sciences Geosciences Environment/Ecology Plant and animal sciences Computer science Physics Neuroscience and behaviour Microbiology Chemistry Social sciences Immunology Molecular biology and genetics Economics and business Biology and biochemistry Clinical medicine Pharmacology and toxicology Materials science Psychiatry/psychology BIOLOGICAL 50% 60% SOCIAL 70% 80% 90% Proportion of papers supporting tested hypothesis Yong (2012) Bad Copy Mistakes can Happen… Johannes Andreas Grib Fibiger Mistakes should be rare (e.g., p<0.05) Mistakes should be corrected by subsequent research 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes Mistakes can Happen… How do we increase Johannes Andreas Grib Fibiger science’s credibility? Mistakes should be rare (e.g., p<0.05) Mistakes should be corrected by subsequent research 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes Why is Reproducibility so Low? Why is Reproducibility so Low? Important, General Culprits 1. Biases in Scientific Practice 2. Underpowered Studies Why is Reproducibility so Low? Important, General Culprits 1. Biases in Scientific Practice 2. Underpowered Studies Some Biases in Scientific Practice •Confirmation bias •Circular analysis/“p-hacking” •Addition or removal of observations to generate statistical significance •Publication bias Some Biases in Scientific Practice •Confirmation bias •Circular analysis/“p-hacking” •Addition or removal of observations to generate statistical significance •Publication bias People are less likely to be critical (e.g., double check) results that confirm expectations. Some Biases in Scientific Practice •Confirmation bias •Circular analysis •Addition or removal of observations Now that I have my to generate statistical significance data, how should I analyze them? •Publication bias A Priori Analysis: Statistically Sound Decide on data collection and analysis parameters Collect data Perform analyses Circular Analysis: Prone to Inflated Significance/Confidence Decide on data collection and analysis parameters Collect data Use data to revise collection or analysis parameters Final analyses See: Gelman, A. & Loken, E. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time Circular Analysis: Prone to Inflated Significance/Confidence Decide on data collection and analysis parameters Collect data Use data to revise collection or analysis parameters Final analyses Can be extremely valuable (but BEWARE!) Prevalence of Circular Analyses 1. Kriegeskorte, et al. (2009) Circular analysis is systems neuroscience: the dangers of double dipping • • Analyzed all 2008 fMRI studies in 5 high impact journals 42-56% contained at least one circular analysis 2. John, Loewenstein, & Prelec (2012) Measuring the prevalence of questionable research practices with incentives for truth telling • • Surveyed over 2000 psychologists 35% reported an unexpected finding as having been predicted from the start in at least one paper Some Biases in Scientific Practice •Confirmation bias •Circular analysis •Addition or removal of observations to generate statistical significance •Publication bias Addition of Data to Generate Significance p=0.09! Collect a bit more data! Removal of Data to Generate Significance p=0.04! Done! Removal of Data to Generate Significance Prevalence of Post-Hoc Data Selection John, Loewenstein, & Prelec (2012) Measuring the prevalence of questionable research practices with incentives for truth telling • • • • Surveyed over 2000 psychologists 58% decided to collect more data after looking to see if results were significant 23% decided to stop collecting data earlier than planned because the expected result had been found 43% decided to exclude data after looking at the impact of doing so on results How often have you read an a priori justification for sample size in a methods section? Some Biases in Scientific Practice •Confirmation bias •Circular analysis •Addition or removal of observations to generate statistical significance •Publication bias Publication Bias The File Drawer Effect: Null/small effects rarely published The High-Impact Journal Effect: Extraordinary results more likely to be published Negative Results Rarely Published olding a heavy clipboard, for mple, took interview candidates e seriously and deemed social blems to be more pressing than . people primed with words ing to cleanliness judged dirty uch conceptual replications are ul for psychology, which often s with abstract concepts. “The al way of thinking would be [a conceptual replication] is n stronger than an exact replion. It gives better evidence for generalizability of the effect,” Eliot Smith, a psychologist at ana University in Blooming- ut to other psychologists, relie on conceptual replication is blematic. “You can’t replicate ncept,” says Chambers. “It’s so ective. It’s anybody’s guess as ow similar something needs to o count as a conceptual replica” The practice also produces a ical double-standard”, he says. ACCENTUATE THE POSITIVE A literature analysis across disciplines reveals a tendency to publish only ‘positive’ studies — those that support the tested hypothesis. Psychiatry and psychology are the worst offenders. PHYSICAL Space sciences Geosciences Environment/Ecology Plant and animal sciences Computer science Physics Neuroscience and behaviour Microbiology Chemistry Social sciences Immunology Molecular biology and genetics Economics and business Biology and biochemistry Clinical medicine Pharmacology and toxicology Materials science Psychiatry/psychology BIOLOGICAL 50% 60% SOCIAL 70% 80% 90% Proportion of papers supporting tested hypothesis Yong (2012) Bad Copy Biased p-Value Distribution: Psychology Publication Bi p>0.05 p<0.05 ribution of z-transformed p values. Note: Dashed line specifies the critical z-statistic (1.96) associated with p = A., Fritz, A.,i.e.&a Scherndl, T. correspond (2014). Publication bias in psychology: ledKühberger, tests. Width of intervals (0.245 multiple of 1.96) to a 12.5% caliper. nal.pone.0105825.g006 a diagnosis based on the correlation between effect size and sample size. Some Biases in Scientific Practice •Confirmation bias •Circular analysis •Addition or removal of observations to generate statistical significance •Publication bias Mistakes should be rare (e.g., p<0.05) Mistakes should be corrected by subsequent research Why is Reproducibility so Low? Important, General Culprits 1. Biases in Scientific Practice 2. Underpowered Studies Underpowered Studies •An experiment’s “statistical power” is the probability of detecting an effect •Power is a function of sample size, effect size, and alpha level •“Significant” effects will necessarily be overestimated when effects are small & studies are underpowered •Consequently, underpowered studies are highly susceptible to aforementioned biases Underpowered Studies •An experiment’s “statistical power” is the probability of detecting an effect •Power is a function of sample size, effect size, and alpha level •“Significant” effects will necessarily be overestimated when effects are small & studies are underpowered •Consequently, underpowered studies are highly susceptible to aforementioned biases Underpowered Studies •An experiment’s “statistical power” is the probability of detecting an effect •Power is a function of sample size, effect size, and alpha level •“Significant” effects will necessarily be overestimated when effects are small & studies are underpowered •Consequently, underpowered studies are highly susceptible to aforementioned biases Underpowered Studies •An experiment’s “statistical power” is the probability of detecting an effect •Power is a function of sample size, effect size, and alpha level •“Significant” effects will necessarily be overestimated when effects are small & studies are underpowered •Consequently, underpowered studies are highly susceptible to aforementioned biases A Hypothetical Example •You want to determine if coffee affects performance on an IQ test •The IQ test has been carefully designed to have a Gaussian distribution of scores with a mean of 100 and a standard deviation of 15 points •The true mean effect of coffee on test performance 3 points (a small effect, Cohen’s d=3/15=0.2) Adequately Powered Design •156 Participants •Power=80% •Minimum mean test score for p<0.05=102.0 -“Coffee increases average IQ to 55th percentile” 104 True Effect 102 100 Post-Coffee IQ Score Underpowered Design Minimum Experimental •10 Participants Result for Significance •Power=15% •Minimum mean test score for 108 p<0.05=107.8 -“Coffee increases average IQ to 106 70th percentile” What Factors Reduce Power? •Small sample sizes •Uncontrolled factors that also influence the process of interest (increased noise) •Multiple statistical comparisons What Factors Reduce Power? •Small sample sizes •Increased noise (uncontrolled factors that also influence the process of interest) •Multiple statistical comparisons What Factors Reduce Power? •Small sample sizes •Increased noise (uncontrolled factors that also influence the process of interest) •Multiple statistical comparisons Many Variables Many Analyses Now that I have my data, how should I analyze them? 40,000-500,000 voxels th small Prevalence of Underpowered Studies Ethical implications. Low average power in neuro- ased on science studies also has ethical implications. In our opriately analysis of animal model studies, the average sample Button et al., (2013) Power failure: why small sample size vidence. size of 22 animals for the water maze experiments was undermines the reliability of neuroscience so cononly sufficient to detect an effect size of d = 1.26 with • Analyzed meta-analyses from 2006-2011 h as •theAverage power is at best 8-30% we have 30 12 25 10 20 8 6 15 10 4 Power (%) 00 –1 0 91 –9 81 0 –8 71 0 –7 61 0 –6 51 0 –5 41 –4 31 –3 21 –2 11 0– 0 0 0 0 0 5 10 2 % 14 N finding he avereuroscid ~31%, s within bserved ence litions for 16 th small Prevalence of Underpowered Studies Ethical implications. Low average power in neuro- ased on science studies also has ethical implications. In our opriately analysis of animal model studies, the average sample Button et al., (2013) Power failure: why small sample size vidence. size of 22 animals for the water maze experiments was undermines the reliability of neuroscience so cononly sufficient to detect an effect size of d = 1.26 with • Analyzed meta-analyses from 2006-2011 h as •theAverage power is at best 8-30% we have 14 30 25 This is very likely an overestimate! 10 8 6 20 15 10 4 Power (%) 00 –1 0 91 –9 81 0 –8 71 0 –7 61 0 –6 51 0 –5 41 –4 31 –3 21 –2 11 0– 0 0 0 0 0 5 10 2 % 12 N finding he avereuroscid ~31%, s within bserved ence litions for 16 How do We Improve Reliability? Important, General Culprits 1. Biases in Scientific Practice 2. Underpowered Studies Some Ways to Improve Reliability •Design adequately powered experiments -Estimate power based on previous work/pilot data -Share data across labs to increase power •Decide on analysis details before looking at data -Post-hoc analyses are valuable but should be labeled as such -Preregister study design •Publish results no matter what -Increasing # of outlets for null results (PLoS One, F1000 Research, Open Science Framework) •Publish data and analysis code -Best way to specify methods, ensure accuracy, and benefits the community Some Ways to Improve Reliability •Design adequately powered experiments -Estimate power based on previous work/pilot data -Share data across labs to increase power •Decide on analysis details before looking at data -Post-hoc analyses are valuable but should be labeled as such -Preregister study design •Publish results no matter what -Increasing # of outlets for null results (PLoS One, F1000 Research, Open Science Framework) •Publish data and analysis code -Best way to specify methods, ensure accuracy, and benefits the community Effect of Preregistration? Kaplan & Irvin (2015) Likelihood of null effects of large NHLBI clinical trials has Increased over time. OSF’s $1 Million Preregistration Challenge Some Ways to Improve Reliability •Design adequately powered experiments -Estimate power based on previous work/pilot data -Share data across labs to increase power •Decide on analysis details before looking at data -Post-hoc analyses are valuable but should be labeled as such -Preregister study design •Publish results no matter what -Increasing # of outlets for null results (PLoS One, F1000 Research, Open Science Framework) •Publish data and analysis code -Best way to specify methods, ensure accuracy, and benefits the community Some Ways to Improve Reliability •Design adequately powered experiments -Estimate power based on previous work/pilot data -Share data across labs to increase power •Decide on analysis details before looking at data -Post-hoc analyses are valuable but should be labeled as such -Preregister study design •Publish results no matter what -Increasing # of outlets for null results (PLoS One, F1000 Research, Open Science Framework) •Publish data and analysis code -Best way to specify methods, ensure accuracy, and benefits the community Some Ways to Improve Reliability •Design adequately powered experiments -Estimate power based on previous work/pilot data -Share data across labs to increase power •Decide on analysis details before looking at data -Post-hoc analyses are valuable but should be labeled as such -Preregister study design •Publish results no matter what -Increasing # of outlets for null results (PLoS One, F1000 Research, Open Science Framework) •Publish data and analysis code -Best way to specify methods, ensure accuracy, and benefits the community The Open Science Framework can help with all of this! Moreover, OSF can: Improve your research log and general organization: Improve the visibility of your work: But why should I make the extra effort to do more reliable research? Won’t this reduce my chances for high impact papers and make it easier for people to detect my mistakes? Funders, Participants, & Patients Deserve More Reliable Science Cassidy Megan, creator of Purple Day for Epilepsy Why did you become a scientist? Hubble Space Telescope Newtonian Mechanics Insulin (Kudos Toronto!) Why did you become a scientist? Johannes Andreas Grib Fibiger 1926 Nobel Prize in Medicine for demonstrating that cancer was caused by nematodes Tag-Team Talk B: Practical Guide to using the Open Science Framework https://osf.io Liz Page-Gould, PhD Associate Professor: Psychology, Univ. of Toronto Canada Research Chair in Social Psychophysiology OSF Ambassador