Underpowered Studies - Open Science Framework

Transcription

Underpowered Studies - Open Science Framework
Learning to Improve the Impact &
Integrity of your Science: An
Introduction to the Open Science
Framework
https://osf.io
Lorraine Chuen
http://www.ooocanada.ca/
@OOOpen_Canada
http://facebook.com/OOOCanada
OOOCanada
Co-Founder
Studio-Y Fellow:
MaRS Discovery District
Tag-Team Talk
A: Science’s Reproducibility Crisis
David Groppe, PhD
Research Associate:
Psychology, Univ. of Toronto
Adjunct Scientist:
Neurosurgery, Feinstein Institute
for Medical Research
Tag-Team Talk
B: Practical Guide to using the
Open Science Framework
Liz Page-Gould, PhD
https://osf.io
Associate Professor:
Psychology, Univ. of Toronto
Canada Research Chair in
Social Psychophysiology
OSF Ambassador
Motivate Change
How to
Implement
Change
To Download Our Slides:
https://osf.io/avdhk/
CC license:
Tag-Team Talk
A: Science’s Reproducibility Crisis
David Groppe, PhD
Research Associate:
Psychology, Univ. of Toronto
Adjunct Scientist:
Neurosurgery, Feinstein Institute
for Medical Research
Science=Awesome
http://science.amorphia-apparel.com/design/robot/
For Example…
Hubble Space
Telescope
Newtonian
Mechanics
Insulin
(Kudos Toronto!)
Mistakes can Happen…
Johannes Andreas
Grib Fibiger
Mistakes can Happen…
Johannes Andreas
Grib Fibiger
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
Mistakes can Happen…
Johannes Andreas
Grib Fibiger
Mistakes should be
rare (e.g., p<0.05)
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
Mistakes can Happen…
Johannes Andreas
Grib Fibiger
Mistakes should be
rare (e.g., p<0.05)
Mistakes should be
corrected by
subsequent research
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
Increasing Evidence that Mistakes
are Common and Difficult to Correct
Image from: http://www.economist.com/news/briefing/21588057scientists-think-science-self-correcting-alarming-degree-it-not-trouble
3 Large Scale Replication Studies
1. Open Science Collaboration (2015) Estimating the
reproducibility of psychological science.
•
•
•
Systematic attempt to replicate 100 psychology experiments
39% replication success
83% of replicated effects were smaller than original effects
2. Begley & Ellis (2012) Raise standards for preclinical
research.
•
•
Amgen’s attempt to replicate 53 “landmark” haematology &
cancer studies
11% replication success
3. Prinz, Schlange, & Asadullah (2011) Believe it or not:
how much can we rely on published data on potential
drug targets?
•
•
Bayer HealthCare’s attempt to replicate 67 studies in cancer,
women’s health, & heart disease
20-25% replication success
3 Large Scale Replication Studies
1. Open Science Collaboration (2015) Estimating the
reproducibility of psychological science.
•
•
•
Systematic attempt to replicate 100 psychology experiments
39% replication success
83% of replicated effects were smaller than original effects
2. Begley & Ellis (2012) Raise standards for preclinical
research.
•
•
Amgen’s attempt to replicate 53 “landmark” haematology &
cancer studies
11% replication success
3. Prinz, Schlange, & Asadullah (2011) Believe it or not:
how much can we rely on published data on potential
drug targets?
•
•
Bayer HealthCare’s attempt to replicate 67 studies in cancer,
women’s health, & heart disease
20-25% replication success
3 Large Scale Replication Studies
1. Open Science Collaboration (2015) Estimating the
reproducibility of psychological science.
•
•
•
Systematic attempt to replicate 100 psychology experiments
39% replication success
83% of replicated effects were smaller than original effects
2. Begley & Ellis (2012) Raise standards for preclinical
research.
•
•
Amgen’s attempt to replicate 53 “landmark” haematology &
cancer studies
11% replication success
3. Prinz, Schlange, & Asadullah (2011) Believe it or not:
how much can we rely on published data on potential
drug targets?
•
•
Bayer HealthCare’s attempt to replicate 67 studies in cancer,
women’s health, & heart disease
20-25% replication success
Mistakes can Happen…
Johannes Andreas
Grib Fibiger
Mistakes should be
rare (e.g., p<0.05)
Mistakes should be
corrected by
subsequent research
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
Mistakes can Happen…
Johannes Andreas
Grib Fibiger
?
Mistakes should be
rare (e.g., p<0.05)
Mistakes should be
corrected by
subsequent research
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
2
fundamental basis. More troubling, some of
the research has triggered a series of clinical
studies — suggesting that many patients had
human micro-environment in preclinical research, reviewers and editors should
demand greater thoroughness.
Reproducibility & Impact
REPRODUCIBILITY OF RESEARCH FINDINGS
Preclinical research generates many secondary publications, even when results cannot be reproduced.
Journal
impact factor
Number of
articles
Mean number of citations of
non-reproduced articles*
Mean number of citations of
reproduced articles
>20
21
248 (range 3–800)
231 (range 82–519)
5–19
32
169 (range 6–1,909)
13 (range 3–24)
Results from ten-year retrospective analysis of experiments performed prospectively. The term ‘non-reproduced’ was
assigned on the basis of findings not being sufficiently robust to drive a drug-development programme.
*Source of citations: Google Scholar, May 2011.
Non-Reproducible
Reproducible
© 2012 Macmillan Publishers Limited. All rights reserved
Over a 10 year period, non-reproducible studies tended to
have greater impact and in some cases spawned subfields
and clinical trials.
Begley & Ellis (2011)
Negative Results Rarely Published
olding a heavy clipboard, for
mple, took interview candidates
e seriously and deemed social
blems to be more pressing than
.
people primed with words
ing to cleanliness judged dirty
uch conceptual replications are
ul for psychology, which often
s with abstract concepts. “The
al way of thinking would be
[a conceptual replication] is
n stronger than an exact replion. It gives better evidence for
generalizability of the effect,”
Eliot Smith, a psychologist at
ana University in Blooming-
ut to other psychologists, relie on conceptual replication is
blematic. “You can’t replicate
ncept,” says Chambers. “It’s so
ective. It’s anybody’s guess as
ow similar something needs to
o count as a conceptual replica” The practice also produces a
ical double-standard”, he says.
ACCENTUATE THE POSITIVE
A literature analysis across disciplines reveals a tendency to publish
only ‘positive’ studies — those that support the tested hypothesis.
Psychiatry and psychology are the worst offenders.
PHYSICAL
Space sciences
Geosciences
Environment/Ecology
Plant and animal sciences
Computer science
Physics
Neuroscience and behaviour
Microbiology
Chemistry
Social sciences
Immunology
Molecular biology and genetics
Economics and business
Biology and biochemistry
Clinical medicine
Pharmacology and toxicology
Materials science
Psychiatry/psychology
BIOLOGICAL
50%
60%
SOCIAL
70%
80%
90%
Proportion of papers supporting
tested hypothesis
Yong (2012) Bad Copy
Mistakes can Happen…
Johannes Andreas
Grib Fibiger
Mistakes should be
rare (e.g., p<0.05)
Mistakes should be
corrected by
subsequent research
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
Mistakes can Happen…
How do we
increase Johannes Andreas
Grib Fibiger
science’s
credibility?
Mistakes should be
rare (e.g., p<0.05)
Mistakes should be
corrected by
subsequent research
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
Why is Reproducibility so Low?
Why is Reproducibility so Low?
Important, General Culprits
1. Biases in Scientific Practice
2. Underpowered Studies
Why is Reproducibility so Low?
Important, General Culprits
1. Biases in Scientific Practice
2. Underpowered Studies
Some Biases in Scientific Practice
•Confirmation bias
•Circular analysis/“p-hacking”
•Addition or removal of observations
to generate statistical significance
•Publication bias
Some Biases in Scientific Practice
•Confirmation bias
•Circular analysis/“p-hacking”
•Addition or removal of observations
to generate statistical significance
•Publication bias
People are less likely to be critical (e.g., double
check) results that confirm expectations.
Some Biases in Scientific Practice
•Confirmation bias
•Circular analysis
•Addition or removal of observations
Now that I have my
to generate statistical significance
data, how should I
analyze them?
•Publication bias
A Priori Analysis: Statistically Sound
Decide on data
collection and
analysis parameters
Collect data
Perform
analyses
Circular Analysis: Prone to Inflated
Significance/Confidence
Decide on data
collection and
analysis parameters
Collect data
Use data to
revise
collection or
analysis
parameters
Final analyses
See: Gelman, A. & Loken, E. The garden of forking paths:
Why multiple comparisons can be a problem, even when
there is no “fishing expedition” or “p-hacking” and the
research hypothesis was posited ahead of time
Circular Analysis: Prone to Inflated
Significance/Confidence
Decide on data
collection and
analysis parameters
Collect data
Use data to
revise
collection or
analysis
parameters
Final analyses
Can be extremely valuable
(but BEWARE!)
Prevalence of Circular Analyses
1. Kriegeskorte, et al. (2009) Circular analysis is systems
neuroscience: the dangers of double dipping
•
•
Analyzed all 2008 fMRI studies in 5 high impact journals
42-56% contained at least one circular analysis
2. John, Loewenstein, & Prelec (2012) Measuring the
prevalence of questionable research practices with
incentives for truth telling
•
•
Surveyed over 2000 psychologists
35% reported an unexpected finding as having been predicted
from the start in at least one paper
Some Biases in Scientific Practice
•Confirmation bias
•Circular analysis
•Addition or removal of observations
to generate statistical significance
•Publication bias
Addition of Data to Generate
Significance
p=0.09! Collect
a bit more data!
Removal of Data to Generate
Significance
p=0.04! Done!
Removal of Data to Generate
Significance
Prevalence of Post-Hoc Data
Selection
John, Loewenstein, & Prelec (2012) Measuring the
prevalence of questionable research practices with
incentives for truth telling
•
•
•
•
Surveyed over 2000 psychologists
58% decided to collect more data after looking to see if results
were significant
23% decided to stop collecting data earlier than planned
because the expected result had been found
43% decided to exclude data after looking at the impact of
doing so on results
How often have you read an a priori justification
for sample size in a methods section?
Some Biases in Scientific Practice
•Confirmation bias
•Circular analysis
•Addition or removal of observations
to generate statistical significance
•Publication bias
Publication Bias
The File Drawer Effect:
Null/small effects rarely
published
The High-Impact
Journal Effect:
Extraordinary results
more likely to be
published
Negative Results Rarely Published
olding a heavy clipboard, for
mple, took interview candidates
e seriously and deemed social
blems to be more pressing than
.
people primed with words
ing to cleanliness judged dirty
uch conceptual replications are
ul for psychology, which often
s with abstract concepts. “The
al way of thinking would be
[a conceptual replication] is
n stronger than an exact replion. It gives better evidence for
generalizability of the effect,”
Eliot Smith, a psychologist at
ana University in Blooming-
ut to other psychologists, relie on conceptual replication is
blematic. “You can’t replicate
ncept,” says Chambers. “It’s so
ective. It’s anybody’s guess as
ow similar something needs to
o count as a conceptual replica” The practice also produces a
ical double-standard”, he says.
ACCENTUATE THE POSITIVE
A literature analysis across disciplines reveals a tendency to publish
only ‘positive’ studies — those that support the tested hypothesis.
Psychiatry and psychology are the worst offenders.
PHYSICAL
Space sciences
Geosciences
Environment/Ecology
Plant and animal sciences
Computer science
Physics
Neuroscience and behaviour
Microbiology
Chemistry
Social sciences
Immunology
Molecular biology and genetics
Economics and business
Biology and biochemistry
Clinical medicine
Pharmacology and toxicology
Materials science
Psychiatry/psychology
BIOLOGICAL
50%
60%
SOCIAL
70%
80%
90%
Proportion of papers supporting
tested hypothesis
Yong (2012) Bad Copy
Biased p-Value Distribution: Psychology
Publication Bi
p>0.05
p<0.05
ribution of z-transformed p values. Note: Dashed line specifies the critical z-statistic (1.96) associated with p =
A., Fritz,
A.,i.e.&a Scherndl,
T. correspond
(2014). Publication
bias in psychology:
ledKühberger,
tests. Width of intervals
(0.245
multiple of 1.96)
to a 12.5% caliper.
nal.pone.0105825.g006
a diagnosis based on the correlation between effect size and sample size.
Some Biases in Scientific Practice
•Confirmation bias
•Circular analysis
•Addition or removal of observations
to generate statistical significance
•Publication bias
Mistakes should be
rare (e.g., p<0.05)
Mistakes should be
corrected by
subsequent research
Why is Reproducibility so Low?
Important, General Culprits
1. Biases in Scientific Practice
2. Underpowered Studies
Underpowered Studies
•An experiment’s “statistical power” is the
probability of detecting an effect
•Power is a function of sample size, effect size, and
alpha level
•“Significant” effects will necessarily be
overestimated when effects are small & studies
are underpowered
•Consequently, underpowered studies are highly
susceptible to aforementioned biases
Underpowered Studies
•An experiment’s “statistical power” is the
probability of detecting an effect
•Power is a function of sample size, effect size, and
alpha level
•“Significant” effects will necessarily be
overestimated when effects are small & studies
are underpowered
•Consequently, underpowered studies are highly
susceptible to aforementioned biases
Underpowered Studies
•An experiment’s “statistical power” is the
probability of detecting an effect
•Power is a function of sample size, effect size, and
alpha level
•“Significant” effects will necessarily be
overestimated when effects are small & studies
are underpowered
•Consequently, underpowered studies are highly
susceptible to aforementioned biases
Underpowered Studies
•An experiment’s “statistical power” is the
probability of detecting an effect
•Power is a function of sample size, effect size, and
alpha level
•“Significant” effects will necessarily be
overestimated when effects are small & studies
are underpowered
•Consequently, underpowered studies are highly
susceptible to aforementioned biases
A Hypothetical Example
•You want to determine if coffee affects performance on
an IQ test
•The IQ test has been carefully designed to have a
Gaussian distribution of scores with a mean of 100 and a
standard deviation of 15 points
•The true mean effect of coffee on test performance 3
points (a small effect, Cohen’s d=3/15=0.2)
Adequately Powered Design
•156 Participants
•Power=80%
•Minimum mean test score for
p<0.05=102.0
-“Coffee increases average IQ
to 55th percentile”
104
True Effect
102
100
Post-Coffee IQ Score
Underpowered Design
Minimum Experimental
•10 Participants
Result for Significance
•Power=15%
•Minimum mean test score for
108
p<0.05=107.8
-“Coffee increases average IQ to
106
70th percentile”
What Factors Reduce Power?
•Small sample sizes
•Uncontrolled factors that also influence the
process of interest (increased noise)
•Multiple statistical comparisons
What Factors Reduce Power?
•Small sample sizes
•Increased noise (uncontrolled factors that
also influence the process of interest)
•Multiple statistical comparisons
What Factors Reduce Power?
•Small sample sizes
•Increased noise (uncontrolled factors that
also influence the process of interest)
•Multiple statistical comparisons
Many Variables
Many Analyses
Now that I have my
data, how should I
analyze them?
40,000-500,000 voxels
th small
Prevalence
of Underpowered
Studies
Ethical implications.
Low average power in neuro-
ased on science studies also has ethical implications. In our
opriately analysis of animal model studies, the average sample
Button et al., (2013) Power failure: why small sample size
vidence. size of 22 animals for the water maze experiments was
undermines the reliability of neuroscience
so cononly sufficient to detect an effect size of d = 1.26 with
• Analyzed meta-analyses from 2006-2011
h as •theAverage power is at best 8-30%
we have
30
12
25
10
20
8
6
15
10
4
Power (%)
00
–1
0
91
–9
81
0
–8
71
0
–7
61
0
–6
51
0
–5
41
–4
31
–3
21
–2
11
0–
0
0
0
0
0
5
10
2
%
14
N
finding
he avereuroscid ~31%,
s within
bserved
ence litions for
16
th small
Prevalence
of Underpowered
Studies
Ethical implications.
Low average power in neuro-
ased on science studies also has ethical implications. In our
opriately analysis of animal model studies, the average sample
Button et al., (2013) Power failure: why small sample size
vidence. size of 22 animals for the water maze experiments was
undermines the reliability of neuroscience
so cononly sufficient to detect an effect size of d = 1.26 with
• Analyzed meta-analyses from 2006-2011
h as •theAverage power is at best 8-30%
we have
14
30
25
This is very likely
an overestimate!
10
8
6
20
15
10
4
Power (%)
00
–1
0
91
–9
81
0
–8
71
0
–7
61
0
–6
51
0
–5
41
–4
31
–3
21
–2
11
0–
0
0
0
0
0
5
10
2
%
12
N
finding
he avereuroscid ~31%,
s within
bserved
ence litions for
16
How do We Improve Reliability?
Important, General Culprits
1. Biases in Scientific Practice
2. Underpowered Studies
Some Ways to Improve Reliability
•Design adequately powered experiments
-Estimate power based on previous work/pilot data
-Share data across labs to increase power
•Decide on analysis details before looking at data
-Post-hoc analyses are valuable but should be labeled as such
-Preregister study design
•Publish results no matter what
-Increasing # of outlets for null results (PLoS One, F1000
Research, Open Science Framework)
•Publish data and analysis code
-Best way to specify methods, ensure accuracy, and benefits the
community
Some Ways to Improve Reliability
•Design adequately powered experiments
-Estimate power based on previous work/pilot data
-Share data across labs to increase power
•Decide on analysis details before looking at data
-Post-hoc analyses are valuable but should be labeled as such
-Preregister study design
•Publish results no matter what
-Increasing # of outlets for null results (PLoS One, F1000
Research, Open Science Framework)
•Publish data and analysis code
-Best way to specify methods, ensure accuracy, and benefits the
community
Effect of Preregistration?
Kaplan & Irvin (2015) Likelihood of null effects of large NHLBI
clinical trials has Increased over time.
OSF’s $1 Million
Preregistration Challenge
Some Ways to Improve Reliability
•Design adequately powered experiments
-Estimate power based on previous work/pilot data
-Share data across labs to increase power
•Decide on analysis details before looking at data
-Post-hoc analyses are valuable but should be labeled as such
-Preregister study design
•Publish results no matter what
-Increasing # of outlets for null results (PLoS One, F1000
Research, Open Science Framework)
•Publish data and analysis code
-Best way to specify methods, ensure accuracy, and benefits the
community
Some Ways to Improve Reliability
•Design adequately powered experiments
-Estimate power based on previous work/pilot data
-Share data across labs to increase power
•Decide on analysis details before looking at data
-Post-hoc analyses are valuable but should be labeled as such
-Preregister study design
•Publish results no matter what
-Increasing # of outlets for null results (PLoS One, F1000
Research, Open Science Framework)
•Publish data and analysis code
-Best way to specify methods, ensure accuracy, and benefits the
community
Some Ways to Improve Reliability
•Design adequately powered experiments
-Estimate power based on previous work/pilot data
-Share data across labs to increase power
•Decide on analysis details before looking at data
-Post-hoc analyses are valuable but should be labeled as such
-Preregister study design
•Publish results no matter what
-Increasing # of outlets for null results (PLoS One, F1000
Research, Open Science Framework)
•Publish data and analysis code
-Best way to specify methods, ensure accuracy, and benefits the
community
The Open Science Framework
can help with all of this!
Moreover, OSF can:
Improve your research
log and general
organization:
Improve the visibility
of your work:
But why should I
make the extra effort to
do more reliable research?
Won’t this reduce my
chances for high impact
papers and make it easier
for people to detect my
mistakes?
Funders, Participants, & Patients
Deserve More Reliable Science
Cassidy Megan, creator of
Purple Day for Epilepsy
Why did you become a scientist?
Hubble Space
Telescope
Newtonian
Mechanics
Insulin
(Kudos Toronto!)
Why did you become a scientist?
Johannes Andreas
Grib Fibiger
1926 Nobel Prize in Medicine for
demonstrating that cancer was caused
by nematodes
Tag-Team Talk
B: Practical Guide to using the
Open Science Framework
https://osf.io
Liz Page-Gould, PhD
Associate Professor:
Psychology, Univ. of Toronto
Canada Research Chair in
Social Psychophysiology
OSF Ambassador