Taming the Statistical Shrew

Transcription

Taming the Statistical Shrew
Taming the
Statistical Shrew
Richard M. Rosenfeld
MD, MPH, FAAP
Professor of Otolaryngology, SUNY Downstate
Medical Center, Chairman of Otolaryngology,
Long Island College Hospital, Brooklyn, NY
Statistics
The science and art of collecting, summarizing, and analyzing
data that are subject to random variation
Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001
Statistics are used to:
Develop sound judgment about data
applicable to clinical care
Read the literature critically, understanding
potential errors and fallacies
Apply epidemiologic information to patient
care and disease prevention
Reach correct conclusions about diagnostic
procedures and laboratory results
Publish and critique journal manuscripts
Create and evaluate research protocols
Dawson B, Trapp RG. Basic & Clinical
Biostatistics, 3rd ed. NY: Lange 2001
Mummy Powder Cures Common Cold
within 24 Hours for 85.7% of Subjects!!!
Not Cured
14%
Cured
86%
How Confident Should You Be?
95% Confidence Interval vs.
Sample Size for a Success Rate of 86%
Successes/
total sample
Success
rate
95% Confidence
interval
6/7
86%
42 – 100%
12/14
86%
57 – 98%
24/28
86%
67 – 96%
48/56
86%
74 – 94%
96/112
86%
78 – 92%
192/224
86%
80 – 90%
Precision
The quality of being sharply defined or stated.
Statistical precision is the inverse of the variance for an estimate.
Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001
Defined as:
Consistency or repeatablility
Threatened by:
Random error (variance)
Independent of:
Systematic error (bias)
Would you cross this
rickety old bridge?
Rule of Threes
The 95% CI Upper Limit Given No Events
in n Trials is 3/n Based on a Poisson Distribution
10
30%
15
20%
20
Trials
with no
events
15%
25
12%
30
10%
35
40
45
50
9%
8%
7%
6%
The rule of threes can address the
following type of question: “I am told by
my physician that I need a serious operation and
there has not been a fatality in 20 she performed.
What is the potential postoperative mortality
based on this information?”
Upper limit of 95% CI
Van Belle G. Statistical Rules of Thumb. NY: Wiley Inter-science, 2002.
Should you
believe a
“zero” result?
It’s all a
question of
confidence.
Great Mysteries in Ear Tube Surgery
12 o’clock
6 o’clock
I say “put it
here”…
… but they
put it there
Accuracy
The degree to which a measurement or an estimate based on measurement
represents the true value of the attribute that is being measured
Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001
Defined as:
Nearness to the truth
Threatened by:
Systematic error (bias)
Independent of:
Random error (variance)
New Technique for Tonsillectomy
Reduces Pain, Improves Satisfaction
The Magic of Cherry-Picking Your Sample
1. AB, age 7y: normal diet by 2 days
2. RR, age 5y: ate everything next
day like nothing happened;
everyone brought gifts but she
didn’t need them
6. NC, age 7y: ate everything next
day like nothing happened
7. LG, age 4y: ate Pudgies’ Chicken,
macaroni & cheese next day
8. KS, age 4y: normal diet next day
3. PV, age 4y: normal diet next day
9. NM, age 5y: normal diet next day
4. AC, age 8y: normal diet next day
10. DW, age 8 y: no pain after surgery
5. MC, age 4y: went to dance class
and ate pizza next day
11. AC, age 4y: ate Chinese food the
same day for dinner
Validity
Study Sample
Findings in the Study
Observation
Truth in the Study
Internal
Validity
Inference
Truth in the Accessible Population
Generalization
Truth in the Target Population
Target Population
External
Validity
Effect of Early vs. Delayed Tympanostomy
Tubes for OME on Child Development
Paradise et al, 2001-2007
RCT of 588 children, identified from monthly screening of 6,350 healthy infants,
with cumulative duration of bilateral OME >90d or unilateral OME >135d, of
which 429 were randomized to prompt vs. delayed (6-9m) tube insertion
Most children had unilateral OME (63%) or discontinuous OME (67%)
Bilateral continuous OME was uncommon (18%)
Early-treatment group had delays in tube placement:
31% within 30d, 33% in 31-60d, 15% in 61-180d, 18% never
Impact of early-tube placement was only 10% fewer days with OME over
24 months (30 vs. 40%), which equals 36 days per year
Developmental and academic tests at age 3y, 4y, 5y, 6y, and 9-11y showed
no difference in outcomes for the prompt- vs. delayed-tube group
NEJM 2001;344:1179-87, Pediatr Infect Dis J 2003;22:309-14, Pediatrics 2003;112:265-77,
NEJM 2005; 353:576-86, NEJM 2007; 356-248-61
Anatomy of an Estimate
Not All Statistics Are Created Equal
Is it Precise?
– Are the results consistent and
repeatable?
Is it Accurate?
– Does it reflect the true value of the
attribute being measured?
Is it Valid?
– Can we make inferences based on
the estimate?
Mummy Powder Cures 85.7%
of Colds within 24 hours
Low Budget Study
Precision
46 – 99% (N = 7)
Accuracy
Inclusion by stuffy nose
Outcome by telephone contact
Validity
Judgmental sample drawn from waiting
room of local chiropractor's office
Mummy Powder Cures 85.7%
of Colds within 24 hours
High Budget Study
Precision
81 – 90% (N = 224)
Accuracy
Inclusion by X-ray and RAST
Outcome by rhinomanometry
Validity
Two-stage cluster sample drawn from
most recent US census report
Controlled Clinical Trial
James Lind, Scottish Surgeon, 1716-1794
Tröhler U (2003). James Lind and scurvy: 1747 to 1795.
The James Lind Library (www.jameslindlibrary.org)
Which Case Series’ are Worth Publishing?
Rosenfeld RM, Otolaryngol HNS 2007
The best case series report uncommon situations or deal with
circumstances where RCTs would be unethical or impractical, AND:
1. Include a consecutive, well-defined sample of subjects that is
fully described so readers can judge relevance
2. Report interventions with enough detail for reproduction,
including any adjunctive treatments allowed
3. Account for all patients initially enrolled, and follow them
long enough to overcome random disease fluctuations
4. Perform statistical analysis, preferably multivariate
5. Reach justifiable conclusions, devoid of “efficacy” claims
Otolaryngol Head Neck Surg 2007; 136:337-9
First Randomized Trial (Sealed Envelopes)
Medical Research Council, BMJ 1948
First clinical trial (streptomycin for tuberculosis) using random numbers
and sealed envelopes, instead of old practice of alternating cases
BMJ 1948; 2:769-82.
Centralized randomization scheme
Why Bother to Randomize?
What’s wrong with thoughtful, individualized allocation
of patients to treatment or no treatment by insightful clinicians?
1. Randomization eliminates
allocation bias, which can give
false or misleading results when
clinicians allocate treatment
2. Randomization provides proper
estimates of random error, which
are required for valid statistical
analysis
Mummy Powder for Adenovirus URI
Symptom relief for 150 patients randomized to cellulose placebo (n=75)
vs. mummy powder (n=75) for adenovirus upper respiratory infection
χ2 = 3.14, P = .076
Rate difference = -13%
76%
63%
Cellulose placebo
Mummy powder
P Value
P value is the probability that a test statistic would be as extreme as
or more extreme than observed if the null hypothesis were true
Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001
P < .05
P ≥ .05
An alternative to the null hypothesis might
better explain the observations
The null hypothesis satisfactorily explains
the observations
P Value Dichotomy
Investigators may arbitrarily set their own significance levels, but in most
biomedical and epidemiologic work, a study whose probability value is less
than 5% (P < .05) or 1% ( P < .01) is considered sufficiently unlikely to have
occurred by chance to justify the designation “statistically significant”
Ronald A. Fisher
British Mathematician and
Biologist, 1890-1962
Introduced P values and randomization
into agricultural research in 1920s
“If P is between 0.1 and 0.9 there is
certainly no reason to suspect the (null)
hypothesis tested. If it is below 0.02 it is
strongly indicated that the hypothesis fails
to account for the whole of the facts. We
shall not often be astray if we draw a
conventional line at 0.05…”
Red Hot News Flash !!
New Studies Prove Efficacy of Recombinant
Mummy Powder for the Common Cold!!
New England Journal of Medicine
Mummy beats placebo,
P < .05, N = 3,000 !!!
Journal of Low Budget Research
Mummy beats placebo,
P < .05, N = 30 !!!
Statistical vs. Clinical Significance
For a Binary Outcome in 2 Groups
15
45%
20
40%
30
35%
Group 40
size 60
30%
25%
90
20%
170
15%
400
1500
10%
5%
For example, a group size
of 15 (N = 30) would need an absolute
difference in outcome between groups of
at least 45% to reach statistical significance
Smallest detectable absolute group difference
Amoxicillin for Acute Otitis Media
Kaleida et al, Pediatrics 1991
Children aged 7m – 12y with 980 episodes of AOM
randomized to amoxicillin vs. placebo for 14 days
96%
92%
Absolute
increase in
success rate
4% [1 – 7%]
P = .015
Amoxicillin
Relative
decrease in
failure rate
50% [14 – 70%]
P = .015
4%
Success
Pediatrics 1991; 87: 466-474
Placebo
8%
Failure
Absolute vs. Relative Risk
Your chance of winning > $50K in a lottery is 1:80 million
You are 12 times more likely to get killed in a 1 mile car
ride to buy a ticket than to actually win the lottery
Anatomy of an Estimate
Not All Statistics Are Created Equal
Is it Precise?
– Are the results consistent and
repeatable?
Is it Accurate?
– Does it reflect the true value of the
attribute being measured?
Is it Valid?
– Can we make inferences based on
the estimate?
Bias in Treatment Studies
Bias is a systematic deviation from the truth, which may occur in
the collection, analysis, interpretation, publication, or review of data
1. Design bias occurs when the study is planned to include subjects,
endpoints, or outcomes that are more likely to support prior expectations
2. Ascertainment bias is caused by studying a subject sample that does not
fairly represent the larger population to which results are to be applied
3. Allocation (selection) bias occurs when groups vary in prognosis because
of demographics, illness severity, or other baseline characteristics
4. Observer (detection or measurement) bias can distort how outcomes are
assessed if the observer is aware of the treatment received
5. Reviewer bias can lead to erroneous conclusions when an author
selectively cites published studies that favor a particular viewpoint
Rosenfeld RM. Disclosure. Otolaryngol Head Neck Surg 2008; In press
Theriac for Sale: Universal
Antidote for Poisoning –
Also cures Aprosexia
Theriac for Aprosexia*
Randomized Double-Blind Clinical Trial
Research
hypothesis
Theriac is more effective than placebo
in treating aprosexia
Null
hypothesis
Theriac is equivalent to or less
effective than placebo
* Aprosexia is defined as the inability to concentrate due to
ocular, aural, or mental deficits or to mental weakness
Stedman’s Medical Dictionary, 23rd edition
Theriac for Aprosexia
Clinical cure rate for 50 patients randomized to
placebo (n=25) vs. theriac (n=25) for intractable aprosexia
χ2 = 9.74, P = .002
Rate difference = 44%
76%
32%
Placebo
Theriac
Type I (Alpha) Error
Probability of Occurrence is P value
Decision situation
Type I error
Null hypothesis
Reject even though true
Diagnostic test
False positive
Clinical trial
Promote worthless therapy
Judicial system
Sentence the innocent
Used car selection
Reject dependable car
Brown GW. Errors, types I and II. Am J Dis Child 137:586-91
Ronald Fisher
Jerzy Neyman
P value, 1925
Confidence Interval, 1937
Absolute rate difference
P values vs. Confidence Intervals
for a Rate Difference of 44%
80%
70%
60%
50%
40%
30%
20%
P=.002
P<.001
P<.001
P<.001
P<.001
100
200
500
1000
10%
50
Sample size
Rosenfeld RM. Taming the Statistical Shrew. In: Johnson JT (ed),
Instructional Courses Volume Six. St. Louis, Mosby Year Book; 1993.
Confidence Interval
The computed interval with a given probability, e.g., 95%, that the true value
of a variable (mean, proportion, or rate) is contained within the interval
Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001
Defined as:
Range of sample means consistent with the data
Threatened by:
Small samples (low precision)
Related to, but better than:
P values
Gardner MJ. Br Med J 1986;292:746-50
Confidence Intervals
Use At Least 12 Observations in
Constructing a Confidence Interval
Gerald van Belle. Statistical Rules of
Thumb. NY: Wiley Interscience; 2002
Precision does not vary linearly
with sample size, but is related to
the square root of the number of
observations
Rule of thumb: the width of a CI
decreases rapidly until 12
observations are reached then
decreases slowly
For example, for a sample size of
15, the half-width of a 95% CI is
0.5 standard deviations
Statistics in Otolaryngology Journals
Wasserman et al, Otolaryngol HNS
Survey of 1,924 clinical research articles from the 4 leading
peer-reviewed otolaryngology journals in 1993, 1998, and 2003
36%
39%
45%
43%
1993
1998
2003
35%
26%
1%
Internal control group
P-values
4%
8%
Confidence intervals
Otolaryngol Head & Neck Surg 2006; 134:717-23
Confidence Intervals
Overlapping Confidence Intervals
Do Not Imply Non-Significance
Gerald van Belle. Statistical Rules of
Thumb. NY: Wiley Interscience; 2002
It is sometimes claimed that if two
independent statistics have
overlapping confidence intervals, then
they are not significantly different
This is true for substantial overlap,
but the overlap can be surprisingly
large and the means still significantly
different
Rule of thumb: overlaps of 25% or
less still suggest statistical
significance
Red Hot News Flash !!
Randomized Controlled Trial Proves Efficacy
of Theriac for Intractable Aprosexia!!
Theriac 44% more effective
than placebo for aprosexia
(95% CI, 19-69%).
There is less than a 2 in 1,000
probability that this is a chance
finding!
Red Hot News Flash !!
Urgent Alert from the
Food and Drug Administration
Complete and irreversible hair
loss linked to theriac ingestion
for aprosexia!
50% of subjects became bald
within one year of therapy.
Univariate Analysis of Risk Factors
for Post-Theriac Baldness
Statistically significant factors
Season of therapy
Shoe size
P = .018
P = .002
Statistically non-significant factors
Geographic region
Ethnic group
Climate
Height
Weight
Socioeconomic status
Hair type
Gender
Hair color
Age
Race
Eye color
Favorite color
P = .007
Family history
Diet
Exercise
Smoking
Alcohol
Red Hot News Flash !!
FDA Issues New Precautions
When Using Theriac
Unless you seek baldness,
theriac is not recommended
for winter-time aprosexia if
your shoe size is 9 and your
favorite color is aquamarine
You seem in fine health, Mr. Cosgrove, but let’s give you
a series of tests. I’m sure we can find something wrong.
Consequences of Performing 20
Statistical Tests on a Single Set of Data
Assuming that:
Each test is performed with an alpha level of .05
And the observed findings are caused solely by
random variations
The probability of:
1 or more type I errors is 64%
2 or more type I errors is 26%
3 or more type I errors is 7%
A.K.A.
If you torture the
data sufficiently, they
will eventually confess
to something!
Statistical Tests for Associating
an Outcome with Predictor Variables
Data scale for outcome
Parametric test
Nonparametric test
Nominal or ordinal
Discriminant analysis
Log-linear model
Dichotomous
Discriminant analysis
Multiple-logistic regression
Numerical, 1 predictor
Pearson correlation
Spearman rank correlation
Numerical, 2 predictors
ANOVA, ANCOVA
—
Numerical, ≥ 2 predictors
Multiple linear regression
—
Censored
Cox regression
—
Parametric tests are used when group size ≥ 30 or if < 30 with normal distribution
Sample Size for Multivariate Analysis
Obtain At Least 10 Events
For Every Variable Investigated
Assume that 20% of subjects are expected to have the
event of interest and there are 5 predictor variables
About 10 events per variable are needed to get stable
estimates of the regression coefficients
Therefore, about 10 x 5 or 50 events are needed,
making the necessary sample size 250 subjects
Gerald van Belle. Statistical Rules of Thumb
New York: Wiley Interscience; 2002
Statistical Tests for Comparing
Three or More Groups
Independent samples
Data scale
Parametric test
Nonparametric test
Dichotomous or nominal
—
χ2 test, log-likelihood ratio
Ordinal
—
Kruskal-Wallis ANOVA, χ2 for trend
Numerical
One-way ANOVA
Kruskal-Wallis ANOVA
Matched, paired, or repeated samples
Data scale
Parametric test
Nonparametric test
Dichotomous
—
Mantel-Haenszel χ2, Cochran’s Q
Ordinal
—
Friedman ANOVA
Numerical
Repeated ANOVA
Friedman ANOVA
Parametric tests are used when group size ≥ 30 or if < 30 with normal distribution
We need theriac! But baldness is not an option…
Unicorn Horn for Aprosexia
Randomized Double-Blind Clinical Trial
Research
hypothesis
Unicorn horn is within 0.20 as
effective as theriac
Null
hypothesis
Theriac is more effective than unicorn
horn by at least 0.20
Sample size calculation = 300 per group (600 overall) based on:
Alpha = .05 (one-sided)
Cure rate, theriac = .60
Beta = .20
Cure rate, unicorn = .40
Unicorn Horn vs. Theriac for Aprosexia
Clinical cure rate for 50 patients randomized to
theriac (n=25) vs. unicorn horn (n=25) for intractable aprosexia
χ2 = 0.33, P = .560
Rate difference = -8%
64%
Theriac
56%
Unicorn horn
Type II (Beta) Error
Power is One Minus Beta
Decision situation
Type II error
Null hypothesis
Accept even though false
Diagnostic test
False negative
Clinical trial
Miss a difference between groups
Judicial system
Free the guilty
Used car selection
Buy a lemon
Brown GW. Errors, types I and II. Am J Dis Child 137:586-91
Effect of Sample Size on Power and
Precision for a Rate Difference of 8%
Absolute rate difference
50%
40%
30%
Circles indicate the point estimate for the absolute rate difference, and vertical
bars indicate the 95% confidence intervals. Positive values favor theriac.
Only the largest trial has adequate statistical power.
20%
10%
0%
-10%
-20%
-30%
P=.56
Power=17%
P=.41
Power=26%
P=.25
Power=34%
P=.16
Power=54%
P=.06
Power=80%
50
100
200
300
600
-40%
Sample size
Rosenfeld RM. Taming the Statistical Shrew. In: Johnson JT (ed),
Instructional Courses Volume Six. St. Louis, Mosby Year Book; 1993.
Oxford: Oxford University
Press, 2001
Philadelphia: American College of
Physicians, 1997
Looking Beyond P values
Statistical Savvy 101
All P values
Significant
P value
How many hypotheses were tested?
How many groups were compared?
Is the result clinically important?
What is the magnitude of outcome?
Is the result precise, accurate, and valid?
Was the correct statistical test used?
Non-significant Was the statistical power adequate?
P value
Is the result clinically important?
“Start out with the conviction that absolute
truth is hard to reach in matters relating to
our fellow creatures, healthy or diseased,
that slips in observation are inevitable
even with the best trained faculties, that
errors in judgment must occur in the
practice of an art which consists largely in
balancing probabilities.
Start, I say, with this attitude of mind, and
mistakes will be acknowledged and
regretted; but instead of a slow process of
self-deception, with ever increasing
inability to recognize truth, you will draw
from your errors the very lessons which
may enable you to avoid their repetition.”
Sir William Osler, Aequanimatas 1904
Taming the Statistical
Shrew 2008
Thank you for your
kind attention!
Richard M. Rosenfeld
[email protected]

Similar documents