2.2 In Particular

Transcription

2.2 In Particular
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
2.2
Contents
In Particular
Risk/Prevalence Difference
1. Comparative measures
p1 − p2 ± z × SE[p1 − p2 ]
2. Large-sample CIs for comparative parameters
3. Test-based CI’s
4. Sample size considerations
= p1 − p2 ± z × (SE 2 [p1 ] + SE 2 [p2 ])/2
= p1 − p2 ± z × (p1 q1 /n1 + p2 q2 /n2 )1/2
Risk/Prevalence Ratio
5. Small sample methods: conditional vs. unconditional estimation of OR
antilog {log(p1 /p2 ) ± z × (SE 2 [log p1 ] + SE 2 [log p2 ])1/2 },
6. Examples
where, for i = 1, 2,
1
SE 2 [log pi ] = V ar[log pi ] = 1/#positivei − 1/#totali .
Comparative measures / parameters:
Odds ratio1
Measure
(Risk or
Prevalence)
Difference
(Risk or
NNT
Comparative
Parameter
antilog {log[oddsratio] ± z × (SE 2 [logit1 ] + SE 2 [logit2 ])1/2 }
New Scale
where, for i = 1, 2,
π1 − π2
p1 − p2
1/{π1 − π2 }
1/{p1 − p2 }
SE 2 [logiti ] = V ar[logiti ] = 1/#positivei + 1/#negativei .
Var[log or] = 1/a + 1/b + 1/c + 1/d for CIOR → “Woolf’s Method.”
Number Needed to Treat
2.3
(Risk or
Prevalence)
Ratio
Odds Ratio
Estimate
π1
π2
p1
p2
π1 /(1−π1 )
π2 /(1−π2 )
odds1
odds2
log
p1
p2
Large-sample test of π1 = π2
Equivalent
to test of
= log p1 − log p2
π1 /π2 = 1 →
1
log[ odds
odds2 ] = logit1 − logit2
Cf. Rothman 2002 p. 135 Eqns 7-2, 7-3 and 7-6.
pi1 /(1−π1 )
π2 /(1−π2 )
z
2
π1 − π1 = 0 →
Large-sample CI for Comparative Parameter
(if 2 component estimates are uncorrelated)
=1→
Risk or Prevalence Difference = 0.
Risk or Prevalence Ratio = 1.
Odds Ratio = 1.
=
(p1 − p2 − {∆ = 0}) / SE[p1 − p2 ]
=
(p1 − p2 ) / (p[1 − p]/n1 + p[1 − p]/n2 )1/2
where p = y/n, with y = y1 + y2 ; n = n1 + n2 .
2.1
In General:
(if work in new scale, must back-transform)
estimate1 − estimate2
estimate1 − estimate2
1 The Odds Ratio (OR) is close to the Risk Ratio when the ‘denominator’ odds is low, e.g.
under 0.1, and the Risk Ratio is not extreme. For example, if π1 = 0.16, and π2 = 0.08, so
that the Risk Ratio is 2, then OR = (0.16/0.84)/(0.08/0.92) = 2.2; but the approximation
worsens with increasing π2 and increasing Risk Ratio.
± z × SE[estimate1 − estimate2 ]
± z × (Var[estimate1 ] + Var[estimate2 ])1/2 .
1
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
Examples:
0
1
2
3
Worked example: Stroke Unit vs. Medical Unit
The generic 2 × 2 contingency table:
+
–
sample 1
sample 2
y1 (%)
y2 (%)
n1 − y 1
n2 − y 2
n1 (100%)
n2 (100%)
Total
y(%)
n-y
n(100%)
95% CI for ∆π:
0.66 − 0.51 ± z × (0.66 × 0.34/101 + 0.51 × 0.49/91)/2
= 0.15 ± 1.96 × 0.07
= 0.15 ± 0.14.
All
Test ∆π = 0: [carrying several decimal places, for comparison with χ2 ]
Became
pregnant
Did
not
Total no.
couples
Bromocriptine
Placebo
7 (29%)
5(22%)
17
18
24(100%)
23(100%)
Total
12(26%)
35
47(100%)
= (0.6634 − 0.5054) /|; (0.5885 × 0.4115 × {1/101 + 1/91})1/2
= 0.1580 / 0.0711
= 2.22 → P = 0.026 (2-sided).
z
Bromocriptine for unexplained primary infertility:2
Worked example: Vitamin C and the common cold
95% CI for ∆π:
0.26 − 0.18 ± z × (0.26 × 0.74/407 + 0.18 × 0.81/411)/2
= 0.18 ± 1.96 × 0.03
= 0.18 ± 0.06.
Vitamin C and the common cold:3
Vitamin C
Placebo
No cold
105(26%)
76(18%)
≥ 1 cold
302
335
Total
181(22%)
637
Total subjects
407(100%)
411(100%)
Test ∆π = 0:
z
818(100%)
Stoke Unit vs. Medical Unit for Acute Stroke in elderly?
Patient status at hospital discharge(BMJ 27 Sept 1980)
Indep’t.
Dep’nt
Stroke Unit
67(66%)
34
101(100%)
Medical Unit
46(51%)
45
91 (100%)
Total
113(59%)
79
192(100%)
2.4
Total no. pts
= (0.258 − 0.185) /|; (0.221 × 0.779 × {1/407 + 1/411})1/2
= 0.073 / 0.029
= 2.52 → P = 0.006 (1-sided) or 0.012 (2-sided).
CI for Risk Ratio (a.k.a. Relative Risk) or Prevalence Ratio cf. Rothman2002 p.135
Example: Vitamin C and the common cold ... Revisited
Vitamin C
Placebo
Total
No cold
105(26%)
76(18%)
181(22%)
≥ 1 cold
302(74%)
335(82%)
637
Total no. subjects
407(100%)
411(100%)
818(100%)
d = P rob[ ≥ 1 cold | Vitamin C ] = 74% = 0.91
RR
P rob[ ≥ 1 cold |
Placebo ]
82%
2
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
2.5
CI[RR]:
antilog{log 0.91 ± z × SE[log p1 − log p2 ]]}
= antilog{log 0.91 ± z × (SE 2 [log p1 ] + SE 2 [log p2 ])1/2 }.
SE 2 [log p1 ] = V ar[log p1 ] = 1/302 − 1/407 = 0.000854;
2
SE [log p2 ] = V ar[log p2 ] = 1/335 − 1/411 = 0.000552.
So, CI[RR]:
antilog{log 0.91 ± z × (0.000854 + 0.000552)1/2 }
= antilog{log 0.91 ± 0.073} = 0.85 to 0.98.
CI for Odds Ratio
had cold(s)
avoided colds
Vitamin C
302
105
Placebo
335
76
# with cold(s) for every
1 who avoided colds
2.88 (:1)
4.41 (:1)
odds of cold(s)
2.88
4.41
odds Ratio =
2.88
4.41
d = 0.65
= 0.65 → OR
CI[OR] = antilog {log[oddsRatio] ± z SE[logit1 − logit2 ] }
Shortcut:
d and use it as a multiplier and divider of RR.
d
Calculate exp{z × SE[log RR]}
d = exp{0.073} = 1.076.
In our e.g., exp{z × SE[log RR]}
SE 2 [logit1 ] =
1
1
+
#positive1
#negative1
SE 2 [logit1 ] =
1
1
+
#positive2
#negative2
Thus {RRLOW ER , RRU P P ER } = {0.91÷1.076, 0.91×1.076} = {0.85 to 0.98}.
You can use this shortcut whenever you are working with log-based CI’s that
you convert back to the original scale, there they become “multiply-divide”
symmetric rather than “plus-minus” symmetric.
SAS
Stata
PROC FORMAT;
VALUE onefirst 0="z0"
1="a1";
DATA CI RR OR;
INPUT vitC cold npeople;
LINES;
1 1 302
1 0 105
0 1 335
0 0 76
;
PROC FREQ data=CI RR OR
ORDER=FORMATTED;
TABLES vitC*cold / CMH;
WEIGHT npeople;
FORMAT vitC cold onefirst;
RUN;
Immediate: csi 302 335 105 76
cs stands for ’cohort study’
SE[logit1 − logit2 ] =
1
1
+
302 105
+
1
1
+
335 76
1/2
= 0.17
z × SE[logit1 − logit2 ] = 1.96 × 0.17 = 0.33
antilog {log 0.65 ± 0.33 } = exp {−0.43 ± 0.33} = 0.47to0.90
input vitc cold npeople
1 1
1 0
0 1
0 0
end
cf. Rothman 2002 p. 139
From SAS
See statements for RR (output gives both RR and OR)
302
105
335
76
Be CAREFUL as to rows / cols. Index exposure category must be 1st row;
reference exposure category must be 2nd.
If necessary, use FORMAT to have table come out this way ... (note trick to
reverse rows / cols)
cs cold vitc [freq=npeople]
SAS doesn’t know if it data come from a ‘case-control’ or ‘cohort’ study.
3
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
1. work back (using a table of the normal distribution) from the p-value to
the corresponding value of the z-statistic (the number of standard errors
that the difference in sample means is from zero);
From Stata
Immediate: cci 302 335 105 76, woolf
cc stands for ’case control study’
2. divide this observed difference by the observed z value, to get the standard
error of the difference in sample means, and
input vit_c cold n_people
1
1
302
1
0
105
0
1
335
0
0
76
3. use the observed difference, and the desired multiple (1.645 for 90% CI,
1.96 for 95% etc.) to create the CI.
The same procedure is directly applicable for the difference of two independently estimated proportions. If one tests the (null) difference using a z-test,
one can obtain the SE of the difference by dividing the observed difference
in proportions by the z statistic; if the difference was tested by a chi-square
statistic, one can obtain the z-statistic by taking the square root of the observed chi-square value (authors call this square root an observed ‘chi’ value).
Either way, the observed z-value leads directly to the SE, and from there to
the CI. This is worked out in the next example, where it is assumed that the
null hypothesis is tested via a chi-squared (χ2 ) test.
end
cc cold vit_c [freq=n_people], woolf
3
3.1
“Test-based CI’s”
Preamble
3.2
In 1959, when Mantel and Haenszel developed their summary Odds Ratio
measure over 2 or more strata, they did not supply a CI to accompany this
point estimate. From 1955 onwards, the main competitor was the weighted
average (in the log OR scale) and accompanying CI obtained by Woolf. But
this latter method has problems with strata where one or more cell frequencies
are zero. In 1976, Miettinen developed the “test-based” method for epidemiologic situations where the summary point estimate is easily calculated, the
standard error estimate is unknown or hard to compute, but where a statistical test of the null value of the parameter of interest (derived by aggregating
a “sub-statistic” from each stratum) is already available. Although the 1886
development, by Robins, Breslow and Greenland, of a direct standard error
for the log of the Mantel-Haenszel OR estimator, the “test-based” CI is still
used (see A&B KKM).
Even though its main usefulness is for summaries over strata, the idea can be
explained using a simpler and familiar (single starum) example, the comparison of two independent means using a z-test with large df (the principle does
not depend on t vs. z). Suppose all that was reported was the difference in
sample means, and the 2-sided p-value associated with a test of the null hypothesis that the mean difference was zero. From the sample means, and the
p-value, how could we obtain a 95%CI for the difference in the ‘population’
means? The trick is to
4
“Test-based” CI’s ... specific applications
• Difference of 2 proportions π1 − π2 (Risk or Prevalence Difference)
Observe: p1 and p2 and (maybe via p-value) the calculated value of X 2
This implies that
(observed X 2 value)1/2 = observed X value = observed z value;
But... observed z statistic = (p1 − p2 ) / SE[p1 − p2 ].
So... SE[p1 − p2 ] = (p1 − p2 ) / observed z statistic
{use +ve sign}
95% CI for p1 − p2 :
(p1 − p2 ) ∓ {z value f or 95%} × SE[p1 − p2 ]
i.e. ...
(p1 − p2 ) ∓ {z value f or 95%} ×
p1 − p 2
observed z statistic
i.e., after re-arranging terms ...
z value f or 95% CI
(p1 − p2 ) 1 ∓
observed z statistic
(1a)
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
or, in terms of a reported chi-squared statistic
z value f or 95% CI
(p1 − p2 ) 1 ∓
.
Sqrt[observed chi − squared statistic]
95% CI for RR:
rr to power of 1 ±
(1b)
See Section 12.3 of Miettinen’s “Theoretical Epidemiology”.
z value f or 95%
observed z statistic
• Ratio of 2 proportions π1 / π2
(Risk Ratio; Prevalence Ratio; Relative Risk; “RR”)
Observe:
1. or =
p1 /(1−p2 )
p2 /(1−p2 )
=
ad
bc
and
2. (maybe via p-value) the value of X 2 statistic (H0 : OR = 1)
→ (observed X 2 value)1/2 = observed X value = observed z value
In log scale, in relation to log[ORnull ] = 0, observed z value would be:
Observe:
observed z value =
1. rr = p1 / p2 and
2. (maybe via p-value) the value of X 2 statistic (H0 : RR = 1)
→ (observed X 2 value)1/2 = observed X value = observed z value.
log rr − 0
SE[log rr]
SE[log or] =
log or
observed z value
use + ve sign
95% CI for log OR:
log or ∓ {z value f or 95% CI} × SE[log or]
This implies that
log[rr]
observed z value
log or ± {z value f or 95% CI} ×
log rr ∓ {z value f or 95% CI} ×
log[rr]
observed z value
i.e., after re-arranging terms ...
z value f or 95% CI
log[rr] × 1 ±
observed z statistic
log or
observed z value
i.e., after re-arranging terms ...
z value f or 95% CI
log or ± × 1 ±
observed z statistic
log rr ∓ {z value f or 95% CI} × SE[log rr]
i.e. ...
(3a)
i.e. ...
{use + ve sign}
95% CI for log RR:
Going back to OR scale, by taking antilogs5 ...
95% CI for OR:
(2a)
Going back to RR scale, by taking antilogs4 ...
4 antilog[log[a]∞b]
log or − 0
SE[log or]
This implies that
In log scale, in relation to log[RRnull ] = 0, observed z value would be:
SE[log rr] =
(2b)
• Ratio of 2 odds π1 /(1 − π1 ) and π2 /(1 − π2 ) (Odds Ratio; “OR”)
Technically, when the variance is a function of the parameter (as
is the case with binary response data), the test-based CI is most accurate
close to the Null. However, as you can verify by comparing test-based
CIs with CI’s derived in other ways, the inaccuracies are not as extreme
as textbooks and manuals (e.g. Stata) suggest.
observed z value =
or to power of 1 ±
z value f or 95%
observed z statistic
See Section 13.3 of Miettinen’s “Theoretical Epidemiology”
= exp[log[a]∞b] = {exp[log[a]]} to power of b = a to power of b
5
(3b)
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
4
Sample Size considerations...
4.0.1
Effect of Unequal Sample Sizes (n1 6= n2 ) on precision of estimated
differences: See Notes on Sample Size Calculations for Inferences Concerning
Means.
CI for π1 − π2
n’s to produce CI for difference in π’s of pre specified margin of error (M E)
at stated confidence level
Test involving OR
Test H0 : OR = 1 vs. Ha : OR 6= OR:
• large-sample CI: p1 − p2 ± Z SE[p1 − p2 ] = p1 − p2 ± M E
n’s for power 1 − β if OR = ORalt ; Prob[Type I error] = α.
• SE[p1 − p2 ] = {p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 }1/2 .
Work in log or scale; SE[log or] = (1/a + 1/b + 1/c + 1/d)1/2 .
Simplify (involves some approximation) by using an average p.
Need
If use equal n’s, then
Zα/2 SE0 [log or] + Zβ SEalt [log or] < ∆.
n per group =
2 × p(1 − p) ×
2
Zα/2
where
M E2
∆ = log[ORalt ]
M&M use the fact that if p = 1/2 then p(1 − p) = 1/4, and so 2p(1 − p) =
1/2, so the above equation becomes
[max] n per group =
4.0.2
4.0.3
1
2
2
Zα/2
Substitute expected a, |; b, c, d values under null and alt. into SE’s and solve
for number of cases and controls.
References: Schlesselman, Breslow and day, Volume II, ...
M E2
Key Points: log or most precise when all 4 cells are of equal size; so ...
Test involving πT and πC
1. increasing the control:case ratio leads to diminishing marginal gains in
precision.
Test H0 : πT = πC vs. Ha : πT 6= πC :
n’s for power 1 − β if πT = πC + ∆; prob[T ype I error] = α
To see this... examine the function
n per group
p
p
Zα/2 2πC {1 − πC } − Zβ πC {1 − πC } + πT {1 − πT }}2
=
2
∆
π
¯
(1
−
π
¯
)
≈ 2(Zα/2 − Zβ )2
∆2
2 σ0,1 2
= 2 Zα/2 − Zβ
∆
1
1
+
# of cases multiple of this # of controls
for various values of “multiple”
[cf earlier notes “effect of unequal sample sizes”]
2. The more unequal the distribution of the etiologic / preventive factor,
the less precise the estimate
(4)
If α = 0.05(2 − sided) & β = 0.2...Zα = 1.96; Zβ = −0.84, then 2(Zα/2 −
π}
Zβ )2 = 2{1.96 − (−0.84)}2 ≈ 16, i.e. n per group ≈ 16 × π¯ {1−¯
.
∆2
→ nT ≈ 100 & nC ≈ 100 if πT = 0.6 & πC = 0.4.
Examine the functions
1/(no. of exposed cases) + 1/(no. of unexposed cases)
and
1/(no. of exposed controls) + 1/(no. of unexposed controls).
See Sample Size Requirements for Comparison of 2 Proportions (from text by Smith and
Morrow) under Resources for Chapter 8.
6
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
Add-ins for M&M §8 and §9
statistics for epidemiology
5
Factors affecting variability of estimates from, and statistical power of, case-control studies
OR
Cases: 100
Exposure Prevalence: 2%
Cases: 200
Exposure Prevalence: 2%
Cases: 400
Exposure Prevalence: 2%
3.375
2.25
1.5
1
or
0.25 0.5
Cases: 100
1
2
4
8
or
0.25 0.5
Exposure Prevalence: 8%
Cases: 200
1
2
4
8
or
0.25
Exposure Prevalence: 8%
0.5
Cases: 400
1
2
4
8
Exposure Prevalence: 8%
3.375
2.25
1.5
1
or
0.25 0.5
Cases: 100
1
2
4
8
or
0.25 0.5
Exposure Prevalence: 32%
Cases: 200
1
2
4
8
or
0.25
Exposure Prevalence: 32%
0.5
Cases: 400
1
2
4
8
Exposure Prevalence: 32%
3.375
2.25
1.5
1
or
0.25 0.5
1
2
4
8
or
0.25 0.5
1
2
4
8
or
0.25 0.5
1
2
4
8
jh 1995-2003
Reading graphs: (Note log scale for observed or). Take as an example the study in
page the
11 Type I
the middle panel, with 200 cases, and an exposure prevalence of 8%. Say that
error rate is set at α = 0.05 (2-sided) so that the upper critical value (the one that cuts
off the top 2.5% of the null distribution) is close to or = 2. Draw a vertical line at this
critical value, and examine how much of each non-null distribution falls to the right of this
critical value. This area to the right of the critical value is the power of the study, i.e.,
the probability of obtaining a significant or, when in fact the indicated non-null value of
OR is correct. Two curves at each OR value are for studies with 1(grey) and 4(black)
controls/case. Note that OR values 1, 1.5, 2.25 and 3.375 are also on a log scale.
5 Power larger if ...
1. non-null OR >> 1 (cf. 2.5 vs 2.25 vs 3.375); 2 exposure
common (cf. 2% vs 8% vs 32%) and not near universal; 3 use more cases (cf. 100 vs. 200
vs. 400), and controls/case (1 vs 4).
7
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
5
Small sample methods:
We can do so by considering only those data configurations which have the
same total number of ‘positives’, y1 + y2 = y, say, as were observed in the
actual study, and then considering the distribution of Y1 | y.
Test:
Since a risk difference of zero implies a risk ratio, or odds ratio, of 1, all three
can be tested in the same way.
U (unconditional)
Suissa S; Shuster JJ. Exact Unconditional Sample Sizes for the 2 × 2 Binomial Trial; Journal of the Royal Statistical Society. Series A (General)
Vol. 148, No. 4 (1985), pp. 317-327.
C (conditional)
Fisher 1935, JRSS Vol 98, p 48. (central) Hypergeometric distribution,
obtained by conditioning on (treating as fixed) all marginal frequencies.
P rob[Y1 = y1 ; Y2 = y2 ] =
n1
Cy1 π1y1 (1 − π1 )n1 −y1 ×
n2
Cy2 π2y2 (1 − π2 )n2 −y2 .
If we condition on Y1 + Y2 = y, then
P rob[Y1 = y1 | Y1 + Y2 = y] = P rob[Y1 = y1 ; Y2 = y − y1 ]/P rob[Y1 + Y1 = y].
If we rewrite the quantity
π1y1 (1 − π1 )n1 −y1 × π2y2 (1 − π2 )n2 −y2
as
π1y1 (1 − π1 )−y1 π2−y2 (1 − π2 )y1 × (1 − π1 )n1 π2y (1 − π2 )n−y
Confidence Interval:
we see that it simplifies to
5.1
ψ y1 × (1 − π1 )n1 π2y (1 − π2 )n−y
Risk Difference
and that the last three terms do not involve ψ and do not involve the random
variable y1 . Since they appear in both the numerator and the denominator of
the conditional probability, they cancel out.
See section 3.1.2 of Sahai and Khurshid (1996).
5.2
Risk Ratio
This we can write the conditional probability P rob[Y1 = y1 | Y1 + Y2 = y] as
See section 3.1.2 of Sahai and Khurshid (1996).
5.3
P rob[ y1 | y ] =
n1
Cy 1
n2
Cy−y1 ψ y1 / Σ
n1
Cy10
n2
0
Cn−y10 ψ y1 ,
where the summation is over those y10 values that are compatible with the 4
marginal frequencies.
Odds Ratio: Point- and Interval-estimation
See section 4.1.2 of Sahai and Khurshid (1996), and Chapter of Volume I of
Breslow and Day. See also example 1, pp 48-51, in Fisher 1935.
Elaboration on equation 4.11 in Sahai and Khurshid , and on the (what we
now call the non-central hypergeometric random variable whose distribution
is given in the middle of p 50 of Fisher’s article.
Let Yi ∼ Binomial(ni , πi ), i = 1, 2, be 2 independent binomial random variables.
We wish to make inference regarding the parameter
ψ = {π1 /(1 − π1 }/{π2 /(1 − π2 }.
8
Aside: you will note that if we set ψ = 1, the probabilities are the same as
those in the central hypergeometric distribution, used for Fisher’s exact test
of two binomial proportions. Indeed, Fisher, in page 48-49 of his 1935 paper,
first computes the null probabilities for the 2 × 2 table.
Conviction of Like-sex Twins of Criminals
Convicted. Not Convicted.
Total.
Monozygotic
10(a)
3(b)
13
Dizygotic
2(c)
15(d)
17
Total
12
18
30
[We use y1 and y2 where epidemiologists typically use a and c.]
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
He calculated that the probability that 1, 2, 3, . . . monozygotic twins would
escape conviction6 was (1/6 652 325)×{1, 102, 2992, ...}. Thus, “a discrepancy
from proportionality as great or greater than that observed, will arise, subject
to the conditions specified by the ancillary information, in exactly 3,095 trials
out of 6,652,325 or approximately once in 2,150 trials.”
Thus, since π
ˆ1 = 10/13, and π
ˆ2 = 2/17, we have
He then went on to work out the lower limit of the 90% 2-sided CI (or a 95%
1-sided CI), for the odds ratio: i.e. for the odds, πmono−z /(1 − πmono−z ), of
criminals to non-criminals in twins of monozygotic criminals divided by the
corresponding odds πdi−z /(1 − πdi−z ), in twins of dizygotic criminals.
Estimator, based on Conditional Approach:
The Maximum Likelihood Estimate ψˆCM LE is the solution of d log L/dψ = 0.
Let Ymono be the number of MZ twins convicted. Fisher finds the value ψL
such that
P rob[ Ymono ≥ 10 | ψL , y = 12 ] = 0.05.
He reports that this value is 1/0.28496 ≈ 3.509. In the Excel spreadsheet
for Fisher’s exact test and exact CI for OR (on website), you can verify that
indeed, with ψL = 3.509, P rob[ Ymono ≥ 10 | ψ = 3.509 , y = 12 ] = 0.05.
One has to admire Fisher’s ability, in 1935, to solve a polynomial equation of
order 12, namely
(10/13)/(2/13)
10 × 15
a×d
ψˆU M LE =
=
= 25 =
.
(2/17)/(15/17)
3×2
b×c
If we use Σ as shorthand for the denominator of prob[ y1 | y ], then ψˆCM LE is
the solution of
d log Σ
dΣ
1
y1
=
=
× .
ψ
dψ
dψ
Σ
Re-arranging, we find that ψˆCM LE is the solution of
y1 = E[ Y1 | ψ ].
In this case the CMLE of ψ is the same as the estimate obtained by equating
the observed and expected moment (the “Method of Moments”).
Using the same spreadsheet used above, we find that the value of ψ that
satisfies this estimating equation is
1 + 102ψ + 2992ψ 2
= 0.05.
1 + 102ψ + 2992ψ 2 + · · · + 476ψ 12
ψˆCM LE = 21.3.
5.3.1
Point estimation of ψ under Hypergeometric Model
It can be shown that, in any given dataset, ψˆCM LE is closer to the null (i.e.,
to ψ = 1) than the ψˆM LE is. Indeed, it the CMLE can be can be seen as a
UMLE that has been shrunk towards the null.7
See section x.x of Breslow and Day, Volume I.
It will come as a surprise to many that there are 2 point estimators of ψ:
8
one, the familiar – unconditional – based on the “2 independent Binomials”
model, with two random variables y1 and y2 , and
the other – conditional – based on the single random variable y1 | y with a
Non-Central Hypergeometric distribution.
While the two estimators yield similar estimates when sample sizes are large,
the estimates can be quite different from each other in small sample situations.
Estimator, based on Unconditional Approach:
The estimator derives from the principle that if there are two parameters θ1
and θ2 , with Maximum Likelihood Estimators θˆ1 and θˆ2 , then the Maximum
Likelihood Estimator of θ1 /θ2 is θˆ1 /θˆ2 .
6 the range is 1 to 13; 0 cannot escape, since then there would be 13 convicted in the
first row, but there are only 12 convicted in all.
9
7 See Hanley JA, Miettinen OS. An Unconditional-like Structure for the Conditional
Estimator of Odds Ratio from 2 x 2 Tables. Biometrical Journal 48 (2006) 1, 2334 DOI:
10.1002/bimj.200510167
8 [Notes from JH]:
• The 5 tables from the tea-tasting experiment with the 2x2 tables with all marginal
totals = 4 are another example of this hypergeometric distribution
• Excel has the Hypergeometric probability function. It is like the Binomial , except that
instead of specifying p, one specifies the size of the POPULATION and the NUMBER
OF POSITIVES IN THE POPULATION .. example, to get P1 above, one would ask
for HYPERGEODIST(a;r1;c1;N)
The spreadsheet “Fisher’s Exact test” uses this function; to use the spreadsheet,
simply type in the 4 cell frequencies, a, b, c, and d. the spreadsheet will calculate the
probability for each possible table. then you can find the tail areas yourself. You can
also use it for the non-null (non-central) hypergeometric distribution.
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
5.4
5.4.1
The “Exact” Test for 2 x 2 tables
Material taken from Armitage & Berry §4.9.
Material on hand-calculation of null probabilities is omitted
Even with the continuity correction there will be some doubt about the adequacy of the χ2 approximation when the frequencies are particularly small.
An exact test was suggested almost simultaneously in the mid-1930s by R.
A. Fisher, J. O. Irwin and F. Yates. It consists in calculating the exact
probabilities of the possible tables described in the previous subsection. The
probability of a table with frequencies
a
c
c1
b
d
c2
r1
r2
N
is given by the formula
P [a|r1 , r2 , c1 , c2 ] =
r1 !r2 !r3 !r4 !
N !a!b!c!d!
(5)
This is, in fact, the probability of the observed cell frequencies conditional
on the observed marginal totals, under the null hypothesis of no association
between the row and column classifications. Given any observed table, the
probabilities of all tables with the same marginal totals can be calculated, and
the P value for the significance test calculated by summation. Example 4.14
illustrates the calculations and some of he difficulties of interpretation which
may arise. The data in Table 4.6, due to M. Hellman, are discussed by Yates
(1934).
Breast-fed
Bottle-fed
Total
Normal
4
1
5
a
Pa
0
0.1720
1
0.3440
2
0.3096
totals as those in
17
20
37
20
22
42
3
0.1253
4
1
5
16
21
37
4
0.0182
these tables are shown in Table 2 Below them are shown the probabilities of
these tables, calculated under the null hypothesis.
Table 2 continued ...
5 15 20
0 22 22
5 37 42
a
5
Pa
0.0182
This is the complete conditional distribution for the observed marginal totals,
and the probabilities sum to unity as would be expected. Note the importance of carrying enough significant digits in the first probability to be calculated; the above calculations were carried out with more decimal places than
recorded by retaining each probability in the calculator for the next stage.
The observed table has a probability of 0.1253. To assess its significance we
could measure the extent to which it falls into the tail of the distribution
by calculating the probability of that table or of one more extreme. For a
one-sided test the procedure clearly gives P = 0.1253 + 0.0182 = 0.1435. The
result is not significant at even the 10% level.
For a two-sided test the other tail of the distribution must be taken into
account, and here some ambiguity arises. Many authors advocate that the
one-tailed P value should be doubled. In the present example, the one-tailed
test gave P = 0.1435 and the two-tailed test would give P = 0.2870. An
alternative approach is to calculate P as the total probability of tables, in
either tail, which are at least as extreme as that observed in the sense of
having a probability at least as small. In the present example we should have
Table 1: Data on malocclusion of teeth in infants (Yates, 1934)
Infants with
teeth Malocclusion
16
21
37
Table 2: Cell frequencies in tables with the same marginal
Table 1
0 20 20
1 19 20
2 18 20
3
5 17 22
4 18 22
3 19 22
2
5 37 42
5 37 42
5 37 42
5
Total
20
22
42
P = 0.1253 + 0.0182 + 0.0310 = 0.1745
There are six possible tables with the same marginal totals as those observed.
since neither a nor c (in the notation given above) can fall below 0 or exceed
5, the smallest marginal total in the table. The cell frequencies in each of
10
The first procedure is probably to be preferred on the grounds that a significant result is interpreted as strong evidence for a difference in the observed
direction, and there is some merit in controlling the chance probability of such
20
22
42
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
a result to no more than half the two-sided significance level.
The results of applying the exact test in this example may be compared
with those obtained by the χ2 test with Yates’s correction. We find X 2 =
2.39, (P = 0.12) without correction and XC2 = 1.14, (P = 0.29) with correction. The probability level of 0.29 for XC2 agrees well with the two-sided
value 0 29 from the exact test, and the probability level of 0.12 for X 2 is a
fair approximation to the exact mid-P value of 0.16.
Cochran (1954) recommends the use of the exact test, in preference to the X 2
test with continuity correction, (i) if N < 20, or (ii) 20 < N < 40 and the
smallest expected value is less than 5. With modern scientific calculators and
statistical software the exact test is much easier to calculate than previously
and should be used for any table with an expected value less than 5.
The exact test and therefore the χ2 test with Yates’s correction for continuity
have been criticized over the last 50 years on the grounds that they are conservative in the sense that a result significant at, say, the 5% level will be found
in less than 5% of hypothetical repeated random samples from a population
in which the null hypothesis is true. This feature was discussed in §4.7 and
it was remarked that the problem was a consequence of the discrete nature of
the data and causes no difficulty if the precise level of P is stated. Another
source of criticism has been that the tests are conditional on the observed
margins, which frequently would not all be fixed. For example, in Example
4.14 one could imagine repetitions of sampling in which 20 breast-fed infants
were compared with 22 bottle-fed infants but in many of these samples the
number of infants with normal teeth would differ from 5. The conditional
argument is that, whatever inference can be made about the association between breast-feeding and tooth decay, it has to be made within the context
that exactly five children had normal teeth. If this number had been different
then the inference would have been made in this different context, but that
is irrelevant to inferences that can be made when there are five children with
normal teeth. Therefore, we do not accept the various arguments that have
been put forward for rejecting the exact test based on consideration of possible
samples with different totals in one of the margins. The issues were discussed
by Yates 1984) and in the ensuing discussion, and by Barnard (1989) and
Upton (1992), and we will not pursue this point further. Nevertheless, the
exact test and the corrected χ2 test have the undesirable feature that the average value of the significance level, when the null hypothesis is true, exceeds
0.5. The mid-P value avoids this problem, and so is more appropriate when
combining results from several studies (see §4.7).
As for a single proportion, the mid-P value corresponds to an uncorrected
χ2 test, whilst the exact P value corresponds to the corrected χ2 test. The
11
confidence limits for the difference, ratio or odds ratio of two proportions
based on the standard errors given by (4.14), (4.17) or (4.19) respectively are
all approximate and the approximate values will be suspect if one or more
of the frequencies in the 2 x 2 table are small. Various methods have been
put forward to give improved limits but all of these involve iterations and are
tedious to carry out on a calculator. The odds ratio is the easiest case. Apart
from exact limits, which involve an excessive amount of calculation, the most
satisfactory limits are those of Cornfield ( 1956); see Example 16.1 and Breslow
and Day (1980, §4.3) or Fleiss ( 1981, §5.6). For the ratio of two proportions
a method was given by Koopman (1984) and Miettinen and Nurminen (1985)
which can be programmed fairly readily. The confidence interval produced
gives a good approximation to the required confidence coefficient, but the two
tail probabilities are unequal due to skewness. Gart and Nam (1988) gave
a correction for skewness but this is tedious to calculate. For the difference
of two proportions a method was given by Mee (1984) and Miettinen and
Nurminen (1985). This involves more calculation than for the ratio limits,
and again there could be a problem due to skewness (Gart and Nam, 1990).
Notes by JH
• The word “exact” means that the p-values are calculated using a finite
discrete reference distribution – the hypergeometric distribution (cousin
of the binomial) rather than using large-sample approximations. It
doesn’t mean that it is the correct test. [see comment by A&B in their
section dealing with Mid-P values].
While greater accuracy is always desirable, this particular test uses a
’conditional’ approach that not all statisticians agree with. Moreover,
compared with some unconditional competitors, the test is somewhat
conservative, and thus less powerful, particularly if sample sizes are very
small.
• Fisher’s exact test is usually used just as a test*; if one is interested
in the difference ∆ = π1 π2 , the conditional approach does not yield
a corresponding confidence interval for ∆. [it does provide one for the
/(1−π1
.
comparative odds ratio parameter ψ = ππ21 /(1−π
2
• Thus, one can find anomalous situations where the (conditional) test
provides P > 0.05 making the difference ‘not statistically significant’,
whereas the large-sample (unconditional) CI for ∆, computed as p1 −
p2 ± z SE(p1 − p2 ), does not overlap 0, and so would indicate that the
difference is ’statistically significant’. [* see the Breslow and Day text
Vol I , §4.2, for CI’s for ψ derived from the conditional distribution]
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
• See letter from Begin & Hanley re 1/20 mortality with pentamidine vs
5/20 with Trimethoprim-Sulfamethoxazole in patients with Pneumocystis
carinii Preumonia-Annals Int Med 106 474 1987.
given in random sequence. The symptoms evaluated included nasal stuffiness, dry mouth, nausea, fatigue, headache, and feelings of disorientation or
depression. No patient had a history of asthma or anaphylaxis.
• Miettinen’s test-based method of forming CI’s, while it can have some
drawbacks, keeps the correspondence between test and CI and avoids
such anomalies.
Results The responses of the patients to the active and control injections
were indistinguishable, as was the incidence of positive responses: 27 percent
of the active injections (16 of 60) were judged by the patients to be the active
substance, as were 24 percent of the control injections (44 of 180). Neutralizing
doses given by some of the physicians to treat the symptoms after a response
were equally efficacious whether the injection was of the suspected allergen or
saline. The rate of judging injections as active remained relatively constant
within the experimental sessions, with no major change in the response rate
due to neutralization or habituation.
• This illustrates one important point about parameters related to binary
data – with means of interval data, we typically deal just with differences*; however, with binary data, we often switch between differences
and ratios, either because the design of the study forces us to use odds
ratios (case-control studies), or because the most readily available regression software uses a ratio (i.e. logistic regression for odds ratios) or
because one is easier to explain that the other, or because one has a more
natural interpretation (e.g. in assessing the cost per life saved of a more
expensive and more efficacious management modality, it is the difference
in, rather than the ratio of, mortality rates that comes into the calculation). [* the sampling variability of the estimated ratios of means of
interval data is also more difficult to calculate accurately].
6
6.1
(Mis-)Application; Costly Application
Conclusions When the provocation of symptoms to identify food sensitivities
is evaluated under double-blind conditions, this type of testing, as well as the
treatments based on “neutralizing” such reactions, appears to lack scientific
validity. The frequency of positive responses to the injected extracts appears
to be the result of suggestion and chance
Calculated according to Fisher’s exact test, which assumes that the hypothesized direction
of effect is the same as the direction of effect in the data. Therefore, when the effect is
opposite to the hypothesis, as it is for the data below those of Patient 9, the P value
computed is testing the null hypothesis that the results obtained were due to change as
compared with the possibility that the patients were more likely to judge a placebo injection
as active than an active injection.
Fisher’s Exact Test in a Double-Blind study of
Symptom Provocation to Determine Food Sensitivity (N Engl J Med 1990; 323: 429-33)
Abstract
Background Some claim that food sensitivities can best be identified by
intradermal injection of extracts of the suspected allergens to reproduce the
associated symptoms. A different dose of an offending allergen is thought to
“neutralize” the reaction.
Methods To assess the validity of symptom provocation, we performed a
double-blind study that was carried out in the offices of seven physicians who
were proponents of this technique and experienced in its use. Eighteen patients were tested in 20 sessions (two patients were tested twice) by the same
technician, using the same extracts (at the same dilutions with the same saline
diluent) as those previously thought to provoke symptoms during unblinded
testing. At each session three injections of extract and nine of diluent were
12
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
Responses of 18 Patients Forced to Decide Whether Injections Contained an
Active Ingredient or Placebo
Pt.
No*
3
1
14a
12
16
Active
Injection
resp no resp
2
1
2
1
2
1
1
2
2
1
Placebo
Injection
resp no resp
1
8
2
7
2
7
0
9
3
6
Notes on P-Values from Fisher’s Exact Test in above article
Patient number 3:
P
Value
0.13
0.24
0.24
0.25
0.36
Active Injection
Placebo Injection
Response
+
2
1
1
8
3
9
Total
3
9
All possible tables with a total of 3 +ve responses
18
14b
4
5
9
2
1
1
1
0
1
2
2
2
3
4
2
2
2
0
5
7
7
7
9
0.50
0.87
0.87
0.87
—
2a
13
15
6
8
0
0
1
0
0
3
3
2
3
3
1
1
3
2
2
8
8
6
7
7
0.75
0.75
0.76
0.55
0.55
17
2b
7
10
11
1
0
0
0
0
2
3
3
3
3
5
3
3
3
3
4
6
6
6
6
0.50
0.38
0.38
0.38
0.38
0
3
3
6
Prob
9×8×7
12×11×10
(pt #)
P-Value*
= 0.382
(2b, 7, 10, 11)
1.0
1
2
2
7
3×3
0.382 × 1×7
= 0.491
(14b, 4, 5)
0.618
2
1
1
8
0.491 × 2×2
2×8
= 0.123
(3)
0.128
3
0
0
9
0.123 × 1×1
3×9
= 0.005
0.005
Patient number 1:
Active Injection
Placebo Injection
*Patients were numbered in the order they were studied
Response
+
2
1
2
7
4
8
Total
3
9
All possible tables with a total of 4 +ve responses
The order in the table is related to the degree that the results agree with
the hypothesis that patients could distinguish active injections from placebo
injections. The results listed below those of Patient 9 do not support this
hypothesis, placebo injections were identified as active at a higher rate than
were true active injections. The letters a and b denote the first and second
testing sessions, respectively, in Patients 2 and 14. true active injections. ID
denotes intradermal, and SC subcutaneous.
The value is the P value associated with the test of whether the common odds
ratio (the odds ratio for all patients) is equal to 1.0. The common odds ratio
was equal to 1.13 (computed according to the Mantel-Haenszel test).
13
0
4
Prob
3
5
8×7×6
12×11×10
= 0.255
(pt #)
P-Value
1.0
1
2
3
6
0.255 × 3×4
1×6
= 0.510
(15)
0.745
2
1
2
7
2×3
0.510 × 2×7
= 0.218
(1, 14a)
0.236
3
0
1
8
1×2
0.218 × 3×8
= 0.018
0.018
*1-sided, guided by Halt :
π of +ve responses with Active > π of +ve responses with Placebo.
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
6.2
Patient number 18:
Active Injection
Placebo Injection
Response
+
2
1
4
5
6
6
Note: The Namibian government expelled the authors from Namibia following the publication of the following article; the reason given was that their
“data and conclusions were premature.”
Total
3
9
Since 1900 the world’s population has increased from about 1.6 to over 5
billion) the U.S. population has kept pace, growing from nearly 75 to 260
million. While the expansion of humans and environmental alterations go
hand in hand, it remains uncertain whether conservation programs will slow
our biotic losses. Current strategies focus on solutions to problems associated
with diminishing and less continuous habitats, but in the past, when habitat loss was not the issue, active intervention prevented extirpation. Here
we briefly summarize intervention measures and focus on tactics for species
with economically valuable body parts, particularly on the merits and pitfalls of biological strategies tried for Africa’s most endangered pachyderms,
rhinoceroses.
All possible tables with a total of 6 +ve responses
0
6
Prob
3
3
6×5×4
12×11×10
= 0.091
(pt #)
P-Value
1.0
(1-sided, as above)
1
2
5
4
0.091 × 3×6
1×4
= 0.409
(17)
0.909
2
1
4
5
0.409 × 2×5
2×5
= 0.409
(18)
0.500
Fisher’s Exact Test and Rhinoceroses
3
0
3
6
0.409 × 1×4
3×6
= 0.091
0.091
[ ... ]
In the Table, the P-values for patients below patient 9 are calculated
as 1-sided, but guided by the opposite Halt from that used for the
patients in the upper half of the table, i.e. by
Halt :
π of +ve responses with Active < π of +ve responses with Placebo.
It appears that the authors decided the “sided-ness” of the Halt after observing
the data!!!
And they used different Halt for different patients!!!
Message: Tail areas for this test are tricky: it is best to lay out all the tables,
so that one is clear which tables are being included in which tail!
Given the inadequacies of protective. legislation and enforcement, Namibia.
Zimbabwe, and Swaziland are using a controversial preemptive measure, dehorning (Fig. D) with the hope that complete devaluation will buy time for
implementing other protective measures (7) In Namibia and Zimbabwe, two
species, black and white rhinos (Ceratotherium simum), are dehorned, a tactic resulting in sociological and biological uncertainty: Is poaching deterred?
Can hornless mothers defend calves from dangerous predators?
On the basis of our work in Namibia during the last 3 years (8) and comparative information from Zimbabwe, some data are available. Horns regenerate
rapidly, about 8.7 cm per animal per year, so that 1 year after dehorning
the regrown mass exceeds 0.5 kg. Because poachers apparently do not prefer
animals with more massive horns (8), frequent and costly horn removal may
be required (9). In Zimbabwe, a population of 100 white rhinos, with at least
80 dehorned, was reduced to less than 5 animals in 18 months (10). These
discouraging results suggest that intervention by itself is unlikely to eliminate
the incentive for poaching. Nevertheless, some benefits accrue when governments, rather than poachers, practice horn harvesting, since less horn enters
the black market Whether horn stockpiles may be used to enhance conservation remains controversial, but mortality risks associated with anesthesia
during dehorning are low (5).
Biologically, there have also been problems. Despite media attention and
a bevy of allegations about the soundness of dehorning ( 11 ), serious attempts to determine whether dehorning is harmful have been remiss. A lack
14
Course BIOS601: Comparisons of 2 Proportions π2 vs. π1 : - models / (frequentist) inference / planning
of negative effects has been suggested because (i) horned and dehorned individuals have interacted without subsequent injury; (ii) dehorned animals have
thwarted the advance of dangerous predators; (iii) feeding is normal; and (iv)
dehorned mothers have given birth (12) However, most claims are anecdotal
and mean little without attendant data on demographic effects. For instance,
while some dehorned females give birth, it may be that these females were
pregnant when first immobilized. Perhaps others have not conceived or have
lost calves after birth. Without knowing more about the frequency of mortality, it seems premature to argue that dehorning is effective. We gathered data
on more than 40 known horned and hornless black rhinos in the presence and
absence of dangerous carnivores in a 7,000 km2 area of the northern Namib
Desert and on 60 horned animals in the 22,000 km2 Etosha National Park.
On the basis of over 200 witnessed interactions between horned rhinos and
spotted hyenas (Crocura crocura) and lions (Panthera leo) we saw no cases
of predation, although mothers charged predators in about 45% of the cases.
Serious interspecific aggression is not uncommon elsewhere in Africa, and
calves missing ears and tails have been observed from South Africa, Kenya,
Tanzania, and Namibia (13).
To evaluate the vulnerability of dehorned rhinos to potential predators, we developed an experimental design using three regions:
• Area A had horned animals with spotted hyenas and occasional lions
• Area B had dehorned animals lacking dangerous predators,
• Area C consisted of dehorned animals that were sympatric with hyenas
only.
Populations were discrete and inhabited similar xeric landscapes that averaged
less than 125 mm of precipitation annually. Area A occurred north of a
country long veterinary cordon fence, whereas animals from areas B and C
occurred to the south or east, and no individuals moved between regions.
The differences in calf survivorship were remarkable. All three calves in area
C died within 1 year of birth, whereas all calves survived for both dehorned
females living without dangerous predators (area B; n = 3) and for horned
mothers in area A (n = 4). Despite admittedly restricted samples, the
differences are striking [Fisher’s (3 x 2) exact test, P = 0.017; area B versus
C, P = 0.05; area A versus C, P = 0.0291 ††. The data offer a first assessment
of an empirically derived relation between horns and recruitment.
Our results imply that hyena predation was responsible for calf deaths, but
other explanations are possible. If drought affected one area to a larger extent
than the others, then calves might be more susceptible to early mortality.
15
This possibility appears unlikely because all of western Namibia has been
experiencing drought and, on average, the desert rhinos in one area were in
no poorer bodily condition than those in another. Also, the mothers who lost
calves were between 15 to 25 years old, suggesting that they were not first
time, inexperienced mothers (14). What seems more likely is that the drought
induced migration of more l than 85% of the large, herbivore biomass (kudu,
springbok, zebra, gemsbok, giraffe, and ostrich) resulted in hyenas preying on
an alternative food, rhino neonates, when mothers with regenerating horns
could not protect them.
Clearly, unpredictable events, including drought, may not be anticipated on
a short-term basis. Similarly, it may not be possible to predict when governments can no longer fund antipoaching measures, an event that may have
led to the collapse of Zimbabwe’s dehorned white rhinos. Nevertheless, any
effective conservation actions must account for uncertainty. In the case of
dehorning, additional precautions must be taken. [ ... ]
survived
died
A
4
0
4
B
3
0
3
C
0
3
3
B
3
0
3
C
0
3
3
B
2
1
3
C
1
2
3
B
1
2
3
C
2
1
3
B
0
3
3
C
3
3
3
total*
3
3
A
4
0
4
C
0
3
3
A
3
1
4
C
1
2
3
A
2
2
4
C
2
1
3
A
1
3
4
C
3
3
3
total*
4
3
††
B vs C
survived
died
A vs C
survived
died
Prob
1
35
12
35
18
35
4
35
“Data and conclusions were premature.”
Agree?