Sample Size Estimation in Research With Dependent Measures and Dichotomous Outcomes 

Transcription

Sample Size Estimation in Research With Dependent Measures and Dichotomous Outcomes 
 EVALUATION METHODS AND PRACTICE 
Sample Size Estimation in Research With Dependent Measures and
Dichotomous Outcomes
I reviewed sample estimation methods for research designs involving nonindependent data and a dichotomous
response variable to examine
the importance of proper sample size estimation and the
need to align methods of sample size estimation with planned
methods of statistical analysis. Examples and references
to published literature are provided in this article.
When the method of sample
size estimation is not in concert with the method of planned
analysis, poor estimates may
result. The effects of multiple
measures over time also need
to be considered.
Proper sample size estimation is often overlooked. Alignment of the sample size estimation method with the planned
analysis method, especially in
studies involving nonindependent data, will produce appropriate estimates. (Am J Public
Health. 2004;94:372–377)
| Kevin L. Delucchi, PhD
WHEN DESIGNING A STUDY—
whether a program evaluation, a
survey, a case–control comparison, or a clinical trial—investigators often overlook sample size
estimation. For ethical and practical reasons, it is important to accurately estimate the required
sample size when one is testing a
hypothesis or estimating the size
of an effect in observational research.1–3 I seek to advance the
existing literature by examining 3
points: (1) the importance of sample size estimation in research,
(2) the need for alignment of
sample size estimation with the
planned analysis, and (3) the special case of a design involving
clustered or correlated data and
a dichotomous outcome.
This discussion is framed primarily in terms of longitudinal
study designs, which are more
common and probably more familiar to many researchers than
cluster-randomized designs. The
broader points, however, apply
to all research settings in which
sample size is important. The
more specific issues and methods
apply to any design in which the
data are nonindependent, such
as studies of members of a
household, comparisons of entire
communities, and multiple measures of the same person.
This topic can be framed from
2 separate perspectives: testing
hypotheses and estimating parameters. When testing a hypothesis, one is concerned with estimating the number of study
participants required to ensure a
minimal probability (power) of
detecting an effect if it exists.
372 | Evaluation Methods and Practice | Peer Reviewed | Delucchi
With many public health applications, the goal is not to test a hypothesis but rather to estimate
the size of an effect, such as an
odds ratio, a correlation coefficient, or a proportion. The focus
is on the variation of the estimate,
which is expressed by the size of
the confidence interval after one
asks the question, “If I have a
sample of a given size, how large
will the confidence interval
around my estimate be?” Proper
sample size estimation is equally
important in both perspectives.
The Importance of Good
Estimation
I have assumed that the need
for sample size estimation in
planning a study is both understood and appreciated. This is
not a trivial assumption. Lenth2
pointed out that despite the importance of this topic, a limited
body of published literature exists on methods of sample size
estimation. Hoenig and Heisey3
demonstrated that some of the
basic concepts about statistical
power are still misunderstood,
and Halpern et al.4 recently discussed the continuing appearance of underpowered medical
research.
When I reviewed the literature,
I found surprisingly little evidence
of improvement in applying sample size estimation to design studies despite the publication of numerous articles that have pointed
to this problem.1,5 In 1988,
Freiman et al. replicated a study
they had first published in 1978.6
In this follow-up study (published
in 19927), they concluded, as
they had in the original work, that
inadequate attention was being
paid to the issue of statistical
power in randomized clinical trials. Reviews within specialties
have consistently found many
studies to be underpowered.8–12
Although most of the literature
on this topic is written from the
experimental or clinical trials perspective, a few publications have
addressed the estimation of sample size for confidence intervals.13–15 Volatier et al.16 discussed sample size estimation
principles for a dietary survey,
and Brogger et al.,17 Bennett et
al.,18 and Panagiotakos et al.19
have provided recent examples of
study design for effect size estimation. Additionally, several articles have addressed sample size
estimation in the context of
estimating gene–environment
interactions.20–22
Estimating Required Sample
Size
When one plans a research
study, several steps are needed
to estimate the number of study
participants. Brief introductions
to this subject can be found in
articles by Streiner23 and Clark.24
The procedures for estimating a
sample size can be summarized
as follows: (1) design the study to
meet its specific aims; (2) use
pilot data and published study
results to estimate the effect size,
or neighborhood of effect sizes,
for each statistical hypothesis to
be tested or effect size to be estimated; (3) set the type I error
rate (usually .05 [α]) and minimal required power (1 – β, usu-
American Journal of Public Health | March 2004, Vol 94, No. 3
 EVALUATION METHODS AND PRACTICE 
370
360
350
340
Sample Size
ally 80%.); (4) compute the
required number of study participants, or sets of study participants, for each estimated effect
size and each tested hypothesis;
and (5) if necessary, revise study
parameters to accommodate a
smaller number of study participants while retaining adequate
power.28–30 It should be noted
that in actual practice, the sample
size estimation process is often
more interactive and adaptive (a
slightly different version of the
process outlined here is provided
by Castelloe and O’Brien,25
Maxwell,26 and Cohen27).
With step 4, it is important to
use an estimation method that
closely matches the planned
analysis method.31 Consider a
study designed to compare 2
groups of participants on a dichotomous outcome with a
logistic regression model to statistically control for a set of covariates. For instance, when one
compares smoking rates, 1 group
may have slightly higher levels of
depression symptoms and a
greater average age. To estimate
the required sample size for a logistic regression, one requires an
estimate of the expected outcome proportions of the 2 conditions (the effect size) plus the
level of correlation ρ [population
correlation coefficient]) between
group membership and the set of
covariates that will be used in
the logistic regression.32
If, however, one is unable to
estimate that correlation, it may
be tempting to use a simple comparison of the 2 proportions as a
test for the basis of estimating
the sample size. For the sake of
the example, if the proportions of
the 2 groups are expected to be
0.20 and 0.35 (α = .05; β = .20
[80% power]), a sample size of
approximately 275 participants is
needed (in accordance with
330
320
310
300
290
280
270
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Correlation Coefficient (ρ)
FIGURE 1—Required sample size as a function of correlation of
covariates with treatment group membership.
PASS 2000 software33). In effect,
this example assumes that ρ is
equal to 0.0. If, however, ρ is
greater than 0.0, the study will
be underpowered when the data
are collected. To reach the targeted power level, the required
sample size must increase in conjunction with the value of ρ2:
306 if ρ2 is equal to 0.10 and
344 if ρ2 is equal to 0.20. The
specific factor is 1/(1 – ρ2); this
effect is illustrated in Figure 1.
In the end, the sample size
must be a compromise between
the competing demands of good
science and available resources
of time and budget.
Methods of Sample Size
Estimation for Longitudinal
Designs
For designs in which the outcome data are continuous and
nonindependent, a number of references34–37 and software packages30,38–41 provide resources for
estimating sample requirements,
depending on the planned analysis (see Muller et al.,31 Hedeker et
al.,42 and Rochon43 for more
complex models).
To illustrate sample size estimation for a dichotomous longi-
March 2004, Vol 94, No. 3 | American Journal of Public Health
tudinal outcome, consider estimating the sample size for a proposed study of smoking rates in
2 groups measured at 3 time
points. The analysis plan is to
conduct tests for the 3 main effects: a comparison of the rates
between the 2 groups, the
change over time, and the interaction of group by time. Set the
α at .05 and the power at 80%
(i.e., type II rate = .20), and assume the expected smoking rates
will be 30%, 40%, and 60% for
1 group and 20%, 25%, and
30% for the other.
Use cross-sectional methods to
approximate the sample size. A
simple approximation ignores the
time factor and either collapses
across time or computes separate
estimates for each assessment
and adjusts the α level for the
multiple tests. In this example,
the average proportion is 0.25
for 1 sample and 0.43 for the
other. A comparison of these 2
proportions requires approximately 108 study participants
per group33 to ensure at least
80% power.
Estimates may vary slightly for
even this simple comparison. For
example, 107 study participants
per group are required if an arcsine transformation is applied to
the proportions first, 118 study
participants per group are required if a correction for continuity is used, and 117 study participants per group are required if
both are used. Rochon’s SAS
macro43 estimates 111 study participants per group and O’Brien’s
UnifyPow38 estimates 109 study
participants per group when the
Pearson χ2 was used and 111
when the Wald χ2 was used. The
PASS manual33 states that use of
the continuity correction, but not
of the arcsine transformation,
yields results close to those obtained with the Fisher exact test,
but when it is used with the data
analysis, the continuity correction
may be overly conservative.44
However, the analysis plan
calls for testing for changes
across time, and a better approximation may be to compare the
proportions of study participants
who smoked at each time point.
This comparison requires a multiple-testing control, such as a
Bonferroni-type correction that
sets the testwise α at .05 / 3 =
.0167 for the type I error across
the 3 tests. The per-group size
estimates were 392 participants
at the first time point, 203 at the
second time point, and 57 at the
third time point. Because the
comparison at the first time
point requires the largest sample
size, a total sample of 784 study
participants is required, a 263%
increase over the estimate of 216
study participants after the proportions are averaged across
time. These estimates, however,
do not include direct tests for
change across time or group-bytime interaction, and they fail to
take into account the assessment-to-assessment correlation
that results from the repeated
measurements.
Delucchi | Peer Reviewed | Evaluation Methods and Practice | 373
 EVALUATION METHODS AND PRACTICE 
Incorporate the across-assessment
correlation. To improve the approximation, one can apply
methods used to analyze data
from related designs: stratified
contingency tables, clusterrandomized studies, and survey
methods.
The data in the example represent 3 2 × 2 matrices of proportions of smokers by group and by
time. The hypothesis of a common odds ratio can be tested
with the Cochran–Mantel–
Haenszel test45 for comparing binary outcomes between 2 groups
while controlling for 1 or more
stratifying variables, such as site
in a multisite clinical trial. Zhang
and Boos46 extended the
Cochran–Mantel–Haenszel test
to a case in which the outcomes
were correlated, and they derived
2 related tests. They also provided power calculations on the
basis of Wittes and Wallenstein’s
research47 by incorporating the
population correlation coefficient—
the intraclass correlation—into
their formula number 3. This incorporation can be applied directly to the example data, which
yield estimates (depending on the
correlation, assumed to range
from ρ = 0.2 to 0.8) of 47 to 91
participants per group.
Another version of a method
incorporating the nonindependence among study participants
in a power analysis comes from
the research that used the cluster-randomized design, which
was discussed by Donner48 and
Donner and Klar49 for the continuous case, while methods of
power analysis for clustered binary data are discussed by Lee
and Durbin50, Jung et al.,51 and
Pan.52 One can conceptualize a
repeated-measures design as a
cluster-randomized design by
thinking of the set of assessments
for each participant as the cluster
that will be randomized to a
group. In this case, the cluster
size is fixed; hence, one should
use the average assessment-toassessment correlation as the estimate of the population correlation coefficient, which is known
as the variance inflation factor in
this context. In the example, if
one examines the same range of
intraclass correlations that range
from .20 to .80 and if one uses
the formula provided by Donner
and Klar,49 one obtains the same
sample size estimates of 47 to 91
per group. (If one uses Rochon’s
program43 and assumes the same
proportions across time, the estimates are 53 and 98.)
Although such methods allow
the investigator to take into account correlation across time, I
have had to assume that the correlations are equal from time to
time (i.e., compound symmetric)
and that the test is a simple comparison of 2 proportions. These
methods for calculating acrossassessment correlation still do
not provide estimates for either
the test of change over time or
the test of group-by-time interaction. As these estimates and
Muller et al.31 demonstrate, such
approximations can be risky.
Use a fully aligned method. To
completely align the sample estimation with the analysis plan
rather than merely approximating the plan, one can use the
methods provided by Rochon,43
Pan,52 and Liu and Liang.53
Pan’s formulas are limited to 2
conditions, do not allow for
dropout, and do not require software implementation; Liu and
Liang’s method is limited to categorical covariates. Rochon’s research is applicable to the more
general case.
Rochon’s method43 is based
on the Wald χ2 test and is implemented in a SAS macro under
374 | Evaluation Methods and Practice | Peer Reviewed | Delucchi
Proc IML (SAS Institute Inc,
Cary, NC); it requires estimates
of effect, such as those in the example, and the specification of
type I and II error rates. The
method also requires an estimate
of the correlation of the outcome
between the first 2 assessments
(the first-order autocorrelation)
and an estimate of the shape of
the correlation matrix.
With the generalized estimating equation (GEE) approach,
the correlation of error terms in
a model is assumed to be a nuisance in the sense that error
terms must be accounted for if
one is to obtain robust estimates
of the standard errors in the
model, but these error terms are
not of direct interest. (Lindsey
and Lambert54 have argued that
such marginal models are not optimal for this analysis and that a
mixed model should be used instead.) While the correct specification of the correlational structure will improve efficiency, the
estimates of the mean structure
will not be biased if the specification is incorrect.
Table 1 shows 3 correlation
matrices, each a different shape,
from a 4-assessment design in
which the first-order autocorrelation is ρ = .5. Table 1a is compound symmetric or exchangeable in shape; the correlation
between any 2 time points is the
same (i.e., .5). Table 1c shows a
case in which the level of correlation declines as the assessment
points become farther apart in
time. Specifically, an autoregressive step 1 (AR[1]) shape—each
correlation is defined as the value
of the first-order autocorrelation,
ρ, raised to a power equal to the
difference between the time
|
|
points (e.g., ρ13 = ρ 1 – 3 = ρ2). Between them, Table 1b shows an
example of an autoregressive
shape in which the rate of decline
in the correlation is slower than
the rate in the full AR1. The alteration in the rate of decline in correlation level is accomplished by
placing an exponent, θ, on the exponent of ρ. Thus, ρ2 would be
0.5
ρ2 if θ were set to 0.50. The effect is to slow the rate of decline
in the correlation over time if
TABLE 1—Three 4 4 Hypothetical Correlation Matrices of a
Variable Measured at 4 Time Points
a. Compound symmetric
1
2
3
4
b. Attenuated decline
1
2
3
4
c. Autoregressive step
1
2
3
4
1
2
3
4
1.00
0.50
0.50
0.50
0.50
1.00
0.50
0.50
0.50
0.50
1.00
0.50
0.50
0.50
0.50
1.00
1.00
0.50
0.38
0.30
0.50
1.00
0.38
0.38
0.38
0.50
1.00
0.50
0.30
0.38
0.50
1.00
1.00
0.50
0.25
0.13
0.50
1.00
0.25
0.25
0.25
0.50
1.00
0.50
0.13
0.25
0.50
1.00
American Journal of Public Health | March 2004, Vol 94, No. 3
 EVALUATION METHODS AND PRACTICE 
nounced. Also notice the increase in sample size that accompanies the increase in assumed
level of correlation for the treatment effect and the reduction in
sample size for the other 2 effects. The reason is that as the
correlation from assessment to
assessment rises, less information
is available from each assessment
for the treatment comparisons,
but more information is available
about the changes over time.
Also, the study in this example
would be overpowered if we
conservatively assumed that no
correlation across time (ρ = 0) existed when in fact such a correlation did exist. A study with too
many participants is not desirable, because it is unethical and
a waste of limited resources to
expose more participants to research than necessary.
The relationship of correlational structure to the number
of study participants can be seen
in greater detail in Figure 2.
Each of the 3 panels displays
the required sample size, 1 effect per panel, as a function of
the level of the correlation
under 3 correlational structures:
compound symmetric, step 1 autoregressive, and a structure
midway between the other 2
that uses a dampening parameter set to .50, which translates
a
120
No. per Condition
planned analysis provides estimates not only for the comparison of groups but also for the 2
effects that involve time. When
used to test the group-by-time interaction, Rochon’s approach indicates that a per-group number
of study participants of 262 (524
total) is required if one assumes
that the first-order correlation
equals .20 and the shape of the
correlation matrix is compound
symmetric. This is the largest of
the required sample sizes for the
3 hypotheses we wish to test
(under those assumptions) and
would be the final estimate for
this example. If we used an estimate on the basis of averaged
treatment group comparisons
only (108 per group), our ability
to detect the interaction effect
would be greatly underpowered
unless we had chosen the estimate of 392 on the basis of 3
nested comparisons at 0.01667,
in which case the study would
have too many participants.
When one focuses on the
GEE-based estimates that are
aligned with the analysis plan,
the interaction test requires
many more participants than either of the other 2 effects. Such a
difference is quite common unless the interaction is very pro-
100
80
60
40
20
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.9
0.8
Correlation
No. per Condition
0 ≤ θ ≤1 and to increase the decline if θ > 1. A θ value of 0 produces the exchangeable matrix of
1a, a value of .5 produces 1b,
and a value of 1.0 produces 1c.
This method of raising an exponent to a further power to change
the rate of decay is implemented
in Rochon’s approach and is
based on the approach of Muñoz
et al.55 (It is possible for the correlation between time points to be
negative and to increase as the
time span increases, but this is not
common.)
Estimating these additional
parameters (correlation and
shape of the correlation matrix)
places an additional burden on
the researcher. Just as one may
have multiple estimates of effect,
one also may have multiple estimates of the additional parameters, and one should check the
extent to which the estimated
sample sizes vary as the parameter estimates vary.
Before considering the effects
of these parameters on the sample size estimates, compare the
estimates from the fully aligned
analysis with the approximations
on the basis of the effects provided in the example data, which
are summarized in Table 2. Use
of a method aligned with the
70
60
50
40
30
20
10
0
0.00
b
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Correlation
TABLE 2—Sample Size Estimates for Each of 4 Methods in the
Comparison of 3 Groups at Type I Error = .05 and at Type II
Error = .20
Correlationa
Method
Cross-sectional design, single comparison
Cross-sectional design, 3 comparisons
Cluster-based design
GEE-based design
Effect
0
.2
Group
108
Group
392/203/57
Group
35
47
Group
38
53
Time
59
47
Group-by-time
327
262
Note. GEE = generalized estimating equation.
Level of first-order autocorrelation, assuming compound symmetry.
a
March 2004, Vol 94, No. 3 | American Journal of Public Health
.8
No. per Condition
c
350
300
250
200
150
100
50
0
00
91
98
12
67
.1
0.20
.3
0.40
.5
0.60
.7
0.80
.9
Correlation
FIGURE 2—Number of study participants per group as a function
of correlation for 3 correlation matrices: compound symmetric
(CS), autoregressive step 1 (AR1), and a midpoint between them
(θ = 0.5).
Delucchi | Peer Reviewed | Evaluation Methods and Practice | 375
 EVALUATION METHODS AND PRACTICE 
to a slowing of the decay in the
correlation (Table 1b). Note that
the y-axis scales vary from panel
to panel, and as the population
correlation coefficient increases,
more study participants are
needed to test the difference between conditions, whereas fewer
are needed to test the effects
that involve time.
The assumed shape of the correlation matrix makes almost no
difference in the case of treatment effects and makes only a
small difference in the case of
time-related effects. The differences can be meaningful, however, in cases where more study
participants are needed, such as
for the interaction effect. If ρ is
equal to .50, 164 study participants per group are required
under a compound-symmetric assumption, while 226 study participants are necessary under an
autoregressive structure.
This approach can be applied
to both continuous and categorical data, and it allows for more
variations than are discussed in
this article, including unequally
spaced assessments, differential
attrition among samples, and unequal number of subjects per
group.43
Use simulations. There is 1
other option that requires substantially more work but is quite
accurate—that is, run a series of
computer-based simulations
that sample number-of-studyparticipant cases of data from a
population with known parameters. In this example, that would
mean sampling from 1 theoretical population with X percent
“abstinent” and sampling from
another with Y percent “abstinent” at each time point with a
given variance/covariance structure. For each sample, one would
test the primary hypotheses, repeat this set of steps (sample a
known population and test the
hypothesis), and count how often
the resultant ρ value was greater
than 0.05. Do this repeatedly
with different sample sizes until
the sample size is large enough
that you can reject the hypothesis
under consideration at least 80%
of the time, if the test hypothesis
is false.
Summary
In addition to the examples
presented in this article, studies
published by Cohen,27 Sedlmeier
and Gigerenzer,5 Freiman et al.,6
Thornley and Adams,9 and
Bezeau and Graves10 demonstrate that more careful attention
to sample sizes used in research
is still needed. A poorly conducted sample size estimation
can result in a study with very little chance of demonstrating any
meaningful effect.
The 2 most important considerations when estimating the required number of participants
are to align the sample size estimation with the data analysis
and to verify the sensitivity of
the resultant estimates. Although
modern methods for data analysis seem to be expanding at a
rapid rate, methods of sample estimation are not far behind, and
user-friendly software for conducting sample size estimation is
increasingly available. The impact of aligning sample estimation methods with data analytic
methods is often overlooked; the
closer the methods of estimating
sample size are to the methods of
analysis, the better the chances
are that the actual power
achieved will match the level of
planned power.
Part of the cost of planning a
more complex design and analysis derives from the additional
information that must be acquired or approximated to accu-
376 | Evaluation Methods and Practice | Peer Reviewed | Delucchi
Acknowledgments
rately estimate how many participants will be required. The effort expended in gathering those
pieces of information will necessarily be in proportion to the
size of the study and the maturity of the research field in
which the study is set.
Once the methods are aligned,
efforts should be focused on estimating the required parameters,
while at the same time one must
realize that it is uncommon to be
able to base sample size estimates
on a single, well-established effect
size. It is equally important to
recognize that the effect size and
some of the other parameters,
such as attrition rates, are themselves estimates. The more the
estimates of these parameters
vary, the more the sample estimates will vary. Whereas the scientifically conservative decision
in the face of such variation
would be to select the largest estimated sample size, decision may
be impractical and may be far in
excess of the true requirement.
Even well-established estimates of
the parameters should be subjected to a sensitivity analysis to
determine the extent to which
the estimated sample size varies
as the parameters vary.
Following these recommendations means more work for the
investigators planning a study
and for the reviewers of proposals and manuscripts, but it is
work that pays off in the long
run—both for the investigators
themselves and for the scientific
community as a whole.
4. Halpern SD, Karlawish JHT, Berlin
JA. The continuing unethical conduct of
underpowered clinical trials. JAMA.
2002;288:358–367.
About the Author
10. Bezeau S, Graves R. Statistical
power and effect sizes of clinical neuropsychology research. J Clin Exp Neuropsychol. 2001;23:399–406.
Requests for reprints should be sent
to Kevin L. Delucchi, PhD, Department of
Psychiatry, University of California, San
Francisco, Box 0984-TRC, 401 Parnassus Ave, San Francisco, CA 94143-0984
(e-mail: [email protected]).
This article was accepted July 14,
2003.
This work was supported by National
Institute on Drug Abuse grant
P50DA09253.
Drs David Wasserman, Alan
Bostrom, Roger Vaughan, and 3 anonymous reviewers provided many very
helpful comments and suggestions.
Human Participant Protection
No protocol approval was needed for
this study.
References
1. Cohen J. The statistical power of
abnormal social and psychological research: a review. J Abnorm Soc Psychol.
1962;65:145–153.
2. Lenth RV. Some practical guidelines for effective sample size determination. Am Statistician. 2001;55:187–193.
3. Hoenig JM, Heisey DM. The abuse
of power: the pervasive fallacy of power
calculations for data analysis. Am Statistician. 2001;55:19–24.
5. Sedlmeier P, Gigerenzer G. Do
studies of statistical power have an effect on the power of studies? Psychol
Bull. 1989;105:309–316.
6. Freiman JA, Chalmers TC, Smith H Jr,
Kuebler RR. The importance of beta,
the type II error and sample size in the
design and interpretation of the randomized controlled trial. Survey of 71
“negative” trials. N Engl J Med. 1978;
299:690–694.
7. Freiman JA, Chalmers TC, Smith H
Jr, Kuebler RR. The importance of beta,
the type II error, and sample size in the
design and interpretation of the randomized controlled trial. In: Bailar JC
III, Mosteller F, eds. Medical Uses of
Statistics. 2nd ed. Boston, Mass: NEJM
Books; 1992: 357–373.
8. Sloan NL, Jordan E, Winikoff B.
Effects of iron supplementation on maternal hematologic status in pregnancy.
Am J Public Health. 2002;92:288–293.
9. Thornley B, Adams C. Content and
quality of 2000 controlled trials in
schizophrenia over 50 years. BMJ.
1998;317:1181–1184.
11. Freedman KB, Bernstein J. Sample
size and statistical power in clinical orthopaedic research. J Bone Joint Surg.
1999:81:1454–1460.
12. Dickison K, Bunn F, Wentz R, Edwards P, Roberts I. Size and quality of
American Journal of Public Health | March 2004, Vol 94, No. 3
 EVALUATION METHODS AND PRACTICE 
randomized controlled trials in head injury: review of published studies. BMJ.
2000:320:1308–1311.
26. Maxwell SE. Sample size and multiple regression analysis. Psychol Methods. 2000;5:434–458.
13. Beal SL. Sample size determination
for confidence intervals on the population mean and on the difference between two populations means. Biometrics. 1989;45:969–977.
27. Cohen J. Statistical Power Analysis
for the Behavioral Sciences. Hillsdale, NJ:
Lawrence Erlbaum; 1988.
14. Daly LE. Confidence intervals and
sample sizes: don’t throw out all your
old sample size tables. BMJ. 1991;302:
333–336.
15. Satten GA, Kupper LL. Sample size
requirements for interval estimation of
the odds ratio. Am J Epidemiol. 1990;
131:177–184.
16. Volatier JL, Turrini A, Welten D;
EFCOSUM Group. Some statistical aspects of food intake assessment. Eur J
Clin Nutr. 2002:56(suppl 2):S46–S52.
17. Brogger J, Bakke P, Eide GE,
Gulsvik A. Comparison of telephone
and postal survey modes on respiratory
symptoms and risk factors. Am J Epidemiol. 2002;155:572–576.
18. Bennett S, Lienhardt C, Bah-Wow
O, et al. Investigation of environmental
and host-related risk factors for tuberculosis in Africa, II: investigation of host
genetic factors. Am J Epidemiol. 2002:
155:1074–1079.
19. Panagiotakos DB, Chrysohoou C,
Pitsavos C, et al. The association between secondhand smoke and the risk
of developing acute coronary syndromes, among non-smokers, under the
presence of several cardiovascular risk
factors: the CARDIO2000 case–control
study. BMC Public Health. 2002;2(1):9.
20. Sturmer T, Brenner H. Flexible
matching strategies to increase power
and efficiency to detect and estimate
gene-environment interactions in
case–control studies. Am J Epidemiol.
2002;155:593–602.
21. Yang Q, Khoury MJ, Friedman JM,
Flanders DW. On the use of population
attributable fraction to determine sample size for case–control studies of
gene-environment interaction. Epidemiology. 2003;14:161–167.
28. Kraemer HC. To increase power in
randomized clinical trials without increasing sample size. Psychopharmacol
Bull. 1991;27:217–224.
29. McAweeney MJ, Klockars AJ. Maximizing power in skewed distributions:
analysis and assignment. Psychol Methods. 1998;3:117–122.
30. McClelland GH. Optimal design in
psychological research. Psychol Methods.
1997;2:3–19.
31. Muller KE, LaVange LM, LandesmanRamey S, Ramey CT. Power calculations
for general linear multivariate models
including repeated measures applications. J Am Stat Assoc. 1992;87:1209–
1226.
32. Hsieh FY, Block DA, Larson MD.
A simple method for sample size calculation for linear and logistic regression.
Stat Med. 1998;17:1623–1634.
33. Hintze J. PASS 2000 [computer
software]. Kaysville, Utah: Number
Cruncher Statistical Software; 2000.
34. Muller KE, Barton CN. Approximate power for repeated measures
ANOVA lacking sphericity. J Am Stat
Assoc. 1989;84:549–555.
35. Overall JE, Doyle SR. Estimating
sample sizes for repeated measurement
designs. Control Clin Trials. 1994;15:
100–123.
36. Overall JE, Atlas RS. Power of univariate and multivariate analyses of repeated measurements in controlled clinical trials. J Clin Psychol. 1999;55:
465–485.
37. Rochon J. Sample size calculations
for two-group repeated-measures experiments. Biometrics. 1991;47:1383–1398.
22. Umbach DM. On the determination of sample size. Epidemiology. 2003;
14:137–138.
38. O’Brien RG. A Tour of UnifyPow, A
SAS Module/Macro for Sample-Size
Analysis. Proceedings of the Twenty-Third
Annual SAS Users Group International
Conference, Nashville, Tenn, 22–25
March 1998. Cary, NC: SAS Institute
Inc; 1998.
23. Streiner DL. Sample size and
power in psychiatric research. Can J
Psychiatry. 1990;35:616–620.
39. Elashoff JD. nQuery Advisor [computer software]. Version 4.0. Sagus,
Mass: Statistical Solutions; 2000.
24. Clark V. Sample size determination.
Plast Reconstr Surg. 1991;87:569–573.
40. Ahn C, Overall JE, Tonidandel S.
Sample size and power calculations in
repeated measurement analysis. Comput
Methods Programs Biomed. 2001;64:
121–124.
25. Castelloe JM, O’Brien RG. Power
and Sample Size Determination for Linear
Models. Proceedings of the Twenty-Sixth
Annual SAS Users Group International
Conference, Long Beach, Calif, 22–25
April 2001. Cary, NC: SAS Institute Inc;
2001.
naux C. Sample size estimation for longitudinal designs with attrition: comparing time-related contrasts between two
groups. J Educ Behav Stat. 1999;24:
70–93.
43. Rochon J. Application of GEE procedures for sample size calculations in
repeated measures experiments. Stat
Med. 1998;17:1643–1658.
44. Delucchi KL. The use and misuse
of chi-square: Lewis and Burke revisited. Psychol Bull. 1983;94:166–176.
45. Mantel N, Haenszel W. Statistical
aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst. 1959;22:719–748.
46. Zhang J, Boos DD. Mantel-Haenszel
test statistics for correlated binary data.
Biometrics. 1997;53:1185–1198.
47. Wittes J, Wallenstein S. The power
of the Mantel-Haenszel test. J Am Stat
Assoc. 1987;82:1104–1109.
48. Donner A. Sample size requirements for stratified cluster randomized
designs. Stat Med. 1992;11:743–50.
49. Donner A, Klar N. Design and
Analysis of Cluster Randomization Trials
in Health Research. London, England:
Arnold; 2000.
50. Lee EW, Durbin N. Estimation and
sample size considerations for clustered
binary responses. Stat Med. 1994;13:
1241–1252.
51. Jung S-H, Kang S-H, Ahn C. Sample size calculations for clustered binary
data. Stat Med. 2001;20:1971–1782.
52. Pan W. Sample size and power calculations with correlated binary data.
Control Clin Trials. 2001;22:211–227.
53. Liu G, Liang K-Y. Sample size calculations for studies with correlated observations. Biometrics. 1997;53:937–947.
54. Lindsey JK, Lambert P. On the appropriateness of marginal models for repeated measurements in clinical trials.
Stat Med. 1998;17:447–469.
55. Muñoz A, Carey V, Shouten JP,
Segal M, Rosner B. A parametric family
of correlation structures for the analysis
of longitudinal data. Biometrics. 1992;
48:733–742.
41. EgretSIZ [computer program].
Cytel Software Inc: Cambridge, Mass;
1994.
42. Hedeker D, Gibbons RD, Water-
March 2004, Vol 94, No. 3 | American Journal of Public Health
Delucchi | Peer Reviewed | Evaluation Methods and Practice | 377