Sample Size Estimation in Research With Dependent Measures and Dichotomous Outcomes
Transcription
Sample Size Estimation in Research With Dependent Measures and Dichotomous Outcomes
EVALUATION METHODS AND PRACTICE Sample Size Estimation in Research With Dependent Measures and Dichotomous Outcomes I reviewed sample estimation methods for research designs involving nonindependent data and a dichotomous response variable to examine the importance of proper sample size estimation and the need to align methods of sample size estimation with planned methods of statistical analysis. Examples and references to published literature are provided in this article. When the method of sample size estimation is not in concert with the method of planned analysis, poor estimates may result. The effects of multiple measures over time also need to be considered. Proper sample size estimation is often overlooked. Alignment of the sample size estimation method with the planned analysis method, especially in studies involving nonindependent data, will produce appropriate estimates. (Am J Public Health. 2004;94:372–377) | Kevin L. Delucchi, PhD WHEN DESIGNING A STUDY— whether a program evaluation, a survey, a case–control comparison, or a clinical trial—investigators often overlook sample size estimation. For ethical and practical reasons, it is important to accurately estimate the required sample size when one is testing a hypothesis or estimating the size of an effect in observational research.1–3 I seek to advance the existing literature by examining 3 points: (1) the importance of sample size estimation in research, (2) the need for alignment of sample size estimation with the planned analysis, and (3) the special case of a design involving clustered or correlated data and a dichotomous outcome. This discussion is framed primarily in terms of longitudinal study designs, which are more common and probably more familiar to many researchers than cluster-randomized designs. The broader points, however, apply to all research settings in which sample size is important. The more specific issues and methods apply to any design in which the data are nonindependent, such as studies of members of a household, comparisons of entire communities, and multiple measures of the same person. This topic can be framed from 2 separate perspectives: testing hypotheses and estimating parameters. When testing a hypothesis, one is concerned with estimating the number of study participants required to ensure a minimal probability (power) of detecting an effect if it exists. 372 | Evaluation Methods and Practice | Peer Reviewed | Delucchi With many public health applications, the goal is not to test a hypothesis but rather to estimate the size of an effect, such as an odds ratio, a correlation coefficient, or a proportion. The focus is on the variation of the estimate, which is expressed by the size of the confidence interval after one asks the question, “If I have a sample of a given size, how large will the confidence interval around my estimate be?” Proper sample size estimation is equally important in both perspectives. The Importance of Good Estimation I have assumed that the need for sample size estimation in planning a study is both understood and appreciated. This is not a trivial assumption. Lenth2 pointed out that despite the importance of this topic, a limited body of published literature exists on methods of sample size estimation. Hoenig and Heisey3 demonstrated that some of the basic concepts about statistical power are still misunderstood, and Halpern et al.4 recently discussed the continuing appearance of underpowered medical research. When I reviewed the literature, I found surprisingly little evidence of improvement in applying sample size estimation to design studies despite the publication of numerous articles that have pointed to this problem.1,5 In 1988, Freiman et al. replicated a study they had first published in 1978.6 In this follow-up study (published in 19927), they concluded, as they had in the original work, that inadequate attention was being paid to the issue of statistical power in randomized clinical trials. Reviews within specialties have consistently found many studies to be underpowered.8–12 Although most of the literature on this topic is written from the experimental or clinical trials perspective, a few publications have addressed the estimation of sample size for confidence intervals.13–15 Volatier et al.16 discussed sample size estimation principles for a dietary survey, and Brogger et al.,17 Bennett et al.,18 and Panagiotakos et al.19 have provided recent examples of study design for effect size estimation. Additionally, several articles have addressed sample size estimation in the context of estimating gene–environment interactions.20–22 Estimating Required Sample Size When one plans a research study, several steps are needed to estimate the number of study participants. Brief introductions to this subject can be found in articles by Streiner23 and Clark.24 The procedures for estimating a sample size can be summarized as follows: (1) design the study to meet its specific aims; (2) use pilot data and published study results to estimate the effect size, or neighborhood of effect sizes, for each statistical hypothesis to be tested or effect size to be estimated; (3) set the type I error rate (usually .05 [α]) and minimal required power (1 – β, usu- American Journal of Public Health | March 2004, Vol 94, No. 3 EVALUATION METHODS AND PRACTICE 370 360 350 340 Sample Size ally 80%.); (4) compute the required number of study participants, or sets of study participants, for each estimated effect size and each tested hypothesis; and (5) if necessary, revise study parameters to accommodate a smaller number of study participants while retaining adequate power.28–30 It should be noted that in actual practice, the sample size estimation process is often more interactive and adaptive (a slightly different version of the process outlined here is provided by Castelloe and O’Brien,25 Maxwell,26 and Cohen27). With step 4, it is important to use an estimation method that closely matches the planned analysis method.31 Consider a study designed to compare 2 groups of participants on a dichotomous outcome with a logistic regression model to statistically control for a set of covariates. For instance, when one compares smoking rates, 1 group may have slightly higher levels of depression symptoms and a greater average age. To estimate the required sample size for a logistic regression, one requires an estimate of the expected outcome proportions of the 2 conditions (the effect size) plus the level of correlation ρ [population correlation coefficient]) between group membership and the set of covariates that will be used in the logistic regression.32 If, however, one is unable to estimate that correlation, it may be tempting to use a simple comparison of the 2 proportions as a test for the basis of estimating the sample size. For the sake of the example, if the proportions of the 2 groups are expected to be 0.20 and 0.35 (α = .05; β = .20 [80% power]), a sample size of approximately 275 participants is needed (in accordance with 330 320 310 300 290 280 270 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Correlation Coefficient (ρ) FIGURE 1—Required sample size as a function of correlation of covariates with treatment group membership. PASS 2000 software33). In effect, this example assumes that ρ is equal to 0.0. If, however, ρ is greater than 0.0, the study will be underpowered when the data are collected. To reach the targeted power level, the required sample size must increase in conjunction with the value of ρ2: 306 if ρ2 is equal to 0.10 and 344 if ρ2 is equal to 0.20. The specific factor is 1/(1 – ρ2); this effect is illustrated in Figure 1. In the end, the sample size must be a compromise between the competing demands of good science and available resources of time and budget. Methods of Sample Size Estimation for Longitudinal Designs For designs in which the outcome data are continuous and nonindependent, a number of references34–37 and software packages30,38–41 provide resources for estimating sample requirements, depending on the planned analysis (see Muller et al.,31 Hedeker et al.,42 and Rochon43 for more complex models). To illustrate sample size estimation for a dichotomous longi- March 2004, Vol 94, No. 3 | American Journal of Public Health tudinal outcome, consider estimating the sample size for a proposed study of smoking rates in 2 groups measured at 3 time points. The analysis plan is to conduct tests for the 3 main effects: a comparison of the rates between the 2 groups, the change over time, and the interaction of group by time. Set the α at .05 and the power at 80% (i.e., type II rate = .20), and assume the expected smoking rates will be 30%, 40%, and 60% for 1 group and 20%, 25%, and 30% for the other. Use cross-sectional methods to approximate the sample size. A simple approximation ignores the time factor and either collapses across time or computes separate estimates for each assessment and adjusts the α level for the multiple tests. In this example, the average proportion is 0.25 for 1 sample and 0.43 for the other. A comparison of these 2 proportions requires approximately 108 study participants per group33 to ensure at least 80% power. Estimates may vary slightly for even this simple comparison. For example, 107 study participants per group are required if an arcsine transformation is applied to the proportions first, 118 study participants per group are required if a correction for continuity is used, and 117 study participants per group are required if both are used. Rochon’s SAS macro43 estimates 111 study participants per group and O’Brien’s UnifyPow38 estimates 109 study participants per group when the Pearson χ2 was used and 111 when the Wald χ2 was used. The PASS manual33 states that use of the continuity correction, but not of the arcsine transformation, yields results close to those obtained with the Fisher exact test, but when it is used with the data analysis, the continuity correction may be overly conservative.44 However, the analysis plan calls for testing for changes across time, and a better approximation may be to compare the proportions of study participants who smoked at each time point. This comparison requires a multiple-testing control, such as a Bonferroni-type correction that sets the testwise α at .05 / 3 = .0167 for the type I error across the 3 tests. The per-group size estimates were 392 participants at the first time point, 203 at the second time point, and 57 at the third time point. Because the comparison at the first time point requires the largest sample size, a total sample of 784 study participants is required, a 263% increase over the estimate of 216 study participants after the proportions are averaged across time. These estimates, however, do not include direct tests for change across time or group-bytime interaction, and they fail to take into account the assessment-to-assessment correlation that results from the repeated measurements. Delucchi | Peer Reviewed | Evaluation Methods and Practice | 373 EVALUATION METHODS AND PRACTICE Incorporate the across-assessment correlation. To improve the approximation, one can apply methods used to analyze data from related designs: stratified contingency tables, clusterrandomized studies, and survey methods. The data in the example represent 3 2 × 2 matrices of proportions of smokers by group and by time. The hypothesis of a common odds ratio can be tested with the Cochran–Mantel– Haenszel test45 for comparing binary outcomes between 2 groups while controlling for 1 or more stratifying variables, such as site in a multisite clinical trial. Zhang and Boos46 extended the Cochran–Mantel–Haenszel test to a case in which the outcomes were correlated, and they derived 2 related tests. They also provided power calculations on the basis of Wittes and Wallenstein’s research47 by incorporating the population correlation coefficient— the intraclass correlation—into their formula number 3. This incorporation can be applied directly to the example data, which yield estimates (depending on the correlation, assumed to range from ρ = 0.2 to 0.8) of 47 to 91 participants per group. Another version of a method incorporating the nonindependence among study participants in a power analysis comes from the research that used the cluster-randomized design, which was discussed by Donner48 and Donner and Klar49 for the continuous case, while methods of power analysis for clustered binary data are discussed by Lee and Durbin50, Jung et al.,51 and Pan.52 One can conceptualize a repeated-measures design as a cluster-randomized design by thinking of the set of assessments for each participant as the cluster that will be randomized to a group. In this case, the cluster size is fixed; hence, one should use the average assessment-toassessment correlation as the estimate of the population correlation coefficient, which is known as the variance inflation factor in this context. In the example, if one examines the same range of intraclass correlations that range from .20 to .80 and if one uses the formula provided by Donner and Klar,49 one obtains the same sample size estimates of 47 to 91 per group. (If one uses Rochon’s program43 and assumes the same proportions across time, the estimates are 53 and 98.) Although such methods allow the investigator to take into account correlation across time, I have had to assume that the correlations are equal from time to time (i.e., compound symmetric) and that the test is a simple comparison of 2 proportions. These methods for calculating acrossassessment correlation still do not provide estimates for either the test of change over time or the test of group-by-time interaction. As these estimates and Muller et al.31 demonstrate, such approximations can be risky. Use a fully aligned method. To completely align the sample estimation with the analysis plan rather than merely approximating the plan, one can use the methods provided by Rochon,43 Pan,52 and Liu and Liang.53 Pan’s formulas are limited to 2 conditions, do not allow for dropout, and do not require software implementation; Liu and Liang’s method is limited to categorical covariates. Rochon’s research is applicable to the more general case. Rochon’s method43 is based on the Wald χ2 test and is implemented in a SAS macro under 374 | Evaluation Methods and Practice | Peer Reviewed | Delucchi Proc IML (SAS Institute Inc, Cary, NC); it requires estimates of effect, such as those in the example, and the specification of type I and II error rates. The method also requires an estimate of the correlation of the outcome between the first 2 assessments (the first-order autocorrelation) and an estimate of the shape of the correlation matrix. With the generalized estimating equation (GEE) approach, the correlation of error terms in a model is assumed to be a nuisance in the sense that error terms must be accounted for if one is to obtain robust estimates of the standard errors in the model, but these error terms are not of direct interest. (Lindsey and Lambert54 have argued that such marginal models are not optimal for this analysis and that a mixed model should be used instead.) While the correct specification of the correlational structure will improve efficiency, the estimates of the mean structure will not be biased if the specification is incorrect. Table 1 shows 3 correlation matrices, each a different shape, from a 4-assessment design in which the first-order autocorrelation is ρ = .5. Table 1a is compound symmetric or exchangeable in shape; the correlation between any 2 time points is the same (i.e., .5). Table 1c shows a case in which the level of correlation declines as the assessment points become farther apart in time. Specifically, an autoregressive step 1 (AR[1]) shape—each correlation is defined as the value of the first-order autocorrelation, ρ, raised to a power equal to the difference between the time | | points (e.g., ρ13 = ρ 1 – 3 = ρ2). Between them, Table 1b shows an example of an autoregressive shape in which the rate of decline in the correlation is slower than the rate in the full AR1. The alteration in the rate of decline in correlation level is accomplished by placing an exponent, θ, on the exponent of ρ. Thus, ρ2 would be 0.5 ρ2 if θ were set to 0.50. The effect is to slow the rate of decline in the correlation over time if TABLE 1—Three 4 4 Hypothetical Correlation Matrices of a Variable Measured at 4 Time Points a. Compound symmetric 1 2 3 4 b. Attenuated decline 1 2 3 4 c. Autoregressive step 1 2 3 4 1 2 3 4 1.00 0.50 0.50 0.50 0.50 1.00 0.50 0.50 0.50 0.50 1.00 0.50 0.50 0.50 0.50 1.00 1.00 0.50 0.38 0.30 0.50 1.00 0.38 0.38 0.38 0.50 1.00 0.50 0.30 0.38 0.50 1.00 1.00 0.50 0.25 0.13 0.50 1.00 0.25 0.25 0.25 0.50 1.00 0.50 0.13 0.25 0.50 1.00 American Journal of Public Health | March 2004, Vol 94, No. 3 EVALUATION METHODS AND PRACTICE nounced. Also notice the increase in sample size that accompanies the increase in assumed level of correlation for the treatment effect and the reduction in sample size for the other 2 effects. The reason is that as the correlation from assessment to assessment rises, less information is available from each assessment for the treatment comparisons, but more information is available about the changes over time. Also, the study in this example would be overpowered if we conservatively assumed that no correlation across time (ρ = 0) existed when in fact such a correlation did exist. A study with too many participants is not desirable, because it is unethical and a waste of limited resources to expose more participants to research than necessary. The relationship of correlational structure to the number of study participants can be seen in greater detail in Figure 2. Each of the 3 panels displays the required sample size, 1 effect per panel, as a function of the level of the correlation under 3 correlational structures: compound symmetric, step 1 autoregressive, and a structure midway between the other 2 that uses a dampening parameter set to .50, which translates a 120 No. per Condition planned analysis provides estimates not only for the comparison of groups but also for the 2 effects that involve time. When used to test the group-by-time interaction, Rochon’s approach indicates that a per-group number of study participants of 262 (524 total) is required if one assumes that the first-order correlation equals .20 and the shape of the correlation matrix is compound symmetric. This is the largest of the required sample sizes for the 3 hypotheses we wish to test (under those assumptions) and would be the final estimate for this example. If we used an estimate on the basis of averaged treatment group comparisons only (108 per group), our ability to detect the interaction effect would be greatly underpowered unless we had chosen the estimate of 392 on the basis of 3 nested comparisons at 0.01667, in which case the study would have too many participants. When one focuses on the GEE-based estimates that are aligned with the analysis plan, the interaction test requires many more participants than either of the other 2 effects. Such a difference is quite common unless the interaction is very pro- 100 80 60 40 20 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 0.8 Correlation No. per Condition 0 ≤ θ ≤1 and to increase the decline if θ > 1. A θ value of 0 produces the exchangeable matrix of 1a, a value of .5 produces 1b, and a value of 1.0 produces 1c. This method of raising an exponent to a further power to change the rate of decay is implemented in Rochon’s approach and is based on the approach of Muñoz et al.55 (It is possible for the correlation between time points to be negative and to increase as the time span increases, but this is not common.) Estimating these additional parameters (correlation and shape of the correlation matrix) places an additional burden on the researcher. Just as one may have multiple estimates of effect, one also may have multiple estimates of the additional parameters, and one should check the extent to which the estimated sample sizes vary as the parameter estimates vary. Before considering the effects of these parameters on the sample size estimates, compare the estimates from the fully aligned analysis with the approximations on the basis of the effects provided in the example data, which are summarized in Table 2. Use of a method aligned with the 70 60 50 40 30 20 10 0 0.00 b 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Correlation TABLE 2—Sample Size Estimates for Each of 4 Methods in the Comparison of 3 Groups at Type I Error = .05 and at Type II Error = .20 Correlationa Method Cross-sectional design, single comparison Cross-sectional design, 3 comparisons Cluster-based design GEE-based design Effect 0 .2 Group 108 Group 392/203/57 Group 35 47 Group 38 53 Time 59 47 Group-by-time 327 262 Note. GEE = generalized estimating equation. Level of first-order autocorrelation, assuming compound symmetry. a March 2004, Vol 94, No. 3 | American Journal of Public Health .8 No. per Condition c 350 300 250 200 150 100 50 0 00 91 98 12 67 .1 0.20 .3 0.40 .5 0.60 .7 0.80 .9 Correlation FIGURE 2—Number of study participants per group as a function of correlation for 3 correlation matrices: compound symmetric (CS), autoregressive step 1 (AR1), and a midpoint between them (θ = 0.5). Delucchi | Peer Reviewed | Evaluation Methods and Practice | 375 EVALUATION METHODS AND PRACTICE to a slowing of the decay in the correlation (Table 1b). Note that the y-axis scales vary from panel to panel, and as the population correlation coefficient increases, more study participants are needed to test the difference between conditions, whereas fewer are needed to test the effects that involve time. The assumed shape of the correlation matrix makes almost no difference in the case of treatment effects and makes only a small difference in the case of time-related effects. The differences can be meaningful, however, in cases where more study participants are needed, such as for the interaction effect. If ρ is equal to .50, 164 study participants per group are required under a compound-symmetric assumption, while 226 study participants are necessary under an autoregressive structure. This approach can be applied to both continuous and categorical data, and it allows for more variations than are discussed in this article, including unequally spaced assessments, differential attrition among samples, and unequal number of subjects per group.43 Use simulations. There is 1 other option that requires substantially more work but is quite accurate—that is, run a series of computer-based simulations that sample number-of-studyparticipant cases of data from a population with known parameters. In this example, that would mean sampling from 1 theoretical population with X percent “abstinent” and sampling from another with Y percent “abstinent” at each time point with a given variance/covariance structure. For each sample, one would test the primary hypotheses, repeat this set of steps (sample a known population and test the hypothesis), and count how often the resultant ρ value was greater than 0.05. Do this repeatedly with different sample sizes until the sample size is large enough that you can reject the hypothesis under consideration at least 80% of the time, if the test hypothesis is false. Summary In addition to the examples presented in this article, studies published by Cohen,27 Sedlmeier and Gigerenzer,5 Freiman et al.,6 Thornley and Adams,9 and Bezeau and Graves10 demonstrate that more careful attention to sample sizes used in research is still needed. A poorly conducted sample size estimation can result in a study with very little chance of demonstrating any meaningful effect. The 2 most important considerations when estimating the required number of participants are to align the sample size estimation with the data analysis and to verify the sensitivity of the resultant estimates. Although modern methods for data analysis seem to be expanding at a rapid rate, methods of sample estimation are not far behind, and user-friendly software for conducting sample size estimation is increasingly available. The impact of aligning sample estimation methods with data analytic methods is often overlooked; the closer the methods of estimating sample size are to the methods of analysis, the better the chances are that the actual power achieved will match the level of planned power. Part of the cost of planning a more complex design and analysis derives from the additional information that must be acquired or approximated to accu- 376 | Evaluation Methods and Practice | Peer Reviewed | Delucchi Acknowledgments rately estimate how many participants will be required. The effort expended in gathering those pieces of information will necessarily be in proportion to the size of the study and the maturity of the research field in which the study is set. Once the methods are aligned, efforts should be focused on estimating the required parameters, while at the same time one must realize that it is uncommon to be able to base sample size estimates on a single, well-established effect size. It is equally important to recognize that the effect size and some of the other parameters, such as attrition rates, are themselves estimates. The more the estimates of these parameters vary, the more the sample estimates will vary. Whereas the scientifically conservative decision in the face of such variation would be to select the largest estimated sample size, decision may be impractical and may be far in excess of the true requirement. Even well-established estimates of the parameters should be subjected to a sensitivity analysis to determine the extent to which the estimated sample size varies as the parameters vary. Following these recommendations means more work for the investigators planning a study and for the reviewers of proposals and manuscripts, but it is work that pays off in the long run—both for the investigators themselves and for the scientific community as a whole. 4. Halpern SD, Karlawish JHT, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;288:358–367. About the Author 10. Bezeau S, Graves R. Statistical power and effect sizes of clinical neuropsychology research. J Clin Exp Neuropsychol. 2001;23:399–406. Requests for reprints should be sent to Kevin L. Delucchi, PhD, Department of Psychiatry, University of California, San Francisco, Box 0984-TRC, 401 Parnassus Ave, San Francisco, CA 94143-0984 (e-mail: [email protected]). This article was accepted July 14, 2003. This work was supported by National Institute on Drug Abuse grant P50DA09253. Drs David Wasserman, Alan Bostrom, Roger Vaughan, and 3 anonymous reviewers provided many very helpful comments and suggestions. Human Participant Protection No protocol approval was needed for this study. References 1. Cohen J. The statistical power of abnormal social and psychological research: a review. J Abnorm Soc Psychol. 1962;65:145–153. 2. Lenth RV. Some practical guidelines for effective sample size determination. Am Statistician. 2001;55:187–193. 3. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Statistician. 2001;55:19–24. 5. Sedlmeier P, Gigerenzer G. Do studies of statistical power have an effect on the power of studies? Psychol Bull. 1989;105:309–316. 6. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized controlled trial. Survey of 71 “negative” trials. N Engl J Med. 1978; 299:690–694. 7. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error, and sample size in the design and interpretation of the randomized controlled trial. In: Bailar JC III, Mosteller F, eds. Medical Uses of Statistics. 2nd ed. Boston, Mass: NEJM Books; 1992: 357–373. 8. Sloan NL, Jordan E, Winikoff B. Effects of iron supplementation on maternal hematologic status in pregnancy. Am J Public Health. 2002;92:288–293. 9. Thornley B, Adams C. Content and quality of 2000 controlled trials in schizophrenia over 50 years. BMJ. 1998;317:1181–1184. 11. Freedman KB, Bernstein J. Sample size and statistical power in clinical orthopaedic research. J Bone Joint Surg. 1999:81:1454–1460. 12. Dickison K, Bunn F, Wentz R, Edwards P, Roberts I. Size and quality of American Journal of Public Health | March 2004, Vol 94, No. 3 EVALUATION METHODS AND PRACTICE randomized controlled trials in head injury: review of published studies. BMJ. 2000:320:1308–1311. 26. Maxwell SE. Sample size and multiple regression analysis. Psychol Methods. 2000;5:434–458. 13. Beal SL. Sample size determination for confidence intervals on the population mean and on the difference between two populations means. Biometrics. 1989;45:969–977. 27. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum; 1988. 14. Daly LE. Confidence intervals and sample sizes: don’t throw out all your old sample size tables. BMJ. 1991;302: 333–336. 15. Satten GA, Kupper LL. Sample size requirements for interval estimation of the odds ratio. Am J Epidemiol. 1990; 131:177–184. 16. Volatier JL, Turrini A, Welten D; EFCOSUM Group. Some statistical aspects of food intake assessment. Eur J Clin Nutr. 2002:56(suppl 2):S46–S52. 17. Brogger J, Bakke P, Eide GE, Gulsvik A. Comparison of telephone and postal survey modes on respiratory symptoms and risk factors. Am J Epidemiol. 2002;155:572–576. 18. Bennett S, Lienhardt C, Bah-Wow O, et al. Investigation of environmental and host-related risk factors for tuberculosis in Africa, II: investigation of host genetic factors. Am J Epidemiol. 2002: 155:1074–1079. 19. Panagiotakos DB, Chrysohoou C, Pitsavos C, et al. The association between secondhand smoke and the risk of developing acute coronary syndromes, among non-smokers, under the presence of several cardiovascular risk factors: the CARDIO2000 case–control study. BMC Public Health. 2002;2(1):9. 20. Sturmer T, Brenner H. Flexible matching strategies to increase power and efficiency to detect and estimate gene-environment interactions in case–control studies. Am J Epidemiol. 2002;155:593–602. 21. Yang Q, Khoury MJ, Friedman JM, Flanders DW. On the use of population attributable fraction to determine sample size for case–control studies of gene-environment interaction. Epidemiology. 2003;14:161–167. 28. Kraemer HC. To increase power in randomized clinical trials without increasing sample size. Psychopharmacol Bull. 1991;27:217–224. 29. McAweeney MJ, Klockars AJ. Maximizing power in skewed distributions: analysis and assignment. Psychol Methods. 1998;3:117–122. 30. McClelland GH. Optimal design in psychological research. Psychol Methods. 1997;2:3–19. 31. Muller KE, LaVange LM, LandesmanRamey S, Ramey CT. Power calculations for general linear multivariate models including repeated measures applications. J Am Stat Assoc. 1992;87:1209– 1226. 32. Hsieh FY, Block DA, Larson MD. A simple method for sample size calculation for linear and logistic regression. Stat Med. 1998;17:1623–1634. 33. Hintze J. PASS 2000 [computer software]. Kaysville, Utah: Number Cruncher Statistical Software; 2000. 34. Muller KE, Barton CN. Approximate power for repeated measures ANOVA lacking sphericity. J Am Stat Assoc. 1989;84:549–555. 35. Overall JE, Doyle SR. Estimating sample sizes for repeated measurement designs. Control Clin Trials. 1994;15: 100–123. 36. Overall JE, Atlas RS. Power of univariate and multivariate analyses of repeated measurements in controlled clinical trials. J Clin Psychol. 1999;55: 465–485. 37. Rochon J. Sample size calculations for two-group repeated-measures experiments. Biometrics. 1991;47:1383–1398. 22. Umbach DM. On the determination of sample size. Epidemiology. 2003; 14:137–138. 38. O’Brien RG. A Tour of UnifyPow, A SAS Module/Macro for Sample-Size Analysis. Proceedings of the Twenty-Third Annual SAS Users Group International Conference, Nashville, Tenn, 22–25 March 1998. Cary, NC: SAS Institute Inc; 1998. 23. Streiner DL. Sample size and power in psychiatric research. Can J Psychiatry. 1990;35:616–620. 39. Elashoff JD. nQuery Advisor [computer software]. Version 4.0. Sagus, Mass: Statistical Solutions; 2000. 24. Clark V. Sample size determination. Plast Reconstr Surg. 1991;87:569–573. 40. Ahn C, Overall JE, Tonidandel S. Sample size and power calculations in repeated measurement analysis. Comput Methods Programs Biomed. 2001;64: 121–124. 25. Castelloe JM, O’Brien RG. Power and Sample Size Determination for Linear Models. Proceedings of the Twenty-Sixth Annual SAS Users Group International Conference, Long Beach, Calif, 22–25 April 2001. Cary, NC: SAS Institute Inc; 2001. naux C. Sample size estimation for longitudinal designs with attrition: comparing time-related contrasts between two groups. J Educ Behav Stat. 1999;24: 70–93. 43. Rochon J. Application of GEE procedures for sample size calculations in repeated measures experiments. Stat Med. 1998;17:1643–1658. 44. Delucchi KL. The use and misuse of chi-square: Lewis and Burke revisited. Psychol Bull. 1983;94:166–176. 45. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst. 1959;22:719–748. 46. Zhang J, Boos DD. Mantel-Haenszel test statistics for correlated binary data. Biometrics. 1997;53:1185–1198. 47. Wittes J, Wallenstein S. The power of the Mantel-Haenszel test. J Am Stat Assoc. 1987;82:1104–1109. 48. Donner A. Sample size requirements for stratified cluster randomized designs. Stat Med. 1992;11:743–50. 49. Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. London, England: Arnold; 2000. 50. Lee EW, Durbin N. Estimation and sample size considerations for clustered binary responses. Stat Med. 1994;13: 1241–1252. 51. Jung S-H, Kang S-H, Ahn C. Sample size calculations for clustered binary data. Stat Med. 2001;20:1971–1782. 52. Pan W. Sample size and power calculations with correlated binary data. Control Clin Trials. 2001;22:211–227. 53. Liu G, Liang K-Y. Sample size calculations for studies with correlated observations. Biometrics. 1997;53:937–947. 54. Lindsey JK, Lambert P. On the appropriateness of marginal models for repeated measurements in clinical trials. Stat Med. 1998;17:447–469. 55. Muñoz A, Carey V, Shouten JP, Segal M, Rosner B. A parametric family of correlation structures for the analysis of longitudinal data. Biometrics. 1992; 48:733–742. 41. EgretSIZ [computer program]. Cytel Software Inc: Cambridge, Mass; 1994. 42. Hedeker D, Gibbons RD, Water- March 2004, Vol 94, No. 3 | American Journal of Public Health Delucchi | Peer Reviewed | Evaluation Methods and Practice | 377