Sample Size and Power Calculations With Correlated Binary Data
Transcription
Sample Size and Power Calculations With Correlated Binary Data
Sample Size and Power Calculations With Correlated Binary Data Wei Pan, PhD, Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN Address reprint request to: Dr. Wei Pan, Division of Biostatistics, A460 Mayo Building, MMC 303, 420 Delaware Street SE, Minneapolis, MN 55455-0378, (Email: [email protected]). 0 ABSTRACT: Correlated binary data are common in biomedical studies. Such data can be analyzed using Liang and Zeger's generalized estimating equations (GEE) approach. An attractive point of the GEE approach is that one can use a misspecied working correlation matrix, such as the working independence model (i.e. the identity matrix), and draw (asymptotically) valid statistical inference by using the so-called robust or sandwich variance estimator. In this article we derive some explicit formulas for sample size and power calculations under various common situations. The given formulas are based on using the robust variance estimator in GEE. We believe that these formulas will facilitate the practice in planning two-arm clinical trials with correlated binary outcome data. KEY WORDS: Clinical trials, GEE, logistic regression, sandwich estimator, z -test. 1 INTRODUCTION It is important and even necessary to do sample size and power calculations in planning a clinical trial 1]. This is relatively well studied and there is a large literature giving many specialized formulas to use. However, most formulas available are only for independent or continuous response data 1, 2]. In practice, correlated binary data also arise frequently. As a general and exible method for correlated discrete data, the generalized estimating equations (GEE) approach has become increasingly important and widely used in analyzing such data 3, 4]. An attractive property of GEE is that one can use a misspecied working correlation matrix and yet the regression coecient estimator is still consistent and asymptotically normal. The covariance matrix of the estimated regression coecients can be also consistently estimated using the so-called robust or sandwich estimator. A convenient choice of the working correlation structure is the working independence model: the correlated data can be treated as though they were independent, and the resulting regression parameter estimates along with the robust covariance estimator can be used to draw proper statistical inference. Recently Liu and Liang discussed a general method for sample size/power calculations in GEE 5]. Their method is based on the (generalized) score test in GEE and the resulting (asymptotically) non-central chi-square distribution of the test statistic under the alternative hypothesis. In general, there is no closed form available except for some special cases. As examples, they gave two explicit formulas for analyzing correlated binary data, where it is assumed that the working correlation structure is correctly specied as a compound symmetric matrix. They did not give formulas for other common special cases. For instance, it would be helpful to have a formula to use when the working independence model is used in GEE. It appears complicated and nontrivial to derive explicit formulas for some important and common situations, at least for many practitioners, though in principle it could be done following Liu and Liang's outline. Shih discussed an alternative approach to sample size/power calculations in GEE, which is based on the z -test using the robust variance estimator of the estimate of the parameter to be tested 6]. After discussing the general method, Shih only gave one specic example for binary data. In this article, we follow Shih's approach and derive explicit formulas for several important special cases. In addition to several situations where a working correlation is specied as the true correlation structure, we also discuss other situations when a misspecied working correlation matrix is used. Note that one needs to 2 specify the underlying correlation structure in sample size/power calculations and thus can use it as the working correlation structure. However, we argue that using other working correlation structures are sometimes useful or necessary. If we admit that we may not know the underlying true correlation structure at the analysis stage, it is doubtful that we can know it at the earlier design stage. Hence, at least, sensitivity analyses can be conducted using various combinations of the working/true correlation structures in the design stage. More comments on this issue will be given in a later section. We hope that the reader can derive relevant formulas for other settings by following the method outlined here. We also note that the issue of sample size and power calculations with binary data has been actively discussed in the literature. In addition to those with the GEE approach 5, 6], there are some other proposals with other estimation (non-GEE) methods, such as the weighted least squares 7-9]. Here we only consider the GEE approach due to its increasing use in practice. In this paper we loosely use \subjects" to refer to the subsampling or correlated sampling units, which consist of a higher level unit called \cluster." In familial studies, the cluster is each family while the subjects are family members. In longitudinal studies, the cluster consists of repeated measures from the same study participant and each measurement is treated as a \subject." The general set-up here is that we are comparing two treatments, a control and an active treatment. We consider two important scenarios. One is that the treatment is given at the cluster level: all the subjects from the same cluster receive the same treatment. There are many such examples. For instance, in a longitudinal study, each patient as a cluster is given only one treatment, and repeated measures are made on the same patient across a period of time. The second scenario is that each cluster may receive more than one treatment. This is also common in practice. For example, in two-period crossover trials, each patient receives each treatment in turn. Another example is that in ophthalmologic studies, one eye of a patient often serves as the control for teh patient's other eye. This article is organized as follows. In the next section we rst review GEE in the current context, and relate the issue of sample size/power calculations in GEE with that in the well-known two-sample z -test. The following section presents explicit sample size/power formulas for situations where the treatment is assigned at the cluster level. We later discuss several special cases where both treatments may be given to the same cluster. We end with a short discussion. 3 BACKGROUND AND THEORY The Model and GEE Denote the j th binary response (outcome) variable from cluster i as yij , and the corresponding 2 1 covariates as xij for j = 1 ::: ni and i = 1 ::: N . By convention we use the coding yij = 1 or 0. The rst element of xij is always 1 (for the intercept term) while the second is the indicator of the treatment group, which is 0 for control and 1 for treatment. Let yi = (yi1 ::: yin ) and Xi = (xi1 ::: xin ) respectively be the response vector and the design matrix for the ith cluster. In general, the components of yi are correlated while yi and yk are independent if i 6= k. In other words, there is a within-cluster correlation, but observations from dierent clusters are independent. Any statistical analysis has to take the within-cluster correlation into account properly. Dene the marginal mean and variance of yij as ij = E (yij jxij ) and V ar(yij ) = ij (1 ; ij ). We use the marginal logistic regression model: 0 i 0 i ! logit(ij ) = log 1 ;ij = xij ij 0 where = (1 2 ) is the unknown regression coecient vector. Note that 2 measures the treatment eect compared with the control 2 = 0 means that there is no dierence between the treatment and control. The GEE estimator ^ of can be obtained through solving the following so-called GEE: N X DiVi 1(yi ; i) = 0 0 0 ; i=1 where i = (i1 ::: in ) , Di = @i =@ and Vi is a working covariance matrix for yi . Let Ai be an ni ni diagonal matrix with the j th diagonal element as V ar(yij ). Then the working covariance matrix can be expressed as Vi = A1i =2 Rwi( )A1i =2 0 i where Rwi( ) is a working correlation matrix with possibly unknown parameters , which can be estimated using method of moments 3] or another set of estimating equations 10]. As usual we assume that the structure of Rwi and the true correlation structure R0i do not change across the clusters, hence we can denote them as Rw and R0 respectively (but keep in mind that their dimension is ni ni, which may change with the cluster i). It is noted that the true covariance of y, Cov(yi ) = A1i =2 R0A1i =2 , is in general unequal to the working covariance Vi = A1i =2 Rw A1i =2 unless R0 = Rw . 4 An attractive property of GEE is that the GEE estimator ^ is still consistent and asymptotically normal even if an incorrect correlation matrix Rw is used. Furthermore, the covariance matrix of ^ can be consistently estimated by the robust or sandwich estimator: R = (1 1 0 1 1 ) ; where 1 = N X i=1 Di Vi 1 Di and 0 = 0 ; (1) ; N X i=1 Di Vi 1 A1i =2 R0 A1i =2 Vi 1Di : 0 ; ; (2) 1 1 is the usual model-based covariance estimator: its (asymptotic) validity depends on the correctness of Rw = R0 , in which case the robust covariance is the same as the model-based one. For correlated discrete data, in general, it is dicult to know the structure of R0 in advance. There are several commonly used (working) correlation structures: ; The identity matrix: Rw = I . This is also called the working independence model. It is the most convenient and probably most common choice. It has been found that in many situations the resulting ^ is also highly ecient 3, 11]. In general, if there is no strong prior knowledge about R0 , we feel that Rw = I could be used at least, computationally, it is the simplest and most stable (in terms of the convergence of the iterative algorithm in solving the GEE). Compound symmetry (CS): 0 1 1 ::: BB C BB 1 ::: CCC B C Rw = B BB 1 ::: CCC : BB CC B@ ::: CA ::: 1 5 The rst order auto-regressive (AR(1)): 0 BB 1 BB 1 BB 2 Rw = B BB BB ::: @ n 1 n i; i ;2 2 1 n i ;3 ni ::: ::: n ::: n ::: 1 ; i ;2 i ;3 1 1 CC CC CC CC : CC CA Note that for simplicity we have suppressed the obvious dependence of Rw on ni . Throughout the paper, we use to denote the parameter in the CS or AR(1) correlation structure. Sample size and Power Calculations The goal is to test whether the treatment has a more favorable eect than the control does. The hypotheses of interest can be formulated as: H0 : 2 = 0 versus HA : 2 = b > 0: Note that since there are only two treatment groups, we can use p0 or p1 to denote the (marginal) mean of the binary response in the control or treatment group respectively: p0 = E yij jxij = (1 0)) and p1 = E yij jxij = (1 1)]. It is well-known that 2 is the log-odds ratio: 2 = log 1 p1p1 = 1 p0p0 . p Denote the robust variance estimate of N (^2 ; 2 ) as vR , which is the (2,2)th element of p N R . Since N (^2 ; 2 ) has an approximately normal distribution N (0 vR ), a z -test can be used to test the null hypothesis. A two-sided z -test with type I error and power 1 ; requires the sample size ; vR (Z=2 ; Z1 )2 b2 where Z = 1 (1 ; ) is the 100(1 ; )th percentile of the standard normal distribution. With a given sample size N , the power is N= ; ; (3) ; s ! 1 ; = 1 ; Z=2 ; b vN : R (4) The corresponding formulas for a one-sided test are straightforward. As to be shown, in the current setting, N corresponds to the number of clusters. The robust variance vR depends on the cluster size ni . Either ni or N can be regarded as the \sample size" since increasing either one leads 6 to a larger data set. Apparently there is a trade-o between increasing ni or N . The decision may depend on their respective cost. In this paper ni is treated as known and the goal is to determine N for a specied power. This strategy is general since for a given series of possible ni 's, one can determine the corresponding N 's and then pick up the most desirable combination of (ni N ). It is obvious that the key to sample size/power calculations is to obtain vR , for which, in general, there is no closed form available. However, in several important special cases, we show next that it is possible (but tedious) to derive the closed forms of vR , and hence those for sample size/power calculations. CONSTANT WITHIN-CLUSTER TREATMENT We rst discuss the situation when the treatment is given at the cluster level. Hence the covariate matrix is Xi = (1n 1n ) if the ith cluster is in the treatment group, and Xi = (1n 0n ) otherwise. Here we use 1n to denote an ni 1 vector of 1's, and similarly for 0n . We assume that the proportion of clusters assigned to the control group is and that to the treatment is 1 ; . In the following we rst discuss a general situation, then derive formulas for various special cases. For convenience the results of this section is summarized in Table 1. Table 1 about here i i i i i i A General Formula It can be veried that Di = p0 (1 ; p0 )1n 0n ] if the ith cluster is in the control group, and Di = p1 (1 ; p1 )1n 1n ] otherwise. Following Shih 6], we have i i i i Di Vi 1 Di = 1n Rw 1 1n M 0 0 ; ; i i where 1 0 p (1 ; p0 ) + (1 ; )p1 (1 ; p1 ) (1 ; )p1 (1 ; p1) C M =B A @ 0 (1 ; )p1 (1 ; p1 ) (1 ; )p1 (1 ; p1 ) 0 1 1 ;1 C M 1 = p (11; p ) B @ A: 0 0 ;1 1 + (1 p0)(1p1 (1p0 )p1 ) ; ; ; ; Hence, the model-based covariance estimate is 1 = ( 1 ; N X i=1 1n Rw 11n M ) 1 : 0 ; i ; i 7 It can be also veried that 0 = N X i=1 Di Vi 1 Cov(yi )Vi 1 Di = ; 0 ; N X i=1 1n Rw 1R0Rw 11n M 0 ; ; i i and thus the robust covariance estimate is PN 1 R 1 R R 1 1 R = (1 0 1 ) = i=1 P n w 0 w n M 1: ( Ni=1 1n Rw 1 1n )2 p The robust variance estimator of N (^2 ; 2 ) is PN 1 R 1R R 11 1 1 n w 0 w n + : vR = N i=1 P ( Ni=1 1n Rw 1 1n )2 p0 (1 ; p0 ) (1 ; )p1 (1 ; p1 ) 0 ; 1 1 ; ; i i ; 0 i i 0 ; ; i i 0 ; i ; ; (5) i Once vR is available, we can plug it into formula (3) or (4) to obtain an explicit formula for sample size or power calculations. We denote in this section PN 1 R 1 R R 1 1 c := N i=1 P n w 0 w n ( N 1 R 1 1 )2 0 ; ; i i i=1 ni w ni 0 ; (6) and now the key to sample size/power calculations is to obtain c. We will give explicit formulas for c for several important special cases. Special Case I: Rw = R0 = CS The rst special case is that Rw is specied correctly as CS. Then 0 = 1 , and we have c = PN N : n i=1 1+(nii 1) (7) ; When ni = n for all i, the formula reduces to the one given in Shih 6]: c = 1 + (nn; 1) = n1 1 + (n ; 1) ] : (8) Replacing the corresponding vR in formula (3) and comparing it with that for independent data 1], we will nd that the required sample size is inated by a factor 1+(n ; 1) due to the within-cluster correlation. A similar but dierent formula is also given in Liu and Liang 5]. The discrepancy arises since the derivations of Liu and Liang are based on the (generalized) score test whereas ours are based on the z -test, a special case of the Wald test. It is well known that the score test and Wald test are asymptotically equivalent. In some situations the score test may be preferred 12]. However, the 8 Wald test is more convenient to use. Currently, for the GEE methodology, only the Wald test is implemented in the two popular statistical packages, SAS and Splus. Hence, in practice, the Wald test is more commonly used than is the score test, and our proposal based on the Wald test will then be useful. As to be shown numerically, the results from these two approaches are often close. Special Case II: Rw = I and R0 = CS Now suppose that the correct correlation matrix is still CS but the working independence model is used. Then PN 1 R 1 N PN n 1 + (n ; 1)] i n 0 n = i=1 Pi c = N Pi=1 : (9) ( Ni=1 1n 1n )2 ( Ni=1 ni )2 In general, formula (9) is not equal to (7). However, when ni = n for all i, formula (9) reduces to (8) in other words, using the working independence model results in no loss of eciency. This is somewhat surprising. However, for bivariate data (i.e. n = 2), it is not surprising since McDonald 11] has proved that the GEE estimator ^ using the working independence model is the same as that using Rw = R0 , implying that their variances must be the same. Hence using the working independence model in this case results in no loss of information. Special Case III: Rw = R0 = AR(1) Without loss of generality, we always assume ni 2 when Rw or R0 is AR(1) in this paper. We have c = PN N 1 = PN N (1 + ) : (10) i=1 1n R0 1n i=1 ni ; (ni ; 2)] When ni = n for all i, (11) c = h 1 + 2 i: n 1 ; (1 ; n ) 0 i i 0 i i 0 ; i i In general, formula (11) is not equal to (8). However, when n = 2, formulas (11) and (8) are the same, because for bivariate data the AR(1) reduces to CS. Special Case IV: Rw = I and R0 = AR(1) PN n + 2(n ; 1) + 2(n ; 2)2 + ::: + 2n 1 ] i i c = N i=1 i : (12) P N ( i=1 ni)2 When ni = n for all i, it is 2 2 1 n 1 1 (13) c = 1 + 2(1 ; n ) + 2(1 ;n n ) + ::: + 2 n i; ; 9 which in general is not equal to formula (11). Furthermore, when < 1, formula (13) can be rewritten as the following: c = n1 1 + 1 2; 1 ; n1(1;;) : n (14) In the Appendix, we show that formula (8) > (14) > (11) (15) when 0 < < 1 and n > 2. An intuitive interpretation is as follows. When the true correlation structure is AR(1), for two observations yij and yik in the same cluster, as their \distance" jj ; kj increases, their correlation j k decreases. On the other hand, if the true correlation structure is CS, the correlation between yij and yik is always , not depending on jj ; kj. Hence, the average within-cluster correlation is weaker for AR(1) than that for CS. In other words, for two equally-sized clusters with AR(1) and CS correlation structures satisfying Corr(yij yij +1 ) = , the eective sample size in the cluster for AR(1) is larger than that for CS. This explains the rst inequality. For the second inequality, it is expected that using the working correlation matrix as the true underlying correlation structure will increase the eciency over that using an incorrect working correlation structure. j ; j Numerical Examples For illustration we rst consider a hypothetical example. Suppose the probabilities of having the desired outcome (i.e., yij = 1) for the control and the treatment are p0 = 0:6 and p1 = 0:8 respectively. The correlation parameter is = 0:4, hence for any j 6= k, Corr(yij yik ) = if the true correlation structure R0 is CS, and Corr(yij yik ) = j k if R0 is AR(1). A cluster (of subjects) is assigned to the control/treatment group with an equal probability: = 0:5. There are n = 4 subjects in each cluster. The c in formulas (8), (11) and (14) are respectively 0.5500, 0.4375 and 0.4480. Their order is consistent with inequality (15). Then we can calculate j ; j vR = c p (11; p ) + (1 ; )p1 (1 ; p ) 0 0 1 1 as 11.458, 9.115 and 9.333 respectively. Since b = log 1 p1p1 = 1 p0p0 = 0:981, based on sample size N = 40 we can calculate the powers using formula (3) as 44.9%, 53.8% and 52.8%, which are respectively for Case I or II (Rw = R0 = CS or Rw = I , R0 = CS ), Case III (Rw = R0 = AR(1)) ; ; 10 and Case IV (Rw = I , R0 = AR(1)). Similarly, for the specied power 80%, the required sample sizes (i.e., numbers of clusters) N for three cases are 94, 75 and 77 respectively. Using Liu and Liang's formula, for Rw = R0 = CS , to retain the 80% power the required sample size is N = 87, and the power is 47.9% for N = 40. More examples with various values of p0 , p1 and are listed in Tables 2 and 3. The set-ups in the two tables are similar except that in Table 2 the risk dierence p1 ; p0 remains as 0.2, whereas in Table 3 the odds ratio exp(b) = p1 (1 ; p0 )=p0 (1 ; p1 ) is always 2.5. It can be veried that the inequality (15) again holds. It is also noted that the dierence between the sample sizes for cases III and IV is very small. Unsurprisingly, as the correlation increases, the required sample size also increases. It appears that for Case I, using Liu and Liang's formula requires smaller sample sizes than does using our formula (see the columns L&L and I,II in Tables 2 and 3), though often the dierence is not large. As discussed earlier, Liu and Liang's formula is based on the score test whereas ours on the Wald test. Although the score test and the Wald test are asymptotically equivalent, it is possible that for nite samples the former is more ecient. The choice of the two tests in GEE needs to be further studied in the future. In the current context, which formula to use will depend on which test is planned to be used in the later analysis. For instance, if the score test is to be used, then it is natural to use Liu and Liang's formula. However, since the score test for GEE has not been implemented in many statistical packages, our formulas based on the Wald test is attractive. In addition, Liu and Liang did not give explicit formulas for other cases we consider. Tables 2 and 3 about here Special Case V: Rw = R0 = I To facilitate comparisons for the next two special cases, we give the formula for independent data: PN 1 1 c = N PNi=1 n n 2 = PNN ( 1 1 ) n 0 i i i=1 ni ni 0 i=1 i (16) which reduces to the familiar 1=n when all ni = n. Special Case VI: Rw = CS and R0 = I Now we discuss the situation when we incorrectly specify the correlation coecient as for 11 independent data. PN PN 1 R 21 n i=1 1+(n n i =1 n w c = N PN n ( 1 R 11 )2 = N PN 0 i ; i=1 ni w ni 0 ]2 i ;1) i i ; i=1 1+(ni i ; 1) 2 : (17) Note that if all ni = n, formula (17) reduces to c = 1=n, which is equal to that of using Rw = R0 = I . For illustration, we include the mathematical derivation of formula (17) in the appendix. Special Case VII: Rw = AR(1) and R0 = I PN 1 R 2 1 PN n i =1 n w n i=1 i c = N PN = N P : 1 2 N ni ; (ni ; 2)] 2 ( i=1 1n Rw 1n ) i=1 0 ; 0 (18) i i ; i i Although we do not have a proof, based on our numerical experience we hypothesize that formula (16) (17) (18), which implies that using incorrect working correlation matrices may lead to larger variances vR and thus larger sample sizes needed for a specied power. When all ni = n, we can simplify formula (18) as c = 1 2 > n1 : n 1 ; 1 ; n2 (19) Hence in this situation using Rw = AR(1) results in a larger sample size needed than that for Rw = R0 = I . VARYING WITHIN-CLUSTER TREATMENT Now we allow for both treatments to be allocated to the subjects within a single cluster. Without loss of generality, assume that within any cluster i, the rst ni0 subjects are assigned to the control group while the remaining ni1 = ni ; ni0 are assigned to the treatment group. Note that for varying within-cluster treatment, there may be no closed form solutions in many cases and thus we will not discuss cases III-VII considered earlier. A General Formula First it is straightforward to establish 0 p0(1 ; p0 ) 1n 0 0n 0 Di = B @ p1(1 ; p1 ) 1n 1 1n 1 1i = Di Ai 1=2 Rw 1 Ai 1=2 Di 0 ; ; ; i i i i 1 ]C A ] 0i = Di Ai 1=2 Rw 1 R0 Rw 1 Ai 1=2 Di : 0 ; ; ; ; 12 The model-based covariance matrix is 1 1 = ( ; while the robust covariance estimate is N X R = ( i=1 N X i=1 N X 1i ) 1 ( ; 1i ) 1 i=1 ; 0i )( N X i=1 1i ) 1 ]: ; In general, neither 1 1 nor R can be written in a closed form. But, at least numerically, with a given set of parameters, it is feasible to compute either 1 1 or R . Special Case I: Rw = R0 = CS We rst consider the general situation with Rw = R0 = CS and ni > 2. Denote ; ; 0 1 a b 1i = B @ i i CA bi di then it can be veried that s " # s " # ; p1) + n p (1 ; p ) u + (n ; 1)w p0(1 ; p0) ai = ni0p0(1 ; p0) ui + (ni0 ; 1)wi pp1(1 i1 1 1 i i1 i p (1 ; p ) 0 (1 ; p0 ) 1 1 q bi = ni0ni1 wi p0(1 ; p0 )p1(1 ; p1 ) + ni1 p1 (1 ; p1)ui + (ni1 ; 1)wi ] di = ni1 p1 (1 ; p1 )ui + (ni1 ; 1)wi ] where ui = ; (n ; 1)(ni2;;2)(n +; 12) ; 1 i i wi = (n ; 1)2 ;(n ; 2) ; 1 : i i Due to the complexity of the form of 1i1 , we will not give it here. In principle, with given P parameters one can calculate 1 = Ni=1 1i and thus R or 1 1 , from which vR can be obtained. For the bivariate case that n0i = n1i = 1 for all i, a direct calculation of Di 1 Vi Di 1 shows that the covariance matrix ; ; 0; ; 0 R = 1 1 = N1 B @ ; q p0 (1 p0 ) + 1 ; ; 1 p0 (1;p0 ) 1 p0 (1;p0 )p1 (1;p1 ) Hence 1 1 ; ; s ; vR = p (1 ; p ) + p (1 ; p ) ; 2 p (1 ; p )1p (1 ; p ) : 0 0 1 1 0 0 1 1 1 1 q 1 p0 (1 p0 ) + q p0 (1 p0 )p1 (1 p1 ) C A 1 1 1 + ; 2 p0 (1 p0 ) p1 (1 p1 ) p0 (1 p0 )p1 (1 p1 ) ; ; ; ; ; (20) 13 A sample size/power formula results by substituting formula (20) into formula (3) or (4). A dierent formula derived from the score test is also given by Liu and Liang 5]. Special Case II: Rw = I and R0 = CS Straightforward but tedious algebraic work can show that 0 1 n p (1 ; p0 ) + ni1 p1(1 ; p1 ) ni1 p1(1 ; p1) C 1i = B @ i0 0 A ni1 p1 (1 ; p1 ) ni1 p1(1 ; p1) 0 1 1 1 ; CA : n 0 p0 (1 p0 ) 1i1 = B @ n 0p0(1 1 p0 ) 1 1 ; n 0p0(1 p0 ) n 0 p0(1 p0 ) + n 1 p1(1 p1 ) Further, if we denote 0 1 e f 0i = B @ i i CA ; ; i ; i ; i ; i i ; fi gi then s " # s " # ; p1) +n p (1;p ) 1 + (n ; 1) + n p0(1 ; p0) ei = ni0 p0 (1;p0 ) 1 + (ni0 ; 1) + ni1 pp1 (1 i1 1 1 i1 i0 p (1 ; p ) 0 (1 ; p0 ) 1 1 q fi = ni0ni1 p0 (1 ; p0 )p1(1 ; p1 ) + ni1 1 + (ni1 ; 1)]p1 (1 ; p1 ) gi = ni1 1 + (ni1 ; 1)]p1 (1 ; p1): R could be calculated, but its closed form is too complex. Now we pursue it by further assuming that ni0 = m0 and ni1 = m1 for all i. Then 0 1 u w C R = N1 B @ A w vR where u = 1m+p(m(10 ;; p1)) 0 0 w = ; 1m+p(m(10 ;; p1)) + p 0 0 and 0 0 p0 (1 ; p0)p1 (1 ; p1) vR = 1m+p(m(10 ;; p1)) + 1m+p(m(11 ;; p1)) ; p 0 0 0 1 1 1 2 : p0(1 ; p0 )p1 (1 ; p1) (21) 14 In the bivariate case with m0 = m1 = 1, it is simple to verify that formula (21) reduces to formula (20). In other words, in this situation, there is no information loss by using a working independence model. Again this is not surprising in light of McDonald's result 11]: the GEE estimate ^ and thus its variance obtained by using Rw = I are equal to those using Rw = R0 when m0 = m1 = 1. Examples Diggle et al. give the data of a crossover trial on cerebrovascular deciency4]. Two treatments, an active drug and a placebo were compared. Thirty-four patients received the active drug rst followed by placebo. The other 33 patients received placebo rst, followed by the active drug. The binary endpoint is a normal/abnormal electrocardiogram reading after receiving each treatment for each patient. It was found that there was no signicant period or carryover eect. Hence, a marginal logistic model with an intercept term and a covariate indicating the treatment group can be tted. The GEE estimates of the regression coecients (using Rw = CS ) are ^ = (0:52 0:56) with an estimated within-cluster correlation ^ = 0:614. The active drug appears to be favorable the log odds-ratio of having a normal electrocardiogram reading is 0.56 when comparing the active drug with placebo. Now assume that the true values of the parameters are the same as the above estimates, and we want to estimate the required sample size to have a certain power to reject the null hypothesis that there is no treatment eect of the active drug. Note that since all ni = 2 in any two-period crossover trial, it is reasonable to assume the true correlation structure as CS. Using formula (20) or (21) we obtain vR = 3:723. To maintain the 80% and 90% power, the required sample sizes are 94 and 125 respectively. If we use Liu and Liang's formula, the required sample sizes are 91 and 122 respectively. As a comparison, we next consider a hypothetical example similar to the one in the section on Numerical Examples except that the cluster size is n = 2. Recall other parameters are p0 = 0:6, p1 = 0:8 and = 0:4. In each cluster one subject is assigned to the control and the other to the treatment. Using formula (20) or (21) we get vR = 6:334. Hence, using GEE with R0 = CS and either Rw = CS or Rw = I , for the given N = 40, the corresponding power is 69.3%. On the other hand, to obtain the power 80%, the required sample size is 52. Using Liu and Liang's formula yields slightly dierent results: the power is 72.7% for N = 40, and N = 48 is needed to obtain the 80% power. 0 15 Comparing the results here with those in the section on Numerical Examples, we see an apparent increase of the power here even though the cluster size is reduced from 4 to 2. This is not surprising: with a positive within-cluster correlation, the within-cluster comparison of the treatments can increase the eciency of detecting the treatment dierence. It is in fact one of the main advantages of crossover trials. DISCUSSION We have discussed the issue of sample size and power calculations in GEE with correlated binary data. We have presented relevant formulas for several important special cases. Many of our given formulas are explicit while the others are straightforward (but maybe tedious) to calculate. Note that all the formulas derived are only applicable to two-armed trials. In particular, we did not include any covariates (other than the treatment indicator) in the marginal logistic regression model. Extensions to the above situations may be complicated. Although this article focuses on correlated binary data, we believe that by following the same line one can derive relevant formulas for count or continuous data. We have emphasized the use of the robust variance estimator of the estimated regression coefcient due to its widespread use in practice. This is in fact an attractive property of the GEE: the (asymptotically) valid statistical inference only depends on whether the marginal mean structure (here the logistic regression model) is correctly specied, but not on the correct specication of the second moments of the response variable, including both the marginal variance and the withincluster correlation structure. For discrete data, it is more often that one does not have a high condence in the underlying true correlation structure. The GEE approach side-steps this diculty by using some working correlation structure. In practice, there are other reasons for using the working independence model. First, when the sample size (or cluster size) is large, as it is in many community-based trials, using non-diagonal working correlation matrices may lead to slow- or even non-convergence of the GEE program. Second, in some situations, such as in partly conditional modeling with longitudinal data, it may even be necessary to use the working independence model 13, 14]. Third, there may be some theoretical diculties with using a more complex working correlation structure 15]. This could happen when the true correlation structure is mis-specied as a non-simple one and we take this misspecied true correlation structure as the working structure. 16 On the other hand, we also note the limitation of the GEE approach. It is well-known that with nite samples the performance of the robust variance estimator may be less satisfactory 16, 17]. For small samples we recommend the use of simulations to conrm the results obtained from the given sample size/power formulas. ACKNOWLEDGMENTS The author is grateful to Bruce Lindgren for reading an earlier draft and giving helpful suggestions, to John Connett and Chap Le for discussions on the subject. The author thanks two referees, an associate editor and the editor for many detailed and helpful comments. APPENDIX 1. Proof of formula (15) If 0 < < 1, then > k for any k > 1. Hence from (13) we have 1 + 2(1 ; n1 ) + 2(1 ; n2 )2 + ::: + 2 n1 n (14) = n 1 + 2(1 ; n1 ) + 2(1 ; n2 ) + ::: + 2 n1 1 ; < n 2 1 + n 1 + 2 + ::: + (n ; 1)] = (8): n On the other hand, since 1 ; n < 1 ; and 2=n > 0, 1 1 " 2(1 ; 1 ) # 1 2 n = (11): (14) > n 1 + 1 ; 1 ; n 1 > n 1 + 1 ; + n2 = 2. Proof of formula (17) Since Rw = CS , we have 0 1 u w w ::: w i i i i BB C BB wi ui wi ::: wi CCC B C Rw 1 = B BB wi wi ui ::: wi CCC BB CC B@ ::: CA ; wi wi wi ::: ui where ui = ; (n ; 1)(ni2;;2)(n +; 12) ; 1 i i wi = (n ; 1)2 ;(n ; 2) ; 1 : i i 17 0 1 a b b ::: b i C BB i i i BB bi ai bi ::: bi CCC B C Rw 2 = B BB bi bi ai ::: bi CCC BB CC B@ ::: CA Then ; bi bi bi ::: ai where ai = u2i + (ni ; 1)wi2 bi = 2ui wi + (ni ; 2)wi2 : It is easy to verify that 1n Rw 11n = niui + (ni ; 1)wi ] = (n ; n1)i + 1 i 0 ; i i 1n Rw 21n = niui + (ni ; 1)wi ]2 0 ; i i and formula (17) is readily available. REFERENCES 1. Friedman LM, Furberg CD, DeMets DL. Fundamentals of Clinical Trials. 3nd ed. New York: Springer 1998. 2. Shuster JJ. Practical Handbook of Sample Size Guidelines for Clinical Trials. Boca Raton, FL: CRC Press 1993. 3. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 198673:13-22. 4. Diggle PJ, Liang K-Y, Zeger SL. Analysis of Longitudinal Data. Oxford: Oxford 1994. 5. Liu G, Liang K-Y. Sample size calculations for studies with correlated observations. Biometrics 199753:937-947. 6. Shih WJ. Sample size and power calculations for periodontal and other studies with clustered samples using the method of generalized estimating equations. Biometrical Journal 199739:899-908. 7. Lipsitz SR, Fitzmaurice GM. Sample size for repeated measures studies with binary responses. Statistics in Medicine 199413:1233-1239. 18 8. Lee EW, Dubin N. Estimation and sample size considerations for clustered binary responses. Statistics in Medicine 199413:1241-1252. 9. Sargent DJ, Sloan JA, Cha SS. Sample size and design considerations for phase II clinical trials with correlated observations. Controlled Clinical Trials 199919:242-252. 10. Prentice RL. Correlated binary regression with covariates specic to each binary observation. Biometrics 198844:1033-1048. 11. McDonald BW. Estimating logistic regression parameters for bivariate binary data. Journal of the Royal Statistical Societyi, Ser. B 199355:391-397. 12. Breslow N. Tests of hypotheses in overdispersed Poisson regression and other quasi-likelihood models. Journal of the American Statistical Association 199085:565-571. 13. Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communcations in Statististics, Ser. B 199423:939-951. 14. Pepe MS, Whitaker RC, Seidel K. Estimating and comparing univariate associations with application to the prediction of adult obesity. Statistics in Medicine 199918:163-173. 15. Crowder M. On the use of a working correlation matrix in using generalized linear models for repeated measures. Biometrika 199582:407-410. 16. Drum M, McCullagh P. Comment. Statistical Science 19938:300-301. 17. Emrich LJ, Piedmonte MR. On some small sample properties of generalized estimating equation estimates for multivariate dichotomous outcomes. Journal of Statistical Computation and Simulation 199241:19-29. 19 Table 1: Formulas for calculating c in various special cases. Once c is obtained, then one can use formulas (5) and (6) to calculate vR , based on which and using formulas (3) and (4) one can calculate the required sample size N or power 1 ; . Cases Unequal ni Equal ni n I. Rw = R0 = CS (7) (8) (9) (8) II. Rw = I , R0 = CS III. Rw = R0 = AR(1) (10) (11) IV. Rw = I , R0 = AR(1) (12) (13) V. Rw = R0 = I (16) 1=n VI. Rw = CS , R0 = I (17) 1=n (18) (19) VII. Rw = AR(1), R0 = I 20 Table 2: Sample sizes N required to attain 80% and 90% power for Cases I-IV. For comparison, results of Liu and Liang (L&L) for Case I are also included. In all set-ups, p1 ; p0 = 0:2, = 0:5 and all ni = n = 4. Set-up = :2, p0 = :1, p1 = :3 p0 = :2, p1 = :4 p0 = :3, p1 = :5 p0 = :4, p1 = :6 = :4, p0 = :1, p1 = :3 p0 = :2, p1 = :4 p0 = :3, p1 = :5 p0 = :4, p1 = :6 = :6, p0 = :1, p1 = :3 p0 = :2, p1 = :4 p0 = :3, p1 = :5 p0 = :4, p1 = :6 Power=0.80 L&L I,II III 48 55 46 63 68 57 73 77 64 76 80 67 65 76 60 87 94 75 100 106 84 104 110 88 83 96 79 110 119 98 127 135 110 132 140 114 IV 46 58 65 67 62 77 86 90 81 101 114 118 Power=0.90 L&L I,II III 64 74 62 85 92 76 97 103 86 101 107 89 87 101 81 116 126 100 133 142 113 139 147 117 111 129 105 148 160 131 170 180 147 177 187 153 IV 62 77 87 90 83 102 115 120 109 135 152 158 21 Table 3: Sample sizes N required to attain 80% and 90% power for Cases I-IV. For comparison, results of Liu and Liang (L&L) for Case I are also included. In all set-ups, the log odds ratio is b = log(2:5), = 0:5 and all ni = n = 4. Set-up = :2, p0 = :1, p1 = :217 p0 = :2, p1 = :385 p0 = :3, p1 = :517 p0 = :4, p1 = :625 = :4, p0 = :1, p1 = :217 p0 = :2, p1 = :385 p0 = :3, p1 = :517 p0 = :4, p1 = :625 = :6, p0 = :1, p1 = :217 p0 = :2, p1 = :385 p0 = :3, p1 = :517 p0 = :4, p1 = :625 Power=0.80 L&L I,II III 120 128 107 73 79 66 62 66 55 59 64 53 164 176 140 101 108 86 85 91 72 81 87 69 209 224 183 128 137 112 107 115 94 103 111 91 IV 108 66 56 53 144 88 74 71 190 116 98 94 Power=0.90 L&L I,II III 160 172 143 98 105 88 83 88 74 79 85 71 220 236 188 134 144 115 113 121 97 109 117 93 280 300 245 171 183 150 144 154 126 138 148 121 IV 144 88 74 71 192 118 99 95 254 155 131 125