A Survey of Exact Inference for Contingency Tables Alan Agresti Statistical Science
Transcription
A Survey of Exact Inference for Contingency Tables Alan Agresti Statistical Science
A Survey of Exact Inference for Contingency Tables Alan Agresti Statistical Science, Vol. 7, No. 1. (Feb., 1992), pp. 131-153. Stable URL: http://links.jstor.org/sici?sici=0883-4237%28199202%297%3A1%3C131%3AASOEIF%3E2.0.CO%3B2-A Statistical Science is currently published by Institute of Mathematical Statistics. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/ims.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Mon Aug 27 14:34:53 2007 Statistical Science 1992,Vol. 7,No. 1,131-177 A Survey -of Exact Inference for Contingency Tables Alan Agresti Abstract. The past decade has seen substantial research on exact inference for contingency tables, both in terms of developing new analyses and developing efficient algorithms for computations. Coupled with concomitant improvements in computer power, this research has resulted in a greater variety of exact procedures becoming feasible for practical use and a considerable increase in the size of data sets to which the procedures can be applied. For some basic analyses of contingency tables, it is unnecessary to use large-sample approximations to sampling distributions when their adequacy is in doubt. This article surveys the current theoretical and computational developments of exact methods for contingency tables. Primary attention is given to the exact conditional approach, which eliminates nuisance parameters by conditioning on their sflicient statistics. The presentation of various exact inferences is unified by expressing them in terms of parameters and their sufficient statistics in loglinear models. Exact approaches for many inferences are not yet addressed in the literature, particularly for multidimensional contingency tables, and this article also suggests additional research for the next decade that would make exact methods yet more widely applicable. Key words and phrases: Categorical data, chi-squared tests, computational algorithms, conditional inference, Fisher's exact test, logistic regression, loglinear models, odds ratios, sufficient statistics. Table of Contents 1. Introduction 1.1 Historical perspective 1.2 Outline and notation 1.3 The exact conditional approach 2. Exact Inference for 2 x 2 Tables 2.1 Fisher's exact test 2.2 "Exact" estimation in 2 x 2 tables 2.3 Comparing dependent proportions '3. Exact Inference in I x J Tables' 3.1 Linking test statistics to alternatives 3.2 Exact estimation for I x J tables 4. Exact Inference in Three-Way Contingency Tables 4.1 Testing conditional independence in 2 x 2 x K tables 4.2 Testing homogeneity of odds ratios in 2 x 2 x K tables f Alan Agresti is Professor of Statistics, University of Florida, Griffin-Floyd Hall, Gainesville, Florida 32611. 4.3 Exact estimation in 2 x 2 x K tables 4.4 Exact methods for I x J x K tables 5. Exact Inference for Logistic Regression Models 6. Exact Goodness of Fit 7. Computing Feasibility 7.1 Algorithms 7.2 Software 8. Other Approaches to Exact Inference 8.1 Controversy over exact conditional approaches 8.2 Exact unconditional approach 8.3 Bayesian approaches 9. Future Research References 1. lNTRODUCTlON This article surveys the development of exact inferential methods for contingency tables. 1interrelate various inferences by expressing them in terms of parameters in a hierarchy of loglinear models. The presentation focuses primarily on exact conditional methods, in which one obtains sampling distributions not dependent on other unknown parameters by conditioning on their sufficient statistics. I also discuss some long-standing controversies for such methods and present topics for additional research that should soon be feasible given the continual improvement in computer hardware and software. 1.I Historical Perspective Traditionally, statistical inference for contingency tables has relied heavily on large-sample approximations for sampling distributions of parameter estimators and test statistics. Many of these approximations are special cases of ones that apply more generally than to categorical data [e.g., chi-squared approximations for likelihood-ratio statistics and normal approximations for maximum likelihood (ML) estimators of model parameters]. With this emphasis on large-sample methods, the development of inferential methods for categorical data parallels the historically earlier development of inferential methods for continuous data. For instance, E. S. Pearson's recently published manuscript about Student notes that studies analyzed by Karl Pearson's laboratory usually involved largesize data sets. When Gosset presented his queries about small samples that led to his development of techniques using the t-distribution, Pearson replied, "Only naughty brewers deal in small samples" (Pearson, 1990, page 73). R. A. Fisher's Statistical Methods for Research Workers was at the forefront of advocating exact procedures for small samples. In the preface of the first edition of that book (1925), Fisher stated, " . . . the traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow! The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data." Not surprisingly, t-procedures received strong emphasis iq that text, and "Fisher's exact ,test" for 2 x 2 contingency tables appeared in the 1934 and subsequent editions. The importance of improving the scope of exact methods for categorical data has become increasingly clear in recent years. Standard asymptotic methods apply to a fixed number of cells, as cell expected frequencies grow to infinity. Yet, researchers often attempt to alialyze additional variables as the sample size grows; thus, large expected frequencies may be the exception rather than the norm. Although recent research has introduced new asymptotic approaches that permit the number of cells to grow as the sample size grows (e.g., Morris, 1975; Haberman, 1977; Koehler, 1986; McCullagh, 1986; Zelterman, 1987), information on the adequacy of these approximations for standard models is at an infant stage. [Cressie and Read (1989) surveyed research on the adequacy of various asymptotic approximations.] Also, simulation studies have shown that it is hopeless to expect simple guidelines to indicate when asymptotic large-sample approximations are adequate (e.g., Koehler and Larntz, 1980). Even when the sample size is quite large, recent work has shown that large-sample approximations can be very poor when the contingency table contains both small and large expected frequencies (Haberman, 1988). Regarding adequacy of asymptotics, the sample size n often has less relevance than the discreteness of the sampling distribution. Thus, "small-sample methods" for categorical data more accurately refer to methods needed when there are few points or relatively large probabilities in the support of that distribution. Table 1, taken from Graubard and Korn (1987), illustrates that different statistics and approximations can give quite different results, even for very large samples. That table, which refers to a prospective study of maternal drinking and congenital malformations, has 32,574 observations. For testing independence of alcohol consumption and malformation, asymptotic chi-squared tests give pvalues of 0.017 using the Pearson statistic and 0.190 using the likelihood-ratio statistic. Exact tests of a type discussed in Section 3.1 using these criteria give p-values of 0.034 and 0.139, respectively. A test based on a trend alternative that utilizes the ordering of column categories by assigning scores (0,0.5,1.5,4,7) to them has exact p-value more than three times the p-value based on an asymptotic normal approximation (0.017 versus 0.005 in the one-sided case). The lag in the development and use of exact inferences for contingency tables is partly explained by the later development of methods for categorical data compared with continuous data, but also by the greater computational complexity. However, recent advances in computational power TABLE1 Maternal drinking and congenital malformations Alcohol consumption (average no. of drinks/day) Malformation 0 <1 1-2 3-5 2 Absent Present 17,066 48 14,464 38 788 5 126 1 37 Source: Graubard and Korn (1987). 6 1 EXACT INFERENCE FOR CONTINGENCY TABLES and efficiency of algorithms have made exact methods feasible for a wider variety of inferential analyses and for a larger collection of table sizes and sample sizes. In this survey of exact inferences for contingency tables, I indicate which computations can currently be done and also highlight areas in which current capabilities are inadequate. For two-way contingency tables, let X denote the row classification and Y the column classification. For three-way tables, denote the third classification by Z. For simplicity, denote loglinear models by standard symbols pertaining to their minimal suffi- cient statistics. For instance, (X, Y) denotes the model of statistical independence in a two-way table, and (XZ, YZ) denotes the model of condi- tional independence between X and Y, given Z, in a three-way table. The minimal sufficient statistics are i n i + ) and {n+j) for ( X , Y) and in,,,) and {n+jk) for (XZ, YZ). Finally, denote expected fre- quencies by m and sample proportions by p, for example, {mij = E(nij)} and { pij = nij/n} in a two-way table. 1.2 Outline and Notation Sections 2 to 4 present a variety of exact methods for contingency tables in the context of loglinear models. Sections 2 and 3 focus on two-way tables, Section 2 for the 2 x 2 case and Section 3 for the I x J case. Section 4 focuses on three-way tables, with emphasis on the 2 x 2 x K case. Section 5 discusses exact inference for logistic regression models, and Section 6 discusses exact goodness-of-fit testing for loglinear and logistic models. Section 7 discusses computing feasibility of exact methods, using currently available software. Nearly all the literature on exact methods for contingency tables emphasizes hypothesis testing; but, in each section, I also indicate the scope of work on interval estimation. My discussion throughout the article takes the viewpoint of classical frequentist conditional inference, with application to loglinear models for contingency tables. Section 8 mentions other approaches. It presents an unconditional approach to exact inference and also indicates how Bayesian inferences correspond to conditional inferences for certain choices of prior distributions. The final section discusses possible future directions for research on exact methods for contingency tables. -. Throughout the article, I assume a standard Poisson or mu1tinorr)ial sampling model for cell counts in the contingency table. For instance, in an I x J table, the cell counts inij) might have a multinomial distribution generated by n independent trials with IJ cell probabilities {.rrij). Or, the counts inij, j = 1 , . . . , J ) in row i might have a multinomial distribution, with counts in different rows ,being independent. In the first case ("full" multinomial sampling), n = CC nij is fixed. In the second case (independent multinomial sampling), i n i + = Cjnij, i = 1 , . . . , I ) are fixed. When inij) are independent Poisson random variables, conditioning on n yields full multinomial sampling; conditioning further on the row totals yields independent multinomial sampling. All sampling models lead to the same exact inferences, since those inferences condition on marginal totals that contain as a subset the naturally fixed totals, and since the parameters of usual interest are not the proportions in the margins that are fixed under some sampling designs but not under others. 133 1.3 The Exact Conditional Approach Historically, the most common approach to exact inference in contingency tables has been a condi- tional one. Suppose an inference refers to a param- eter in some loglinear model. Exact conditional inferential methods utilize the distribution of the sufficient statistic for that parameter, conditional on sufficient statistics for the other parameters ("nuisance" parameters) in the model. For in- stance, suppose one wants to test a hypothesis Ho that corresponds to a loglinear model symbolized by Mo, under the assumption that a more general model M, holds, corresponding to an alternative hypothesis HI. Denote minimal sufficient statistics for the models by To and TI. Exact inference uses the conditional distribution of T, given To (e.g., Andersen, 1974). By the definition of sufficiency, the conditional distribution does not depend on the nuisance parameters, thus making exact inference possible. To illustrate, for Poisson sampling in a 2 x 2 contingency table, the saturated loglinear model has the form ' The parameters {Aij) describing association in this model pertain to the odds ratio. For instance, suppose we achieve identifiability by setting = %= &, = A,, = &, = 0. Then A,, =. log(8), where 8 deThe ~ ~ m ~ ~ notes the odds ratio, 8 = ( m ~ ~ m ~ ~ ) / ( m parameter set is (p, A?, A,; A,,), and normally inter- est focuses on 8 = exp(A,,), the others being nui- sance parameters. The sufficient statistics are n for ,; and n,, for A,,. To conduct p, n,, for A?, n,, for A exact inference about 8, consider the conditional distribution of n,, given n, n,,, and n,,. This conditional distribution is the same as the one having elements P(Y, = n,,, Y, = n,, - n,, I Y, Y, = n+,), where {Yi, i = 1 , 2 ) are indepen- + 134 A. AGRESTI dent binomial random variables with parameters mi,)]. Straightforward calculation shows that it equals [ ni+, mi, /(mi, + the model by conditioning on sufficient statistics for them. 2. EXACT INFERENCE FOR 2 x 2 TABLES where the index of summation ranges from max(0, n,++ n+, - n) to min(n,+, n+,), the possible values for n,, for the given marginal totals. This is the noncentral hypergeometric distribution (Fisher, 1935a; Cornfield, 1956). To test statistical independence of X and Y [A,, = 0 in model (1.1)1, one uses this distribution with 8 = 1. Of course, to complete a test, one needs to specify a test statistic and the way to compute the p-value. In testing a loglinear model M, against a more complex model MI, in this article I will generally base p-values on the exact conditional distribution of Rao's efficient score statistic for testing that the extra parameters that are in M, but not in Mo equal zero. The efficient score statistic is based on the vector of partial derivatives of the log likelihood with respect to the extra parameters, evaluated at the null estimates (Rao, 1973, Section 6e). For testing independence in a 2 x 2 table, this suggests [ n,, - (n,+n+,)/ nl, suitably normalized, as a test statistic. The p-value, based on extreme values of the test statistic, is calculated using the distribution for that statistic that is_induced by the exact conditional distribution of { nij) . When M, contains one more parameter than M,, tests based on the efficient score statistic are uniformly most powerful unbiased (UMPU), as a consequence of a result for exponential families (e.g., Lehmann, 1986, Theorem 3, page 147). Denote the ML estimator of the parameters in M, by (e,, el), where 9, denotes the estimator of the parameters that are in M, but not in M,. For large samples with a fixed number of cells, these tests are asymptotically equivalent to ~ a l tests d based on el and likelihood-ratio tests of Mo against M, (Rao, 1973, pages 418-420). My reason for relating exact conditional inference to loglinear models in this article is as follows. Loglinear models are generalized linear models for categorical data that use the canonical link, that is, they directlyqmodel the natural parameter (the log mean) of a natural exponential family (the Poisson). Such generalized linear models permit reduction of the data through sufficient statistics (McCullagh and Nelder, 1989, page 32). Thus, one can eliminate nuisance parameters in This simple case still generates an enormous volume of publications, partly because of controversy (discussed in Section 8.1) about applying exact methods that condition on marginal totals that are not naturally fixed under Poisson, multinomial, or binomial sampling schemes. The most common sampling model for 2 x 2 tables is independent binomial sampling in the two rows. Denote the "success" probabilities by a, and a,, for which the odds ratio is 8 = [a, /(1 - a,)]/[a, /(1 - a,)]. The hypothesis of homogeneity states that a, = a,. For Poisson or full multinomial sampling, conditioning on { n,+, n,+ ) gives binomial sampling, and statistical independence is equivalent to homogeneity. Under the hypothesis of homogeneity, conditioning as well on n+, to eliminate the nuisance parameter (the common value of a, and a,) yields the hypergeometric distribution (1.2) with 8 = 1. 2.1 Fisher's Exact Test Fisher's exact test (Fisher, 1934, 1935a; Yates, 1934; Irwin, 1935) of H,: 8 = 1 is "well known," and I refer the reader to Lehmann (1986, pages 151-162) for details of its derivation under various sampling schemes. The null hypergeometric distribution has E(n,,) = n,+n+, / n and Var(n,,) = n1+n+,n2+n+,/n2(n- 1). To test H,: 8 = 1 against HI: 8 > 1, the p-value is where S = { t: t 2 n,,). The set S is identical to that of tables for which the sample odds ratio is at least as large as observed. The hypergeometric applies directly as a sampling model when both sets of marginal counts are naturally fixed. A classic example of this case is Fisher's (1935b) tea-tasting experiment, relating to a woman's claim to be able to judge whether tea or milk was poured in the cup first. The woman was given eight cups of tea, in four of which tea was poured first and in four of which milk was poured first, and was told to guess which four had tea added first. The contingency table for this design (see Table 2) has n = 8 and n,+= n+, = 4; n,, can take values O,1,2,3,4 with corresponding one-sided p-values (for HI: 8 > 1) of 1.0, 0.986, 0.757, 0.243 and 0.014. Ways of forming two-sided p-values in Fisher's exact test were discussed by Gibbons and Pratt (1975), Yates and discussants (1984), Davis (1986), EXACT INFERENCE FOR CONTINGENCY TABLES TABLE2 Fisher's tea-tasting experiment Guess poured first Poured first Milk Tea Total Milk Tea 3 1 1 3 4 4 Total 4 4 8 - Dupont (1986), Mantel (1987), and Lloyd (1988a). The most popular approaches are (a) double the one-sided p-value, (b) (2.1) with S = { t: f ( t l n, n,,, n,,) If(n1, I n, n,,, n,,)), and (c) (2.1) with S = {t: I t - E(n,,)I 2 1 n,, - E(n,,)I}. The third option is identical to the null probability that the Pearson chi-squared statistic is at least as large as observed. Different approaches can give different results because of the discreteness and potential skewness of the hypergeometric distribution. For instance, consider the table having counts by row (10,90/20,80) (i.e., 10 and 90 in the first row, 20 and 80 in the second), discussed by Dupont (1986). The null distribution of n,, is symmetric about 15, and the three p-values are identical and equal 0.073. However, for the nearly identical table (10,91/20,80), p-value (a) equals 0.069, whereas (b) and (c) equal 0.050. 2.2 "Exact" Estimation in 2 x 2 Tables The non-null conditional distribution (1.2) of n,, is used in constructing "exact" confidence intervals for the odds ratio. For data n,,, the conditional ML estimate of 6 is t h e value of 19 that maximizes probability (1.2). The estimate is obtained using iterative methods (Cornfield, 1956), and differs from the unconditional ML estimate 6 = n,, n,, / n,, n,,. For instance, for Fisher's tea-tasting data (Table 2), n,, = 3 and the unconditional ML estimate is 9; the conditional ML estimate maximizes the conditional likelihood (1.2), and equals 6.41. For testing H,: 8 = Oo against HI: 6 > O,, the p-value is P = C,f(t I n, n,,, n+,;8,), where S = {t: t 2 n,,}. For testing against H,: 0 < O, S = {t: t I n,,}. One can obtain an "exact" confidence interval for 19 by inverting the test (Cornfield, 1956; Mantel and Hankey, 1971'; Thomas, 1971). The lower endpoint is the 8, value for which P = a / 2 in testing H,: 19 = 8, against H I : 8 > 8,. The upper endpoint is the 8, value for which P = a / 2 in testing Ho against H,: 8 < 8,. If n,, = 0, then the lower endpoint is 0, and one uses P = a in obtain- 135 ing the upper endpoint; if n,, = min(n,+, n,,), then the upper endpoint is m, and one uses P = a in obtaining the lower endpoint. For the tea-tasting data, the 95% confidence interval obtained in this manner is (0.21,626.2). The discreteness of the distribution of n,, limits the confidence intervals to a discrete set of possible endpoints, for fixed a . Thus, the true confidence coefficient is at least 1 - a , rather than exactly 1 - a , and its value depends on the value of I9 (Neyman, 1935). It is strictly greater than 1 - a unless the true 19 is an attainable endpoint. I use quotes around "exact" in referring to such intervals to reflect this behavior. Baptista and Pike (1977) described an alternative approach that also guarantees the confidence level but sometimes gives slightly shorter intervals. Their interval is a generalization of Sterne's (1954) confidence interval for a single binomial parameter. One inverts a family of acceptance regions that are formed using the minimal number of most likely outcomes. Specifically, for each 00, one finds a set A(0,) of possible n,, values such that the probability of the set is at least 1 - a and such that every integer in the set is at least as likely to occur as every integer outside the set. The confidence interval is the set of 8, values for which the observed n,, falls in A(0,). It is not possible to construct "exact" confidence intervals for association measures that are not functions of the odds ratio. They do not occur as parameters in generalized linear models with Poisson or binomial random component using canonical links. Thus, the usual conditioning arguments do not eliminate nuisance parameters. For instance, consider estimation of the difference of probabilities 6 = a, - a, for independent binomial samples. The joint sampling distribution can be expressed in terms of 6 and a,, for instance, but conditioning on the marginal totals does not eliminate a,. Santner and Snell (1980) discussed this and other difficulties in interval estimation of the difference of proportions and the relative risk. They also described ways of getting conservative confidence intervals for these parameters, but the usual conservativeness due to discreteness is compounded because of the approach used to eliminate the nuisance parameter. "Exact" interval estimates exist in certain extreme situations in which asymptotic interval estimates do not. Suppose the conditional inference uses the distribution of TI, given To. When T, assumes its maximum or minimum value, the unconditional (or conditional) ML estimator of 6 and its asymptotic standard error do not exist (unless one adds some constant to the cells), but "exact" 136 A. AGRESTI one-sided confidence intervals do exist. For a point estimate in such cases, some authors (e.g., Hirji, Mehta and Patel, 1987, 1988; Hirji, Tsiatis and Mehta, 1989) use median unbiased estimates. To illustrate, for a 2 x 2 table, when n,, = 0, the ML estimator of the loglinear association parameter (the log odds ratio) and the ML estimator ( C C l / nij) of the asymptotic variance of the ML estimator do not exist. The median unbiased estimate is the 8, value that gives P = 0.5 and the "exact" upper 100(1 - a)% confidence limit is the 8, value that gives P = a , in testing against HI: 8 < 8,. 2.3 Comparing Dependent Proportions Next, consider the comparison of two proportions when each sample contains the same subjects, or the sample consists of matched pairs. For instance, one might have repeated measurement of a binary response at two occasions. Let nij be the number of observations in category i at occasion 1 and category j at occasion 2. Let a i j denote the probability of response i at occasion 1 and response j at occasion 2, and let pij = nij/n. The sample proportions of " S U C C ~ S S ~ S "p l + at occasion 1 and p + , at occasion 2 are dependent, rather than independent, because of the matching. The hypothesis of marginal homogeneity, H,: a, = a ,, corresponds to homogeneity for the 2 x 2 table consisting of the two marginal distributions. For binary responses, marginal homogeneity is equivalent to symmetry, a,, = a,, To obtain an exact test, one conditions on n* = n,, + n,,. Under H,, n,, has a binomial (n*, 112) distribution. A two-sided p-value is the sum of binomial probabilities for n,, values at least as far from n*/2 as observed. This is a small-sample version of McNemar's test (McNemar, 1947). Cox (1958a) provided an argument, of which I now sketch the outline, that motivates this conditional approach. Let ( Y,,, Y,,) denote the hth pair of observations, where a "1" response denotes category 1 (success) and "0" denotes category 2. Consider the logit model + pair, one eliminates the nuisance parameters { a h } by conditioning on the pairwise success totals { ylh + Y , ~ }Given . a total of 0, P(Ylh = YZh= 0) = 1, and given a total of 2, P(Ylh = Y,, = 1) = 1. The conditional distribution of ( Y, ,, Y2,) depends on 0 only when y,, + YZh = 1. For each of the n* such pairs, direct calculation using (2.2) shows that P(Ylh = 1, Y,, = 0) = [l + exp(0)l-l. Since n,, = Cy,, for these n* pairs, the distribution of n,, conditional on n* is binomial with parameter [l + exp(0)I-l. The parameter equals 112 under marginal homogeneity ( 0 = 0). This conditional analysis implies that pairs having identical response at the two occasions are irrelevant to testing a , + =a+,. Cox (1966) generalized this analysis for data in which each case is matched with several controls. Gart (1969) gave an exact test for comparing matched proportions in crossover designs. 3. EXACT INFERENCE IN I x J TABLES For I x J tables, statistical independence of X and Y corresponds to the loglinear model M, = (X, Y), which has form log mij = p + AT + A;. + where the indicator ' I ( t = 2) equals 1 when t = 2 and 0 when t = 1. This model (a special case of the Rasch model) permits separate response distributions for each pair, but assumes a common effect, exp(0) representing the odds ratio of success at occasion 2 compared with occasion 1. For the joint mass function of the data under the assumption that Ylh and Y,, are independent within each To test this model against the saturated model [ M, = (1.1) extended to I rows and J columnsl, one uses the null distribution of inij} given { n , + }and For multinomial sampling, the cell probability parameters {aij} in the distribution of {n,,} can be expressed in terms of the marginal probabilities and (I - 1)(J - 1) odds ratios. Conditional on i n , + } and {n+j}, Cornfield (1956) noted that the distribution of inij} depends only on the odds ratios. The conditional probabilities are proportional to where a i j = (aijaIJ)/(aiJaIj).For exact tests of statistical independence (all a i j = I), this distribution simplifies to the multiple hypergeometric. The probability of a table inij} having the given marginal totals equals n!n n n,,! 137 EXACT INFERENCE FOR CONTINGENCY TABLES 3.1 Linking Test Statistics to Alternatives Let p,,, denote the null probability (3.1) of the observed table. Freeman and Halton (1951) defined the p-value for a conditional test of independence to be the null probability of the set of tables having probability no greater than pobs. For the 2 x 2 case, this simplifies to a two-sided version of Fisher's exact test. Many statisticians have argued that the p-value should instead be based on the exact distribution of some meaningful statistic T (such as the efficient score statistic) for quantifying the departure of the data from Ho (e.g., Yates, 1934; Fisher, 1950; Healy, 1969; Agresti and Wackerly, 1977; Cressie and Read, 1989). The p-value for the Freeman-Halton test may be regarded as one that uses a statistic negatively related to p,,,, such as T = - 2 log( p,,,). When both classifications are nominal, the usual alternative to the independence model is the saturated model. The efficient score statistic is then the Pearson chi-squared statistic for testing the goodness-of-fit of the independence model, where mij = ni+n . I n . One could let the p-value be the null probabh!ity that X 2 is at least as large as observed (Yates, 1934; Agresti and Wackerly, 1977; Baker, 1977). More generally, one could use a power divergence statistic (Cressie and Read, 1989), special cases of which are the Pearson statistic and the likelihood-ratio statistic. When both classifict$ions are ordinal, it is often important to have good power for detecting a monotone trend in the association. To do this, one could let T relate to a Spearman or Pearson-type correlation (Patefield, 1982), the difference between the numbers of concordant and discordant pairs (Agresti and Wackerly, 1977), or the Jonckheere-Terpstra statistic (StatXact, 1991). In the first case, the statistic on which the reference tables are ordered is T = CCxiyj(nij - ni+n+j/n), .for two sets of mon,otone scores { xi) for rows and { yj) for columns. An exact conditional test using this statistic is an efficient score test for the loglinear model of linearby-linear association (Agresti, Mehta and Patel, 1990). That model has the form using T fall in a complete class of unbiased and admissible tests. When X is nominal and Y is ordinal, one might let T be the Kruskal-Wallis statistic adjusted for ties (Klotz and Teng, 1977). In that statistic, average ranks are used as scores {yj} for the column categories, and the test is sensitive to variation among the mean ranks computed for the conditional distributions within the rows. This is an efficient score statistic for a loglinear model having form (3.2), with "row effects" {fix,} treated as parameters. For 2 x J tables, Soms (1985) presented exact tests sensitive to other alternatives. See Haberman (1974), Goodman (1979) and Agresti (1990, Chapter 8) for discussions of loglinear models with scores. To illustrate that tables that are "more contradictory" to Ho according to some statistic T need not be less likely, consider the twelve 3 x 3 tables having row totals (6,l, 2) and column totals (1,2,6). Table 3 gives the conditional null distribution of T = CCxiyjnij, for { x i = y, = i - 2). The distribution has support between - 7 and 0 with a mean of - 2.22. It is-highly irregular, far from its limiting normal distribution. Table 3 shows that, for some margins, it is not possible to obtain small p-values for some alternatives (e.g., a test using T for the one-sided alternative of a positive association). The table having entries by row (0,2,4/1,0,0/0,0,2), one of two tables to have T = -2, has only a quarter of the probability of the table having entries (1,2,3/0,0,1/0,0,2), the only table having T = 0. Yet, the first table is less contradictory to Ho than the second, using I T - E(T) I as the criterion. The first table has a p-value of 1.0 using this criterion, but only 0.2856 in the Freeman-Halton test. By contrast, the I x I table having nii = 1for all i and nij = 0 for all i + j has P = (2/1!) using ) T - E(T) I but has P = 1.0 for the FreemanHalton test or the exact test using x2 as the criterion. TABLE3 Exact conditional distribution of T = CC(i - 2 ) ( j - 2)nij under independence, for margins ( 6 , 1 , 2 )and ( 1 , 2 , 6 ) t (Birch, 1965; Haberman, 1974; Goodman, 1979), with the special case 0 = 0 representing statistical independence of X and Y. For model (3.2), the sufficient statistics are T plus those for the Cohen and independence model ({ ni+),{ n,)). Sackrowitz (1991) showed that nonrandomized tests No. of tables Probabilities P(T = t ) 138 A. AGRESTI When I = 2, ordering the tables by T is equivalent to ordering them by U = Cj yjnIj. Many statistical tests use U for various choices of scores (Graubard and Korn, 1987). For arbitrary monotone scores, the exact test for T is a small-sample version of a trend test proposed by Cochran (1954) and Armitage (1955). The statistic with midrank scores is used in exact Wilcoxon tests for ordered categorical data (Klotz, 1966; Mehta, Pate1 and Tsiatis, 1984). 3.2 Exact Estimation for I x J Tables "Exact" confidence intervals are rarely used for I x J tables, probably because of the complexity. Reducing parameter dimensionality could be useful in many applications, by using an unsaturated loglinear model. Agresti, Mehta and Pate1 (1990) discussed confidence intervals for ordinal classifications when odds ratios satisfy the pattern implied by the linear-by-linear association model (3.2). For this pattern, the non-null distribution of T = CCxiyjnij is where C, is the sum of (11,IIjnij!)-l for all tables with the given marginal distributions having T = t . One obtains a confidence interval for 0 and thus {aij} by inverting the test of H,:p = Po, as in the case of the odds ratro for a 2 x 2 table. When T takes its maximum or minimum possible value for the given marginals, the conditional and unconditional ML estimators of P and asymptotic standard errors do not exist, but one-sided confidence bounds for 0 and odds ratios using it do exist. 4. EXACT INFERENCE IN THREE-WAY CONTINGENCY TABLES I,next consider three-way tables, paying special attention to the 2 x 2 x K case. Such tables occur, for instance, when one compares a binary response (Y) for two treatments (X), using data obtained at K levels of a possibly confounding factor (2).Two hypotheses of importance are (1) conditional independence of X and Y, given Z, and (2) no threefactor interaction, meaning' that the true X-Y odds ratio is identical at each level of Z. Conditional independence of X and Y, given Z, corresponds to the loglinear model symbolized by (XZ, YZ), and no three-factor interaction corresponds to the loglinear model (XY, XZ, YZ). For cell-expected fre- quencies { mijk),model (XY, XZ, YZ) has the form log mi,, =p + A?+ A+: % + hEY+ + and model (XZ, YZ) is its special case in which 0 xu {hij - 1. 4.1 Testing Conditional Independence in 2 x 2 x K Tables A common approach to testing conditional independence [ M, = (XZ, YZ)] performs the analysis under the assumption of no three-factor interaction [MI = (XY, XZ, YZ)]. The sufficient statistics are the X-Z and Y-Z two-way marginal tables for M,, and these as well as the X-Y marginal table for MI. The relevant conditional distribution is that of the X-Y marginal table, given the X-Z and Y-Z marginal tables. The conditioned totals are the marginal counts for the K partial tables. For the 2 x 2 x K case, the distribution simplifies to that of = Cknllk, given {nl+k,n2+k,n+lk, n+2k, = 1, . . . , K}. Under the assumption of no three-factor interaction, Birch (1964) showed that UMPU tests of conditional independence utilize T (see also Lehmann 1986, pages 163-164). Conditional on the strata margins, inllk, k = 1,. . . , K) have independent hypergeometric distributions, each of form (1.2) with 8 = 1. The product of the K mass functions determines the null distribution of their sum. For the one-sided alternative of a "positive" association (odds ratio greater than 1.0 in each level of Z), the p-value for Birch's exact test of conditional independence is the null probability that Cknllk is at least as large as observed, for the fixed marginal totals. Thomas (1975), Pagano and Tritchler (1983b), and Mehta, Pate1 and Gray (1985) gave algorithms for implementing this test, which may be regarded as an exact smallsample version of the Cochran-Mantel-Haenszel test. The exact McNemar test for matched pairs (Section 2.3) is a special case of Birch's exact test of conditional independence. In this representation, each of the n matched pairs has a 2 x 2 table relating occasion (or member of pair) to response. One tests M, = conditional independence of occasion and response, given the pair, under the assumption of MI = homogeneous odds ratios in the n 2 x 2 tables. For 2 x 2 x K tables in which it is unrealistic to expect the K conditional X-Y odds ratios to be similar or even of the same sign, the saturated model is a more relevant alternative than no three-factor interaction. In that case, the efficient score statistic is CkX;, where X l denotes the Pearson statistic for testing independence between 139 EXACT INFERENCE FOR CONTINGENCY TABLES X and Y within the kth level of Z. This statistic is commonly used in asymptotic tests of conditional independence against the general alternative. Currently, there do not seem to be any computer algorithms for this case. 4.2 Testing Homogeneity of Odds Ratios in 2 x 2 x K Tables Birch's exact test using test statistic C,n,,, assumes homogeneity of the odds ratios in the 2 x 2 x K table. Zelen (1971) presented an exact test of this assumption. Here, M, = (XY, XZ, YZ), and M, is the saturated model. The conditional distribution, given each of the three sets of two-way marginal totals, has probabilities proportional to (IIiIIjIIknijk!)-l.For the 2 x 2 x K case, the fixed totals are in,,,, n,+,, n,,,, n,,,, k = 1,. . K) and {n,,,, n,,,, n,,,, n,,,). Zelen defined the pvalue to be the sum of probabilities of all 2 x 2 x K tables that are no more probable than the observed table. Alternatively, one could define P by ordering the tables with the given two-way margins by x2 for testing the fit of loglinear model (XY, XZ, YZ). Thomas (1975) and Pagano and Tritcher (1983b) gave algorithms for implementing Zelen's test. To improve potential power in testing model (XY, XZ, YZ), one could instead test it against an unsaturated model. When the levels of Z have a natural ordering, one could use the alternative loglinear model by which the log odds ratios change linearly across the K strata; that is, a log mi,,= (XY, XZ, YZ) + I(i =j = , 1) 6zk, for fixed monotone scores { z,), where I(.) is the indicator function. The relevant distribution is then that of ~ , z , n , , , , conditional on all two-way marginal totals. Zelen (1971) also presented such an exact test. To illustrate some exact tests for conditional independence and for homogeneity of odds ratios for 2 x 2 x K tables, consider Table 4, a 2 x 2 x 3 table based on a larger table presented by Gast- TABLE4 Example of 2 x 2 x K analysis July promotions ' August promotions September promotions Race Yes No Yes No Yes No Black White 0 4 7 16 0 7 4 13 0 2 8 13 Source: Gastwirth (1988). wirth (1988, page 266). The data refers to P = whether promoted and R = race, stratified by M = month of promotion consideration (in 1974). Cases involved GS-13 level computer specialists being considered for promotion to level GS-14. (It appears that many of the subjects appeared in two-or all three strata, but this information is not available. So, like Gastwirth, I treat the promotion decisions as independent.) Under the assumption of a constant odds ratio 8 between P and R at each level of M, we first test H,: 8 = 1 against H,: 8 < 1. In using the one-sided alternative for the association between P and R, the test is sensitive to evidence of possible discrimination against blacks, in the sense of the probability of promotions being lower for blacks than whites. Given the P and R marginal totals at each level of M, n,,, can range between 0 and 4, n,,, can range between 0 and 4, and n,,, can range between 0 and 2. The test statistic T = Cn,,, can assume values between 0 and 10, and under H,, E(T) = 2.90 and u(T) = 1.35. Note that the sample data represent the most extreme possible result in each of the three cases. The observed test statistic is T = 0, and the p-value is the null value of P(T r 0), which is 0.026. Zelen's test of the hypothesis of a constant odds ratio is a test of fit of the loglinear model (PR, PM, RM). The reference set consists of the subset of these 2 x 2 x 3 tables that also satisfy n,,+= 0. The observed table is the only such table, so the test is degenerate, giving P = 1.0. 4.3 Exact Estimation in 2 x 2 x K Tables Assuming no three-factor interaction and conditioning on the strata totals, the joint distribution of { n . . . , n,,,) is the product of K terms of the type given in (1.2) for the 2 x 2 case. This distribution depends only on the common odds ratio. Birch (1964) discussed the conditional ML estimator of the common odds ratio, and Gart (1970) defined "exact" confidence intervals as a direct extension of Cornfield's "exact" intervals for 2 x 2 tables. Thomas (1975), Pagano and Tritchler (1983b), Mehta, Pate1 and Gray (1985), and Vollset, Hirji and Elashoff (1991) provided algorithms for calculating such intervals. When the sufficient statistic T = C knllk attains its minimum or maximum possible value, asymptotic confidence intervals (e.g., using the Mantel-Haenszel approach; see Agresti 1990, pages 235-236) do not exist, but one-sided "exact" confidence intervals are well defined. Table 4 has the boundary value T = 0, resulting in a degenerate conditional ML estimate of 0.0 for an assumed constant odds ratio. An "exact" 95% upper confidence bound for the odds ratio equals 0.779, the value 8, that gives P = 0.05 in testing ,,,, 140 A. AGRESTI H:, 19 = 6 , against HI: 19 < 19,. The median unbiased estimator equals 0.152. 4.4 Exact Methods for I x J x K Tables In principle, methods for exact testing and esti,nation extend to loglinear models for multiway tables. Current computational algorithms are restricted to certain analyses for 2 x J x K tables. In this subsection, I outline the types of inferences for which it would be useful to extend computational algorithms in the I x J x K case. Consider the hypotheses corresponding to the following five hierarchical loglinear models: (1) (X, Y, 2): Mutual independence of X, Y, and Z; (2) ( X, YZ): Joint independence of X and the Y-Z classification; (3) ( XZ, YZ): Conditional independence of X and Y, given Z; (4) (XY, XZ, YZ): No three-factor interaction; (5) ( XYZ): Saturated model. For each of models (1)to (4), one can consider exact testing against the alternative of the next most complex model, and against the general alternative of the saturated model. Certain tests are special cases of ones already developed for two-way tables, so they do not require separate consideration. Hypothesis [2: (X, YZ)] is a special case of statistical independence for a twoway table, in which the second classification consists of the J K combinations of categories of Y and Z. Thus, an exact test of [2: (X, YZ)] against [5: (XYZ)] is simply a standard exact test of independence for a two-way ( I x J K ) table. For instance, in Table 4, an exact test of ( P , RM) (i.e., that promotion is jointly independent of race and month of decision) is an exact test for the two-way table having rows (0,4,0,4,0,2/7,16,7,13,8,13). The Pearson test statistic of 5.62 has an exact pvalue of 0.353. [The attained significance for this joint test is weak compared to the p-value of 0.026 obtained for the P-R association alone in the 'one-sided exact test of (PM, RM) against (PR, PM, RM). This shows how severely the evidence represented by an effect of a certain size can diminish when the degrees of freedom on which it is based increase drastically.] A test of [2: (X, YZ)] against [3: (XZ, YZ)l tests whether the hxZ term in model (XZ, YZ) is zero. By standard collapsibility results (e.g., Bishop, 1971; Agresti, 1990, pages 146 and 230), when this model holds the hXZ term is identical to the hXZ term in the saturated two-way loglinear model for the marginal X-Z two-way table. Thus, one can conduct an exact test of [2: (X, YZ)] against [3: (XZ, YZ)] using an exact test of independence of X and Z in that two-way table. For instance, one can conduct an exact test of ( P , RM) against (PR, RM) by simply testing whether P and R are independent in the marginal table (0,22/10,42); for this table, the Pearson statistic has two-sided P = 0.056. Similarly, one can conduct an exact test of [I: ( X, Y, Z)] against [2: ( X, YZ)] using an exact test of independence of Y and Z in the two-way Y-Z marginal table. Computational algorithms are unavailable for the other situations, discussed in the remainder of this subsection. To test hypothesis [I: (X, Y, Z)l against the saturated model, one conditions on the sufficient statistics for (X, Y, Z), which are {n,++,n+j+,n++,). The resulting mass function is Stumpf and Steyn (1986) gave formulas for the first- and second-order moments of the cell counts. To construct an exact test of hypothesis [I: ( X , Y, Z)l against the saturated model, one could use this distribution to generate the exact conditional distribution of the Pearson statistic for testing the fit of model (X, Y, 2 ) . This case is relatively unimportant, as the hypothesis of mutual independence is plausible in very few applications. A more important case is a test of conditional independence [3: ( XZ, YZ)] of X and Y against [4: (XY, XZ, YZ)]. Such a test generalizes Birch's test for 2 x 2 x K tables. Here, one tests conditional independence under the assumption that the ( I - 1 ) ( J - 1) odds ratios relating X and Y are identical across the K levels of Z. Model (XZ, YZ) has the sufficient statistics ({ni+k),{n+jk)), and model (XY, XZ, YZ) has these plus { nij+). One uses the distribution of ({ nij+} I { ni+k), which is a product of multivariate hypergeometric distributions from the various layers of Z. Let d denote the ( I - 1)(J - 1) x 1 vector having elements The efficient score test orders tables in the reference set using d'VP1d, where V is the null covariance matrix of d. Birch (1965) proposed an asymptotic test of this type. EXACT INFERENCE FOR CONTINGENCY TABLES Alternatively, one could test conditional independence [3: (XZ, YZ)] against [5: (XYZ)], if one believes that the association between X and Y may vary considerably across levels of Z. An efficient score statistic is then the Pearson statistic for testing the fit of (XZ, YZ), which is C,X;, where X; is the Pearson statistic for testing independence between X and Y within the kth level of Z. This test would also use the conditional distribution that fixes the XZ and YZ marginal tables. Since it has a broader alternative than the generalized Birch test, this test would tend to be less powerful than that test when model (XY, XZ, YZ) provides a decent approximation to the actual distribution. An exact test of no three-factor interaction [4: (XY, XZ, YZ)] for I x J x K tables generalizes Zelen's test for 2 x 2 x K tables. Here, the null hypothesis states that the (I - 1)(J - 1) odds ratios relating X and Y are identical across the K levels of Z. The relevant conditional distribution has probabilities proportional to (IIiIIjIIknijk!)-l,for the reference set of tables having XY, XZ and YZ marginal tables identical to the observed ones. An efficient score test against the general alternative [5: (XYZ)] orders tables in the reference set by the Pearson statistic for testing the fit of the model. For ordinal variables, one would modify the above ideas by constructing tests to increase power against important alternatives, such as has been done for I x J tables. For instance, to test conditional independence (XZ, YZ), one might choose an alternative model that implies a monotone conditional X-Y association. One could use the single degreeof-freedom test statistic C ,{ C C xi yj[ nijk (ni+kn+jk)/n++kl) to order the reference set of tables having the fixed X-Z and Y-Z marginal tables. This results from comparing model (XZ, YZ) to a model of homogeneous linear-by-linear association, whereby the term in model (XY, XZ, YZ) is replaced by a term @xiyj for fixed monotone row and column scores. Such a test would be an exact analog of an asymptotic score test proposed by Mantel (1963) and Birch (1965). . There has been some work of this type for the 2 x J x K case with ordered levels of Y (Mehta, Pate1 and Senchaudhuri, 1991). In this case, one can take x, = 1and x, = 0, and consider the conditional distribution of C k[Cjyjnijk].For preselected monotone scores, this gives a stratified version of the Cochran-Armitage trend test. For rank { yj) scores, it gives a stratified version of the Wilcoxon test. Another application of this case is testing marginal homogeneity in an I x I table with the same ordered row and column categories. One can conduct this test by conducting the exact test of conditional independence for the 2 x I x n table, 141 where the two rows for stratum k contain one observation in each row, giving the responses at the two occasions for subject k. Such a test is an exact analog of asymptotic tests described by White, Landis and Cooper (1982) and Kuritz, Landis and Koch (1988). Interval estimation for I x J x K tables would seem to be awkward except for simpler models that reduce the dimensionality of the parameter space. Examples include models that describe the X-Y conditional association by a linear-by-linear term or describe three-factor interaction by a linear trend in conditional log odds ratios. 5. EXACT INFERENCE IN LOGISTIC REGRESSION MODELS Exact inferences for loglinear models discussed in previous sections have counterparts for other generalized linear models that use natural parameters as the basis of the link function. A closely related example for categorical data is logistic regression modeling. Assuming a binomial distribution with parameter ?r for the response, one uses the logit link, log[?r/(l - T)]. Logistic models are particularly useful when highlighting one categorical variable in the contingency table as a response and the others as explanatory. Loglinear models do not make this distinction, although logistic models with qualitative explanatory variables have equivalent loglinear representations. For subject i, let yi denote a binary response, and let x i = (xio,xi,, . . . , xi,) denote values of k explanatory variables, where xi, = 1. The logistic regression model is Under the usual assumption that { yi) are independent Bernoulli outcomes, the sufficient statistic for Pj is Tj = Ciyixij, j = 0,. . . , k. As noted by Cox (1958b, 1970), one can conduct exact inference for Pj using the distribution of Tj, conditional on {Ti, i # j). Such inference is called conditional logistic regression (Bayer and Cox, 1979; Breslow and Day, 1980, Chapter 7; Tritchler, 1984; Hirji, Mehta and Patel, 1987). To illustrate, Table 5 shows some data from a case-control study (Shapiro et al., 1979) relating cigarette smoking to myocardial infarction for women of various ages using oral contraceptives. 142 A. AGRESTI TABLE5 Example for exact logistic analysis Smoking level (cigaretteslday) Age Disease status 0 1-24 >24 25-29 Myocardial infarction Control 0 25 1 25 3 12 30-34 Myocardial infarction Control 0 13 1 10 8 10 Source: Shapiro et al. (1979). Let {nij,) denote the count at level i of smoking, j of disease status, and k of age. Let {xi) denote scores assigned to the levels of smoking. Let ?ri, denote the probability of disease for subjects at level i of smoking and k of age. One might consider the model Then {ni+,) are fixed for this model, and are fixed by the retrospective nature of the study. To conduct exact inference about p,, one considers the distribution of C ,(C xi nil,) (the sufficient statistic for PI), conditional on these totals. For {x, = 0, x2 = 12.5, x, = 301, the conditional ML is 0.130, and the exact p-value for estimate of testing p, = 0 against 0, > 0 is 0.000. Here, results are similar to those obtained with the unconditional ML analysis, for which the estimate of P, of 0.133 has an estimated standard error of 0.042. One could add an interaction term to the model and do an exact analysis for it. This would be particularly natural with additional age strata representing greater variation in the age factor, since one might expect the effect of smoking to increase at higher age levels. Several special cases of exact analyses for the logistic model have received attention in the literature., For instance, Breslow and Day (1980), Peritz (1982), and Hirji, Mehta and Pate1 (1988) discussed exact inference for matched case-control studies with the logistic model, in which case a subset of the explanatory variables are used for matching. When k = 1in (5.1), the exact test of 0, = 0 using T, = C yi xi, given To = X i yi is a special case of the linear-by-linear test described in Section 3.1 applied to I x 2 tables. Here, I represents the number of distinct sample values of the explanatory variable, and the test may be regarded as an exact version of the Cochran-Armitage trend test. Difficulties can arise in exact inference for logistic regression when some explanatory variables are @, continuous. The { y,) values may be completely determined by the given sufficient statistics, making the conditional distribution degenerate. Algorithms for exact conditional inference for logistic regression can be applied to perform inference for equivalent loglinear models. To illustrate, consider loglinear modeling of several I x 2 tables. Regard the tables as a three-way I x 2 x K crossclassification of X, Y and 2. The loglinear model is equivalent to a logistic model for response Y whenever that loglinear model has a general association term relating X and 2. For instance, the logistic model (5.2) for Table 5 corresponds to the loglinear model having form log mij, = A, + A? + A? + A: + A?: + :A? + p,xiyj where { y, = 1, y2 = 0) and S = smoking, D = disease status, and A = age. This is a special case of the loglinear model of homogeneous linearby-linear S-D association. 6. EXACT GOODNESS OF FIT One can interpret tests of independence, conditional independence and no three-factor interaction against general alternatives as tests of goodness of fit of loglinear models. In principle, one could use analogous methods to construct exact tests of goodness of fit for other loglinear or logistic regression models. For a particular model M, the reference set consists of all tables having the observed values for the minimal sufficient statistics. Given that the model holds, the conditional distribution of the data given those sufficient statistics is independent of any parameters. One could construct the test by computing the null distribution of a goodness-of-fit statistic, such as the Pearson statistic. The p-value for testing the model is the conditional probability 'that the goodness-of-fit statistic is at least as large as observed. The ML fitted values for the model are the same for all tables in the conditional reference set. For instance, in testing independence with Fisher's exact test, one also implicitly tests the adequacy of the loglinear model of independence, (X, Y). To test the more general model (3.2) of linear-by-linear association in a two-way table, one considers the conditional distribution of the data given i n i + ) , {n+j),and CCxi yjnij. McCullagh (1986) showed that, even for large samples, it is beneficial to perform goodnessof-fit tests using the conditional rather than unconditional distribution. Although an exact goodness-of-fit test makes theoretical sense for any model having simple sufficient statistics, a general EXACT INFERENCE FOR CONTINGENCY TABLES computer algorithm is not available for it. This is a useful topic for future research. There is also a need for work on exact distributions of localized measures of goodness of fit, such as cell residuals. Bedrick and Hill (1990) gave exact conditional tests for a single outlier and for multiple outliers in logistic regression. 7. COMPUTING FEASIBILITY Doing computations for exact conditional inference requires working with the set of contingency tables having the given values of the sufficient statistics that are fixed for the inference. The potentially huge cardinality of the conditional reference set has been a severe impediment to the use of exact tests. To illustrate, a 4 x 4 table with only 20 observations can have as many as 40,176 tables with the same margins; a 4 x 4 table with 100 observations has a maximum cardinality on the order of 7.2 x lo9. For given I and J and marginal proportions, the number of tables in the reference set of I x J tables with those fixed proportions increases exponentially in the sample size n. For fixed n, the number of tables having given row and column marginal proportions also increases rapidly as I and J increase or as the row and column proportions become more homogeneous. For instance, a 5 x 5 table has a maximum cardinality on the order of 2.1 x l o 6 for 20 observations and 9.2 x 1014 for 100 observations. Good (1976, 1977) and Gail and Mantel (1977) gave approximations for the cardinality, and Agresti and Wackerly (1977) and Agresti, Wackerly and Boyett (1979) gave maxima for several table dimensions and sample sizes. Enormous improvements achieved recently both in algorithms and in computer power have made exact inference much more feasible than it was a decade ago. Most analyses conducted then with a mainframe computer can be conducted now in the same order of time on a personal computer. When one is interested only in a p-value rather than the entire distribution of some statistic, substantial savings in time are obtained using algorithms that do not require total enumeration of the reference set (Pagano and Halvorsen, 1981; Mehta and Patel, 1983). With some algorithms, exact analyses can be easily conducted on a PC running on MS-DOS when the order of the cardinality is about lo7. They can be conducted when the cardinality is much larger, using computers having operating systems with larger memory capacities. To illustrate, consider the 3 x 4 table (60,4,1, 0/1,5,4,1/3,3,3,2), having n = 100 and 33,675 tables in the reference set. In 1978, I performed the 143 Freeman-Halton test using a state-of-the-art FORTRAN program on an IBM 3701165 in about 15 seconds CPU time. This year, I performed the same analysis in 15 seconds total time using the software package StatXact (1991) on a 386-version PC (the CompuAdd 386SX, with math coprocessing chip) running on MS-DOS at 16 mHz, and in 1second of CPU time using SAS (PROC FREQ) on a workstation (DEC 3100). The 4 x 4 table (7,5,0,0/1,15, 1,0/0,7,7,0/0,0,4,9) has 12,798,781 tables in the reference set. In 1977, Klotz and Teng (1977) estimated it would take 6 hours on a Univac 1110 at the University of Wisconsin to perform an exact Kruskal-Wallis test for a table having these margins. I performed this analysis with StatXact in about 8 minutes of total time on a 386 version PC. This table has a very small p-value (0.0000 rounded to four decimal places), which results in considerable savings in time for algorithms that are able to determine whether many tables contribute to the p-value without explicitly enumerating all their cells. 7.1 Algorithms A variety of algorithms have been used in computing exact conditional distributions. Verbeek and Kroonenberg (1985) presented a good survey of those used for I x J contingency tables. These include algorithms that provide total enumeration of the tables in the reference set (e.g., March, 1972; Boulton, 1974; Baker, 1977; Cantor, 1979; Balmer, 1988), algorithms that compute the characteristic function and invert it via Fourier transforms (e.g., Pagano and Tritchler, 1983a, b), network algorithms (Mehta and Patel, 1983) and Monte Carlo algorithms (e.g., Agresti, Wackerly, and Boyett, 1979). Their paper also has enlightening discussions of practical problems related to the algorithms and hardware for implementing them, such as how to ensure proper comparison of extremely small probabilities. Algorithms that provide total enumeration of the reference set are very time-consuming, and adequate only for small problems. In the characteristic function approach (Good, Gover and Mitchell, 1970; Good, 1982; Pagano and Tritchler, 1983a, b), one computes the characteristic function of the statistic of interest (such as a goodness-of-fit statistic) using a recurrence relation and then inverts it using a Fourier transform to obtain the relevant distribution. The fast Fourier transform (Cooley and Tukey, 1965) is a popular method for fast convolution of long sequences, and so it is a natural one to apply to analyses (such as those for 2 x 2 x K tables) involving convolutions of distributions. This method is relatively space and time efficient, computation 144 A. AGRESTI time increasing polynomially in the sample size, rather than exponentially. However, Vollset, Hirji and Elashoff (1991) noted that this method's use of trigonometric functions and complex arithmetic can introduce substantial round-off errors for non-null calculations when there is a wide range between the largest and smallest of the combinatorial coefficients that determine the exact distribution (e.g., a ratio of the two exceeding about lo2'). Among the most popular and versatile programs developed in the past decade have been ones using the network algorithm. This algorithm has been applied to several problems in a series of papers by Cyrus Mehta, Nitin Pate1 and some coworkers. For instance, see Mehta and Pate1 (1983, 1986) for its application to Freeman-Halton exact tests for I x J tables; Mehta, Pate1 and Tsiatis (1984) for exact tests for 2 x J tables; Mehta, Pate1 and Gray (1985) for inference for the common odds ratio in 2 x 2 x K tables; Hirji, Mehta and Pate1 (1987) for exact logistic regression; Agresti, Mehta and Pate1 (1990) for exact inference in I x J tables with ordered categories; Mehta, Pate1 and Senchaudhuri (1991) for inference for 2 x J x K tables; and Hilton, Mehta and Pate1 (1991) for Smirnov tests for categorical or continuous data. I provide here only a brief outline of the network representation for the reference set for an I x J table with fixed margins, and refer the reader to the previously mentioned papers for technical details on the algorithm itself. The network representation consists of nodes and arcs, constructed in J 1 stages. For k = 0,. . . , J, the nodes at stage k have the form (k,w,), where w, = (w,, , . . . , w,,), with wik = nil + .. + iiik and w, = 0. There are as many nodes at stage k as there are possible partial sums for the first k columns of the table. Arcs emanate from each node at any stage k, each arc being connected to a distinct node at stage k 1. The network is constructed recursively by specifying all successor nodes (k + 1,wk+,) that are connected by arcs to each node (k, w,). A path through the network is a sequence of arcs (0,O) -, (1, w,) -, . + ( J , w,). Each path represents a distinct table in the reference set, with entries (w,+, - w,) in column k + 1. The network representation is used in calculating the exact distribution by stagewise recursion, beginning at node (0,O). Figure 1 shows the network representation for the 3 x 3 table discussed in Section 3.1 having row margins (6,1,2) and column margins (1, 2, 6). The topmost path gives the table 3/0,0,1/0,0,2). For simply calculating p-values, one can dramatically increase the speed of the network algorithm by computing at each node lower and upper bounds on test statistic values for tables having path passing through that node. In this way, one can determine tables that necessarily do or do not contribute to the p-value, without processing the remaining parts of paths in the network passing through that node. Vollset, Hirji and Elashoff (1991) gave an adaptation of the network algorithm that uses an algebraic rather than geometric representation, treating convolutions of distributions using polynomial multiplication. Their algorithm computes the null distribution on the natural rather than logarithmic scale. Using this approach, they obtained faster computation of "exact" confidence limits for the common odds ratio in 2 x 2 x K tables. These days, algorithms such as the network algorithm can practically handle analyses for most I x J and 2 x 2 x K tables having a small to moderate number of cells and small to moderate cell counts. To illustrate, Table 6 reports the CPU time on a + + FIG. 1. Network representation for 3 x 3 tables having row margin ( 6 , 1 , 2 )and column margin ( 1 , 2 , 6 ) . TABLE 6 Sample CPU times (seconds) for Freeman-Halton test on I x I tables with uniform marginal counts andp-values approximately 0.05, using SAS on a DEC 3100 workstation Total sample size a More than 6 hours. EXACT INFERENCE FOR CONTINGENCY TABLES DEC 3100 workstation needed for SAS (PROC FREQ) to perform the Freeman-Halton test for I x I tables of various sizes having uniform marginal totals and cell frequencies changed from uniformity sufficiently to give p-values approximately equal to 0.05. Times are much faster when marginal counts are nonuniform. For instance, a 5 x 5 table with n = 100 having both margins equal to (60,25,10,3,2) took only 40 seconds CPU time. The remaining problematic area is large, sparse tables, for which there are a large number of cells and cell counts too small to appeal to standard asymptotic theory. For two-way tables, examples would be tables of at least about 50 cells, having several fitted values less than about 5. Even with current computing power, the conditional reference set for such tables is often too large to be handled by exact methods. Moreover, the cardinality grows so rapidly as a function of the number of cells that, regardless of future improvements in computer power, it may always be possible to produce tables that cannot be handled exactly. A good compromise for handling large, sparse tables is to estimate precisely the inferential characteristics of interest, such as exact p-values and confidence intervals. One can do this using Monte Carlo sampling of tables in the reference set, by simulating the conditional hypergeometric Sampling distribution (Agresti, Wackerly and Boyett, 1979; Boyett, 1979; Cox and Plackett, 1980; Patefield, 1981, 1982; Kreiner, 1987; StatXact, 1991). Each sampled table provides a Bernoulli random variable, indicating whether the test statistic is at least as large as observed. The estimated exact p-value is the sample mean of those Bernoulli random variables, which is the proportion of sampled tables that have test statistics at least as large as the observed one. The precision of the estimate is determined by the estimated sample variance P ( l - P)/N, where N is the number of tables Sampled. Sampling of 17,000 tables ensures the estimate is good to within 0.01 with confidence at least 0.99. Agresti (1990, page 308) gave a 16 x 5 table with 219 observations, relating various characteristics of alligators to their primary food choice (having categories: fish, invertebrate, reptile, bird, other). For such large tables, one could either sample a fixed number of tables N to guarantee a certain accuracy, or repeatedly take samples of N' tables until achieving a certain accuracy. Using the latter approach with N' = 2,000 and desired accuracy 0.0005 for a p-value for testing independence with the Pearson statistic, Monte Carlo sampling of 10,000 tables provides an estimated exact p-value 145 of 0.0004 and a 99% confidence interval for that exact p-value of (0.0000,O.0009). An advantage of the Monte Carlo method is that the amount of computational work is much less dependent on the sample size n and table size I x J than for methods for exact analysis. For a method of simulating tables that involves taking a random permutation of n integers, Agresti, Wackerly and Boyett (1979) noted that the CPU time is approximately linear in n and stable in I and J. Patefield (1981) provided a method that is more efficient for large n. Although it takes longer to generate each table with Monte Carlo methods, only a relatively small number need to be generated. In principle, this method could be used to approximate precisely any exact analysis, including those for which exact calculations may never be feasible. Mehta, Pate1 and Senchaudhuri (1988) described a more sophisticated and faster Monte Carlo approach, using importance sampling. Tables are sampled from the conditional reference set in proportion to their importance for reducing the variance of the estimated p-value, rather than in proportions corresponding to their hypergeometric probabilities. In importance sampling, each Sampled table provides an estimate of the p-value that is designed to be much better than the crude Bernoulli estimate provided by Monte Carlo Sampling. The tables are sampled using a network algorithm. For linear rank tests for 2 x J tables, they noted that importance sampling can be up to four orders of magnitude more efficient than Monte Carlo sampling. That is, to achieve a certain fixed accuracy with a p-value estimate, the ratio of the number of tables sampled using the Monte Carlo approach versus importance sampling was about 10,000 for tests such as the trend test. However, the initial overhead involved in using importance sampling, due partly to using backward induction with the network algorithm to set up the networkbased sampling scheme, makes it inefficient for certain very large data sets. ' 7.2 Software Until recently, software for exact methods for contingency tables was nearly nonexistent, at least in the most popular statistical packages. Even now, nearly all packages can perform Fisher's exact test but little if anything else. With a couple of exceptions, our discussion here is limited to the most commonly used packages. SAS (using procedure FREQ) and IMSL (using routine CTPRB) can perform Fisher's exact test and the Freeman-Halton extension for I x J tables, but do not give options to perform tests that base the p-values on goodness-of-fit statistics or 146 A. AGRESTI ordinal statistics. SAS uses the network algorithm from Mehta and Pate1 (1983), whereas IMSL uses an algorithm that enumerates the entire reference set. Thus, although neither program seems to have limits on table sizes, SAS can handle a much greater variety of tables in a reasonable amount of time. Currently, BMDP and SPSSXonly perform Fisher's exact test. All these packages seem to use the table probability as the basis of ordering the reference set for two-sided p-values. StatXact (1991) is a statistical package specializing in exact nonparametric inference and in exact inference for contingency table problems. Developed by Mehta and Pate1 and colleagues, it uses versions of the network algorithm described in their articles. For 2 x J tables, StatXact performs a general linear rank test that includes as special cases the Wilcoxon test and a trend test with arbitrary scores. For I x J tables with min(I, J ) I 5, it can perform the Freeman-Halton test and exact tests of independence using the Pearson or likelihoodratio chi-squared statistics. For tables with ordered columns, it can perform the exact test using the Kruskal-Wallis statistic when max(I, J ) I 5. When rows are also ordered, it can perform the exact test of linear-by-linear association described by Agresti, Mehta and Pate1 (1990) when min(I, J ) s 5, and it can use the Jonckheere-Terpstra statistic when I s 5. For 2 x 2 x K tables, it performs Birch's exact test of conditional independence (for K I 200), Zelen's exact test of homogeneity of odds ratios, and "exact" confidence intervals for an assumed common odds ratio. For 2 x J x K tables, StatXact performs str-atified linear rank tests and can perform inference for parameters in conditional logistic regression models. For I x J and 2 x J x K tables, StatXact performs Monte Carlo sampling for cases beyond its capability for exact inference (as long as I s 50, J s 50, K 5 200, and I J K s 2500), and it can perform importance sampling for 2 x J tables. The StatXact manual is also a good reference for many examples of exact conditional analyses. Many of the StatXact routines are also available in the EGRET statistical software package (EGRET, 1991). Baptista and Pike (1977) gave a FORTRAN program for the Sterne-type confidence interval for the odds ratio in a single 2 x 2 table. For 2 x 2 x K tables, Thomas (1975) gave a FORTRAN program for the conditional ML estimate and an "exact" confidence interval for a common odds ratio, and for exact tests of conditional independence and no three-factor interaction (when K I 20). This program can be slow since, unlike algorithms presented by Pagano and Tritchler (1983b) and StatXact (1991), it requires evaluating every table in the conditional reference set. Vollset and Hirji (1991) gave a fast GAUSS program for the exact test of conditional independence and confidence interval for a common odds ratio, and indicated that it can handle up to about 1,000 points in the distribution of C , n,,, . 8. OTHER APPROACHES TO EXACT INFERENCE This discussion has focused on the classical conditional approach to exact inference for contingency tables. This section discusses controversies regarding that approach and describes alternative approaches that produce results having some connection with those for exact conditional methods. 8.1 Controversy Over Exact Conditional Approaches Most of the debate about exact conditional methods for categorical data has focused on their use with Fisher's exact test when both margins of the table are not naturally fixed. I discuss the controversy only briefly, as it has already generated an enormous literature. See, for instance, Barnard (1945, 1947, 1949, 1979, 1989), Berkson (1978), Basu (1979), Kempthorne (1979), Upton (1982), Suissa and Shuster (1984, 1985), Yates and discussants (1984), Bhapkar (1986), Haber (1987, 1989), D'Agostino, Chase and Belanger (1988), Lloyd (1988b), Rice (1988), Little (1989), Camilli (1990), Mehta and Hilton (1990), Routledge (1990), Storer and Kim (1990), and Greenland (1991). The perceived problem with the test results mainly from the conditional distribution of n,, (or the odds ratio) being highly discrete, much more so than when one or neither margin is fixed. This results in the test being quite conservative in a conditional or unconditional sense, when used with a fixed significance level such as a = 0.05. The actual probability of rejecting the null hypothesis may be considerably less than the nominal level. Proponents of Fisher's test (e.g., Yates, 1984) argue that (1) one should not use arbitrary fixed significance levels (versus simply reporting the p-value), (2) one should not average into the calcu lation of the p-value other tables whose marg i n a l ~did not occur, and (3) no substantive loss of information about H,, results from conditioning on the marginals (i.e., the marginal counts are approximately ancillary). The randomized-decision version of Fisher's exact test, using randomization on the boundary of a critical region to achieve a fixed significance level a , is UMPU (Tocher, 1950). This is of little solace for practical work, since randomization is not used, although it indicates that Fisher's test may also 147 EXACT INFERENCE FOR CONTINGENCY TABLES perform well for the mid-p definition of a p-value. The mid-p value is half the probability of the observed result plus the probability of more extreme values. For discrete data, the mid-p value has null behavior more nearly like a uniform ( 0 , l ) random variable than the ordinary p-value (e.g., its null expected value is 0.5, and the sum of its two onetailed p-values equals 1.0). It has been recommended (e.g., by Lancaster, 1961; Plackett in the discussion of Yates, 1984; Barnard, 1989, 1990) as a good compromise between having a conservative test and using randomization on the boundary to eliminate problems from discreteness. For comparing two binomial probabilities n1 and a,, Hirji, Tan and Elashoff (1991) noted that the mid-p adjustment to Fisher's exact test has actual levels of significance closer to nominal levels than do classical asymptotic tests. This is especially true when the common value of i n i ) is near 0 or 1or when the sample sizes nl+ and n,, are quite different. Haber (1986b) obtained similar conclusions both for comparing binomials and for multinomial sampling over the four cells. Hirji (1991) showed that, for tests of parameters in conditional logistic models for case-control designs with unmatched binary covariates, mid-p adjustments to an exact test perform well in approximating nominal levels compared to the exact test and asymptotic score tests. Analogous remarks apply to "exact" interval estimation. Although necessarily conservative, "exact" interval estimates for the odds ratio can be so highly conservative as to be less useful than asymptotic large-sample approaches in terms of long-run coverage performance. An adaptation of Cornfield's "exact" method uses a mid-p adjustment, choosing 0 endpoints that have one-sided mid-p values of a / 2 . This approach does not guarantee the desired coverage probability, but simulations by Mehta and Walsh (1992) showed that it performs well in this respect. For small samples, the mid-p-adjusted intervals can be much narrower. For instance, for Fisher's tea-tasting data, having ,rows (3,1/1,3), the "exact" 95%' confidence interval' is (0.21,626.2), and the mid-p-based 95% confidence interval is (0.31,308.6). Vollset, Hirji and Afifi (1991) showed similar good performance of the mid-p approach for interval estimation of parameters in conditional logistic designs. 8.2 Exact Unconditional Approach Other methods for 2 x 2'tables have been motivated by the controversy about the conditioning argument and the conservativeness of Fisher's exact test. Some authors argue that it is better to use a robust asymptotic test than a possibly highly conservative exact test (e.g., D'Agostino, Chase and Belanger, 1988). Others recommend using an "exact" unconditional test. To illustrate the latter approach, consider testing n, = n, (0 = 1)for two independent binomial samples. One first computes an exact p-value P ( a ) for each possible common value n of n, and n,. For this p-value, one might use the product binomial probability that a z statistic (or chi-squared statistic) for comparing two proportions is at least as large as observed, when n is the value of the nuisance parameter. The global p-value for the test with unknown n is then P = sup, P(n). Proposed by Barnard (1945, 1947), this test was disavowed by him in later publications (e.g., Barnard, 1949, discussion of Yates, 1984). Boschloo (1970), McDonald, Davis and Milliken (1977), Suissa and Shuster (1984, 1985), Haber (1986a, 1987, 1989), Shuster (1988) discussed computational implementation of the unconditional test. Suissa and Shuster (1991) extended it to comparisons of dependent proportions for matched pairs. The 2 x 2 table having entries (3,0/0,3), discussed by Barnard (1945) and Fisher (1945) [see also Little (1989) and Routledge (1990)1, illustrates that results with the conditional and unconditional approaches can be quite discre ant. For HI: 0 # 1, = 0.100, and the the Fisher p-value is 2(:)(:) asymptotic Pearson chi-squared test has a p-value of 0.014. For the exact binomial test of n, = a, having only fixed row totals (3,3), the Pearson chi-squared value of 6.0 that occurs for the observed table and for table (0,3/3,0) is the maximum possible, and the p-value for given nuisance parameter a is 2a3(1 - a)3. The supremum of this over 0 Ia I1occurs at n = 112, giving p-value = 1/32. This unconditional p-value is related to Fisher conditional p-values by fi:) 6 P ( X 2 2 6) = p(x22 6 1 n,, = k) P ( n + , = k ) . k=O But the conditional probability that X 2 2 6 is zero except when n,, = 3, so the unconditional p-value is a weighted average of the Fisher p-value for the observed column marginals and p-values of 0 corresponding to the impossibility of getting results as extreme as observed if other marginals had occured, that is, 1/32 = 0.10[(:)(1/2)'], where the term in brackets is the binomial probability (when a = 112) of the column marginals (3,3). Fisher (1945) remarked, "It is my view that the existence of these less informative possibilities should not affect our judgment of significance based on the series actually observed. . . . The fact that such an unhelpful outcome as these might occur, or must 148 A. AGRESTI occur with a certain probability, is surely no reason for enhancing our judgment of significance in cases where it has not occurred;. . . it is only the sampling distribution of samples of the same type that can supply a rational test of significance." The unconditional test is so computationally intensive that it seems complex to extend it to larger contingency-table problems (Mehta and Hilton, 1990). In addition, removing nuisance parameters by taking a supremum may itself produce quite conservative analyses for tables having larger numbers of such parameters. It should be noted that asymptotic procedures can also be quite conservative. For instance, Koehler and Larntz (1980) noted that the likelihood-ratio test tends to be highly conservative when most expected frequencies are smaller than 0.5. To illustrate, consider the 3 x 9 table (0, 7, 0, 0, 0, 0, 0, 1, 111, 1, 1, 1, 1,1,1,0,0/0,8,0,0,0,0,0,0,0), discussed in the StatXact manual. For the likelihood-ratio statistic, the asymptotic p-value is 0.0837 and the exact p-value is 0.0015; for the Pearson statistic, the values are 0.1342 and 0.0013. 8.3 Bayesian Approaches For 2 x 2 tables, Bayesian approaches using certain "conservative" prior distributions give results equivalent to conditional tests. For instance, Altham (1969) gave an exact Bayesian analysis comparing parameters for two independent binomial samples. She tested H,: a, 5 a, against a, > a, using a beta(a,, Pi) prior distribution for a,; i.e., the prior for a, is proportional to (?ri)OL(l - a,)@, with a = a, - 1 and,@= Pi - 1, i = 1,2. The posterior distributigns are beta(a:, (3:) with a: = a i nil and Pi = Pi + ni2.Taking the Bayesian p-value to be the posterior probability that a, 4 ?r, Altham showed this equals the one-sided p-value for Fisher's exact test when one uses improper prior distributions (a,, (3,) = (1,O) and (a,, (3), = (0,l). This represents prior belief favoring the null hypothesis, in effect penalizing oneself against concluding that a, > ?r,. If a, = =,y, i = 1,2, where 0 +y I1, Altham showed that the Bayesian pvalue is smaller than the Fisher p-value, and the difference between the two is no greater than the null probability of the observed data. Altham (1971) also gave Bayesian analyses for dependent proportions. For a simple model in which the probability of success is the same for each subject at a given occasion, 'she again showed that the classical p-value is a Bayesian p-value for a prior distribution favoring the null hypothesis. For a model similar to Cox's, in which the probability of success varies by subject but the occasion effect + @, is constant, she showed that the Bayesian evidence against the null is weaker as the number of pairs giving the same response at both occasions increases, for fixed n,, and n,,. This result differs, and is perhaps more intuitively pleasing, than the exact conditional ML result. For examples of other Bayesian analyses for 2 x 2 tables, see Leonard (1975), Chen and Novick (1984) and Nurminen and Mutanen (1987). 9. FUTURE RESEARCH By the turn of the century, we should see advances in applicability of exact methodology for contingency tables at least comparable to those of the past decade. One does not need a crystal ball to predict that computer speed will continue to increase and algorithms will be further improved, so that tables not now feasible for analysis soon will be. In addition, it is reasonable to expect development of algorithms to handle new types of categorical data, in particular, more complex relationships for larger tables in higher dimensions. This article has emphasized exact inference for contingency tables in the context of loglinear modeling. I believe that presenting the methods in the context of inference for parameters in models helps to unify a variety of exact conditional methods. It also helps to identify areas in which additional research would yield fruitful results, such as exact inferences for I x J x K tables and a general goodness-of-fit test for loglinear models. This section summarizes other avenues for future research. An important but difficult area for future research is exact analysis of model parameters for contingency tables that are large and sparse. For such data, standard asymptotic methods behave poorly; yet, it has been impractical to apply exact methods. The size of the conditional reference set is often too large to handle when the table has a large number of cells. This can also happen when the table is small but contains very large as well as small cell counts. For such problematic tables, there should be additional research on (1) hybrid algorithms that use exact methods for some parts of the computation and approximations for other parts (Baglivo, Olivier and Pagano, 1988), (2) fast ways of simulating the exact distribution (Kreiner, 1987; Mehta, Patel and Senchaudhuri, 1988), and (3) better asymptotic approximations (e.g., Koehler, 1986), including saddlepoint approximations (Davison, 1988; Booth and Butler, 1990; Bedrick and Hill, 1992; Pierce and Peters, 1992). An area in which large, sparse tables commonly occur is modeling of longitudinal data with categorical responses. Conditioning on sufficient statistics EXACT INFERENCE FOR CONTINGENCY TABLES to eliminate nuisance parameters, such as subject random effects, can be very helpful. Even then, the large number of cells in the table make exact methods or approximations for them quite challenging. Another complication is that some useful models do not have reductions of data through sufficiency, such as loglinear and logistic models for the marginal probabilities rather than joint cell probabilities. An area that has scope for lots of additional work is exact inference for I x J x K tables. Section 4.4 described several exact inferences for standard loglinear models. Other exact inferences are of interest when at least one variable is ordinal. In testing conditional independence, one could use test statistics designed to improve power for narrower alternatives. For instance, Section 4.4 discussed the possibility of testing conditional independence against a linear-by-linear alternative. When X is nominal and Y is ordinal, one could use a statistic analogous to a stratified Kruskal-Wallis statistic. Such a statistic results from comparing (XZ, YZ) to a loglinear model in which the X-Y partial association term has the form p i yj, for parametric row effects { p i ) and rank scores { y j ) . Efficient score statistics for these cases are equivalent to statistics proposed by Birch (1965) and Landis, Heyman and Koch (1978) for large-sample testing of conditional independence. Similar remarks apply to exact tests of no three-factor interaction. Greater power for narrower alternatives (unsaturated models for three-factor interaction) would be achieved by using statistics that treat one, two or all three variables as ordinal. Of course, exact inferences for standard loglinear models and for models recognizing ordinality are also relevant for four-way and higher dimensional tables. Many tests of mutual independence or conditional independence can be re-expressed in terms of two-way or three-way tables, but not all. Also, complications may result in tests about three-factor and higher order interactions. The reference set of 'tables having the required fixed 'marginals may be difficult to generate or simulate. For instance, in a four-way table, an exact test of no three-factor interaction deals with a reference set in which all six two-way margins are fixed. There is unlikely to be as much controversy with exact conditional methods for models for higher dimensional tables as there has been with Fisher's exact test for 2 x 2 tables. That controversy is probably due less to philosophical disagreements about conditioning than to practical consequences of using. continuously defined measures such as p-values with highly discrete distributions. Except 149 in extreme cases in which nearly all observations fall in one or two marginal categories for each classification, sampling distributions are much "less discrete" for larger tables. As the number of dimensions or the number of response categories per dimension increases, the sampling distributions that determine exact tests and confidence intervals rapidly approach a continuous form. Thus, the conservativeness of the inferences becomes less problematic. For instance, performance of exact conditional tests should be closer to that of their randomized versions, many of which are UMPU. Also, the three approaches discussed for confidence intervals for odds ratios (test-based with a / 2 in each tail, adapted Sterne, test-based with mid-p values) should all have true confidence levels close to the nominal one. For cases in which discreteness is problematic, there should be more work on developing adaptations of exact methods that, while perhaps sacrificing "exactness," give performance closer to the randomized exact versions. Examples are adaptations of exact tests and confidence intervals using the mid-p value (e.g., Hirji, Tan and Elashoff, 1991; Mehta and Walsh, 1992), and tests and confidence intervals that use the exact distribution computed at the ML estimate of the nuisance parameter (e.g., Conlon and Thomas, 1990; Storer and Kim, 1990). So far there has been relatively little attention paid to algorithms for calculating power for exact conditional tests. The literature refers almost exclusively to 2 x 2 tables. For instance, Gail and Gart (1973), Haseman (1978) and Suissa and Shuster (1985) discussed the determination of sample size for obtaining desired powers in fixed-a use of Fisher's exact test. Finally, it would be very useful to have a general-purpose algorithm for exact tests comparing two nested loglinear models. This would unify exact tests of goodness of fit for all loglinear models, as well as tests against unsaturated alternatives. In summary, much has been accomplished in the past decade, but this is only the tip of the iceberg of what can be done. In the future, statistical practice for categorical data should place more emphasis on "exact" methods and less emphasis on methods using possibly inadequate large-sample approximations. ACKNOWLEDGMENTS This work was partially supported by NIH Grant GM43824. The author appreciates helpful comments on manuscript organization and content from Dr. Cyrus Mehta and two referees. Computations 150 A. AGRESTI for many of the examples presented in this article were obtained using StatXact (1991). REFERENCES AGRESTI, A. (1990). Categorical Data Analysis. Wiley, New York. AGRESTI,A., MEHTA,C. R. and PATEL,N. R. (1990). Exact inference for contingency tables with ordered categories. J. Amer. Statist. Assoc. 85 453-458. AGRESTI,A. and WACKERLY, D. (1977). Some exact conditional tests of independence for R x C cross-classification tables. Psychometrika 42 111-125. AGRESTI, A., WACKERLY, D. and BOYETT,J. (1979). Exact conditional tests for cross-classifications: Approximation of a t - tained significance levels. Psychometrika 44 75-83. ALTHAM,P. M. E. (1969). Exact Bayesian analysis of a 2 x 2 contingency table and Fisher's 'exact' significance test. J. Roy. Statist. Soc. Ser. B 31 261-269. ALTHAM, P. M. E. (1971). The analysis of matched proportions. Bwmetrika 58 561-576. ANDERSEN, A. H. (1974). Multidimensional contingency tables. Scand. J. Statist. 1 115-127. ARMITAGE, P. (1955). Tests for linear trends in proportions and frequencies. Biometries 11 375-386. J., OLIVIER, D. and PAGANO, M. (1988). Methods for the BAGLIVO, analysis of contingency tables with large and small cell counts. J. Amer. Statist. Assoc. 83 1006-1013. BAKER,R. J. (1977). Exact distributions derived from two-way tables. J. Roy. Statist. Soc. Ser. C 26 199-206. BALMER, D. W. (1988). Recursive enumeration of r x c tables for exact likelihood evaluation. J. Roy. Statist. Soc. Ser. C 37 290-301. BAPTISTA,J. and PIKE, M. C. (1977). Exact two-sided confidence limits for the odds ratio in a 2 x 2 table. J. Roy. Statist. Soc. Ser. C 26 214-220. BARNARD, G. A. (1945). A new test for 2 x 2 tables (Letter to the Editor). Nature 156 177. BARNARD,G. A. (1947). Significance tests for 2 x 2 tables. Biometrika 34 123-138. BARNARD, G. A. (1949). St'atistical inference. J. Roy. Statist. Soc. Ser. B 11 115-139. BARNARD, G. A. (1979). In contradiction to J. Berkson's dis- praise: Conditional tests can be more efficient. J. Statist. Plann. Inference 3 181-188. BARNARD, G. A. (1989). On alleged gains in power from lower p-values. Statistics in Medicine 8 1469-1477. G. A. (1990). Must clinical trials be large? The inter- BARNARD, pretation of P-values and the combination of test results. Statistics in Medicine 9 601-614. BASU,D. (1979). Discussion of "In dispraise of the exact test" by J. Berkson. J. Statist. Plann. Inference 3 189-192. BAYER,L. and Cox, C. (1979). Exact tests of significance in binary regression models. J. Roy. Statist. Soc. Ser. C 28 319-324. BEDRICK, E. J. and HILL,J. R. (1990). Outlier tests for logistic regression: A conditional approach, Bwmetrika 77 815-827. BEDRICK, E. J. and HILL, J. R. (1992). An empirical assessment of saddlepoint approximations for testing a logistic regression parameter. Bwmetrics. To$appear. BERKSON, J. (1978). In dispraise of the exact test. J. Statist. Plann. Inference 2 27-42. BHAPKAR,V. P. (1986). On conditionality and likelihood with nuisance parameters i n models for contingency tables. Technical Report 253, Dept. Statistics, Univ. Kentucky. BIRCH,M. W. (1964). The detection of partial association I: The 2 x 2 case. J. Roy. Statist. Soc. Ser. B 26 313-324. BIRCH,M. W. (1965). The detection of partial association 11: The general case. J. Roy. Statist. Soc. Ser. B 27 111-124. BISHOP,Y. M. M. (1971). Effects of collapsing multidimensional contingency tables. Biometries 27 545-562. BOOTH,J. and BUTLER,R. (1990). Randomization distributions and saddlepoint approximations in generalized linear mod- els. Biometrika 77 787-796. BOSCHLOO, R. D. (1970). Raised conditional level of significance for the 2 x 2 table when testing for the equality of two probabilities. Statist. Neerlandica 21 1-35. BOULTON, D. M. (1974). Remark on algorithm 434. Comm. ACM 17 326. BOYETT,J. (1979). Random R x C tables with given row and column totals. J. Roy. Statist. Soc. Ser. C 28 329-332. BRESLOW, N. E. and DAY,N. E. (1980). The Analysis of CaseControl Studies. IARC Scientific Publications No. 32, Lyon, France. CAMILLI, G. (1990). The test of homgeneity for 2 x 2 contingency tables: A review of and some personal opinions on the controversy. Psychological Bulletin 108 135-145. CANTOR, A. B. (1979). A computer algorithm for testing significance in M x K contingency tables. In Proceedings of the Statistical Computing Section 220-221. Amer. Statist. Assoc., Washington, DC. CHEN,J. J. and NOVICK,M. R. (1984). Bayesian analysis for binomial models with general beta prior distributions. Journal of Educational Statistics 9 163-175. COCHRAN, W. G. (1954). Some methods of strengthening the common x 2 tests. Biometries 10 417-451. H. B. COHEN,A. and SACKROWITZ, (1991). Tests for indepen- dence in contingency tables with ordered alternatives. J. Multivariate Anal. 36 56-67. CONLON, M. and THOMAS, R. G. (1990). A new coddence inter- val for the difference of two binomial proportions. Comput. Statist. Data Anal. 8 237-241. J. M. and TUKEY,J. W. (1965). An algorithm for the COOLEY, machine calculation of complex Fourier transforms. Math. Comp. 12 297-301. J. (1956). A statistical problem arising from retro- CORNFIELD, spective studies. Proc. Third Berkeley Symp. Math. Statist. Probab. 4 135-148 Univ. California Press, Berkeley. Cox, D. R. (1958a). Two further applications of a model for binary regression. Biometrika 45 562-565. Cox, D. R. (195813). The regression analysis of binary sequences (with discussion). J. Roy. Statist. Soc. Ser. B 20 215-242. Cox, D. R. (1966). A simple example of a comparison involving quanta1 data. Biometrika 53 215-220. Cox, D. R. (1970). Analysis of Binary Data. Chapman and Hall, London. Cox, M. A. A. and PLACKETT, R. L. (1980). Small samples in contingency tables. Biometrika 67 1-13. CRESSIE,N. and READ,T. R. C. (1989). Pearson x 2 and the loglikelihood ratio statistic G2: A comparative review. In- ternat. Statist. Rev. 57 19-43. D'AGOSTINO, R. B., CHASE,W. and BELANGER, A. (1988). The appropriateness of some common procedures for testing the equality of two independent binomial populations. Amer. Statist. 42 198-202. DAVIS,L. J. (1986). Exact tests for 2 by 2 contingency tables. Amer. Statist. 40 139-141. DAVISON,A. C. (1988). Approxima_te conditional inference in generalized linear models. J. Roy. Statist. Soc. Ser. B 50 445-461. DUPONT, W. D. (1986). Sensitivity of Fisher's exact test to minor perturbations in 2 x 2 contingency tables. Statistics in Medicine 5 629-635. EXACT INFERENCE FOR CONTINGENCY TABLES EGRET. (1991). EGRET Statistical Software. Statistics and Epidemiology Research Corporation, Seattle. FISHER,R. A. (1934). Statistical Methods for Research Workers. (Originally published 1925, 14th ed. 1970.) Oliver and Boyd, Edinburgh. FISHER,R. A. (1935a). The logic of inductive inference (with discussion). J. Roy. Statist. Soc. Ser. A 98 39-82. FISHER,R. A. (1935b). The Design of Experiments (8th ed. 1966). Oliver and Boyd, Edinburgh. FISHER,R. A. (1945). A new test for 2 x 2 tables (Letter to the Editor). Nature 156 388. FISHER,R . A. (1950). The significance of deviations from expec- tation in a Poisson series. Biometrics 6 17-24. FREEMAN, G. H. and HALTON,J. H. (1951). Note on a n exact treatment of contingency, goodness of fit and other problems of significance. Biometrika 38 141-149. GAIL,M. H. and GART,J. J. (1973). The determination of sample sizes for use with the exact conditional test in 2 x 2 compar- ative trials. Bwmetrics 29 441-448. GAIL,M. H. and MANTEL,N. (1977). Counting the number of r x c contingency tables with fixed margins. J. Amer. Statist. Assoc. 72 859-862. GART,J. J. (1969). An exact test for comparing matched propor- tions in crossover designs. Biometrika 56 75-80. GART,J. J. (1970). Point and interval estimation of the common odds ratio in the combination of 2 x 2 tables with fixed marginals. Biometrika 57 471-475. J. L. (1988). Statistical Reasoning in Law and Pub- GASTWIRTH, lic Policy 1. Academic, San Diego. GIBBONS, J. D. and PRATT,J. W. (1975). P-values: Interpretation and methodology. Amer. Statist. 29 20-25. GOOD,I. J. (1976). On the application of symmetric Dirichlet distributions and their mixtures to contingency tables. Ann. Statist. 4 1159-1189. GOOD,I. J. (1977). The enumeration of arrays and a generaliza- tion related to contingency tables. Discrete Math. 1 9 23-45. GOOD,I. J. (1982). The fast calculation of the exact distribution of Pearson's chi-squared and of the number of repeats within the cells of a multinomial by using a Fast Fourier Trans- form. J. Statist. Comput. Simulation 14 71-78. GOOD,I. J., GOVER,T. NF and MITCHELL, G. J. (1970). Exact distributions for x 2 and for the likelihood-ratio statistic for the equiprobable multinomial distribution. J. Amer. Statist. Assoc. 65 267-283. GOODMAN, L. A. (1979). Simple models for the analysis of associ- ation in cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74 537-552. GRAUBARD, B. I. and KORN,E. L. (1987). Choice of column scores for testing independence in ordered 2 x K tables. Biomet- ries 43 471-476. GREENLAND, S. (1991). On the logical justification of conditional tests for two-by-two contingency tables. Amer. Statist. 45 248-251. HABER,M. (1986a). An exact unconditional test for the 2 x 2 comparative trial. Psychological Bulletin 99 129-132. HABER,M. (198613). A modified exact test for 2 x 2 contingency tables. Biometrical J. 28 455-463. HABER,M. (1987). A comparison of some conditional and uncon- ditional exact tests for 2 by 2 contingency tables. Comm. Statist. Simulation Comput. 16 999-1013. HABER,M. (1989). 3 0 the marginal totals of a 2 x 2 contingency table contain information regarding the table proportions? Comm. Statist. Theory Methods 18 147-156. HABERMAN, S. J. (1974). Log-linear models for frequency tables with ordered classifications. Biometrics 36 589-600. HABERMAN, S. J. (1977). Log-linear models and frequency tables with small expected cell counts. Ann. Statist. 5 1148-1169. 151 HABERMAN, S. J. (1988). A warning on the use of chi-squared statistics with frequency tables with small expected cell counts. J. Amer. Statist. Assoc. 83 555-560. HASEMAN, J. K. (1978). Exact sample sizes for use with the Fisher-Irwin test for 2 x 2 tables. Biometrics 34 106-110. HEALY,M. J. R. (1969). Exact tests of significance in contin- gency tables. Technometrics 11 393-395. HILTON, J. F., MEHTA,C. R. and PATEL,N. R. (1991). Exact Smirnov tests using a network algorithm. Technical Report 14, Dept. Epidemiology and Biostatistics, Univ. California, San Francisco. HIRJI, K. F. (1991). A comparison of exact, mid-P, and score tests for matched case-control studies. Biometrics 47 487-496. HIRJI,K. F., MEHTA,C. R. and PATEL,N. R. (1987). Computing distributions for exact logistic regression. J. Amer. Statist. Assoc. 82 1110-1117. HIRJI, K. F., MEHTA,C. R. and PATEL,N. R. (1988). Exact inference for matched case-control studies. Biometrics 44 803-814. HIRJI,K. F., TAN,S. and ELASHOFF, R. M. (1991). A quasi-exact test for comparing two binomial parameters. Statistics in Medicine 10 1137- 1153. HIRJI, K. F., TSIATIS,A. A. and MEHTA,C. R. (1989). Median unbiased estimation for binary data. Amer. Statist. 43 7-11. IRWIN,J. 0. (1935). Tests of significance for differences between percentages based on small numbers. Metron 12 83-94. 0. (1979). In dispraise of the exact test: Reactions. KEMPTHORNE, J. Statist. Plann. Inference 3 199-213. KLOTZ,J. (1966). The Wilcoxon, ties, and the computer. J. Amer. Statist. Assoc. 61 772-787. KLOTZ,J. and TENG,J. (1977). One-wajl layout for counts and the exact eumeration of the Kruskal-Wallis H distribution with ties. J. Amer. Statist. Assoc. 72 165-169. KOEHLER, K. (1986). Goodness-of-fittests for log-linear models in sparse contingency tables. J. Amer. Statist. Assoc. 81 483-493. KOEHLER, K. and LARNTZ, K. (1980). An empirical investigation of goodness-of-fit statistics for sparse multinomials. J. Amer. Statist. Assoc. 75 336-344. KREINER,S. (1987). Analysis of multidimensional contingency tables by exact conditional tests: Techniques and strategies. Scand. J. Statist. 14 97-112. KURITZ,S. J., LANDIS,J. R. and KOCH,G. G. (1988). A general overview of Mantel-Haenszel methods: Applications and re- . cent developments. Annual Review of Public Health 9 123-160. LANCASTER, H. 0. (1961). Significance tests in discrete distributions. J. Amer. Statist. Assoc. 56 223-234. E. R. and KOCH,G. G. (1978). Average LANDIS,J. R., HEYMAN, partial association in three-way contingency tables: A re- view and discussion of alternative tests. Internat. Statist. Rev. 46 237-254. LEHMANN, E. L. (1986). Testing Statistical Hypotheses, 2nd ed. Wiley, New York. T. (1975). Bayesian estimation methods for two-way LEONARD, contingency tables. J. Roy. Statist. Soc. Ser. B 37 23-37. LITTLE,R. J. A. (1989). Testing the equality of two independent binomial proportions. Amer. Statist. 43 283-288. LLOYD, C. J. (1988a). Doubling the one-sided P-value in testing independence in 2 x 2 tables against a two-sided alterna- tive. Statistics in Medicine 7 1297-1306. LLOYD,C. J. (1988b). Some issues arising from the analysis of 2 x 2 contingency tables. Austral. J. Statist. 30 35-46. 152 A. AGRESTI MANTEL, N. (1963). Chi-square tests with one degree of freedom: Extensions of the Mantel-Haenszel procedure. J . Amer. Statist. Assoc. 58 690-700. MANTEL,N. (1987). Exact tests for 2 x 2 contingency tables (Letter). Amer. Statist. 41 159. MANTEL,N. and HANKEY,B. J. (1971). Programmed analysis of a 2 x 2 contingency table. Amer. Statist. 25 40-44. MARCH,D. L. (1972). Exact probabilities for R x C contingency tables. Comm. ACM 15 991-992. P. (1986). The conditional distribution of goodness- MCCULLAGH, of-fit statistics for discrete data. J. Amer. Statist. Assoc. 81 104-107. MCCULLAGH, P. and NELDER,J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London. L. L., DAVIS,B. M. and MILLIKEN, MCDONALD, G. A. (1977). A nonrandomized unconditional test for comparing two pro- portions in 2 x 2 contingency tables. Technometrics 19 145-158. MCNEMAR, Q. (1947). Note on the sampling error of the differ- ence between correlated proportions or percentages. Psy- chometrika 12 153-157. MEHTA,C. and HILTON,J. (1990). Exact power of conditional and unconditional tests: Going beyond the 2 x 2 contingency table. Unpublished manuscript. MEHTA,C. R. and PATEL,N. R. (1983). A network algorithm for performing Fisher's. exact test in r x c contingency tables. J. Amer. Statist. Assoc. 78 427-434. MEHTA,C. R. and PATEL,N. R. (1986). FEXACT: A Fortran subroutine for Fisher's exact test in unordered r x c contin- gency tables. ACM Trans. Math. Soffware 12 154-161. MEHTA,C. R. and WALSH,S. J. (1992). Comparison of exact, mid-p, and Mantel-Haenszel confidence intervals for the common odds ratio across several 2 x 2 contingency tables. Amer. Statist. To appear. MEHTA,C. R., PATEL,N. R. and GRAY,R. (1985). Computing a n exact confidence interval for the common odds ratio in several 2 by 2 contingency tables. J. Amer. Statist. Assoc. 80 969-973. MEHTA,C. R., PATEL,N. R. and SENCHAUDHURI, P. (1988). Importance sampling for estimating exact probabilities in permutational inference. J. Amer. Statist. Assoc. 8 3 999-1005. P. (1991). Exact MEHTA,C . R., PATEL,N. R. and SENCHAUDHURI, stratified linear rank tests for binary data. In ComputingScience and Statistics: Proceedings of the 23rd Symposium on the Interface. m. M. Keramidas., ed.), 200-207. Interface ~oundation;Fairfax Station, VA. MEHTA,C. R., PATEL,N. R. and T s r ~ n s A. , A. (1984). Exact significance testing to establish treatment equivalence with ordered categorical data. Bwmetrics 40 819-825. MORRIS,C. N. (1975). Central limit theorems for multinomial s p . Ann. Statist. 3 165-188. NEYMAN, J. (1935). On the problem of confidence limits. Ann. Math. Statist. 6 111-116. NURMJNEN, M. and MUTANEN, P. (1987). Exact Bayesian analy- sis of two proportions. Scand. J. Statist. 14 67-77. PAGANO,M. and HALVORSEN, K. T. (1981). An algorithm for finding the exact significance levels of r x c contingency tables. J. Amer. Statist. Assoc. 76 931-934. PAGANO, M. and TRITCHLER, D. (1983a). On obtaining permuta- tion distributions in polynomial time. J. Amer. Statist. Assoc. 78 435-440. PAGANO,M. and TRITCHLER,D. (1983b). Algorithms for the analysis of several 2 x 2 contingency tables. SZAM J . Sci. Statist. Comput. 4 302-309. PATEFIELD,W. M. (1981). An efficient method of generating - random R x C tables with given row and column totals. J . Roy. Statist. Soc. Ser. C 30 91-97. PATEFIELD, W. M. (1982). Exact tests for trends in ordered contingency tables. J . Roy. Statist. Soc. Ser. C 31 32-43. PEARSON,E . S. (1990). 'Student' A Statistical Biography of William Sealy Gosset (R. L. Plackett and G. A. Barnard, eds.). Clarendon Press, Oxford, England. PERITZ,E. (1982). Exact tests for matched pairs: Studies with covariates. Comm. Statist. Theory Methods 11 2165-2166. [Correction: (1983) 12 1209-1210.1 PIERCE,D. A. and PETERS,D. (1991). Practical use of higher-order asymptotics for multiparameter exponential families. J . Roy. Statist. Soc. To appear. RAO,C. R. (1973). Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New York. RICE, W. R. (1988). A new probability model for determining exact P-values for 2 x 2 contingency tables when compar- ing binomial proportions. Biometrics 44 1-22. ROUTLEDGE, R. D. (1990). Resolving the controversy over Fisher's exact test. Paper presented a t the International Biometric Conference, Budapest. SANTNER, T. J. and SNELL,M. K. (1980). Small-sample confi- dence intervals for p, - p, and p, l p , in 2 x 2 contingency tables. J. Amer. Statist. Assoc. 75 386-394. SHAPIRO,S., SLONE,D., ROSENBERG, L., KAUFMAN, D., STOLLEY, P. D. and MIEWINEN,0 . S. (1979). Oral contraceptive use in relation to myocardial infarction. Lancet 8119 743-746. SHUSTER, J. J. (1988). EXACTB and CONF: Exact unconditional procedures for binomial data. Amer. Statist. 42 234. SOMS,A. P. (1985). Permutation tests for k-sample binomial data with comparisons of exact and approximate P-levels. Comm. Statist. Theory Methods 14 217-233. STATXACT. (1991). StatXact: Statistical Software for Exact Nonparametric Inference, Version 2. Cytel Software, Cambridge, Mass. c efiducial STERNE,T. E. (1954). Some remarks on c ~ ~ d e n or limits. Biometrika 41 275-278. STORER, B. E. and KIM,C. (1990). Exact properties of some exact test statistics for comparing two binomial proportions. J . Amer. Statist. Assoc. 85 146-155. STUMPF,R. H. and STEYN,H. S. (1986). Exact distributions associated with a n I x J x K contingency table. Comm. Statist. Theory Methods 15 1889-1904. J. J. (1984). Are uniformly most power- SUISSA,S. and SHUSTER, ful unbiased tests really best? Amer. Statist. 38 204-206. SUISSA,S. and SHUSTER, J. J. (1985). Exact unconditional sam- ples sizes for the 2 by 2 binomial trial. J. Roy. Statist. Soc. Ser. A 148 317-327. J. J. (1991). The 2 x 2 matched pair SUISSA,S. and SHUSTER, trial: Exact unconditional design and analysis. Biometries 47 361-372. THOMAS,D. G. (1971). Exact confidence limits for the odds ratio in a 2 x 2 table. J . Roy. Statist. Soc. Ser. C 20 105-110. THOMAS,D. G. (1975). Exact and asymptotic methods for the combination of 2 x 2 tables. Computers and Biomedical Research 8 423-446. TOCHER, K. D. (1950). Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika 37 130-144. TRITCHLER, D. (1984). An algorithm for exact logistic regression. J . Amer. Statist. Assoc. 79 709-711. UPTON,G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. J . Roy. Statist. Soc. Ser. A 145 86-105. VERBEEK,A. and KROONENBERG, P. M. (1985). A survey of algorithms for exact distributions of test statistics in r x c 153 EXACT INFERENCE FOR CONTINGENCY TABLES contingency tables with fixed margins. Comput. Statist. Data Anal. 3 159-185. VOLLSET, S. E. and HIRJI, K. F. (1991). A microcomputer program for exact and asymptotic analysis of several 2 x 2 tables. Epidemiology 2 217-220. VOLLSET, S. E., HIRJI,K. F. and AFIFI,A. A. (1991). Evaluation of exact and asymptotic interval estimators in logistic analysis of matched case-control studies. Biometrics 47 1311-1325. R. M. (1991). Fast VOLLSET, S. E . , HIRJI, K. F. and ELASHOFF, computation of exact confidence limits for the common odds ratio in a series of 2 x 2 tables. J. Amer. Statist. Assoc. 86 404-409. WHITE,A. A,, LANDIS,J. R. and COOPER,M. M. (1982). A note on the equivalence of several marginal homogeneity test criteria for categorical data. Internat. Statist. Rev. 50 27-34. YATES,F. (1934). Contingency tables involving small numbers and the x 2 test. J. Roy. Statist. Soc. Supp. 1 217-235. YATES,F. (1984). Tests of significance for 2 x 2 contingency tables. J. Royal Statist. Soc. Ser. A 147 426-463. ZELEN,M . (1971). The analysis of several 2 x 2 contingency tables. Biometrika 58 129-137. ZELEN,M. (1972). Exact significance tests for contingency tables embedded in a 2" classification. Proc. Sixth Berkeley Symp. Math. Statist. Probab. 1 737-757. Univ. California Press, Berkeley. D. (1987). Goodness-of-fit tests for large sparse ZELTERMAN, multinomial distributions. J. Amer. Statist. Assoc. 82 624-629. Comment Edward J. Bedrick and Joe R. Hill We congratulate Professor Agresti for his comprehensive review of exact inference with categorical data. We share his enthusiasm for exact conditional methods and believe that the coming years will produce many important computational breakthroughs in this area. The mechanics of conditioning on sufficient statistics to generate reference distributions for estimation, testing and model checking with loglinear models for Poisson data and logistic regression models for binomial data are well-known, but the utility of conditioning in these settings is not universally agreed upon. Furthermore, the role of conditioning- in the analysis of discrete generalized linear models with noncanonical link functions has received little attention from most of the statistical community. As a result, scientists and statisticians are familiar with conditional methods, but many are unsure how ,such methods should be incorporated into a n overall strategy for analyzing categorical data. We feel that the use and abuse of conditional methods will not be fully understood or appreciated without such a strategy. We hope that Professor Agresti's survey and the ensuing discussions stimulate further work in this direction. Edward J. Bedrick is an Assistant Professor, Department of Mathematics and Statistics, University of New Mexico, Albuquerque, New Mexico 87131. Joe R. Hill is an R&D Specialist at EDS Research, 5951 Jefferson Street, NE, Albuquerque, New Mexico 87109. CHECKING LOGISTIC REGRESSION MODELS We would like to convey some of our recent work on model checking for logistic regression and some of our thoughts regarding conditional inference. For the sake of simplicity, we assume that a single model is under consideration. A little notation is required. The usual logistic regression model has two distinct parts: a sampling component and a structural component. The sampling component specifies that Y = (Y,, . . . , Y,)' is a vector of independent binomial random variables with Y, Bin(m,, n,). The structural or regression component of the model is given by - logit ( n ) = XP, where logit(*) is a n n x 1 vector of log-odds with elements log{ a,/ ( l - n,)), X is a n n x p fullcolumn rank design matrix with ith row xi, and 0 is a p x 1 vector of unknown regression parameters. Under model (I), S = X' Y is sufficient for 0. Let 7i be the MLE of a under this model. The distribution of the data pr(Y; P), indexed by 0, can be factored into the marginal distribution of the sufficient statistic S, and the conditional distribution of the data given the sufficient statistic: Taking a Fisherian point of view (Fisher, 1950), inferences about 0 are based on pr(S; P), whereas model checks use the conditional distribution pr(Y 1 S). Letting sobs= X' yobs be the observed value of the sufficient statistic for the logistic model,