Document 6516938

Transcription

Document 6516938
Power Analysis: What is Available and What You Need to Know Gerry Hobbs Departments of Statistics and of Community Medicine West Virginia University Introduction Planning should be the first step in any scientific experimentation. A critical part of planning is the determination of the sample size necessary to achieve the goals of the particular project. It is clearly a waste of resources to have too many experimental units (subjects, animals, ingots or whatever). It is equally a waste of resources to carry out an investigation with so few experimental units that there is little hope of demonstrating whatever effect it is that we wish to demonstrate. SAS ® Software provides tools that can be used in order to guide us in our effort to, at least, estimate sample size requirements. It is not always understood that there is a considerable amount of information that is required for power and sample size calculations done by hand (perish the thought). No less information is needed if we want software do the calculations for us. The purpose of this tutorial presentation is to discuss the information required, a priori, in order to calculate sample size/power and to discuss the implementation of those calculations in the SAS environment. Information Requirements Inferential statistics is all about making inferences about populations from the information available in samples. One of the more common manifestations of that activity is testing hypotheses about the value of a parameter in a population or, even more commonly, comparing the values of parameters in two or more populations. In the case of designed experiments the populations may be thought of as a set of objects observed under differing conditions. In an experiment the researcher controls the conditions. By way of contrast, in observational studies the conditions are not under the control of the researcher. In either event the information we have available to us allows us to compute estimates of the population parameter(s). In order for those estimates to be useful for hypothesis testing purposes we need also to be able to know something about their accuracy. Ordinarily, that is measured as the (estimated) standard error of the estimate(s). Consider the simple problem of testing the null hypothesis Ho: µ = 5 based on a sample large enough so that the distribution of the sample mean may be reasonably assumed to be close to normally distributed. We can calculate the sample mean easily enough and, if it is close to 5, accept Ho. If it is far from 5 we can reject Ho. Obviously the trick is to define “close to” and “far from”. We do that dividing the observed difference between the estimate and 5 by the standard error and then using the normal distribution as a reference in order to determine the “chances” that such a difference would occur if the null hypothesis were actually true. Since the standard error of an estimator is a function of the standard deviation of the population we need that for our power calculations. Power is the probability you will reject Ho when Ho is not true. Assume Ho is not true and let Δ be the difference between the postulated value (here, 5) and the actual value. Clearly, it will be easier to reject Ho if Δ is 10 than if it is 1. In the aggregate there are five quantities that are entwined in a mathematical expression. They are: significance level (α), power (1-­‐
β), sample size (n), standard deviation (σ) and the difference (Δ). Given any four of them we may find the fifth. In certain cases, including a few that we will discuss, the standard deviation may be inferred from other information. SAS PROCs SAS Software contains two procedures that produce both tabular and graphical results of power/sample size calculations. They are PROC POWER and PROC GLMPOWER. The POWER procedure does power and sample size calculations for: t-­‐
tests (paired and independent), one-­‐way ANOVA, tests of proportions (paired and independent), tests of correlation, regression, tests comparing survival curves, confidence intervals for means and certain equivalence tests. The GLMPOWER procedure handles more complex fixed effect linear modeling problems. We first will consider PROC POWER. The syntax for PROC POWER begins, of course, with a PROC statement. PROC POWER; The only option available on the PROC statement is PLOTONLY. The option suppresses the tabular results and produces just the graphical results. We will not display those here. Following the procedure statement there is a procedure information (analysis) statement that specifies one of the ten types of analyses for which we can get power and sample size calculations. For example, if we were examining the sample size requirements for a simple ANOVA problem the statement would take the form ONEWAYANOVA <options>; If the problem were one of paired proportions (the McNemar test) the statement would take the form PAIREDFREQ <options>; The options available in each of the test specifications depends on the particular test being contemplated. Some of the options, ALPHA= and POWER= for instance, are common to all analysis statements. As might be suspected, other options are tied to a particular kind of analyses. An additional procedure information statement PLOT <plot options>; requests that graph(s) associated with the previous analysis statements be produced. Certain graph-­‐options are also available and are very similar to those found in SASGRAPH ® Software. Examples Using PROC POWER Four examples of the application of PROC POWER will be shown in order to illustrate the different sorts of information that has to be supplied depending on the specific test being planned. In example 1 suppose you are involved in planning a clinical trial in which the primary goal is to compare the five-­‐year survival rates of women with metabolic breast cancer under two treatment regimens, one of which is Standard of Care (SoC). The trial involves randomizing women with MBC to each group in equal numbers. Suppose that the current five-­‐year survival rate hovers around 20% and it is our expectation is that this new treatment modality could double the survival proportion to around 40%. Suppose further that we want to find the sample size necessary if we want to be 90% certain of rejecting Ho under the conditions outlined above, assuming α=0.05. Note particularly in this example that the test of equal proportions is a binomial distribution based test. As such, both the mean and variance are functions of π, so there is no need to specify a standard deviation in the SAS code. The statements that will cause the appropriate calculations to be carried out are as follows. PROC POWER; TWOSAMPLEFREQ TEST=PCHI ALPHA=.05 GPS=(0.2 0.4) POWER=.9 NPERGROUP=.; RUN; The displayed results from SAS software include a tabulation of the user specified parameter settings as well as the sample size requirements requested by the “NPERGROUP=.” statement. Those results follow. Computed N Per Group Actual N Per Power Group 0.902 109 In the description above the GPS= option indicates a doubling of the 5-­‐year survival rate. It is common in clinical medicine to express changes as odds ratios. In this case doubling the odds ratio from ¼ to ½ corresponds to increasing the 5-­‐year survival fraction from 15 to 1 3 . That is a smaller change. To carry out the calculations we could replace the GPS= option with OR=2 and REFP=.2. The resulting sample size requirement becomes 230 (per group). In order to reproduce the sample size € in the € display above we would have to set OR=2.66667. At certain times it is convenient to have unequally sized samples. Continuing to make changes in the displayed code by replacing NPERGROUP=. with GWEIGHTS=(2 1) and NTOTAL= . results in the following statements PROC POWER; TWOSAMPLEFREQ TEST=PCHI ALPHA=.05 OR=2.666667 REFP=.2 POWER=.9 GWEIGHTS=(2 1) NTOTAL=.; The request is that twice as many control subjects be used as new drug subjects. The resulting sample size is 240 to be divided up as 160 controls and 80 new drug subjects. It is to be expected that unequally sized samples will result in a somewhat larger total sample size requirement. Other options can be used in various combinations in order to specify the information needed to carry out the computations. The TEST= statement is illustrated above with the PCHI option. That specifies that the ordinary (Pearson) Chi-­‐square test is going to be used. Other options that may be used are LRCHI (likelihood ratio Chi-­‐square) and FISHER (Fisher’s Exact Test). In example 2 we consider a situation where someone is interested in comparing the time to a certain event under two conditions. The time to event could be the length of time until some component in a manufacturing operation fails, the time until a customer needs warranty work on a car or, the time until a patient with some form of cancer progress to the next stage of disease. In each case we have to be aware of the fact that some components, customers or patients will not fail, return for warranty work or progress to a new stage of disease until after the study is completed. That is to say that some of the observations will be censored. First we will display some working code from PROC POWER and then discuss what some of the many and various options on the TWOSAMPLESURVIVAL statement actually do. PROC POWER; TWOSAMPLESURVIVAL GROUPWEIGHTS=(1 1) ALPHA=.05 POWER=.8 SIDES=1 TEST=LOGRANK GROUPMEDSURVTIMES=(.3333 .6667) ACCRUALTIME=.01 FOLLOWUPTIME=10 NTOTAL=.; RUN; First, of course, there is the PROC statement. The TWOSAMPLESURVIVAL statement refers to the comparison of two survival curves. The next statement can be shortened to GWEIGHTS and in the displayed form refers to a 1 to 1 sample allocation (equal sample sizes). That happens also to be the default. ALPHA=, NTOTAL= and POWER= are as they were in example 1. SIDES=1 specifies that a one-­‐tailed alternative hypothesis is being tested. TEST=LOGRANK identifies the Log-­‐Rank test as being used. Other options for TEST= specify the Gehan and Tarone-­‐Ware tests. The next statement can be shrunk to GMEDSURVTS=. The values displayed refer to one-­‐third and two-­‐thirds of a year. That is, it specifies median survivals of 4-­‐months versus 8-­‐months for the two groups. ACCRUALTIME= (also ACCT=) indicates that accrual of cases is assumed to be uniformly distributed between time=0 and the specified value. Here, the value is set to .01. That, effectively, means all the accrual occurs at the beginning of the study. FOLLOWUPTIME= (also FUT=) indicates the amount of time in the study beyond the accrual time. In this case the combination of specifications intentionally ensures that there is almost no chance of censoring. We are (almost) guaranteed that all of the events will occur prior to the end of the study. The results of program include a tabulation of the parameters used in the calculations as well as the following. Computed N Total Actual N Power Total 0.800 56 There are alternative ways of specifying the required information. For instance, instead of specifying the median survival times for the two groups we can specify a hazard ratio. In addition we can specify rates at which the groups suffer loss to follow-­‐up either as hazard rates or as median loss time. Exponential probability models are used in all of these calculations. In example 3 we will examine power calculations for the McNemar test. The study situation can be likened to that appropriate for a matched-­‐pairs t-­‐test except that the response variable is now binary instead of continuous. Applications of the McNemar test are common in marketing, where we often want to see whether an intervention has changed customer opinions and, if so, in which direction. Other applications don’t necessarily involve the before/after scenario and occur in medicine as well as many other fields. Assume the outcome (opinion) is A or B, each of which is measured on (say) a person before and after some intervention. The possible paired responses are then (A, A), (A, B), (B, A) or (B, B). (A, A) and (B, B) represent cases where there has been no change in opinion and, thus, don’t provide any information about the way in which opinions have shifted. A preponderance of (A, B)’s compared to (B, A)’s indicates a shift toward opinion B while the opposite outcome indicates a shift toward opinion A. In fact, the McNemar test statistic calculations are trivially easy and can be shown to be equivalent to “doing a sign-­‐
test on the changes”. That, in turn, is the same as testing Ho π=1/2, where π is the proportion of (A, B)’s among just the (A, B)’s and (B, A)’s in some imaginary population. One facet of this test to keep in mind is that all of the useful information is contained in the, “changes”, usually called discordances. A large sample of, say, 2,000 subjects becomes a small sample of only 20 subjects if only 1 % of all the subjects change their minds. Therefore, one of the bits of information we need to have at hand is an estimate of the proportion of discordances. We will, again, display some code and proceed to explain what each statement does. PROC POWER; PAIREDFREQ METHOD=MIETTINEN DIST=NORMAL DISCPROPDIFF= .1 TOTALPROPDISC=.2 NPAIRS=. POWER=.9; RUN; Following the PROC statement PAIREDFREQ indicates that the power and sample size characteristics of the McNemar test will be evaluated. METHOD=MIETTINEN specifies one of two possible approximations to the exact binomial solution. The other is CONNOR. DIST=NORMAL specifies the assumed distribution of the test statistic. DISCPROPDIFF indicates the difference between the discordant proportions and TOTALPROPDISC indicates the sum of the two discordant proportions. Taken together, the two options selected; DISCPROPDIFF=.1 and TOTALPROPDISC=.2 implies that we are powering the test based on the assumption that PA,B = .15 and PB,A = .05, or vice-­‐versa. In this case POWER= sets the desired power to 90% and the significance level, α, is left to default to .05. Equivalent results may be obtained by replacing DISPROPDIFF=.1 with DISCPRATIO=3 (as in .15/.05) or by replacing both the DISCPROPDIFF and TOTALPROPDISC statements with DISCPS=.05 | .15. The resulting output contains a recapitulation of the parameter settings plus the following. Computed N Pairs Actual N Power Pairs 0.900 193 Calculations using exact binomial results (METHOD=EXACT) are limited to finding the power of the McNemar test given the value of NPAIRS. Fortunately, we can use a number-­‐list in order to do a “what if” obtain exact power results for various sample sizes as the following program illustrates. PROC POWER; PAIREDFREQ METHOD=EXACT DIST=EXACT_COND DISCPS=.05 | .15 NPAIRS=100 TO 200 BY 20 POWER=.; RUN; Here, the program has been directed to use exact methods to compute the power for a range of sample sizes – 100, 120, 140, 160, 180 and 200. After the tabulation of the parameter settings the output displays the following. Computed Power N Actual Index Pairs Alpha Power 1 100 0.0294 0.544 2 120 0.0296 0.633 3 140 0.0303 0.711 4 160 0.0315 0.777 5 180 0.0331 0.832 6 200 0.0346 0.875 From the table we can easily discern that 80% power would be achieved with approximately 170 pairs of observations and 90% power would be achieved with somewhere around 210 pairs of observations. Examples Using PROC GLMPOWER PROC GLMPOWER can be used to compute power and sample size requirements for fixed effect linear models problems (optionally) with covariates that may be either continuous or categorical. In addition, the procedure can produce the same calculations for sub-­‐hypotheses expressed as linear contrasts among the levels of a factor. The analysis begins with an “exemplary” data set, that is to say, a data set that expresses the researchers ideas (hopes) about the means of the underlying populations. For example, in a 2 x 2 factorial problem, we might postulate the following means. Treatments A I II B I 4 6 II 7 9 In brief, our conjecture is that the effect of factor A is a bit smaller than the effect of factor B and that there is no interaction between the two factors. The following data step creates the exemplary data set. DATA PERFECT; INPUT A $ B $ VOLTS; DATALINES; I I 4 I II 7 II I 6 II II 9 RUN; Sample size calculations now can be focused on either of the two factors. Since the effect sizes are different we shouldn’t anticipate that a sample size that yields 80% power for factor B will do the same for factor A. The following program requests sample size calculations for the no-­‐interaction model assuming σ=2 and that we want the power to be 80%. PROC GLMPOWER DATA=PERFECT; CLASS A B; MODEL VOLTS = A B; POWER STDDEV = 2 NTOTAL=. POWER=.8; RUN; Output from the program again documents the parameters being assumed intgh the calculations and displays the results as follows. Computed N Total Test Error Actual N Index Source DF DF Power Total 1 A 1 33 0.829 36 2 B 1 17 0.885 20 It might be that one of the factors is important and the other is just a control. From the display the sample size for the important factor can be chosen. If both are important then, of course, the larger of the two numbers needs to be chosen. In our second example we will consider a 4 x 2 factorial in which the treatment means assume the following form. The new exemplary data set is stored in EXEMP but the data step is not shown here. For factor B the means have been chosen in a pattern consistent with the “least favorable configuration” for a situation where the experimenter wants a maximum difference between means to be 3 and then to calculate a sample size for the worst combination of means where that is true. Treatments B I II III IV A I 1.5 3 3 4.5 II 2.5 4 4 5.5 PROC GLMPOWER DATA=EXEMP; CLASS A B; MODEL VOLTS = A B; CONTRAST 'A=I VS A=IV' B -­‐1 0 0 1; POWER STDDEV = 2 NTOTAL=. POWER=.8; RUN; The contrast statement specifies that we are interested in showing that the mean for level I of factor B is different than the mean for level IV of the same factor. Computed N Total Test Error Actual N Index Type Source DF DF Power Total 1 Effect A 1 123 0.801 128 2 Effect B 3 43 0.851 48 3 Contrast A=I VS A=IV 1 27 0.824 32 The difference between the two levels of A is small compared the four levels of B so the sample size required for factor A is substantially larger than that for factor B and, of course, we chose the easy comparison for the contrast so the estimated sample size for that comparison is smaller still. References SAS/STAT User’s Guide 9.1 Vol. 3, (2004) SAS Institute, Cary, NC 27513 SAS/STAT User’s Guide 9.1 Vol. 5, (2004) SAS Institute, Cary, NC 27513 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.