Document 6537093
Transcription
Document 6537093
Draft revised on 10-11-05 Testing equivalence between two bioanalytical laboratories or two methods using paired-sample analysis and interval hypothesis testing Shixia Feng*' , Qiwei Liang , Robin D. Kinser , Kirk Newland and Rudolf Guilbaud a Philip Morris USA, Research Center, 4201 Commerce Road, Richmond, VA 23234, USA; bMDS Pharma Services, 621 Rose Street, Lincoln, NE 68502, USA; CMDS Pharma Services, 2350 Cohen Street, St. Laurent (Montreal), Quebec H4R 2N6, Canada Corresponding author: Shixia Feng, Ph.D. Philip Morris USA Research Center 4201 Commerce Road Richmond, VA 23234 Email: [email protected] Abstract http://legacy.library.ucsf.edu/tid/puk82i00/pdf 1 Draft revised on 10-11-05 A modified interval hypothesis testing procedure, based on paired-sample analysis, and its application in testing equivalence between two bioanalytical laboratories or two methods are described. This testing procedure has the advantage of reducing the risk of wrongly concluding equivalence while in fact two laboratories or two methods are not equivalent. The advantage of using paired-sample analysis is that the test is less confounded by the inter-sample variability than unpaired-sample analysis when incurred biological samples with a wide range of concentrations are included in the experiments. Practical aspects including experimental design, sample size calculation and power estimation are also discussed through examples. 1. Introduction 2 3117039798 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039798 Draft revised on 10-11-05 Bioanalytical methods for quantitative measurement of chemical compounds and their metabolites in biological samples must be validated to ensure that methods yield reliable results1. To achieve necessary analytical throughput, bioanalytical methods are often transferred from one laboratory to another, and sometimes, samples from a clinical or animal study are analyzed using different methods or at different laboratories. To ensure that comparable results can be achieved between two laboratories or methods, comparison should be conducted through a process of cross-validation1 before a new laboratory or method is used to analyze the biological samples. We believe that the objectives of the experimental design for bioanalytical cross-validation should be three-fold: to establish statistical equivalence (no unacceptable bias) between the two laboratories or methods, to identify the sources of any differences, and to resolve the differences. Although statistical procedures exit for inter-laboratory comparisons {e.g. ISO 5725), such experiments often require more than 6 participating laboratories. For comparing the two means generated by two laboratories or methods, the general ttest (point hypothesis test) is often used. The hypotheses are: Ho: fJ2/(J] = 1 (there is no bias) versus Hi: /72/jt/i * 1 (bias exists) (1) If there is insufficient evidence to reject the null hypothesis, the two means are declared equal. It has been shown2 that the general f-test has the undesirable property of penalizing higher precision. If the sample size is small and there are relatively large variations within two sets of data in comparison to the relatively small difference between the two means, the f-test may be unable to detect a possible difference, and http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3 Draft revised on 10-11-05 sometimes when the precision of each data set is very high, differences of little practical importance could be considered significant. Hartmann et al. 3 recently compared the general Mest with the interval hypothesis test for testing the equivalence of two laboratories or two methods and concluded that for laboratory or method comparison purposes, the latter is the preferred one because the (3-error of the general Mest can be controlled. The interval hypothesis test is based on Schuirmann's2 two one-sided Mests (TOST), which has become the standard approach in bioequivalence studies. This approach requires the pre-determination of an acceptance interval (lower limit 01 and upper limit 0z) and then tests if the measured bias is within the acceptance interval. In statistical terms, the hypotheses to be tested are: Ho: /J2/^1 ^ @1 or JL/^1 ^ @2 versus Hi: 61 <\i2\\J\ < O2 (2) The hypotheses can also be expressed as two one-sided tests: H01: \Jzl\J\ ^ 61 versus Hn: /v2/^i > 01 (3) H02: JU2///1 > 62 versus Hi 2 : £/2/£/i < O2 (4) Compared to the general Mest, the null and alternative hypotheses have been reversed. Consequently, the definitions of type I (a) and type II (p) errors in the interval hypothesis test are exactly switched with respect to the general Mest. The type I error is now the probability (a) of the erroneous acceptance of equivalence when in fact the two laboratories or methods are not equivalent. The type II error is the probability ((3) of erroneous acceptance of nonequivalence while in fact the two laboratories or methods are equivalent. The acceptance of null hypotheses leads to the conclusion that the bias is not acceptable, and the rejection of null hypotheses leads to the conclusion that the bias is acceptable. This test has the advantage of limiting the risk of erroneously http://legacy.library.ucsf.edu/tid/puk82i00/pdf 4 Draft revised on 10-11-05 accepting a new laboratory or new method that is biased as unbiased to very low degree. However, close examination of the algorithm and procedures described by Hartmann et al. 3 led us to believe that they are only suitable for testing two means from two independent sets of samples (unpaired sample analysis). Unpaired-sample experimental design in a bioanalytical cross-validation study may confound this statistical test because of a possible large pooled variance that is actually due to the inter-sample variability, especially for incurred biological samples obtained from clinical or animal studies. We believe that this problem can be overcome by applying pairedsample analysis. In this manuscript, we present a modified interval hypothesis testing procedure and practical experimental design based on paired-sample analysis for testing equivalence between two bioanalytical laboratories or methods. The example of concern is represented by the transfer of an LC-MS/MS method for S-phenylmercapturic acid (SPMA), a benzene metabolite in human urine and a biomarker for exposure to benzene4'5. Both spiked quality control (QC) and incurred human samples were included in this study. 2. Methods 2.1 The interval hypothesis test procedure The acceptance interval (#7 and O2) should be defined before the experiments begin. As it is now widely accepted1 that a validated bioanalytical method should have less than ±15% bias at concentration levels other than the lower limit of quantification http://legacy.library.ucsf.edu/tid/puk82i00/pdf 5 Draft revised on 10-11-05 (LLOQ), where maximally ±20% bias is acceptable, the Gi and 02 in this study were therefore set as 0.85 and 1.15, respectively (note that although we typically find these limits reasonable, one should define the acceptance interval based on his or her own experience with the method). A paired-sample design was employed and the C OTIC concentration ratio for each sample was calculated. Let R = ——Lab2 (lTKZClStiT&di for each ConcLabl (measured) sample and RLab2llahl be the mean ratio of each sample set and R0 = the true ratio between the two laboratories (Lab2/I_ab1) for the entire sample population. The hypotheses to be tested can be stated as: H0: RQ^ 0.85 orR 0 > 1.15 versus Hi: 0.85 <R0 < 1.15 (5) R0<0.85 versus Hn: R0 > 0.85 (6) ^0>1.15 versus H i2 : #o<1.15 (7) or: Hoi: and H02: The Ho will be rejected if f _ cal(.\) ~ l R 1 ^b2fLab\ -85 r,ab2/r,ab] - 0^-^J„ I j— SDjJn. > l a,n-\ /n\ \°/ where tan_^ the 1-a quantile of the t distribution with n-1 degrees of freedom. SDR represents the standard deviation of the mean ratio for the sample set, or by rearrangement, if £ ^ 2 / ^ 1 - W i £ D * A ^ > 0.85. and if http://legacy.library.ucsf.edu/tid/puk82i00/pdf (9) Draft revised on 10-11-05 ^ cal(2) l _ ^Lab2/Labl ~ ~ ^^ / (— ^ "^ f a.n-l l \ ('\C)\ i U ) SDR/jn or by rearrangement, if Rr*,2,M+la^SDjJl<l.l5. Notice that RLab2!Lab\±ta.n-\SDRiJn (11) actually represents the confidence interval of the mean ratio at (1-2a)x100% level. Thus, H0 will be rejected at a significance level of a if the (1-2a)x100% confidence interval is contained entirely between 0.85 and 1.15. In this study, basic data calculations were performed with Microsoft® EXCEL 2002. The computation of the interval hypothesis test and statistical power were performed with Statistical Analytical System (SAS®) software. 2.2 The LC-MS/MS method An LC-MS/MS method for S-PMA in human urine was initially developed and validated at Labi, the originating laboratory. The full details of the method will be published separately. Briefly, the method utilized 1 mL of a human urine sample and a solid phase extraction procedure. The LC-MS/MS system included a Thermo Hypersil BioBasic® AX column with a guard column and a PE Sciex API 4000 MS detector using an electrospray ionization interface. Negative ions were monitored in the multiple reaction monitoring (MRM) mode. After the method had been validated and used to analyze approximately 1800 human urine samples at Labi, it was transferred to Lab2 (accepting laboratory) to accommodate the need for higher throughput. The method used at Lab2 was identical http://legacy.library.ucsf.edu/tid/puk82i00/pdf 7 Draft revised on 10-11-05 to that at Labi except that the final step of the extraction procedure was slightly modified to allow for differences in available analytical equipment. 2.3 Cross-validation experimental design The method transfer was conducted according to a method transfer protocol which outlined the sample types and source and the cross-validation acceptance criteria. Lab2 performed the initial between-site qualification batches to re-establish the lower and upper limits of quantification, inter-batch accuracy and precision, and the linear range. The performance characteristics at both laboratories are summarized in Table 1. (Insert Table 1 here) To establish equivalence between the two laboratories, a paired-sample design was employed: paired analyses of both spiked quality control (QC) samples at three fixed concentrations (low, medium, and high) and human incurred samples. QC samples at each concentration level were prepared by Labi and a portion of each sample was sent to Lab2. The nominal values for each QC level were 89.0, 2140, and 15100 pg/mL, respectively. Each laboratory measured QC samples at each concentration level in twelve replicates in paired fashion. In addition, 12 incurred human urine samples were collected at Labi and a portion of each sample was also sent to Lab2. The ratio of measured concentration of each sample pair was then calculated. The interval hypothesis test was performed on each QC set as well as the incurred human sample set. The equivalence between the two laboratories was declared at a = 0.025 level if the 95% confidence interval of the mean ratios was contained entirely between 0.85 and 1.15 for all QC levels and incurred human samples. 8 3117039804 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039804 Draft revised on 10-11-05 3. Results and Discussion 3.1 Sample selection The calibration curve of a bioanalytical method should represent the normal concentration distribution of the real biological samples. To establish equivalence between the two laboratories, we believe that at least three concentration levels which represent the entire range of the calibration curve should be studied: one near the lower boundary of the curve, one near the center, and one near the upper boundary of the curve. Real biological samples that represent the normal concentration distribution of the analyte should also be included. The selection of QC levels and incurred human samples in this study was based on these considerations. The paired-sample design is based on the properties of the interval hypothesis test which will be discussed later in the section on paired versus unpaired sample analysis. (Insert Table 2 here) 3.2 Type I and II errors, power, and sample size As mentioned earlier for the interval hypothesis test, the type I error is the probability (a) of the erroneous acceptance of equivalence when in fact the two laboratories or methods are not equivalent. The type II error is the probability ((3) of erroneous acceptance of nonequivalence while in fact the two laboratories or methods are equivalent. The corresponding power of the test, 1-p, is therefore the probability of http://legacy.library.ucsf.edu/tid/puk82i00/pdf 9 Draft revised on 10-11-05 correctly declaring equivalence between the two laboratories or two methods when they are truly equivalent. Similar to bioequivalence studies6'7, the primary concern of bioanalytical chemists should be the control of a-error (equivalent to the p-error in the general f-test) because the results of clinical samples generated by nonequivalent laboratories or methods as equivalent ones will cause comparability problems and confusion in data interpretation. The a level should be defined a priori as the acceptance interval {9i and ft). In this study, we pre-defined in the cross-validation protocol that the 95% confidence interval of the mean ratios must be contained entirely between 0.85 and 1.15 for all QC levels and incurred human samples. Therefore, the risk of wrongly concluding the two labs as equivalent ones is limited to no more than 2.5%. The actual probability of this risk is represented by the p-value and can be calculated by SAS® software using the two onesided tests approach (Eq. 6 and 7). The p-value of the interval hypothesis test (Eq. 5) is the maximum p-value of the two one-sided tests. For example, as shown in Table 3, the p-values for the two one-sided tests for the incurred human sample set were 0.0008 and <0.0001, respectively. The p-value of the interval hypothesis test is therefore 0.0008, which is much lower than the pre-defined acceptable limit of 0.025. To reject Ho, both p\ and p2 should be less than the a-value. In this example, the risks of declaring equivalence between the two laboratories while the two laboratories are in fact not equivalent were extremely small for all sample sets tested. For the incurred sample set, we plotted the ratio versus the mean concentration from the two laboratories (not shown here) which showed no correlation. It indicated the http://legacy.library.ucsf.edu/tid/puk82i00/pdf 10 Draft revised on 10-11-05 two laboratories were not systematically different in various parts of the concentration range. {Insert Table 3 here) A basic question bioanalytical chemists often ask when they design the experiments is how many samples should be analyzed in order to be statistically meaningful. Sample size (number of sample pairs in our case) is one of the key factors that affects the statistical power and should be estimated based on the desired power and estimated variability. The power for the interval hypothesis test should be: Power (RLab2flM) = P{fca<i) > W i and fca/(2) < - fa,n-i} =P{0.S5 + ta^<TR/yfc <R0 <1.15-ta^aR/Ji} (12) where P stands for the probability and OR is the standard deviation of the entire sample population. Using the same procedure described by Chow and Liu8;9, we obtained Eq. 13 and 14 for estimating the minimum number of sample pairs for achieving the desired power. LV w ^(f«,«-i+'A«-i): V ' XftLab2fLabl ~ ^LablILab\ for1.00<^2;^<1.15 (13) for0.85<^2/tol<1.00 (14) J and ^(Vi+Wi) : CVxR R Lab2ILab\ Lab2ILab\ ~0.85 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 11 Draft revised on 10-11-05 where CV is the coefficient of variation of the sample set and tpn_y is the 1-J3 quantile of the t distribution with n-1 degrees of freedom. Figure 1A-J illustrates the effect of the number of sample pairs, CV, ratio and avalue on power based on the paired-sample design (when 61 and 62 were set at 0.85 and 1.15, respectively). The power increases as the CV decreases or the number of sample pairs increases. The relationships showed in Figure 1A-J should be used to guide the planning of the experiments. However, because the computation of the power requires the computer programming and also because the main concern of bioanalytical chemists in cross-validation is the control of the a-error, the post-study calculation of the actual power does not have to be part of the test. If the confidence interval is within the acceptance interval after the experiments, a simple way to verify if the desired power has been achieved is to compare the calculated number of sample pairs for a desired power and actual number of sample pairs. If the calculated number of sample pairs is less than the actual values, the desired power has been achieved. It is common in clinical studies to set p to be less than 0.2, which represents the power of the statistical test as greater than 80%. Therefore, for example, for the incurred sample set in this study, the estimated number of sample pairs for 80% power is: n = (2.20 + 0.88)2 x [(0.07)/(0.94 - 0.85)]2 « 6. Since the actual number was 12, the power must have exceeded 80%. The power for this sample set is 98%. Each laboratory or method is associated with two types of errors: systematic and random errors. A ratio that is close to the acceptance boundary is an indication of 12 3117039808 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039808 Draft revised on 10-11-05 systematic bias between the two laboratories or methods. One should first focus on finding the sources of systematic bias. For example, comparing the calibration standards, stock solutions, and working solutions in both laboratories often helps resolve the problem. Sometimes in practice it is necessary to conduct pilot experiments with a limited number of samples to generate preliminary data in order to guide the full scale comparison. The CV of the ratio is a function of random errors of the two laboratories or methods being compared. Assuming there is no variation in sampling, the CV of the ratio can be estimated using Eq. 15 from the precision data of each laboratory or method for a given concentration level, which should have been generated during the method validation by repeated measurements of QC samples at different concentration levels. r a„Labi \2 VConcLam J , + \2 <TLab2 C°»CL*I J ferEn+cK*! (15) Note that Eq. 15 does not apply to the incurred human samples which may vary over a wide concentration range. Because the accuracy and precision at different concentration levels are often different for a laboratory or method, the incurred samples should have an approximately even distribution over the entire concentration range. If the incurred samples are not distributed evenly both the ratio and CV may be skewed. 3.3 Paired versus unpaired sample analysis The interval hypothesis tests procedures for testing either the ratio or difference {s) between two means from two independent random samples (unpaired sample analysis) have been described by previous authors310'11. When testing the ratio of two means, the http://legacy.library.ucsf.edu/tid/puk82i00/pdf 13 Draft revised on 10-11-05 rejection criteria for the null hypotheses for two independent series of data with equal variance (of = G\ ) are: t = ''cal(X) _&ZM>_>, | : (16) V IVJ/ — —*«,«!+«,-2 where ^ is the pooled variance of the two means: ^2 (nx - l)sf + (n2 - \)s22 = p «j + n2 - 2 and 'cai(2) — I . —r y - n( 7 / W + 2 — 'a,n 1 +n 2 +2 \ ' ^/ / ) / W l It should be noted that the unpaired-sample design in bioanalytical cross-validation studies may confound this statistical test due to an incidence of very large pooled variance that is actually due to the inter-sample variability, especially for incurred biological samples. Two actually equivalent laboratories or methods may be determined to be nonequivalent unless the sample size is extremely large. The paired-sample design and testing the ratios allow one to remove such variability and therefore render higher sensitivity to the test. For similar reasons, it is inappropriate to test the mean differences (<5) for incurred human samples in paired-sample analysis. For example, for the incurred human sample set in this study, 5 = -308.8; SD = 469.1; CV = 151.9%; 95% CI = -606.9 (lower), -10.8 (upper); U = 1-83 and t2 = -6.39 (compared to taXnA) = 14 3117039810 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039810 Draft revised on 10-11-05 2.20). Notice that the variability in the mean difference (s) of this sample set was very high (CV = 151.9%). This variability is mainly due to the variable sample concentrations. Therefore, the test performed on the mean difference led to the wrong conclusion that the two laboratories were not equivalent because the test was confounded by the variability in incurred sample itself. 3.4 Some potential issues Compared to the unpaired sample analysis, the degree of freedom in the paired sample analysis is lower (/?-1 versus n<\+ri2-2), which seemingly implies lower power. However, with good planning the desired power can be achieved as discussed above. Another issue is that the statistical testing procedure described herein assumes that the ratio has normal distribution around the mean values, which is true in this study. For non-normal ratio data, if the sample size is sufficiently large and the variability is small, this procedure is still applicable. Log-transformation may also be considered if the data are not normally distributed. Sometimes, especially when the sample size is small, one may encounter a situation where outliers (larger or smaller ratios than the others in a data set) may occur and the outcomes of the analysis can be influenced depending on how the outliers are treated. Potential outliers can be identified using statistical procedures such as Dixon's test, Grubbs' test12, or simply as values outside of three standard deviations of the mean. We recommend that the root causes be investigated before a decision is made to exclude outliers from the statistical analysis. The exclusion of an outlier may be justified if a review of the documentation clearly indicates an error in the sample processing or 15 3117039811 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039811 Draft revised on 10-11-05 analysis. One should keep in mind that the exclusion of the outliers will result in reduced sample size and therefore will affect both types of statistical errors. An alternative is to reanalyze the outlying samples in either or both laboratories by both methods. However, if the repeat analyses verify the original values, these values should not be excluded in the statistical analysis and unfavorable conclusions may be reached in such cases. However, it is our opinion that this situation is likely due to the large variability between the two laboratories or methods, which is one indication that they are not equivalent. Sample stability is another factor that may affect the cross-validation results. The shipping and storage conditions must be proven to be effective in preserving the sample integrity. Otherwise, the results are confounded and may be misleading. 4. Conclusion The interval hypothesis test and paired-sample experimental design described in this manuscript can be used in testing equivalence between two bioanalytical laboratories or methods. The premise for using this approach is that the two laboratories or methods in comparison have been validated independently prior to the crossvalidation. The acceptance interval and the limits for the two types of risks associated with this statistical test (a- and p-errors) should be defined prior to the experiments. Equations for estimating the number of sample pairs based on power were also described. The number of sample pairs should be sufficiently large in order to achieve the desired power. The advantages of this test and the experimental design based on this test is that the risk of wrongly concluding equivalence while in fact two laboratories 16 3117039812 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039812 Draft revised on 10-11-05 or methods are not equivalent can be limited to a low level. In addition, as the number of sample pairs are determined based on power calculations, the risk of wrongly concluding non-equivalence while in fact the two laboratories or methods are equivalent can also be limited to a low level (< 20%). Finally, since both systematic bias and random errors of the two laboratories or methods are taken into account, this procedure should be suitable for real practice. Acknowledgement: The authors would like to thank Drs. Hans Roethig and Mohamadi Sarkar for helpful discussion and suggestions. References: 1. Shah, V. P.; Midha, K. K.; Findlay, J. W.; Hill, H. M.; Hulse, J. D.; McGilveray, I. J.; McKay, G.; Miller, K. J.; Patnaik, R. N.; Powell, M. L; Tonelli, A.; Viswanathan, C. T.; Yacobi, A. Pharm.Res. 2000, 17, 1551-57. Also see US DHSS FDA. Guidance for 17 3117039813 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039813 Draft revised on 10-11-05 Industry Bioanalytical Method Validation, http://www.fda.gov/cder/guidance/4252fnl.pdf. 2001, accessed October 11, 2005. 2. Schuirmann, D. J. J.Pharmacokinet.Biopharm. 1987, 15, 657-80. 3. Hartmann, C; Smeyersverbeke, J.; Penninckx, W.; Vanderheyden, Y.; Vankeerberghen, P.; Massart, D. L. Analytical Chemistry 1995, 67, 4491-99. 4. Boogaard, P. J.; van Sittert, N. J. Occup.Environ.Med. 1995, 52, 611-20. 5. Boogaard, P. J.; van Sittert, N. J. Environ.Health Perspect. 1996, 104 Suppl 6, 1151-57. 6. Diletti, E.; Hauschke, D.; Steinijans, V. W. Int J Clin Pharmacol Ther Toxicol 1991, 29, 1-8. 7. Phillips, K. F. J.Pharmacokinet.Biopharm. 1990, 18, 137-44. 8. Chow, S. C; Liu, J. P. Design and Analysis of Bioavailability and Bioequivalence Studies, 2 ed.; Marcel Dekker: New York, 1999; Chapter 5. 9. Liu, J. P.; Chow, S. C. J Pharmacokinet Biopharm. 1992, 20, 101-04. 10. Locke, C. S. J Pharmacokinet Biopharm. 1984, 12, 649-55. 11. Richter, S. J.; Richter, C. Quality Engineering 2002, 14, 375-380. 12. Miller, J. N.; Miller, J. C. Statistics and Chemometrics for Analytical Chemistry, 4 ed.; Prentice Hall: New York, 2000; Chapter 3. Table 1. Validation summary for the two laboratories Performance Parameter Labi* Lab2** Lower Limit of Quantitation Linear Range 20 pg/mL 20 - 20000 20 pg/mL 20 - 20000 http://legacy.library.ucsf.edu/tid/puk82i00/pdf Draft revised on 10-11-05 Between-Run Precision (CV%) LowQC Medium QC High QC Between-Run Accuracy (% Nominal) LowQC Medium QC High QC pg/mL pg/mL 9.7 (n=24) 1.8(n=24) 2.8 (n=24) 7.8(n=17) 2.6(n=18) 2.6(n=18) 100.0 (n=24) 97.7 (n=24) 99.3 (n=24) 103.3 (n=17) 96.0 (n=18) 96.1 (n=18) *The nominal values for low, medium, and high QC levels for the Labi were 142, 2140, and 15100 pg/mL, respectively **The nominal values for low, medium, and high QC levels for the receiving Lab2 were 48.6, 2070, and 15206 pg/mL, respectively http://legacy.library.ucsf.edu/tid/puk82i00/pdf 19 CO -vl o Draft revised on 10-11-05 CO CD 00 0> Table 2. The analytical results from a cross-validation study QC1 QC2 QC3 (Nominal = 89.0 pg/mL) (Nominal = 2140 pg/mL) (Nominal = 15100 pg/mL) Incurred N Cocn1 Cocn2 R2/1 Cocrii Cocn2 R2/1 Cocrii Cocn2 R2/1 Cocn1 Cocn2 R2/1 1 83.7 83.6 1.00 2090 2070 0.99 14500 14700 1.01 391 372 0.95 2 92.8 91.3 0.98 2150 2140 1.00 14700 14700 1.00 432 362 0.84 3 87.4 93.7 1.07 2100 2110 1.01 14700 14900 1.01 4430 4230 0.96 4 89.4 86.1 0.96 2070 2120 1.02 14600 14900 1.02 141 119 0.84 5 99.0 91.2 0.92 2040 2190 1.07 14500 14300 0.99 80.7 82.7 1.03 6 96.1 85.9 0.89 2230 2190 0.98 15100 14500 0.96 12800 11600 0.91 7 102 93.4 0.92 2110 2130 1.01 14400 14100 0.98 232 243 1.05 8 89.5 94.4 1.06 2000 2050 1.03 14800 14400 0.97 638 660 1.03 9 102 87.4 0.86 2180 2030 0.93 14900 13800 0.93 10800 9930 0.92 10 102 81.3 0.80 2150 2150 1.00 14600 14800 1.01 1120 1100 0.98 11 94.0 95.2 1.01 2170 2020 0.93 14800 14900 1.01 9750 8620 0.88 12 93.0 95.1 1.02 2020 2010 1.00 14500 14300 0.99 3710 3500 0.94 Mean 94.2 89.9 0.96 2109 2101 1.00 14675 14525 0.99 3710 3402 0.94 SDR 6.14 4.83 0.083 69.60 63.60 0.039 200.57 354.52 0.028 4727.18 4272.11 0.07 CV (%) 6.51 5.37 8.64 3.30 3.03 3.90 1.37 2.44 2.79 127.40 125.59 7.41 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 20 Draft revised on 10-11-05 Table 3. The interval hypothesis test results for mean ratios QC 1 QC2 QC3 Incurred (n=12) (n=12) (n=12) (n=12) 95% CI (lower, upper) 0.91, 1.01 0.97, 1.02 0.97, 1.01 0.90, 0.99 ti 4.57 13.13 18.45 4.61 t2 -8.29 -13.73 -21.09 -10.33 ta,(n--\) 2.20 2.20 2.20 2.20 P1 <0.0001 <0.0001 <0.0001 0.0008 ?2 <0.0001 <0.0001 <0.0001 O.0001 P <0.0001 <0.0001 <0.0001 0.0008 Power (%) 99% >99% >99% 98% 21 3117039817 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039817 Draft revised on 10-11-05 Figure caption: Figure 1A-J. The effect of the number of sample pairs, CV% and mean ratio on power at a = 0.05 and 0.025 levels. 22 3117039818 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039818 Draft revised on 10-11-05 < A) alpha = 0.05, mean=1.0 alpha = 0.025, m e a n = 1 . 0 if 12 15 18 21 24 27 sample size <• ) (C) alpha=0.05, mean=0.90 I 12 15 CV=59S IS CCQ £lpha=0.02S, rrtsan = 0.90 CV-8% I I I CV=10% 21 12 Sample size 18 21 Sample size <F> £lpha=0.025, rrtean = 0.95 alpha=0.05, mean=0.95 I 16 OV-5« QG© CV-8% A A A cv-i3« # # # cv-is% l - H - l CV-10* 12 0 8 0 OV-8% # # 9 cv-13% 16 IS 21 Sample size Sample size <H) aJptia=0-025, mean =1.05 (G) =0-05, mean = Sample size http://legacy.library.ucsf.edu/tid/puk82i00/pdf I I I CV-SK A A A cv-is?s sample size 23 l - H - l CV-10* Draft revised on 10-11-05 (i) alpha=0.05, mean=1.10 alpha=0.025, mean =1.10 9 1 2 1 5 1 8 2 1 2 - 1 2 7 M 3 S sample size 12 16 18 ZL sample size 24 3117039820 http://legacy.library.ucsf.edu/tid/puk82i00/pdf 3117039820