Tests of Hypotheses Based on a Single Sample Introduction
Transcription
Tests of Hypotheses Based on a Single Sample Introduction
CHAPTER NINE Tests of Hypotheses Based on a Single Sample Introduction A parameter can be estimated from sample data either by a single number (a point estimate) or an entire interval of plausible values (a confidence interval). Frequently, however, the objective of an investigation is not to estimate a parameter but to decide which of two contradictory claims about the parameter is correct. Methods for accomplishing this comprise the part of statistical inference called hypothesis testing. In this chapter, we first discuss some of the basic concepts and terminology in hypothesis testing and then develop decision procedures for the most frequently encountered testing problems based on a sample from a single population. J.L. Devore and K.N. Berk, Modern Mathematical Statistics with Applications, Springer Texts in Statistics, DOI 10.1007/978-1-4614-0391-3_9, # Springer Science+Business Media, LLC 2012 425 426 CHAPTER 9 Tests of Hypotheses Based on a Single Sample 9.1 Hypotheses and Test Procedures A statistical hypothesis, or just hypothesis, is a claim or assertion either about the value of a single parameter (population characteristic or characteristic of a probability distribution), about the values of several parameters, or about the form of an entire probability distribution. One example of a hypothesis is the claim m ¼ $311, where m is the true average one–term textbook expenditure for students at a university. Another example is the statement p < .50, where p is the proportion of adults who approve of the job that the President is doing. If m1 and m2 denote the true average decreases in systolic blood pressure for two different drugs, one hypothesis is the assertion that m1 m2 ¼ 0, and another is the statement m1 m2 > 5. Yet another example of a hypothesis is the assertion that the stopping distance for a car under particular conditions has a normal distribution. Hypotheses of this latter sort will be considered in Chapter 13. In this and the next several chapters, we concentrate on hypotheses about parameters. In any hypothesis-testing problem, there are two contradictory hypotheses under consideration. One hypothesis might be the claim m ¼ $311 and the other m¼ 6 $311, or the two contradictory statements might be p .50 and p < .50. The objective is to decide, based on sample information, which of the two hypotheses is correct. There is a familiar analogy to this in a criminal trial. One claim is the assertion that the accused individual is innocent. In the U.S. judicial system, this is the claim that is initially believed to be true. Only in the face of strong evidence to the contrary should the jury reject this claim in favor of the alternative assertion that the accused is guilty. In this sense, the claim of innocence is the favored or protected hypothesis, and the burden of proof is placed on those who believe in the alternative claim. Similarly, in testing statistical hypotheses, the problem will be formulated so that one of the claims is initially favored. This initially favored claim will not be rejected in favor of the alternative claim unless sample evidence contradicts it and provides strong support for the alternative assertion. DEFINITION The null hypothesis, denoted by H0, is the claim that is initially assumed to be true (the “prior belief” claim). The alternative hypothesis, denoted by Ha, is the assertion that is contradictory to H0. The null hypothesis will be rejected in favor of the alternative hypothesis only if sample evidence suggests that H0 is false. If the sample does not strongly contradict H0, we will continue to believe in the plausibility of the null hypothesis. The two possible conclusions from a hypothesis-testing analysis are then reject H0 or fail to reject H0. A test of hypotheses is a method for using sample data to decide whether the null hypothesis should be rejected. Thus we might test H0: m ¼ .75 against the alternative Ha: m 6¼ .75. Only if sample data strongly suggests that m is something other than .75 should the null hypothesis be rejected. In the absence of such evidence, H0 should not be rejected, since it is still quite plausible. Sometimes an investigator does not want to accept a particular assertion unless and until data can provide strong support for the assertion. As an example, suppose a company is considering putting a new additive in the dried fruit that it produces. 9.1 Hypotheses and Test Procedures 427 The true average shelf life with the current additive is known to be 200 days. With m denoting the true average life for the new additive, the company would not want to make a change unless evidence strongly suggested that m exceeds 200. An appropriate problem formulation would involve testing H0: m ¼ 200 against Ha: m > 200. The conclusion that a change is justified is identified with Ha, and it would take conclusive evidence to justify rejecting H0 and switching to the new additive. Scientific research often involves trying to decide whether a current theory should be replaced by a more plausible and satisfactory explanation of the phenomenon under investigation. A conservative approach is to identify the current theory with H0 and the researcher’s alternative explanation with Ha. Rejection of the current theory will then occur only when evidence is much more consistent with the new theory. In many situations, Ha is referred to as the “research hypothesis,” since it is the claim that the researcher would really like to validate. The word null means “of no value, effect, or consequence,” which suggests that H0 should be identified with the hypothesis of no change (from current opinion), no difference, no improvement, and so on. Suppose, for example, that 10% of all computer circuit boards produced by a manufacturer during a recent period were defective. An engineer has suggested a change in the production process in the belief that it will result in a reduced defective rate. Let p denote the true proportion of defective boards resulting from the changed process. Then the research hypothesis, on which the burden of proof is placed, is the assertion that p < .10. Thus the alternative hypothesis is Ha: p < .10. In our treatment of hypothesis testing, H0 will generally be stated as an equality claim. If y denotes the parameter of interest, the null hypothesis will have the form H0: y ¼ y0, where y0 is a specified number called the null value of the parameter (value claimed for y by the null hypothesis). As an example, consider the circuit board situation just discussed. The suggested alternative hypothesis was Ha: p < .10, the claim that the defective rate is reduced by the process modification. A natural choice of H0 in this situation is the claim that p .10, according to which the new process is either no better or worse than the one currently used. We will instead consider H0: p ¼ .10 versus Ha: p < .10. The rationale for using this simplified null hypothesis is that any reasonable decision procedure for deciding between H0: p ¼ .10 and Ha: p < .10 will also be reasonable for deciding between the claim that p .10 and Ha. The use of a simplified H0 is preferred because it has certain technical benefits, which will be apparent shortly. The alternative to the null hypothesis H0: y ¼ y0 will look like one of the following three assertions: 1. Ha: y > y0 (in which case the implicit null hypothesis is y y0) 2. Ha: y < y0 (so the implicit null hypothesis states that y y0) 3. Ha: y 6¼ y0. For example, let s denote the standard deviation of the distribution of outside diameters (inches) for an engine piston. If the decision was made to use the piston unless sample evidence conclusively demonstrated that s > .0001 in., the appropriate hypotheses would be H0: s ¼ .0001 versus Ha: s > .0001. The number y0 that appears in both H0 and Ha (separates the alternative from the null) is called the null value. Test Procedures A test procedure is a rule, based on sample data, for deciding whether to reject H0. A test of H0: p ¼ .10 versus Ha: p < .10 in the circuit board problem might be 428 CHAPTER 9 Tests of Hypotheses Based on a Single Sample based on examining a random sample of n ¼ 200 boards. Let X denote the number of defective boards in the sample, a binomial random variable; x represents the observed value of X. If H0 is true, E(X) ¼ np ¼ 200(.10) ¼ 20, whereas we can expect fewer than 20 defective boards if Ha is true. A value x just a bit below 20 does not strongly contradict H0, so it is reasonable to reject H0 only if x is substantially < 20. One such test procedure is to reject H0 if x 15 and not reject H0 otherwise. This procedure has two constituents: (1) a test statistic or function of the sample data used to make a decision and (2) a rejection region consisting of those x values for which H0 will be rejected in favor of Ha. For the rule just suggested, the rejection region consists of x ¼ 0, 1, 2, . . . , 15. H0 will not be rejected if x ¼ 16, 17, . . . , 199, or 200. A test procedure is specified by the following: 1. A test statistic, a function of the sample data on which the decision (reject H0 or do not reject H0) is to be based 2. A rejection region, the set of all test statistic values for which H0 will be rejected The null hypothesis will then be rejected if and only if the observed or computed test statistic value falls in the rejection region. As another example, suppose a cigarette manufacturer claims that the average nicotine content m of brand B cigarettes is (at most) 1.5 mg. It would be unwise to reject the manufacturer’s claim without strong contradictory evidence, so an appropriate problem formulation is to test H0: m ¼ 1.5 versus Ha: m > 1.5. Consider a decision rule based on analyzing a random sample of 32 cigarettes. Let X denote the sample average nicotine content. If H0 is true, EðXÞ ¼ m ¼ 1:5, whereas if H0 is false, we expect X to exceed 1.5. Strong evidence against H0 is provided by a value x that considerably exceeds 1.5. Thus we might use X as a test statistic along with the rejection region x 1:60. In both the circuit board and nicotine examples, the choice of test statistic and form of the rejection region make sense intuitively. However, the choice of cutoff value used to specify the rejection region is somewhat arbitrary. Instead of rejecting H0: p ¼ .10 in favor of Ha: p < .10 when x 15, we could use the rejection region x 14. For this region, H0 would not be rejected if 15 defective boards are observed, whereas this occurrence would lead to rejection of H0 if the initially suggested region is employed. Similarly, the rejection region x 1:55 might be used in the nicotine problem in place of the region x 1:60. Errors in Hypothesis Testing The basis for choosing a particular rejection region lies in an understanding of the errors that one might be faced with in drawing a conclusion. Consider the rejection region x 15 in the circuit board problem. Even when H0: p ¼ .10 is true, it might happen that an unusual sample results in x ¼ 13, so that H0 is erroneously rejected. On the other hand, even when Ha: p < .10 is true, 9.1 Hypotheses and Test Procedures 429 an unusual sample might yield x ¼ 20, in which case H0 would not be rejected, again an incorrect conclusion. Thus it is possible that H0 may be rejected when it is true or that H0 may not be rejected when it is false. These possible errors are not consequences of a foolishly chosen rejection region. Either one of these two errors might result when the region x 14 is employed, or indeed when any other sensible region is used. DEFINITION A type I error consists of rejecting the null hypothesis H0 when it is true. A type II error involves not rejecting H0 when H0 is false. In the nicotine scenario, a type I error consists of rejecting the manufacturer’s claim that m ¼ 1.5 when it is actually true. If the rejection region x 1:60 is employed, it might happen that x ¼ 1:63 even when m ¼ 1.5, resulting in a type I error. Alternatively, it may be that H0 is false and yet x ¼ 1:52 is observed, leading to H0 not being rejected (a type II error). In the best of all possible worlds, test procedures for which neither type of error is possible could be developed. However, this ideal can be achieved only by basing a decision on an examination of the entire population, which is almost always impractical. The difficulty with using a procedure based on sample data is that because of sampling variability, an unrepresentative sample may result. Even though EðXÞ ¼ m, the observed value x may differ substantially from m (at least if n is small). Thus when m ¼ 1.5 in the nicotine situation, x may be much larger than 1.5, resulting in erroneous rejection of H0. Alternatively, it may be that m ¼ 1.6 yet an x much smaller than this is observed, leading to a type II error. Instead of demanding error-free procedures, we must look for procedures for which either type of error is unlikely to occur. That is, a good procedure is one for which the probability of making either type of error is small. The choice of a particular rejection region cutoff value fixes the probabilities of type I and type II errors. These error probabilities are traditionally denoted by a and b, respectively. Because H0 specifies a unique value of the parameter, there is a single value of a. However, there is a different value of b for each value of the parameter consistent with Ha. Example 9.1 An automobile model is known to sustain no visible damage 25% of the time in 10-mph crash tests. A modified bumper design has been proposed in an effort to increase this percentage. Let p denote the proportion of all 10-mph crashes with this new bumper that result in no visible damage. The hypotheses to be tested are H0: p ¼ .25 (no improvement) versus Ha: p > .25. The test will be based on an experiment involving n ¼ 20 independent crashes with prototypes of the new design. Intuitively, H0 should be rejected if a substantial number of the crashes show no damage. Consider the following test procedure: Test statistic: X ¼ the number of crashes with no visible damage Rejection region: R8 ¼ {8, 9, 10, . . . , 19, 20}; that is, reject H0 if x 8, where x is the observed value of the test statistic This rejection region is called upper-tailed because it consists only of large values of the test statistic. 430 CHAPTER 9 Tests of Hypotheses Based on a Single Sample When H0 is true, X has a binomial probability distribution with n ¼ 20 and p ¼ .25. Then a ¼ P(type I errorÞ ¼ PðH0 is rejected when it is trueÞ ¼ P½X 8 when X Binð20; :25Þ ¼ 1 Bð7; 20; :25Þ ¼ 1 :898 ¼ :102 That is, when H0 is actually true, roughly 10% of all experiments consisting of 20 crashes would result in H0 being incorrectly rejected (a type I error). In contrast to a, there is not a single b. Instead, there is a different b for each different p that exceeds .25. Thus there is a value of b for p ¼ .3 [in which case X ~ Bin(20, .3)], another value of b for p ¼ .5, and so on. For example, bð:3Þ ¼ Pðtype II error when p ¼ :3Þ ¼ PðH0 is not rejected when it is false because p ¼ :3Þ ¼ P½X 7 when X Bin(20, .3)] = B(7; 20, .3) = .772 When p is actually .3 rather than .25 (a “small” departure from H0), roughly 77% of all experiments of this type would result in H0 being incorrectly not rejected! The accompanying table displays b for selected values of p (each calculated for the rejection region R8). Clearly, b decreases as the value of p moves farther to the right of the null value .25. Intuitively, the greater the departure from H0, the more likely it is that such a departure will be detected. p .3 .4 .5 .6 .7 .8 b(p) .772 .416 .132 .021 .001 .000 The proposed test procedure is still reasonable for testing the more realistic null hypothesis that p .25. In this case, there is no longer a single a, but instead there is an a for each p that is at most .25: a(.25), a(.23), a(.20), a(.15), and so on. It is easily verified, though, that a(p) < a(.25) ¼ .102 if p < .25. That is, the largest value of a occurs for the boundary value .25 between H0 and Ha. Thus if a is small for the simplified null hypothesis, it will also be as small as or smaller for the more realistic H0. ■ Example 9.2 The drying time of a type of paint under specified test conditions is known to be normally distributed with mean value 75 min and standard deviation 9 min. Chemists have proposed a new additive designed to decrease average drying time. It is believed that drying times with this additive will remain normally distributed with s ¼ 9. Because of the expense associated with the additive, evidence should strongly suggest an improvement in average drying time before such a conclusion is adopted. Let m denote the true average drying time when the additive is used. The appropriate hypotheses are H0: m ¼ 75 versus Ha: m < 75. Only if H0 can be rejected will the additive be declared successful and used. Experimental data is to consist of drying times from n ¼ 25 test specimens. Let X1, . . . , X25 denote the 25 drying times—a random sample of size 25 from a normal distribution with mean value m and standard deviation s ¼ 9. The sample mean drying time X then hasp a ffiffinormal with expected value mX ¼ m and pdistribution ffiffiffiffiffi ffi standard deviation sX ¼ s= n ¼ 9= 25 ¼ 1:80. When H0 is true, mX ¼ 75, so only an x value substantially < 75 would strongly contradict H0. A reasonable 9.1 Hypotheses and Test Procedures 431 rejection region has the form x c, where the cutoff value c is suitably chosen. Consider the choice c ¼ 70.8, so that the test procedure consists of test statistic X and rejection region x 70:8. Because the rejection region consists only of small values of the test statistic, the test is said to be lower-tailed. Calculation of a and b now involves a routine standardization of X followed by reference to the standard normal probabilities of Appendix Table A.3: a ¼ Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ ¼ PðX 70:8 when X normal with mX ¼ 75; sX ¼ 1:8Þ 70:8 75 ¼ Fð2:33Þ ¼ :01 ¼F 1:8 bð72Þ ¼ Pðtype II error when m ¼ 72Þ ¼ PðH0 is not rejected when it is false because m ¼ 72Þ ¼ PðX > 70:8 when X normal with mX ¼ 72; sX ¼ 1:8Þ 70:8 72 ¼ 1 Fð:67Þ ¼ 1 :2514 ¼ :7486 ¼ 1F 1:8 70:8 70 ¼ :3300 bð67Þ ¼ :0174 bð70Þ ¼ 1 F 1:8 For the specified test procedure, only 1% of all experiments carried out as described will result in H0 being rejected when it is actually true. However, the chance of a type II error is very large when m ¼ 72 (only a small departure from H0), somewhat less when m ¼ 70, and quite small when m ¼ 67 (a very substantial departure from H0). These error probabilities are illustrated in Figure 9.1 on the next page. Notice that a is computed using the probability distribution of the test statistic when H0 is true, whereas determination of b requires knowing the test statistic’s distribution when H0 is false. As in Example 9.1, if the more realistic null hypothesis m 75 is considered, there is an a for each parameter value for which H0 is true: a(75), a(75.8), a(76.5), and so on. It is easily verified, though, that a(75) is the largest of all these type I error probabilities. Focusing on the boundary value amounts to working explicitly with the “worst case.” ■ The specification of a cutoff value for the rejection region in the examples just considered was somewhat arbitrary. Use of the rejection region R8 ¼ {8, 9, . . ., 20} in Example 9.1 resulted in a ¼ .102, b(.3) ¼ .772, and b(.5) ¼ .132. Many would think these error probabilities intolerably large. Perhaps they can be decreased by changing the cutoff value. Example 9.3 (Example 9.1 continued) Let us use the same experiment and test statistic X as previously described in the automobile bumper problem but now consider the rejection region R9 ¼ {9, 10, . . ., 20}. Since X still has a binomial distribution with parameters n ¼ 20 and p, a ¼ PðH0 is rejected when p ¼ :25Þ ¼ P½X 9 when X Bin(20, .25)] = 1 Bð8; 20; :25Þ ¼ :041 432 CHAPTER 9 Tests of Hypotheses Based on a Single Sample a Shaded area = a = .01 73 75 70.8 b Shaded area = b (72) 72 75 70.8 c Shaded area = b (70) 70 75 70.8 Figure 9.1 a and b illustrated for Example 9.2: (a) the distribution of X when m ¼ 75 (H0 true); (b) the distribution of X when m ¼ 72 (H0 false); (c) the distribution of X when m ¼ 70 (H0 false) The type I error probability has been decreased by using the new rejection region. However, a price has been paid for this decrease: bð:3Þ ¼ PðH0 is not rejected when p ¼ :3Þ ¼ P½X 8 when X Binð20; :3Þ ¼ Bð8; 20; :3Þ ¼ :887 bð:5Þ ¼ Bð8; 20; :5Þ ¼ :252 Both these b’s are larger than the corresponding error probabilities .772 and .132 for the region R8. In retrospect, this is not surprising; a is computed by summing over probabilities of test statistic values in the rejection region, whereas b is the probability that X falls in the complement of the rejection region. Making the rejection region smaller must therefore decrease a while increasing b for any fixed alternative value of the parameter. ■ Example 9.4 (Example 9.2 continued) The use of cutoff value c ¼ 70.8 in the paint-drying example resulted in a very small value of a (.01) but rather large b’s. Consider the same experiment and test statistic X with the new rejection region x 72. Because X is still normally distributed with mean value mX ¼ m and sX ¼ 1:8, 9.1 Hypotheses and Test Procedures 433 a ¼ PðH0 is rejected when it is trueÞ ¼ P½X 72 when X Nð75;1:82 Þ 72 75 ¼F ¼ Fð1:67Þ ¼ :0475 :05 1:8 bð72Þ ¼ PðH0 is not rejected when m ¼ 72Þ ¼ PðX > 72 when X is a normal rv with mean 72 and standard deviation 1:8Þ 72 72 ¼ 1 Fð0Þ ¼ :5 ¼ 1F 1:8 72 70 ¼ :1335 bð67Þ ¼ :0027 bð70Þ ¼ 1 F 1:8 The change in cutoff value has made the rejection region larger (it includes more x values), resulting in a decrease in b for each fixed m less than 75. However, a for this new region has increased from the previous value .01 to approximately .05. If a type I error probability this large can be tolerated, though, the second region (c ¼ 72) is preferable to the first (c ¼ 70.8) because of the smaller b’s. ■ The results of these examples can be generalized in the following manner. PROPOSITION Suppose an experiment and a sample size are fixed and a test statistic is chosen. Then decreasing the size of the rejection region to obtain a smaller value of a results in a larger value of b for any particular parameter value consistent with Ha. This proposition says that once the test statistic and n are fixed, there is no rejection region that will simultaneously make both a and all b’s small. A region must be chosen to effect a compromise between a and b. Because of the suggested guidelines for specifying H0 and Ha, a type I error is usually more serious than a type II error (this can always be achieved by proper choice of the hypotheses). The approach adhered to by most statistical practitioners is then to specify the largest value of a that can be tolerated and find a rejection region having that value of a rather than anything smaller. This makes b as small as possible subject to the bound on a. The resulting value of a is often referred to as the significance level of the test. Traditional levels of significance are .10, .05, and .01, although the level in any particular problem will depend on the seriousness of a type I error—the more serious this error, the smaller should be the significance level. The corresponding test procedure is called a level a test (e.g., a level .05 test or a level .01 test). A test with significance level a is one for which the type I error probability is controlled at the specified level. Example 9.5 Consider the situation mentioned previously in which m was the true average nicotine content of brand B cigarettes. The objective is to test H0: m ¼ 1.5 versus Ha: m > 1.5 based on a random sample X1, X2, . . . , X32 of nicotine contents. Suppose the distribution of nicotine content is known to be normal with s ¼ .20. It follows that X is p normally distributed with mean value mX ¼ m and standard ffiffiffiffiffi deviation sX ¼ :20= 32 ¼ :0354: 434 CHAPTER 9 Tests of Hypotheses Based on a Single Sample Rather than use X itself as the test statistic, let’s standardize X assuming that H0 is true. Test statistic : Z ¼ X 1:5 X 1:5 pffiffiffi ¼ s= n :0354 Z expresses the distance between X and its expected value when H0 is true as some number of standard deviations. For example, z ¼ 3 results from an x that is 3 standard deviations larger than we would have expected it to be were H0 true. Rejecting H0 when x “considerably” exceeds 1.5 is equivalent to rejecting H0 when z “considerably” exceeds 0. That is, the form of the rejection region is z c. Let’s now determine c so that a ¼ .05. When H0 is true, Z has a standard normal distribution. Thus a ¼ Pðtype I error) = P(rejecting H0 when it is trueÞ ¼ P½Z c when Z N ð0; 1Þ The value c must capture upper-tail area .05 under the z curve. Either from Section 4.3 or directly from Appendix Table A.3, c ¼ z.05 ¼ 1.645. Notice that z 1.645 is equivalent to x 1:5 ð:0354Þð1:645Þ; that is, x 1:56. Then b is the probability that X < 1:56 and can be calculated for any m >1.5. ■ Exercises Section 9.1 (1–14) 1. For each of the following assertions, state whether it is a legitimate statistical hypothesis and why: a. H: s > 100 b. H: x~ ¼ 45 c. H: s .20 d. H: s1/s2 < 1 e. H: X Y ¼ 5 f. H: l .01, where l is the parameter of an exponential distribution used to model component lifetime 2. For the following pairs of assertions, indicate which do not comply with our rules for setting up hypotheses and why (the subscripts 1 and 2 differentiate between quantities for two different populations or samples): a. H0: m ¼ 100, Ha: m > 100 b. H0: s ¼ 20, Ha: s 20 c. H0: p 6¼ .25, Ha: p ¼ .25 d. H0: m1 m2 ¼ 25, Ha: m1 m2 > 100 e. H0 : S21 ¼ S22 ; Ha : S21 6¼ S22 f. H0: m ¼ 120, Ha: m ¼ 150 g. H0: s1/s2 ¼ 1, Ha: s1/s2 6¼ 1 h. H0: p1 p2 ¼ .1, Ha: p1 p2 <.1 3. To determine whether the girder welds in a new performing arts center meet specifications, a random sample of welds is selected, and tests are conducted on each weld in the sample. Weld strength is measured as the force required to break the weld. Suppose the specifications state that mean strength of welds should exceed 100 lb/in2; the inspection team decides to test H0: m ¼ 100 versus Ha: m > 100. Explain why it might be preferable to use this Ha rather than m < 100. 4. Let m denote the true average radioactivity level (picocuries per liter). The value 5 pCi/L is considered the dividing line between safe and unsafe water. Would you recommend testing H0: m ¼ 5 versus Ha: m > 5 or H0: m ¼ 5 versus Ha: m < 5? Explain your reasoning. [Hint: Think about the consequences of a type I and type II error for each possibility.] 5. Before agreeing to purchase a large order of polyethylene sheaths for a particular type of high-pressure oil-filled submarine power cable, a company wants to see conclusive evidence that the true standard deviation of sheath thickness is < .05 mm. What hypotheses should be tested, and why? In this context, what are the type I and type II errors? 9.1 Hypotheses and Test Procedures 6. Many older homes have electrical systems that use fuses rather than circuit breakers. A manufacturer of 40-amp fuses wants to make sure that the mean amperage at which its fuses burn out is in fact 40. If the mean amperage is lower than 40, customers will complain because the fuses require replacement too often. If the mean amperage is higher than 40, the manufacturer might be liable for damage to an electrical system due to fuse malfunction. To verify the amperage of the fuses, a sample of fuses is to be selected and inspected. If a hypothesis test were to be performed on the resulting data, what null and alternative hypotheses would be of interest to the manufacturer? Describe type I and type II errors in the context of this problem situation. 7. Water samples are taken from water used for cooling as it is being discharged from a power plant into a river. It has been determined that as long as the mean temperature of the discharged water is at most 150 F, there will be no negative effects on the river’s ecosystem. To investigate whether the plant is in compliance with regulations that prohibit a mean discharge-water temperature above 150 , 50 water samples will be taken at randomly selected times, and the temperature of each sample recorded. The resulting data will be used to test the hypotheses H0: m ¼ 150 versus Ha: m > 150 . In the context of this situation, describe type I and type II errors. Which type of error would you consider more serious? Explain. 8. A regular type of laminate is currently being used by a manufacturer of circuit boards. A special laminate has been developed to reduce warpage. The regular laminate will be used on one sample of specimens and the special laminate on another sample, and the amount of warpage will then be determined for each specimen. The manufacturer will then switch to the special laminate only if it can be demonstrated that the true average amount of warpage for that laminate is less than for the regular laminate. State the relevant hypotheses, and describe the type I and type II errors in the context of this situation. 9. Two different companies have applied to provide cable television service in a region. Let p denote the proportion of all potential subscribers who favor the first company over the second. Consider testing H0: p ¼ .5 versus Ha: p 6¼ .5 based on a random sample of 25 individuals. Let X denote the number in the sample who favor the first company and x represent the observed value of X. 435 a. Which of the following rejection regions is most appropriate and why? R1 ¼ fx : x 7 or x 18g; R2 ¼ fx : x 8g; R3 ¼ fx : x 17g b. In the context of this problem situation, describe what type I and type II errors are. c. What is the probability distribution of the test statistic X when H0 is true? Use it to compute the probability of a type I error. d. Compute the probability of a type II error for the selected region when p ¼ .3, again when p ¼ .4, and also for both p ¼ .6 and p ¼ .7. e. Using the selected region, what would you conclude if 6 of the 25 queried favored company 1? 10. For healthy individuals the level of prothrombin in the blood is approximately normally distributed with mean 20 mg/100 mL and standard deviation 4 mg/100 mL. Low levels indicate low clotting ability. In studying the effect of gallstones on prothrombin, the level of each patient in a sample is measured to see if there is a deficiency. Let m be the true average level of prothrombin for gallstone patients. a. What are the appropriate null and alternative hypotheses? b. Let X denote the sample average level of prothrombin in a sample of n ¼ 20 randomly selected gallstone patients. Consider the test procedure with test statistic X and rejection region x 17:92. What is the probability distribution of the test statistic when H0 is true? What is the probability of a type I error for the test procedure? c. What is the probability distribution of the test statistic when m ¼ 16.7? Using the test procedure of part (b), what is the probability that gallstone patients will be judged not deficient in prothrombin, when in fact m ¼ 16.7 (a type II error)? d. How would you change the test procedure of part (b) to obtain a test with significance level .05? What impact would this change have on the error probability of part (c)? e. Consider the standardized test statistic Z ¼ pffiffiffiffiffi ðX 20Þ=ðs= nÞ ¼ ðX 20Þ=:8944. What are the values of Z corresponding to the rejection region of part (b)? 11. The calibration of a scale is to be checked by weighing a 10-kg test specimen 25 times. Suppose that the results of different weighings are 436 CHAPTER 9 Tests of Hypotheses Based on a Single Sample independent of one another and that the weight on each trial is normally distributed with s ¼ .200 kg. Let m denote the true average weight reading on the scale. a. What hypotheses should be tested? b. Suppose the scale is to be recalibrated if either x 10:1032 or x 9:8968. What is the probability that recalibration is carried out when it is actually unnecessary? c. What is the probability that recalibration is judged unnecessary when in fact m ¼ 10.1? When m ¼ 9.8? pffiffiffiffiffi d. Let z ¼ ðx 10Þ=ðs= nÞ. For what value c is the rejection region of part (b) equivalent to the “two-tailed” region either z c or z c? e. If the sample size were only 10 rather than 25, how should the procedure of part (d) be altered so that a ¼ .05? f. Using the test of part (e), what would you conclude from the following sample data? 9.981 9.728 10.006 10.439 9.857 10.214 10.107 10.190 9.888 9.793 g. Re-express the test procedure of part (b) in terms of the standardized test statistic pffiffiffiffiffi Z ¼ ðX 10Þ=ðs= nÞ: 12. A new design for the braking system on a certain type of car has been proposed. For the current system, the true average braking distance at 40 mph under specified conditions is known to be 120 ft. It is proposed that the new design be implemented only if sample data strongly indicates a reduction in true average braking distance for the new design. a. Define the parameter of interest and state the relevant hypotheses. b. Suppose braking distance for the new system is normally distributed with s ¼ 10. Let X denote the sample average braking distance for a random sample of 36 observations. Which of the following rejection regions is appropriate: R1 ¼ fx : x 124:80g; R2 ¼ fx : x 115:20g; R3 ¼ fx : either x 125:13 or x 114:87g? c. What is the significance level for the appropriate region of part (b)? How would you change the region to obtain a test with a ¼ .001? d. What is the probability that the new design is not implemented when its true average braking distance is actually 115 ft and the appropriate region from part (b) is used? pffiffiffiffiffi e. Let Z ¼ ðX 120Þ=ðs= nÞ. What is the significance level for the rejection region {z: z 2.33}? For the region {z: z 2.88}? 13. Let X1, . . . , Xn denote a random sample from a normal population distribution with a known value of s. a. For testing the hypotheses H0: m ¼ m0 versus Ha: m > m0 (where m0 is a fixed number), show that the test with test statistic X and rejection pffiffiffi region x m0 þ 2:33s= n has significance level .01. b. Suppose the procedure of part (a) is used to test H0: m m0 versus Ha: m > m0. If m0 ¼ 100, n ¼ 25, and s ¼ 5, what is the probability of committing a type I error when m ¼ 99? When m ¼ 98? In general, what can be said about the probability of a type I error when the actual value of m is less than m0? Verify your assertion. 14. Reconsider the situation of Exercise 11 and suppose the rejection region is x : x 10:1004 or x 9:8940g ¼ fz : z 2:51 or z 2:65g: a. What is a for this procedure? b. What is b when m ¼ 10.1? When m ¼ 9.9? Is this desirable? 9.2 Tests About a Population Mean The general discussion in Chapter 8 of confidence intervals for a population mean m focused on three different cases. We now develop test procedures for these same three cases. Case I: A Normal Population with Known s Although the assumption that the value of s is known is rarely met in practice, this case provides a good starting point because of the ease with which general procedures and their properties can be developed. The null hypothesis in all three cases will state that m has a particular numerical value, the null value, which we will 9.2 Tests About a Population Mean 437 denote by m0. Let X1, . . . , Xn represent a random sample of size n from the normal population. Then the sample mean X has a normal distribution with expected value pffiffiffi mX ¼ m and standard deviation sX ¼ s= n. When H0 is true, mX ¼ m0 . Consider now the statistic Z obtained by standardizing X under the assumption that H0 is true: Z¼ X m0 pffiffiffi s= n Substitution of the computed sample mean x gives z, the distance between x and m0 expressed in “standard units.” For example, if the null hypothesis is pffiffiffiffiffi pffiffiffideviation H0: m ¼ 100, sX ¼ s= n ¼ 10= 25 ¼ 2:0 and x ¼ 103, then the test statistic value is given by z ¼ (103 100)/2.0 ¼ 1.5. That is, the observed value of x is 1.5 standard deviations (of X) above what we expect it to be when H0 is true. The statistic Z is a natural measure of the distance between X, the estimator of m, and its expected value when H0 is true. If this distance is too great in a direction consistent with Ha, the null hypothesis should be rejected. Suppose first that the alternative hypothesis has the form Ha: m > m0. Then an x value less than m0 certainly does not provide support for Ha. Such an xpcorresponds to ffiffiffi a negative value of z (since x m0 is negative and the divisor s= n is positive). Similarly, an x value that exceeds m0 by only a small amount (corresponding to z which is positive but small) does not suggest that H0 should be rejected in favor of Ha. The rejection of H0 is appropriate only when x considerably exceeds m0—that is, when the z value is positive and large. In summary, the appropriate rejection region, based on the test statistic Z rather than X, has the form z c. As discussed in Section 9.1, the cutoff value c should be chosen to control the probability of a type I error at the desired level a. This is easily accomplished because the distribution of the test statistic Z when H0 is true is the standard normal distribution (that’s why m0 was subtracted in standardizing). The required cutoff c is the z critical value that captures upper-tail area a under the standard normal curve. As an example, let c ¼ 1.645, the value that captures tail area .05 (z.05 ¼ 1.645). Then, a ¼ Pðtype I errorÞ ¼ PðH0 is rejected when H0 is trueÞ ¼ P½Z 1:645 when Z Nð0; 1Þ ¼ 1 Fð1:645Þ ¼ :05 More generally, the rejection region z za has type I error probability a. The test procedure is upper-tailed because the rejection region consists only of large values of the test statistic. Analogous reasoning for the alternative hypothesis Ha: m < m0 suggests a rejection region of the form z c, where c is a suitably chosen negative number (x is far below m0 if and only if z is quite negative). Because Z has a standard normal distribution when H0 is true, taking c ¼ za yields P(type I error) ¼ a. This is a lower-tailed test. For example, z.10 ¼ 1.28 implies that the rejection region z 1.28 specifies a test with significance level .10. Finally, when the alternative hypothesis is Ha: m 6¼ m0, H0 should be rejected if x is too far to either side of m0. This is equivalent to rejecting H0 either if z c or if z c. Suppose we desire a ¼ .05. Then, :05 ¼ PðZ c or Z c when Z has a standard normal distributionÞ ¼ FðcÞ þ 1 FðcÞ ¼ 2½1 FðcÞ 438 CHAPTER 9 Tests of Hypotheses Based on a Single Sample Thus c is such that 1 F(c), the area under the standard normal curve to the right of c, is .025 (and not .05!). From Section 4.3 or Appendix Table A.3, c ¼ 1.96, and the rejection region is z 1.96 or z 1.96. For any a, the two-tailed rejection region z za/2 or z za/2 has type I error probability a (since area a/2 is captured under each of the two tails of the z curve). Again, the key reason for using the standardized test statistic Z is that because Z has a known distribution when H0 is true (standard normal), a rejection region with desired type I error probability is easily obtained by using an appropriate critical value. The test procedure for Case I is summarized in the accompanying box, and the corresponding rejection regions are illustrated in Figure 9.2. Null hypothesis: H0: m ¼ m0 xm Test statistic value: z ¼ pffiffi0ffi s= n Alternative Hypothesis Rejection Region for Level a Test Ha: m > m0 Ha: m < m0 Ha: m 6¼ m0 z za (upper-tailed test) z za (lower-tailed test) either z za/2 or z za/2 (two-tailed test) z curve (probability distribution of test statistic Z when H 0 is true) a b c Total shaded area = a = P(type I error) Shaded area = a = P(type I error) 0 −z a za Shaded area = a /2 0 Rejection region: z £ −z a Rejection region: z Ï z a −z a/2 Shaded area = a /2 0 z a/2 Rejection region: either z Ï za/2 or z £ −za/2 Figure 9.2 Rejection regions for z tests: (a) upper-tailed test; (b) lower-tailed test; (c) two-tailed test Use of the following sequence of steps is recommended when testing hypotheses about a parameter. 1. Identify the parameter of interest and describe it in the context of the problem situation. 9.2 Tests About a Population Mean 439 2. Determine the null value and state the null hypothesis. 3. State the appropriate alternative hypothesis. 4. Give the formula for the computed value of the test statistic (substituting the null value and the known values of any other parameters, but not those of any sample-based quantities). 5. State the rejection region for the selected significance level a. 6. Compute any necessary sample quantities, substitute into the formula for the test statistic value, and compute that value. 7. Decide whether H0 should be rejected and state this conclusion in the problem context. The formulation of hypotheses (steps 2 and 3) should be done before examining the data. Example 9.6 A manufacturer of sprinkler systems used for fire protection in office buildings claims that the true average system-activation temperature is 130 . A sample of n ¼ 9 systems, when tested, yields a sample average activation temperature of 131.08 F. If the distribution of activation times is normal with standard deviation 1.5 F, does the data contradict the manufacturer’s claim at significance level a ¼ .01? 1. Parameter of interest: m ¼ true average activation temperature. 2. Null hypothesis: H0: m ¼ 130 (null value ¼ m0 ¼ 130). 3. Alternative hypothesis: Ha: m 6¼ 130 (a departure from the claimed value in either direction is of concern). 4. Test statistic value: z¼ x m0 x 130 pffiffiffi pffiffiffi ¼ s= n 1:5= n 5. Rejection region: The form of Ha implies use of a two-tailed test with rejection region either z z.005 or z z.005. From Section 4.3 or Appendix Table A.3, z.005 ¼ 2.58, so we reject H0 if either z 2.58 or z 2.58. 6. Substituting n ¼ 9 and x ¼ 131:08; z¼ 131:08 130 1:08 pffiffiffi ¼ ¼ 2:16 :5 1:5= 9 That is, the observed sample mean is a bit more than 2 standard deviations above what would have been expected were H0 true. 7. The computed value z ¼ 2.16 does not fall in the rejection region (2.58 < 2.16 < 2.58), so H0 cannot be rejected at significance level .01. The data does not give strong support to the claim that the true average differs from the design value of 130. ■ Another view of the analysis in the previous example involves calculating a 99% CI for m based on Equation 8.5: pffiffiffi pffiffiffi x 2:58s= n ¼ 131:08 2:58ð1:5= 9Þ ¼ 131:08 1:29 ¼ ð129:79; 132:37Þ 440 CHAPTER 9 Tests of Hypotheses Based on a Single Sample Notice that the interval includes m0 ¼ 130, and it is not hard to see that the 99% CI excludes m0 if and only if the two-tailed hypothesis test rejects H0 at level .01. In general, the 100(1 a)% CI excludes m0 if and only if the two-tailed hypothesis test rejects H0 at level a. Although we will not always call attention to it, this kind of relationship between hypothesis tests and confidence intervals will occur over and over in the remainder of the book. It should be intuitively reasonable that the CI will exclude a value when the corresponding test rejects the value. There is a similar relationship between lower-tailed tests and upper confidence bounds, and also between upper-tailed tests and lower confidence bounds. b and Sample Size Determination The z tests for Case I are among the few in statistics for which there are simple formulas available for b, the probability of a type II error. Consider first thep upper-tailed test with rejection region z za. This ffiffiffi pffiffiisffi equivalent to x m0 þ za s= n, so H0 will not be rejected if x < m0 þ za s= n. Now let m0 denote a particular value of m that exceeds the null value m0. Then, bðm0 Þ ¼ PðH0 is not rejected when m ¼ m0 Þ pffiffiffi ¼ PðX < m0 þ za s= n when m ¼ m0 Þ X m0 m m0 pffiffiffi < za þ 0 pffiffiffi when m ¼ m0 ¼P s= n s= n 0 m m ¼ F za þ 0 pffiffiffi s= n As m0 increases, m0 m0 becomes more negative, so b(m0 ) will be small when m0 greatly exceeds m0 (because the value at which F is evaluated will then be quite negative). Error probabilities for the lower-tailed and two-tailed tests are derived in an analogous manner. If s is large, the probability of a type II error can be large at an alternative value m0 that is of particular concern to an investigator. Suppose we fix a and also specify b for such an alternative value. In the sprinkler example, company officials might view m0 ¼ 132 as a very substantial departure from H0: m ¼ 130 and therefore wish b(132) ¼ .10 in addition to a ¼ .01. More generally, consider the two restrictions P(type I error) ¼ a and b(m0 ) ¼ b for specified a, m0 , and b. Then for an upper-tailed test, the sample size n should be chosen to satisfy m0 m0 pffiffiffi ¼ b F za þ s= n This implies that zb ¼ m m0 z critical value that ¼ za þ 0 pffiffiffi captures lower tail area b s= n It is easy to solve this equation for the desired n. A parallel argument yields the necessary sample size for lower- and two-tailed tests as summarized in the next box. 9.2 Tests About a Population Mean Alternative Hypothesis Ha: m > m0 Ha: m < m0 Ha: m 6¼ m0 441 Type II Error Probability b(m0 ) for a Level a Test m m0 F za þ 0 pffiffiffi s= n m0 m0 pffiffiffi 1 F za þ s= n 0 m m m m0 F za=2 þ 0 pffiffiffi F za=2 þ 0 pffiffiffi s= n s= n where F(z) ¼ the standard normal cdf. The sample size n for which a level a test also has b(m0 ) ¼ b at the alternative value m0 is 8 sðza þ zb Þ 2 > > > < m m0 n¼ 0 2 > > sðza=2 þ zb Þ > : m0 m0 Example 9.7 for a one - tailed (upper or lower) test for a two - tailed test (an approximate solution) Let m denote the true average tread life of a type of tire. Consider testing H0: m ¼ 30,000 versus Ha: m > 30,000 based on a sample of size n ¼ 16 from a normal population distribution with s ¼ 1500. A test with a ¼ .01 requires za ¼ z.01 ¼ 2.33. The probability of making a type II error when m ¼ 31,000 is 30;000 31;000 pffiffiffiffiffi ¼ Fð:34Þ ¼ :3669 bð31;000Þ ¼ F 2:33 þ 1500= 16 Since z.1 ¼ 1.28, the requirement that the level .01 test also have b(31,000) ¼ .1 necessitates 1500ð2:33 þ 1:28Þ 2 n¼ ¼ ð5:42Þ2 ¼ 29:32 30;000 31;000 The sample size must be an integer, so n ¼ 30 tires should be used. ■ Case II: Large-Sample Tests When the sample size is large, the z tests for Case I are easily modified to yield valid test procedures without requiring either a normal population distribution or known s. The key result was used in Chapter 8 to justify large-sample confidence intervals: A large n implies that the sample standard deviation s will be close to s for most samples, so that the standardized variable Z¼ Xm pffiffiffi S= n 442 CHAPTER 9 Tests of Hypotheses Based on a Single Sample has approximately a standard normal distribution. Substitution of the null value m0 in place of m yields the test statistic Z¼ X m0 pffiffiffi S= n which has approximately a standard normal distribution when H0 is true. The use of rejection regions given previously for Case I (e.g., z za when the alternative hypothesis is Ha: m > m0) then results in test procedures for which the significance level is approximately (rather than exactly) a. The rule of thumb n > 40 will again be used to characterize a large sample size. Example 9.8 A sample of bills for meals was obtained at a restaurant (by Erich Brandt). For each of 70 bills the tip was found as a percentage of the raw bill (before taxes). Does it appear that the population mean tip percentage for this restaurant exceeds the standard 15%? Here are the 70 tip percentages: 14.21 19.12 29.87 13.46 11.48 15.23 21.53 20.24 20.37 17.92 16.79 13.96 16.09 12.76 20.10 15.29 19.74 19.03 21.58 19.19 18.07 15.0 22.5 14.94 18.39 22.73 19.19 11.94 11.91 14.11 30.0 15.69 27.55 14.56 19.23 19.02 18.21 15.86 37.5 ** * 15.04 16.01 15.16 12.39 17.73 15.37 20.67 45.0 * * 95% Confidence Intervals Mean Median 16 27 18 19 12.04 10.94 16.09 16.89 20.07 16.31 15.66 20.16 13.52 16.42 18.93 40.09 16.03 18.54 17.85 17.42 19.07 13.56 19.88 48.77 27.88 16.35 14.48 13.74 17.70 22.79 12.31 13.81 Anderson-Darting Normality Test A-Squared 4.17 P-Value < 0.005 Mean 17.986 StDev 5.937 Variance 35.247 Skewness 2.9391 Kurtosis 12.0154 N 70 Minimum 10.940 1st Quartile 14.540 Median 16.840 3st Quartile 19.358 48.770 Maximum 95% Confidence Interval for Mean 16.571 19.402 95% Confidence Interval for Median 15.913 18.402 95% Confidence Interval for StDev 5.090 7.124 Figure 9.3 MINITAB descriptive summary for the tip data of Example 9.8 Figure 9.3 shows a descriptive summary obtained from MINITAB. The sample mean tip percentage is >15. Notice that the distribution is positively skewed because there are some very large tips (and a normal probability plot therefore does not exhibit a linear pattern), but the large-sample z tests do not require a normal population distribution. 1. m ¼ true average tip percentage 2. H0: m ¼ 15 9.2 Tests About a Population Mean 443 3. Ha: m > 15 x 15 4. z ¼ pffiffiffi s= n 5. Using a test with a significance level .05, H0 will be rejected if z 1.645 (an upper tailed test). 6. With n ¼ 70, x ¼ 17:99, and s ¼ 5.937, z¼ 17:99 15 2:99 pffiffiffiffiffi ¼ ¼ 4:21 5:937= 70 :7096 7. Since 4.21 > 1.645, H0 is rejected. There is evidence that the population mean tip percentage exceeds 15%. ■ Determination of b and the necessary sample size for these large-sample tests can be based either on specifying a plausible value of s and using the Case I formulas (even though s is used in the test) or on using the methods to be introduced shortly in connection with Case III. Case III: A Normal Population Distribution with Unknown s When n is small, the Central Limit Theorem (CLT) can no longer be invoked to justify the use of a large-sample test. We faced this same difficulty in obtaining a small-sample confidence interval (CI) for m in Chapter 8. Our approach here will be the same one used there: We will assume that the population distribution is at least approximately normal and describe test procedures whose validity rests on this assumption. If an investigator has good reason to believe that the population distribution is quite nonnormal, a distribution-free test from Chapter 14 can be used. Alternatively, a statistician can be consulted regarding procedures valid for specific families of population distributions other than the normal family. Or a bootstrap procedure can be developed. The key result on which tests for a normal population mean are based was used in Chapter 8 to derive the one-sample t CI: If X1, X2, . . . , Xn is a random sample from a normal distribution, the standardized variable T¼ Xm pffiffiffi S= n has a t distribution with n 1 degrees of freedom (df). Considerpffiffitesting H0: ffi m ¼ m0 against Ha: m > m0 by using the test statistic ðX m0 Þ=ðS= nÞ. That is, the test statistic results from standardizing X under the assumption pffiffiffi pffiffithat ffi H0 is true (using S= n, the estimated standard deviation of X, rather than s= n). When H0 is true, the test statistic has a t distribution with n 1 df. Knowledge of the test statistic’s distribution when H0 is true (the “null distribution”) allows us to construct a rejection region for which the type I error probability is controlled at the desired level. In particular, use of the upper-tail t critical value ta,n1 to specify the rejection region t ta,n1 implies that 444 CHAPTER 9 Tests of Hypotheses Based on a Single Sample Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ ¼ PðT ta;n1 when T has a t distribution with n 1 dfÞ ¼a The test statistic is really the same here as in the large-sample case but is labeled T to emphasize that its null distribution is a t distribution with n 1 df rather than the standard normal (z) distribution. The rejection region for the t test differs from that for the z test only in that a t critical value ta,n1 replaces the z critical value za. Similar comments apply to alternatives for which a lower-tailed or two-tailed test is appropriate. THE ONE-SAMPLE t TEST Null hypothesis: H0: m ¼ m0 xm Test statistic value: t ¼ pffiffiffi0 s= n Rejection Region for a Level a Test Alternative Hypothesis Ha: m > m0 Ha: m < m0 Ha: m 6¼ m0 Example 9.9 t ta,n1 (upper-tailed) t ta,n1 (lower-tailed) either t ta/2,n1 or t ta/2,n1 (two-tailed) A well-designed and safe workplace can contribute greatly to increased productivity. It is especially important that workers not be asked to perform tasks, such as lifting, that exceed their capabilities. The accompanying data on maximum weight of lift (MAWL, in kg) for a frequency of four lifts/min was reported in the article “The Effects of Speed, Frequency, and Load on Measured Hand Forces for a Floor-to-Knuckle Lifting Task” (Ergonomics, 1992: 833–843); subjects were randomly selected from the population of healthy males age 18–30. Assuming that MAWL is normally distributed, does the following data suggest that the population mean MAWL exceeds 25? 25.8 36.6 26.3 21.8 27.2 Let’s carry out a test using a significance level of .05. 1. m ¼ population mean MAWL 2. H0: m ¼ 25 3. Ha: m > 25 x 25 4. t ¼ pffiffiffi s= n 5. Reject H0 if t ta, n1 ¼ t.05,4 ¼ 2.132. 6. Sxi ¼ 137.7 and Sx2i ¼ 3911:97, from which x ¼ 27:54, s ¼ 5.47, and 9.2 Tests About a Population Mean t¼ 445 27:54 25 2:54 pffiffiffi ¼ ¼ 1:04 2:45 5:47= 5 The accompanying MINITAB output from a request for a one-sample t test has the same calculated values (the P-value is discussed in Section 9.4). Test of mu ¼ 25.00 vs mu > 25.00 Variable mawl N 5 Mean 27.54 StDev 5.47 SE Mean 2.45 T 1.04 P-Value 0.18 7. Since 1.04 does not fall in the rejection region (1.04 < 2.132), H0 cannot be rejected at significance level .05. It is still plausible that m is (at most) 25. ■ b and Sample Size Determination The calculation of b at the alternative value m0 in Case I was carried out by expressing the rejection region in terms of x (e.g., pffiffiffi x m0 þ za s= n) and then subtracting m0 to standardize correctly. An equivalent pffiffiffi approach involves noting that when m ¼ m0 , the test statistic Z ¼ ðX m0 Þ=ðs= nÞ still has a normal distribution pffiffiffi with variance 1, but now the mean value of Z is given by ðm0 m0 Þ=ðs= nÞ. That is, when m ¼ m0 , the test statistic still has a normal distribution though not the standard normal distribution. Because of this, b(m0 ) is an area under the normal curve corresponding to mean value pffiffiffi ðm0 m0 Þ=ðs= nÞ and variance 1. Both a and b involve working with normally distributed variables. The calculation of b(m0 ) for the t test is much less straightforward. This pffiffiffi is because the distribution of the test statistic T ¼ ðX m0 Þ=ðS= nÞ is quite complicated when H0 is false and Ha is true. Thus, for an upper-tailed test, determining bðm0 Þ ¼ PðT < ta;n1 when m ¼ m0 rather than m0 Þ involves integrating a very unpleasant density function. This must be done numerically, but fortunately it has been done by research statisticians for both one- and two-tailed t tests. The results are summarized in graphs of b that appear in Appendix Table A.16. There are four sets of graphs, corresponding to one-tailed tests at level .05 and level .01 and two-tailed tests at the same levels. To understand how these graphs are used, note first that both b and the necessary sample size n in Case I are functions not just of the absolute difference |m0 m0 | but of d ¼ |m0 m0 |/s. Suppose, for example, that |m0 m0 | ¼ 10. This departure from H0 will be much easier to detect (smaller b) when s ¼ 2, in which case m0 and m0 are 5 population standard deviations apart, than when s ¼ 10. The fact that b for the t test depends on d rather than just |m0 m0 | is unfortunate, since to use the graphs one must have some idea of the true value of s. A conservative (large) guess for s will yield a conservative (large) value of b(m0 ) and a conservative estimate of the sample size necessary for prescribed a and b(m0 ). Once the alternative m0 and value of s are selected, d is calculated and its value located on the horizontal axis of the relevant set of curves. The value of b is the height of the n 1 df curve above the value of d (visual interpolation is necessary if n 1 is not a value for which the corresponding curve appears), as illustrated in Figure 9.4. 446 CHAPTER 9 Tests of Hypotheses Based on a Single Sample 1 b curve for n − 1 df b when m = m⬘ d 0 Value of d corresponding to specified alternative m⬘ Figure 9.4 A typical b curve for the t test Rather than fixing n (i.e., n 1, and thus the particular curve from which b is read), one might prescribe both a (.05 or .01 here) and a value of b for the chosen m0 and s. After computing d, the point (d, b) is located on the relevant set of graphs. The curve below and closest to this point gives n 1 and thus n (again, interpolation is often necessary). Example 9.10 The true average voltage drop from collector to emitter of insulated gate bipolar transistors of a certain type is supposed to be at most 2.5 V. An investigator selects a sample of n ¼ 10 such transistors and uses the resulting voltages as a basis for testing H0: m ¼ 2.5 versus Ha: m > 2.5 using a t test with significance level a ¼ .05. If the standard deviation of the voltage distribution is s ¼ .100, how likely is it that H0 will not be rejected when m ¼ 2.6? With d ¼ |2.5 2.6|/.100 ¼ 1.0, the point on the b curve at 9 df for a one-tailed test with a ¼ .05 above 1.0 has height approximately .1, so b .1. The investigator might think that this is too large a value of b for such a substantial departure from H0 and may wish to have b ¼ .05 for this alternative value of m. Since d ¼ 1.0, the point (d, b) ¼ (1.0, .05) must be located. This point is very close to the 14 df curve, so using n ¼ 15 will give both a ¼ .05 and b ¼ .05 when the value of m is 2.6 and s ¼ .10. A larger value of s would give a larger b for this alternative, and an alternative value of m closer to 2.5 would also result in an increased value of b. ■ Most of the widely used statistical computer packages will also calculate type II error probabilities and determine necessary sample sizes. As an example, we asked MINITAB to do the calculations from Example 9.10. Its computations are based on power, which is simply 1 b. We want b to be small, which is equivalent to asking that the power of the test be large. For example, b ¼ .05 corresponds to a value of .95 for power. Here is the resulting MINITAB output. Power and Sample Size Testing mean ¼ null (versus Calculating power for mean > null) ¼ null + 0.1 9.2 Tests About a Population Mean Alpha ¼ 0.05 Sample Size 10 Sigma ¼ 447 0.1 Power 0.8975 Power and Sample Size 1-Sample t Test Testing mean ¼ null (versus Calculating power for mean Alpha ¼ Sample Size 13 0.05 Sigma Target Power 0.9500 ¼ > null) ¼ null + 0.1 0.1 Actual Power 0.9597 Notice from the second part of the output that the sample size necessary to obtain a power of .95 (b ¼ .05) for an upper-tailed test with a ¼ .05 when s ¼ .1 and m0 is .1 larger than m0 is only n ¼ 13, whereas eyeballing our b curves gave 15. When available, this type of software is more trustworthy than the curves. Exercises Section 9.2 (15–35) 15. Let the test statistic Z have a standard normal distribution when H0 is true. Give the significance level for each of the following situations: a. Ha: m > m0, rejection region z 1.88 b. Ha: m < m0, rejection region z 2.75 c. Ha: m 6¼ m0, rejection region z 2.88 or z 2.88 16. Let the test statistic T have a t distribution when H0 is true. Give the significance level for each of the following situations: a. Ha: m > m0, df ¼ 15, rejection region t 3.733 b. Ha: m < m0, n ¼ 24, rejection region t 2.500 c. Ha: m 6¼ m0, n ¼ 31, rejection region t 1.697 or t 1.697 17. Answer the following questions for the tire problem in Example 9.7. a. If x ¼ 30; 960 and a level a ¼ .01 test is used, what is the decision? b. If a level .01 test is used, what is b(30,500)? c. If a level .01 test is used and it is also required that b(30,500) ¼ .05, what sample size n is necessary? d. If x ¼ 30; 960, what is the smallest a at which H0 can be rejected (based on n ¼ 16)? 18. Reconsider the paint-drying situation of Example 9.2, in which drying time for a test specimen is normally distributed with s ¼ 9. The hypotheses H0: m ¼ 75 versus Ha: m < 75 are to be tested using a random sample of n ¼ 25 observations. a. How many standard deviations (of X) below the null value is x ¼ 72:3? b. If x ¼ 72:3, what is the conclusion using a ¼ .01? c. What is a for the test procedure that rejects H0 when z 2.88? d. For the test procedure of part (c), what is b(70)? e. If the test procedure of part (c) is used, what n is necessary to ensure that b(70) ¼ .01? f. If a level .01 test is used with n ¼ 100, what is the probability of a type I error when m ¼ 76? 19. The melting point of each of 16 samples of a brand of hydrogenated vegetable oil was determined, resulting in x ¼ 94:32. Assume that the distribution of melting point is normal with s ¼ 1.20. a. Test H0: m ¼ 95 versus Ha: m 6¼ 95 using a two-tailed level .01 test. b. If a level .01 test is used, what is b(94), the probability of a type II error when m ¼ 94? c. What value of n is necessary to ensure that b(94) ¼ .1 when a ¼ .01? 448 CHAPTER 9 Tests of Hypotheses Based on a Single Sample 20. Lightbulbs of a certain type are advertised as having an average lifetime of 750 h. The price of these bulbs is very favorable, so a potential customer has decided to go ahead with a purchase arrangement unless it can be conclusively demonstrated that the true average lifetime is smaller than what is advertised. A random sample of 50 bulbs was selected, the lifetime of each bulb determined, and the appropriate hypotheses were tested using MINITAB, resulting in the accompanying output. Variable N Mean StDev SEMean lifetime 50 738.44 38.20 5.40 Z 2.14 P-Value 0.016 What conclusion would be appropriate for a significance level of .05? A significance level of .01? What significance level and conclusion would you recommend? 21. The true average diameter of ball bearings of a certain type is supposed to be .5 in. A one-sample t test will be carried out to see whether this is the case. What conclusion is appropriate in each of the following situations? a. n ¼ 13, t ¼ 1.6, a ¼ .05 b. n ¼ 13, t ¼ 1.6, a ¼ .05 c. n ¼ 25, t ¼ 2.6, a ¼ .01 d. n ¼ 25, t ¼ 3.9 22. The article “The Foreman’s View of Quality Control” (Quality Engrg., 1990: 257–280) described an investigation into the coating weights for large pipes resulting from a galvanized coating process. Production standards call for a true average weight of 200 lb per pipe. The accompanying descriptive summary and boxplot are from MINITAB. Variable N Mean ctg wt 206.73 206.00 206.81 6.35 30 Variable Min ctg wt Max Median TrMean StDev SEMean Q1 1.16 Q3 193.00 218.00 202.75 212.00 Coating weight 190 200 210 220 a. What does the boxplot suggest about the status of the specification for true average coating weight? b. A normal probability plot of the data was quite straight. Use the descriptive output to test the appropriate hypotheses. 23. Exercise 33 in Chapter 1 gave n ¼ 26 observations on escape time (sec) for oil workers in a simulated exercise, from which the sample mean and sample standard deviation are 370.69 and 24.36, respectively. Suppose the investigators had believed a priori that true average escape time would be at most 6 min. Does the data contradict this prior belief? Assuming normality, test the appropriate hypotheses using a significance level of .05. 24. Reconsider the sample observations on stabilized viscosity of asphalt specimens introduced in Exercise 43 in Chapter 1 (2781, 2900, 3013, 2856, and 2888). Suppose that for a particular application, it is required that true average viscosity be 3000. Does this requirement appear to have been satisfied? State and test the appropriate hypotheses. 25. Recall the first-grade IQ scores of Example 1.2. Here is a random sample of 10 of those scores: 107 113 108 127 146 103 108 118 111 119 The IQ test score has approximately a normal distribution with mean 100 and standard deviation 15 for the entire U.S. population of first-graders. Here we are interested in seeing whether the population of first-graders at this school is different from the national population. Assume that the normal distribution with standard deviation 15 is valid for the school, and test at the .05 level to see whether the school mean differs from the national mean. Summarize your conclusion in a sentence about these first-graders. 26. In recent years major league baseball games have averaged 3 h in duration. However, because games in Denver tend to be high-scoring, it might be expected that the games would be longer there. In 2001, the 81 games in Denver averaged 185.54 min with standard deviation 24.6 min. What would you conclude? 27. On the label, Pepperidge Farm bagels are said to weigh four ounces each (113 g). A random sample of six bagels resulted in the following weights (in grams): 117.6 109.5 111.6 109.2 119.1 110.8 a. Based on this sample, is there any reason to doubt that the population mean is at least 113 g? 9.2 Tests About a Population Mean b. Assume that the population mean is actually 110 g and that the distribution is normal with standard deviation 4 g. In a z test of H0: m ¼ 113 against Ha: m < 113 with a ¼ .05, find the probability of rejecting H0 with six observations. c. Under the conditions of part (b) with a ¼ .05, how many more observations would be needed in order for the power to be at least .95? 28. Minor surgery on horses under field conditions requires a reliable short-term anesthetic producing good muscle relaxation, minimal cardiovascular and respiratory changes, and a quick, smooth recovery with minimal aftereffects so that horses can be left unattended. The article “A Field Trial of Ketamine Anesthesia in the Horse” (Equine Vet. J., 1984: 176–179) reports that for a sample of n ¼ 73 horses to which ketamine was administered under certain conditions, the sample average lateral recumbency (lying-down) time was 18.86 min and the standard deviation was 8.6 min. Does this data suggest that true average lateral recumbency time under these conditions is less than 20 min? Test the appropriate hypotheses at level of significance .10. 29. The amount of shaft wear (.0001 in.) after a fixed mileage was determined for each of n ¼ 8 internal combustion engines having copper lead as a bearing material, resulting in x ¼ 3:72 and s ¼ 1.25. a. Assuming that the distribution of shaft wear is normal with mean m, use the t test at level .05 to test H0: m ¼ 3.50 versus Ha: m > 3.50. b. Using s ¼ 1.25, what is the type II error probability b(m0 ) of the test for the alternative m0 ¼ 4.00? 30. The recommended daily dietary allowance for zinc among males older than age 50 years is 15 mg/day. The article “Nutrient Intakes and Dietary Patterns of Older Americans: A National Study” (J. Gerontol., 1992: M145–150) reports the following summary data on intake for a sample of males age 65–74 years: n ¼ 115, x ¼ 11:3, and s ¼ 6.43. Does this data indicate that average daily zinc intake in the population of all males age 65–74 falls below the recommended allowance? 31. In an experiment designed to measure the time necessary for an inspector’s eyes to become used to the reduced amount of light necessary for penetrant inspection, the sample average time for n ¼ 9 inspectors was 6.32 s and the sample standard deviation was 1.65 s. It has previously been assumed that the average adaptation time was at least 7 s. Assuming adaptation time to be normally 449 distributed, does the data contradict prior belief? Use the t test with a ¼ .1. 32. A sample of 12 radon detectors of a certain type was selected, and each was exposed to 100 pCi/L of radon. The resulting readings were as follows: 105.6 100.1 90.9 105.0 91.2 99.6 96.9 107.7 96.5 103.3 91.3 92.4 a. Does this data suggest that the population mean reading under these conditions differs from 100? State and test the appropriate hypotheses using a ¼ .05. b. Suppose that prior to the experiment, a value of s ¼ 7.5 had been assumed. How many determinations would then have been appropriate to obtain b ¼ .10 for the alternative m ¼ 95? 33. Show that for any D > 0, when the population distribution is normal and s is known, the twotailed test satisfies b(m0 D) ¼ b(m0 + D), so that b(m0 ) is symmetric about m0. 34. For a fixed alternative value m0 , show that b(m0 ) ! 0 as n ! 1 for either a one-tailed or a two-tailed z test in the case of a normal population distribution with known s. 35. The industry standard for the amount of alcohol poured into many types of drinks (e.g., gin for a gin and tonic, whiskey on the rocks) is 1.5 oz. Each individual in a sample of 8 bartenders with at least 5 years of experience was asked to pour rum for a rum and coke into a short, wide (tumbler) glass, resulting in the following data: 2.00 1.78 2.16 1.91 1.70 1.67 1.83 1.48 (Summary quantities agree with those given in the article “Bottoms Up! The Influence of Elongation on Pouring and Consumption Volume,” J. Consumer Res., 2003: 455–463.) a. What does a boxplot suggest about the distribution of the amount poured? b. Carry out a test of hypotheses to decide whether there is strong evidence for concluding that the true average amount poured differs from the industry standard. c. Does the validity of the test you carried out in (b) depend on any assumptions about the population distribution? If so, check the plausibility of such assumptions. d. Suppose the actual standard deviation of the amount poured is .20 oz. Determine the probability of a type II error for the test of (b) when the true average amount poured is actually (1) 1.6, (2) 1.7, (3) 1.8. 450 CHAPTER 9 Tests of Hypotheses Based on a Single Sample 9.3 Tests Concerning a Population Proportion Let p denote the proportion of individuals or objects in a population who possess a specified property (e.g., cars with manual transmissions or smokers who smoke a filter cigarette). If an individual or object with the property is labeled a success (S), then p is the population proportion of successes. Tests concerning p will be based on a random sample of size n from the population. Provided that n is small relative to the population size, X (the number of S’s in the sample) has (approximately) a binomial distribution. Furthermore, if n itself is large, both X and the estimator p^ ¼ X=n are approximately normally distributed. We first consider large-sample tests based on this latter fact and then turn to the small-sample case that directly uses the binomial distribution. Large-Sample Tests Large-sample tests concerning p are a special case of the more general large-sample procedures for a parameter y. Let ^y be an estimator of y that is (at least approximately) unbiased and has approximately a normal distribution. The null hypothesis has the form H0: y ¼ y0, where y0 denotes a number (the null value) appropriate to the problem context. Suppose that when H0 is true, the standard deviation of ^ y, s^y , involves no unknown parameters. For example, if y ¼ m and ^y ¼ X, pffiffiffi s^y ¼ sX ¼ s= n, which involves no unknown parameters only if the value of s is known. A large-sample test statistic results from standardizing ^y under the assumption that H0 is true [so that Eð^yÞ ¼ y0 ]: Test statistic: ^y y0 s^y If the alternative hypothesis is Ha: y > y0, an upper-tailed test whose significance level is approximately a is specified by the rejection region z za. The other two alternatives, Ha: y < y0 and Ha: y 6¼ y0, are tested using a lower-tailed z test and a two-tailed z test, respectively. In the case y ¼ p, s^y will not involve any unknown parameters when H0 is true, but this is atypical. When s^y does involve unknown parameters, it is often possible to use an estimated standard deviation S^y in place of s^y and still have Z approximately normally distributed when H0 is true (because when n is large, s^y s^y for most samples). The large-sample test of the previous section furnishes pffiffiffi an example pffiffiffi of this: Because s is usually unknown, we use s^y ¼ sX ¼ s= n in place of s= n in the denominator of z. The estimator p^ ¼ X=n is unbiased [Eð^ pÞ ¼ p], has approximately a normal pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi distribution, and its standard deviation is sp^ ¼ pð1 pÞ=n. These facts were used in Section 8.2 to obtain a confidence interval for p. When H0 is true, Eð^ pÞ ¼ p0 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and sp^ ¼ p0 ð1 p0 Þ=n , so sp^ does not involve any unknown parameters. It then follows that when n is large and H0 is true, the test statistic p^ p0 Z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p0 ð1 p0 Þ=n 9.3 Tests Concerning a Population Proportion 451 has approximately a standard normal distribution. If the alternative hypothesis is Ha: p > p0 and the upper-tailed rejection region z za is used, then Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ ¼ PðZ za when Z has approximately a standard normal distributionÞ a Thus the desired level of significance a is attained by using the critical value that captures area a in the upper tail of the z curve. Rejection regions for the other two alternative hypotheses, lower-tailed for Ha: p < p0 and two-tailed for Ha: p 6¼ p0, are justified in an analogous manner. Null hypothesis: H0: p ¼ p0 p^ p0 Test statistic value: z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p0 ð1 p0 Þ=n Alternative Hypothesis Rejection Region H a: p > p 0 H a: p < p 0 Ha: p 6¼ p0 z za (upper-tailed) z za (lower-tailed) either z za/2 or z za/2 (two-tailed) These test procedures are valid provided that np0 10 and n(1 p0) 10. Example 9.11 Recent information suggests that obesity is an increasing problem in America among all age groups. The Associated Press (Oct. 9, 2002) reported that 1276 individuals in a sample of 4115 adults were found to be obese (a body mass index exceeding 30; this index is a measure of weight relative to height). A 1998 survey based on people’s own assessment revealed that 20% of adult Americans considered themselves obese. Does the recent data suggest that the true proportion of adults who are obese is more than 1.5 times the percentage from the self-assessment survey? Let’s carry out a test of hypotheses using a significance level of .10. 1. p ¼ the proportion of all American adults who are obese. 2. Saying that the current percentage is 1.5 times the self-assessment percentage is equivalent to the assertion that the current percentage is 30%, from which we have the null hypothesis as H0: p ¼ .30. 3. The phrase “more than” in the problem description implies that the alternative hypothesis is Ha: p > .30. 4. Since np0 ¼ 4115(.3) 10 and nq0 ¼ 4115(.7) 10, the large-sample z test can certainly be used. The test statistic value is z ¼ ð^ p :3Þ= pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð:3Þð:7Þ=n 452 CHAPTER 9 Tests of Hypotheses Based on a Single Sample 5. The form of Ha implies that an upper-tailed test is appropriate: Reject H0 if z z.10 ¼ 1.28. 6. p^ ¼ 1276=4115 ¼ :310, from which pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi z ¼ ð:310 :3Þ= ð:3Þð:7Þ=4115 ¼ :010=:0071 ¼ 1:40: 7. Since 1.40 exceeds the critical value 1.28, z lies in the rejection region. This justifies rejecting the null hypothesis. Using a significance level of .10, it does appear that more than 30% of American adults are obese. ■ b and Sample Size Determination When H0 is true, the test statistic Z has approximately a standard normal distribution. Now suppose that H0 is not true and that p ¼ p0 . Then Z still has approximately a normal distribution (because it is a linear function of p^), but its mean value and variance are no longer 0 and 1, respectively. Instead, p0 p0 EðZÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p0 ð1 p0 Þ=n VðZÞ ¼ p0 ð1 p0 Þ=n p0 ð1 p0 Þ=n The probability of a type II error for an upper-tailed test is b(p0 ) ¼ P(Z < za when p ¼ p0 ). This can be computed by using the given mean and variance to standardize and then referring to the standard normal cdf. In addition, if it is desired that the level a test also have b(p0 ) ¼ b for a specified value of b, this equation can be solved for the necessary n as in Section 9.2. General expressions for b(p0 ) and n are given in the accompanying box. Alternative Hypothesis H a: p > p 0 H a: p < p 0 Ha: p 6¼ p0 b(p0 ) " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# p0 p0 þ za p0 ð1 p0 Þ=n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F p0 ð1 p0 Þ=n " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# p0 p0 za p0 ð1 p0 Þ=n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1F p0 ð1 p0 Þ=n " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# p0 p0 þ za=2 p0 ð1 p0 Þ=n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F p0 ð1 p0 Þ=n " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# p0 p0 za=2 p0 ð1 p0 Þ=n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F p0 ð1 p0 Þ=n The sample size n for which the level a test also satisfies b(p0 ) ¼ b is 8 " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#2 > > z ð1 p Þ þ z p p0 ð1 p0 Þ a 0 0 b > > one tailed test > < p0 p0 n ¼ " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#2 > > za=2 p0 ð1 p0 Þ þ zb p0 ð1 p0 Þ two tailed test (an > > > : 0 approximate solution) p p0 9.3 Tests Concerning a Population Proportion Example 9.12 453 A package-delivery service advertises that at least 90% of all packages brought to its office by 9 a.m. for delivery in the same city are delivered by noon that day. Let p denote the true proportion of such packages that are delivered as advertised and consider the hypotheses H0: p ¼ .9 versus Ha: p < .9. If only 80% of the packages are delivered as advertised, how likely is it that a level .01 test based on n ¼ 225 packages will detect such a departure from H0? What should the sample size be to ensure that b(.8) ¼ .01? With a ¼ .01, p0 ¼ .9, p0 ¼ .8, and n ¼ 225, " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# :9 :8 2:33 ð:9Þð:1Þ=225 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 1 Fð2:00Þ ¼ :0228 bð:8Þ ¼ 1 F ð:8Þð:2Þ=225 Thus the probability that H0 will be rejected using the test when p ¼ .8 is .9772— roughly 98% of all samples will result in correct rejection of H0. Using za ¼ zb ¼ 2.33 in the sample size formula yields " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#2 2:33 ð:9Þð:1Þ þ 2:33 ð:8Þð:2Þ n¼ 266 :8 :9 ■ Small-Sample Tests Test procedures when the sample size n is small are based directly on the binomial distribution rather than the normal approximation. Consider the alternative hypothesis Ha: p > p0 and again let X be the number of successes in the sample. Then X is the test statistic, and the upper-tailed rejection region has the form x c. When H0 is true, X has a binomial distribution with parameters n and p0, so Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ ¼ P½X c when X Binðn; p0 Þ ¼ 1 P½X c 1 when X Binðn; p0 Þ ¼ 1 Bðc 1; n; p0 Þ As the critical value c decreases, more x values are included in the rejection region and P(type I error) increases. Because X has a discrete probability distribution, it is usually not possible to find a value of c for which P(type I error) is exactly the desired significance level a (e.g., .05 or .01). Instead, the largest rejection region of the form {c, c + 1, . . . , n} satisfying 1 B(c 1; n, p0) a is used. Let p0 denote an alternative value of p ðp0 >p0 Þ. When p ¼ p0 ; X Binðn; p0 Þ, so bðp0 Þ ¼ Pðtype II error when p ¼ p0 Þ ¼ P½X < c when X Binðn; p0 Þ ¼ Bðc 1; n; p0 Þ That is, b(p0 ) is the result of a straightforward binomial probability calculation. The sample size n necessary to ensure that a level a test also has specified b at a particular alternative value p0 must be determined by trial and error using the binomial cdf. Test procedures for Ha: p < p0 and for Ha: p ¼ 6 p0 are constructed in a similar manner. In the former case, the appropriate rejection region has the form x c (a lowertailed test). The critical value c is the largest number satisfying B(c; n, p0) a. 454 CHAPTER 9 Tests of Hypotheses Based on a Single Sample The rejection region when the alternative hypothesis is Ha: p ¼ 6 p0 consists of both large and small x values. Example 9.13 A plastics manufacturer has developed a new type of plastic trash can and proposes to sell them with an unconditional 6-year warranty. To see whether this is economically feasible, 20 prototype cans are subjected to an accelerated life test to simulate 6 years of use. The proposed warranty will be modified only if the sample data strongly suggests that fewer than 90% of such cans would survive the 6-year period. Let p denote the proportion of all cans that survive the accelerated test. The relevant hypotheses are then H0: p ¼ .9 versus Ha: p < .9. A decision will be based on the test statistic X, the number among the 20 that survive. If the desired significance level is a ¼ .05, c must satisfy B(c; 20, .9) .05. From Appendix Table A.1, B(15; 20, .9) ¼ .043, and B(16; 20, .9) ¼ .133. The appropriate rejection region is therefore x 15. If the accelerated test results in x ¼ 14, H0 would be rejected in favor of Ha, necessitating a modification of the proposed warranty. The probability of a type II error for the alternative value p0 ¼ .8 is bð:8Þ ¼ P½H0 is not rejected when X Binð20; :8Þ ¼ P½X 16 when X Binð20; :8Þ ¼ 1 Bð15; 20; :8Þ 1 :370 ¼ :630 That is, when p ¼ .8, 63% of all samples consisting of n ¼ 20 cans would result in H0 being incorrectly not rejected. This error probability is high because 20 is a small sample size and p0 ¼ .8 is close to the null value p0 ¼ .9. ■ Exercises Section 9.3 (36–44) 36. State DMV records indicate that of all vehicles undergoing emissions testing during the previous year, 70% passed on the first try. A random sample of 200 cars tested in a particular county during the current year yields 124 that passed on the initial test. Does this suggest that the true proportion for this county during the current year differs from the previous statewide proportion? Test the relevant hypotheses using a ¼ .05. 37. A manufacturer of nickel–hydrogen batteries randomly selects 100 nickel plates for test cells, cycles them a specified number of times, and determines that 14 of the plates have blistered. a. Does this provide compelling evidence for concluding that more than 10% of all plates blister under such circumstances? State and test the appropriate hypotheses using a significance level of .05. In reaching your conclusion, what type of error might you have committed? b. If it is really the case that 15% of all plates blister under these circumstances and a sample size of 100 is used, how likely is it that the null hypothesis of part (a) will not be rejected by the level .05 test? Answer this question for a sample size of 200. c. How many plates would have to be tested to have b(.15) ¼ .10 for the test of part (a)? 38. A random sample of 150 recent donations at a blood bank reveals that 82 were type A blood. Does this suggest that the actual percentage of type A donations differs from 40%, the percentage of the population having type A blood? Carry out a test of the appropriate hypotheses using a significance level of .01. Would your conclusion have been different if a significance level of .05 had been used? 39. A university library ordinarily has a complete shelf inventory done once every year. Because of new shelving rules instituted the previous year, the head librarian believes it may be possible to save money by postponing the inventory. The librarian decides to select at random 1000 books from the 9.3 Tests Concerning a Population Proportion library’s collection and have them searched in a preliminary manner. If evidence indicates strongly that the true proportion of misshelved or unlocatable books is <.02, then the inventory will be postponed. a. Among the 1000 books searched, 15 were misshelved or unlocatable. Test the relevant hypotheses and advise the librarian what to do (use a ¼ .05). b. If the true proportion of misshelved and lost books is actually .01, what is the probability that the inventory will be (unnecessarily) taken? c. If the true proportion is .05, what is the probability that the inventory will be postponed? 40. The article “Statistical Evidence of Discrimination” (J. Amer. Statist. Assoc., 1982: 773–783) discusses the court case Swain v. Alabama (1965), in which it was alleged that there was discrimination against blacks in grand jury selection. Census data suggested that 25% of those eligible for grand jury service were black, yet a random sample of 1050 people called to appear for possible duty yielded only 177 blacks. Using a level .01 test, does this data argue strongly for a conclusion of discrimination? 41. A plan for an executive traveler’s club has been developed by an airline on the premise that 5% of its current customers would qualify for membership. A random sample of 500 customers yielded 40 who would qualify. a. Using this data, test at level .01 the null hypothesis that the company’s premise is correct against the alternative that it is not correct. b. What is the probability that when the test of part (a) is used, the company’s premise will be judged correct when in fact 10% of all current customers qualify? 42. Each of a group of 20 intermediate tennis players is given two rackets, one having nylon strings and the other synthetic gut strings. After several weeks of playing with the two rackets, each player will be asked to state a preference for one of the two types of strings. Let p denote the proportion of all such players who would prefer gut to nylon, and let X be the number of players in the sample who prefer gut. Because gut strings are more expensive, consider the null hypothesis that at most 50% of all such players prefer gut. We simplify this to H0: p ¼ .5, planning to reject H0 only if sample evidence strongly favors gut strings. 455 a. Which of the rejection regions {15, 16, 17, 18, 19, 20}, {0, 1, 2, 3, 4, 5}, or {0, 1, 2, 3, 17, 18, 19, 20} is most appropriate, and why are the other two not appropriate? b. What is the probability of a type I error for the chosen region of part (a)? Does the region specify a level .05 test? Is it the best level .05 test? c. If 60% of all enthusiasts prefer gut, calculate the probability of a type II error using the appropriate region from part (a). Repeat if 80% of all enthusiasts prefer gut. d. If 13 out of the 20 players prefer gut, should H0 be rejected using a significance level of .10? 43. A manufacturer of plumbing fixtures has developed a new type of washerless faucet. Let p ¼ P(a randomly selected faucet of this type will develop a leak within 2 years under normal use). The manufacturer has decided to proceed with production unless it can be determined that p is too large; the borderline acceptable value of p is specified as .10. The manufacturer decides to subject n of these faucets to accelerated testing (approximating 2 years of normal use). With X ¼ the number among the n faucets that leak before the test concludes, production will commence unless the observed X is too large. It is decided that if p ¼ .10, the probability of not proceeding should be at most .10, whereas if p ¼ .30 the probability of proceeding should be at most .10. Can n ¼ 10 be used? n ¼ 20? n ¼ 25? What is the appropriate rejection region for the chosen n, and what are the actual error probabilities when this region is used? 44. Scientists have recently become concerned about the safety of Teflon cookware and various food containers because perfluorooctanoic acid (PFOA) is used in the manufacturing process. An article in the July 27, 2005, New York Times reported that of 600 children tested, 96% had PFOA in their blood. According to the FDA, 90% of all Americans have PFOA in their blood. a. Does the data on PFOA incidence among children suggest that the percentage of all children who have PFOA in their blood exceeds the FDA percentage for all Americans? Carry out an appropriate test of hypotheses. b. If 95% of all children have PFOA in their blood, how likely is it that the null hypothesis tested in (a) will be rejected when a significance level of .01 is employed? c. Referring back to (b), what sample size would be necessary for the relevant probability to be .10? 456 CHAPTER 9 Tests of Hypotheses Based on a Single Sample 9.4 P-Values Using the rejection region method to test hypotheses entails first selecting a significance level a. Then after computing the value of the test statistic, the null hypothesis H0 is rejected if the value falls in the rejection region and is otherwise not rejected. We now consider another way of reaching a conclusion in a hypothesis testing analysis. This alternative approach is based on calculation of a certain probability called a P-value. One advantage is that the P-value provides an intuitive measure of the strength of evidence in the data against H0 DEFINITION The P-value is the probability, calculated assuming that the null hypothesis is true, of obtaining a value of the test statistic at least as contradictory to H0 as the value calculated from the available sample. The definition is quite a mouthful. Here are some key points: • The P-value is a probability. • This probability is calculated assuming that the null hypothesis is true. • To determine the P-value, we must first decide which values of the test statistic are at least as contradictory to H0 as the value obtained from our sample. Example 9.14 Urban storm water can be contaminated by many sources, including discarded batteries. When ruptured, these batteries release metals of environmental significance. The paper “Urban Battery Litter” (J. Environ. Engr., 2009: 46–57) presented summary data for characteristics of a variety of batteries found in urban areas around Cleveland. A sample of 51 Panasonic AAA batteries gave a sample mean zinc mass of 2.06 g. and a sample standard deviation of .141 g. Does this data provide compelling evidence for concluding that the population mean zinc mass exceeds 2.0 g.? With m denoting the true average zinc mass for such batteries, the relevant hypotheses are H0: m ¼ 2.0 versus Ha: m > 2.0. The sample size is large enough so that a z test can be used without making any specific assumption about the shape of the population distribution. The test statistic value is z¼ x 2:0 2:06 2:0 pffiffiffiffiffi ¼ 3:04 pffiffiffi ¼ s= n :141= 51 Now we must decide which values of z are at least as contradictory to H0. Let’s first consider an easier task: Which values of x are at least as contradictory to the null hypothesis as 2.06, the mean of the observations in our sample? Because > appears in Ha, it should be clear that 2.10 is at least as contradictory to H0 as is 2.06, so is 2.25, and so in fact is any x value that exceeds 2.06. But an x value that exceeds 2.06 corresponds to a value of z that exceeds 3.04. Thus the P-value is P-value ¼ PðZ 3:04 when m ¼ 2:0Þ 9.4 P-Values 457 Since the test statistic Z was created by subtracting the null value 2.0 in the numerator, when m ¼ 2.0 (i.e., when H0 is true) Z has approximately a standard normal distribution. As a result, P-value ¼ PðZ 3:04 when m ¼ 2:0Þ area under the z curve to the right of 3:04 ¼ 1 Fð3:04Þ ¼ :0012 ■ We will shortly illustrate how to determine the P-value for any z or t test; that is, any test where the reference distribution is the standard normal distribution (and z curve) or some t distribution (and corresponding t curve). For the moment, though, let’s focus on reaching a conclusion once the P-value is available. Because it is a probability, the P-value must be between 0 and 1. What kinds of P-values provide evidence against the null hypothesis? Consider two specific instances: • P-value ¼ .250: In this case, fully 25% of all possible test statistic values are more contradictory to H0 than the one that came out of our sample. So our data is not that contradictory to the null hypothesis. • P-value ¼ .0018: Here, only .18%, much less than 1%, of all possible test statistic values, are at least as contradictory to H0 as what we obtained. Thus the sample appears to be highly contradictory to the null hypothesis. More generally, the smaller the P-value, the more evidence there is in the sample data against the null hypothesis and for the alternative hypothesis. That is, H0 should be rejected in favor of Ha when the P-value is sufficiently small. So what constitutes “sufficiently small”? DECISION RULE BASED ON THE P-VALUE Select a significance level a (as before, the desired type I error probability). Then reject H0 if P-value a; do not reject H0 if P-value > a Thus if the P-value exceeds the chosen significance level, the null hypothesis cannot be rejected at that level. But if the P-value is equal to or < a, then there is enough evidence to justify rejecting H0. In Example 8.14, we calculated P-value ¼ .0012. Then using a significance level of .01, we would reject the null hypothesis in favor of the alternative hypothesis because .0012 .01. However, suppose we select a significance level of only .001, which requires more substantial evidence from the data before H0 can be rejected. In this case we would not reject H0 because .0012 > .001. How does the decision rule based on the P-value compare to the decision rule employed in the rejection region approach? The two procedures—the rejection region method and the P-value method—are in fact identical. Whatever the conclusion reached by employing the rejection region approach with a particular a, the same conclusion will be reached via the P-value approach using that same a. Example 9.15 The nicotine content problem discussed in Example 9.5 involved testing H0: m ¼ 1.5 versus Ha: m > 1.5 using a z test (i.e., a test which utilizes the z curve as the reference distribution). The inequality in Ha implies that the upper-tailed 458 CHAPTER 9 Tests of Hypotheses Based on a Single Sample rejection region z za is appropriate. Suppose z ¼ 2.10. Then using exactly the same reasoning as in Example 8.14 gives P-value ¼ 1 F(2.10) ¼ .0179. Consider now testing with several different significance levels: a ¼ :10 ) za ¼ z:10 ¼ 1:28 ) 2:10 1:28 ) reject H0 a ¼ :05 ) za ¼ z:05 ¼ 1:645 ) 2:10 1:645 ) reject H0 a ¼ :01 ) za ¼ z:01 ¼ 2:33 ) 2:10 < 2:33 ) do not reject H0 Because P-value ¼ .0179 .10 and also .0179 .05, using the P-value approach results in rejection of H0 for the first two significance level. However, for a ¼ :01, 2.10 is not in the rejection region and .0179 is larger than .01. More generally, whenever a is smaller than the P-value .0179, the critical value za will be beyond the P-value and H0 cannot be rejected by either method. This is illustrated in Figure 9.5. a Standard normal (z) curve Shaded area = .0179 0 2.10 = computed z z curve b z curve c Shaded area = a 0 2.10 za Shaded area = a 0 2.10 za Figure 9.5 Relationship between a and tail area captured by computed z: (a) tail area captured by computed z; (b) when a > .0179, za < 2.10 and H0 is rejected; (c) when a < .0179, za > 2.10 and H0 is not rejected ■ Let’s reconsider the P-value .0012 in Example 9.14 once again. H0 can be rejected only if :0012 a. Thus the null hypothesis can be rejected if a ¼ .05 or .01 or .005 or .0015 or .00125. What is the smallest significance level a here for which H0 can be rejected? It is the P-value .0012. PROPOSITION The P-value is the smallest significance level a at which the null hypothesis can be rejected. Because of this, the P-value is alternatively referred to as the observed significance level (OSL) for the data. It is customary to call the data significant when H0 is rejected and not significant otherwise. The P-value is then the smallest level at which the data is 9.4 P-Values 459 P−value = smallest level at which H0 can be rejected 0 (b) (a) 1 Figure 9.6 Comparing a and the P-value: (a) reject H0 when a lies here; (b) do not reject H0 when a lies here significant. An easy way to visualize the comparison of the P-value with the chosen a is to draw a picture like that of Figure 9.6. The calculation of the P-value depends on whether the test is upper-, lower-, or two-tailed. However, once it has been calculated, the comparison with a does not depend on which type of test was used. Example 9.16 The true average time to initial relief of pain for a best-selling pain reliever is known to be 10 min. Let m denote the true average time to relief for a company’s newly developed reliever. Suppose that when data from an experiment involving the new pain reliever was analyzed, the P-value for testing H0: m ¼ 10 versus Ha: m < 10 was calculated as .0384. Since a ¼ .05 is larger than the P-value [.05 lies in the interval (a) of Figure 9.6], H0 would be rejected by anyone carrying out the test at level .05. However, at level .01, H0 would not be rejected because .01 is smaller than the smallest level (.0384) at which H0 can be rejected. ■ The most widely used statistical computer packages automatically include a P-value when a hypothesis-testing analysis is performed. A conclusion can then be drawn directly from the output, without reference to a table of critical values. With the P-value in hand, an investigator can see at a quick glance for which significance levels H0 would or would not be rejected. Also, each individual can then select his or her own significance level. In addition, knowing the P-value allows a decision maker to distinguish between a close call (e.g., a ¼ .05, P-value ¼ .0498) and a very clear-cut conclusion (e.g., a ¼ .05, P-value ¼ .0003), something that would not be possible just from the statement “H0 can be rejected at significance level .05.” P-Values for z Tests The P-value for a z test (one based on a test statistic whose distribution when H0 is true is at least approximately standard normal) is easily determined from the information in Appendix Table A.3. Consider an upper-tailed test and let z denote the computed value of the test statistic Z. The null hypothesis is rejected if z za, and the P-value is the smallest a for which this is the case. Since za increases as a decreases, the P-value is the value of a for which z ¼ za. That is, the P-value is just the area captured by the computed value z in the upper tail of the standard normal curve. The corresponding cumulative area is F(z), so in this case P-value ¼ 1 F(z). An analogous argument for a lower-tailed test shows that the P-value is the area captured by the computed value z in the lower tail of the standard normal curve. More care must be exercised in the case of a two-tailed test. Suppose first that z is positive. Then the P-value is the value of a satisfying z ¼ za/2 (i.e., computed z ¼ upper-tail 460 CHAPTER 9 Tests of Hypotheses Based on a Single Sample z curve P-value = area in upper tail 1. Upper-tailed test Ha contains the inequality > = 1 – Φ(z) 0 Calculated z z curve P-value = area in lower tail 2. Lower-tailed test = Φ(z) Ha contains the inequality < 0 Calculated z P-value = sum of area in two tails = 2[1 – Φ(|z|)] z curve 3. Two-tailed test Ha contains the inequality ≠ 0 Calculated z, −z Figure 9.7 Determination of the P-value for a z test critical value). This says that the area captured in the upper tail is half the P-value, so that P-value ¼ 2[1 F(z)]. If z is negative, the P-value is the a for which z ¼ za/2, or, equivalently, z ¼ za/2, so P-value ¼ 2[1 F(z)]. Since z ¼ |z| when z is negative, the P-value ¼ 2[1 F(|z|)] for either positive or negative z. P-value: 8 > < 1 FðzÞ P ¼ FðzÞ > : 2½1 FðjzjÞ for an upper -tailed test for a lower -tailed test for a two -tailed test Each of these is the probability of getting a value at least as extreme as what was obtained (assuming H0 true). The three cases are illustrated in Figure 9.7. The next example illustrates the use of the P-value approach to hypothesis testing by means of a sequence of steps modified from our previously recommended sequence. Example 9.17 The target thickness for silicon wafers used in a type of integrated circuit is 245 mm. A sample of 50 wafers is obtained and the thickness of each one is determined, resulting in a sample mean thickness of 246.18 mm and a sample standard deviation of 3.60 mm. Does this data suggest that true average wafer thickness is something other than the target value? 9.4 P-Values 461 1. Parameter of interest: m ¼ true average wafer thickness 2. Null hypothesis: H0: m ¼ 245 Ha: m 6¼ 245 x 245 pffiffiffi 4. Formula for test statistic value: z ¼ s= n 246:18 245 pffiffiffiffiffi ¼ 2:32 5. Calculation of test statistic value: z ¼ 3:60 50 6. Determination of P-value: Because the test is two-tailed, 3. Alternative hypothesis: P-value ¼ 2½1Fð2:32Þ ¼ :0204 7. Conclusion: Using a significance level of .01, H0 would not be rejected since .0204 > .01. At this significance level, there is insufficient evidence to conclude that true average thickness differs from the target value. ■ P-Values for t Tests Just as the P-value for a z test is a z curve area, the P-value for a t test will be a t curve area. Figure 9.8 illustrates the three different cases. The number of df for the one-sample t test is n 1. The table of t critical values used previously for confidence and prediction intervals doesn’t contain enough information about any particular t distribution to allow for accurate determination of desired areas. So we have included another t table in Appendix Table A.7, one that contains a tabulation of upper-tail t curve areas. Each different column of the table is for a different number of df, and the rows are for calculated values of the test statistic t ranging from 0.0 to 4.0 in increments of .1. For example, the number .074 appears at the intersection of the 1.6 row and the 8 df column, so the area under the 8 df curve to the right of 1.6 (an upper-tail area) is .074. Because t curves are symmetric, .074 is also the area under the 8 df curve to the left of 1.6 (a lower-tail area). Suppose, for example, that a test of H0: m ¼ 100 versus Ha: m > 100 is based on the 8 df t distribution. If the calculated value of the test statistic is t ¼ 1.6, then the P-value for this upper-tailed test is .074. Because .074 exceeds .05, we would not be able to reject H0 at a significance level of .05. If the alternative hypothesis is Ha: m < 100 and a test based on 20 df yields t ¼ 3.2, then Appendix Table A.7 shows that the P-value is the captured lower-tail area .002. The null hypothesis can be rejected at either level .05 or .01. Consider testing H0: m1 m2 ¼ 0 versus Ha: m1 m2 6¼ 0; the null hypothesis states that the means of the two populations are identical, whereas the alternative hypothesis states that they are different without specifying a direction of departure from H0. If a t test is based on 20 df and t ¼ 3.2, then the P-value for this two-tailed test is 2(.002) ¼ .004. This would also be the P-value for t ¼ 3.2. The tail area is doubled because values both larger than 3.2 and smaller than 3.2 are more contradictory to H0 than what was calculated (values farther out in either tail of the t curve). 462 CHAPTER 9 Tests of Hypotheses Based on a Single Sample t curve for relevant df P-value = area in upper tail 1. Upper-tailed test Ha contains the inequality > 0 Calculated t t curve for relevant df P-value = area in lower tail 2. Lower-tailed test Ha contains the inequality < 0 Calculated t P-value = sum of area in two tails t curve for relevant df 3. Two-tailed test Ha contains the inequality ≠ 0 Calculated t, −t Figure 9.8 P-values for t tests Example 9.18 In Example 9.9, we carried out a test of H0: m ¼ 25 versus Ha: m > 25 based on 4 df. The calculated value of t was 1.04. Looking to the 4 df column of Appendix Table A.7 and down to the 1.0 row, we see that the entry is .187, so the P-value .187. This P-value is clearly larger than any reasonable significance level a (.01, .05, and even .10), so there is no reason to reject the null hypothesis. The MINITAB output included in Example 9.9 has P-value ¼ .18. P-values from software packages will be more accurate than what results from Appendix Table A.7 since values of t in our table are accurate only to the tenths digit. ■ More on Interpreting P-Values The P-value resulting from carrying out a test on a selected sample is not the probability that H0 is true, nor is it the probability of rejecting the null hypothesis. Once again, it is the probability, calculated assuming that H0 is true, of obtaining a test statistic value at least as contradictory to the null hypothesis as the value that actually resulted. For example, consider testing H0: m ¼ 50 against H0: m < 50 using a lower-tailed z test. If the calculated value of the test statistic is z ¼ 2.00, then 9.4 P-Values 463 P-value ¼ PðZ < 2:00 when m ¼ 50Þ ¼ area under the z curve to the left of 2:00 ¼ :0228 But if a second sample is selected, the resulting value of z will almost surely be different from 2.00, so the corresponding P-value will also likely differ from .0228. Because the test statistic value itself varies from one sample to another, the P-value will also vary from one sample to another. That is, the test statistic is a random variable, and so the P-value will also be a random variable. A first sample may give a P-value of .0228, a second sample result in a P-value of .1175, a third yield .0606 as the P-value, and so on. If H0 is false, we hope the P-value will be close to 0 so that the null hypothesis can be rejected. On the other hand, when H0 is true, we’d like the P-value to exceed the selected significance level so that the correct decision to not reject H0 is made. The next example presents simulations to show how the P-value behaves both when the null hypothesis is true and when it is false. Example 9.19 The fuel efficiency (mpg) of any particular new vehicle under specified driving conditions may not be identical to the EPA figure that appears on the vehicle’s sticker. Suppose that four different vehicles of a particular type are to be selected and driven over a certain course, after which the fuel efficiency of each one is to be determined. Let m denote the true average fuel efficiency under these conditions. Consider testing H0: m ¼ 20 versus H0: m > 20 using the one-sample t test based on the resulting sample. Since the test is based on n 1 ¼ 3 degrees of freedom, the P-value for an upper-tailed test is the area under the t curve with 3 df to the right of the calculated t. Let’s first suppose that the null hypothesis is true. We asked MINITAB to generate 10,000 different samples, each containing 4 observations, from a normal population distribution with mean value m ¼ 20 and standard deviation s ¼ 2. The first sample and resulting summary quantities were x1 ¼ 20:830; x2 ¼ 22:232; x3 ¼ 20:276; x4 ¼ 17:718 20:264 20 pffiffiffi ¼ :2799 x ¼ 20:264 s ¼ 1:8864 t ¼ :1:8864= 4 The P-value is the area under the 3-df t curve to the right of .2799, which according to MINITAB is .3989. Using a significance level of .05, the null hypothesis would of course not be rejected. The values of t for the next four samples were 1.7591, .6082, .7020, and 3.1053, with corresponding P-values .912, .293, .733, and .0265. Figure 9.9(a) shows a histogram of the 10,000 P-values from this simulation experiment. About 4.5% of these P-values are in the first class interval from 0 to .05. Thus when using a significance level of .05, the null hypothesis is rejected in roughly 4.5% of these 10,000 tests. If we continue to generate samples and carry out the test for each one at significance level .05, in the long run 5% of the P-values would be in the first class interval—because when H0 is true and a test with significance level .05 is used, by definition the probability of rejecting H0 is .05. Looking at the histogram, it appears that the distribution of P-values is relatively flat. In fact, it can be shown that when H0 is true, the probability distribution of the P-value is a uniform distribution on the interval from 0 to 1. That is, the density curve is completely flat on this interval, and thus must have a Tests of Hypotheses Based on a Single Sample a 6 5 4 Percent 9 3 2 1 0 0.00 0.15 0.30 0.45 0.60 0.75 0.90 0.60 0.75 0.90 0.60 0.75 0.90 P-value b 20 15 Percent CHAPTER 10 5 0 0.00 c 0.15 0.30 0.45 P-value 50 40 Percent 464 30 20 10 0 0.00 0.15 0.30 0.45 P-value Figure 9.9 P-value simulation results for Example 9.19 9.4 P-Values 465 height of 1 if the total area under the curve is to be 1. Since the area under such a curve to the left of .05 is (.05)(1) ¼ .05, we again have that the probability of rejecting H0 when it is true is .05, the chosen significance level. Now consider what happens when H0 is false because m ¼ 21. We again had MINITAB generate 10,000 different samples of size 4, each a normal pffiffifrom ffi distribution with m ¼ 21 and s ¼ 2, calculate t ¼ ðx 20Þ=ðs= 4Þ for each one, and then determine the P-value. The first such sample resulted in x ¼ 20:6411; s ¼ :49637; t ¼ 2:5832; P-value ¼ :0408. Figure 9.9(b) gives a histogram of the 10,000 resulting P-values. The shape of this histogram is quite different from that of Figure 9.9(a): there is a much greater tendency for the P-value to be small (closer to 0) when m ¼ 21 than when m ¼ 20. Again H0 is rejected at significance level .05 whenever the P-value is at most .05 (in the first class interval). Unfortunately this is the case for only about 19% of the 10,000 P-values. So only about 19% of the 10,000 tests correctly reject the null hypothesis; for the other 81%, a type II error is committed. The difficulty is that the sample size is quite small and 21 is not very different from the value asserted by the null hypothesis. Figure 9.9(c) illustrates what happens to the P-value when H0 is false because m ¼ 22 (still with n ¼ 4 and s ¼ 2). The histogram is even more concentrated toward values close to 0 than was the case when m ¼ 21. In general, as m moves further to the right of the null value 20, the distribution of the P-value will become more and more concentrated on values close to 0. Even here a bit fewer than 50% of the 10,000 P-values are smaller than .05. So it is still slightly more likely than not that the null hypothesis is incorrectly not rejected. Only for values of m much larger than 20 (e.g., at least 24 or 25) is it highly likely that the P-value will be smaller than .05 and thus give the correct conclusion. The big idea of this example is that because the value of any test statistic is random, the P-value will also be a random variable and thus have a distribution. The farther the actual value of the parameter is from the value specified by the null hypothesis, the more the distribution of the P-value will be concentrated on values close to 0 and the greater the chance that the test will correctly reject H0 ▄ (corresponding to smaller b). Exercises Section 9.4 (45–59) 45. For which of the given P-values would the null hypothesis be rejected when performing a level .05 test? a. .001 b. .021 c. .078 d. .047 e. .148 46. Pairs of P-values and significance levels, a, are given. For each pair, state whether the observed Pvalue would lead to rejection of H0 at the given significance level. a. P-value ¼ .084, a ¼ .05 b. P-value ¼ .003, a ¼ .001 c. P-value ¼ .498, a ¼ .05 d. P-value ¼ .084, a ¼ .10 e. P-value ¼ .039, a ¼ .01 f. P-value ¼ .218, a ¼ .10 47. Let m denote the mean reaction time to a certain stimulus. For a large-sample z test of H0: m ¼ 5 versus Ha: m > 5, find the P-value associated with each of the given values of the z test statistic. a. 1.42 b. .90 c. 1.96 d. 2.48 e. .11 466 CHAPTER 9 Tests of Hypotheses Based on a Single Sample 48. Newly purchased tires of a certain type are supposed to be filled to a pressure of 30 lb/in2. Let m denote the true average pressure. Find the P-value associated with each given z statistic value for testing H0: m ¼ 30 versus Ha: m 6¼ 30. a. 2.10 b. 1.75 c. .55 d. 1.41 e. 5.3 49. Give as much information as you can about the P-value of a t test in each of the following situations: a. Upper-tailed test, df ¼ 8, t ¼ 2.0 b. Lower-tailed test, df ¼ 11, t ¼ 2.4 c. Two-tailed test, df ¼ 15, t ¼ 1.6 d. Upper-tailed test, df ¼ 19, t ¼ .4 e. Upper-tailed test, df ¼ 5, t ¼ 5.0 f. Two-tailed test, df ¼ 40, t ¼ 4.8 50. The paint used to make lines on roads must reflect enough light to be clearly visible at night. Let m denote the true average reflectometer reading for a new type of paint under consideration. A test of H0: m ¼ 20 versus Ha: m > 20 will be based on a random sample of size n from a normal population distribution. What conclusion is appropriate in each of the following situations? a. n ¼ 15, t ¼ 3.2, a ¼ .05 b. n ¼ 9, t ¼ 1.8, a ¼ .01 c. n ¼ 24, t ¼ .2 51. Let m denote true average serum receptor concentration for all pregnant women. The average for all women is known to be 5.63. The article “Serum Transferrin Receptor for the Detection of Iron Deficiency in Pregnancy” (Amer. J. Clin. Nutrit., 1991: 1077–1081) reports that P-value > .10 for a test of H0: m ¼ 5.63 versus Ha: m ¼ 6 5.63 based on n ¼ 176 pregnant women. Using a significance level of .01, what would you conclude? 52. An aspirin manufacturer fills bottles by weight rather than by count. Since each bottle should contain 100 tablets, the average weight per tablet should be 5 grains. Each of 100 tablets taken from a very large lot is weighed, resulting in a sample average weight per tablet of 4.87 grains and a sample standard deviation of .35 grain. Does this information provide strong evidence for concluding that the company is not filling its bottles as advertised? Test the appropriate hypotheses using a ¼ .01 by first computing the P-value and then comparing it to the specified significance level. 53. Because of variability in the manufacturing process, the actual yielding point of a sample of mild steel subjected to increasing stress will usually differ from the theoretical yielding point. Let p denote the true proportion of samples that yield before their theoretical yielding point. If on the basis of a sample it can be concluded that more than 20% of all specimens yield before the theoretical point, the production process will have to be modified. a. If 15 of 60 specimens yield before the theoretical point, what is the P-value when the appropriate test is used, and what would you advise the company to do? b. If the true percentage of “early yields” is actually 50% (so that the theoretical point is the median of the yield distribution) and a level .01 test is used, what is the probability that the company concludes a modification of the process is necessary? 54. Many consumers are turning to generics as a way of reducing the cost of prescription medications. The article “Commercial Information on Drugs: Confusing to the Physician?” (J. Drug Issues, 1988: 245–257) gives the results of a survey of 102 doctors. Only 47 of those surveyed knew the generic name for the drug methadone. Does this provide strong evidence for concluding that fewer than half of all physicians know the generic name for methadone? Carry out a test of hypotheses with a significance level of .01 using the P-value method. 55. A random sample of soil specimens was obtained, and the amount of organic matter (%) in the soil was determined for each specimen, resulting in the accompanying data (from “Engineering Properties of Soil,” Soil Sci., 1998: 93–102). 1.10 0.14 3.98 0.76 5.09 4.47 3.17 1.17 0.97 1.20 3.03 1.57 1.59 3.50 2.21 2.62 4.60 5.02 0.69 1.66 0.32 0.55 1.45 4.67 5.22 2.69 4.47 3.31 1.17 2.05 The values of the sample mean, sample standard deviation, and (estimated) standard error of the mean are 2.481, 1.616, and .295, respectively. Does this data suggest that the true average percentage of organic matter in such soil is something other than 3%? Carry out a test of the appropriate hypotheses at significance level .10 by first determining the P-value. Would your conclusion be different if a ¼ .05 had been used? [Note: A normal probability plot of the data shows an 9.5 Some Comments on Selecting a Test Procedure acceptable pattern in light of the reasonably large sample size.] 56. The times of first sprinkler activation for a series of tests with fire prevention sprinkler systems using an aqueous film-forming foam were (in sec) 27 41 22 27 23 35 30 33 24 27 28 22 24 (see “Use of AFFF in Sprinkler Systems,” Fire Tech., 1976: 5). The system has been designed so that true average activation time is at most 25 s under such conditions. Does the data strongly contradict the validity of this design specification? Test the relevant hypotheses at significance level .05 using the P-value approach. 57. A pen has been designed so that true average writing lifetime under controlled conditions (involving the use of a writing machine) is at least 10 h. A random sample of 18 pens is selected, the writing lifetime of each is determined, and a normal probability plot of the resulting data supports the use of a one-sample t test. a. What hypotheses should be tested if the investigators believe a priori that the design specification has been satisfied? b. What conclusion is appropriate if the hypotheses of part (a) are tested, t ¼ 2.3, and a ¼ .05? c. What conclusion is appropriate if the hypotheses of part (a) are tested, t ¼ 1.8, and a ¼ .01? d. What should be concluded if the hypotheses of part (a) are tested and t ¼ 3.6? 467 58. A spectrophotometer used for measuring CO concentration [ppm (parts per million) by volume] is checked for accuracy by taking readings on a manufactured gas (called span gas) in which the CO concentration is very precisely controlled at 70 ppm. If the readings suggest that the spectrophotometer is not working properly, it will have to be recalibrated. Assume that if it is properly calibrated, measured concentration for span gas samples is normally distributed. On the basis of the six readings—85, 77, 82, 68, 72, and 69—is recalibration necessary? Carry out a test of the relevant hypotheses using the P-value approach with a ¼ .05. 59. The relative conductivity of a semiconductor device is determined by the amount of impurity “doped” into the device during its manufacture. A silicon diode to be used for a specific purpose requires an average cut-on voltage of .60 V, and if this is not achieved, the amount of impurity must be adjusted. A sample of diodes was selected and the cut-on voltage was determined. The accompanying SAS output resulted from a request to test the appropriate hypotheses. N Mean Std Dev T Prob > |T| 15 0.0453333 0.0899100 1.9527887 0.0711 [Note: SAS explicitly tests H0: m ¼ 0, so to test H0: m ¼ .60, the null value .60 must be subtracted from each xi; the reported mean is then the average of the (xi .60) values. Also, SAS’s P-value is always for a two-tailed test.] What would be concluded for a significance level of .01? .05? .10? 9.5 Some Comments on Selecting a Test Procedure Once the experimenter has decided on the question of interest and the method for gathering data (the design of the experiment), construction of an appropriate test procedure consists of three distinct steps: 1. Specify a test statistic (the decision is based on this function of the data). 2. Decide on the general form of the rejection region (typically, reject H0 for suitably large values of the test statistic, reject for suitably small values, or reject for either small or large values). 3. Select the specific numerical critical value or values that will separate the rejection region from the acceptance region (by obtaining the distribution of the test statistic when H0 is true, and then selecting a level of significance). 468 CHAPTER 9 Tests of Hypotheses Based on a Single Sample In the examples thus far, both steps 1 and 2 were carried out in an ad hoc manner through intuition. For example, when the underlying population was assumed normal with mean m and known s, we were led from X to the standardized test statistic Z¼ X m0 pffiffiffi s= n For testing H0: m ¼ m0 versus Ha: m > m0, intuition then suggested rejecting H0 when z was large. Finally, the critical value was determined by specifying the level of significance a and using the fact that Z has a standard normal distribution when H0 is true. The reliability of the test in reaching a correct decision can be assessed by studying type II error probabilities. Issues to be considered in carrying out steps 1–3 encompass the following questions: 1. What are the practical implications and consequences of choosing a particular level of significance once the other aspects of a test procedure have been determined? 2. Does there exist a general principle, not dependent just on intuition, that can be used to obtain best or good test procedures? 3. When two or more tests are appropriate in a given situation, how can the tests be compared to decide which should be used? 4. If a test is derived under specific assumptions about the distribution or population being sampled, how well will the test procedure work when the assumptions are violated? Statistical Versus Practical Significance Although the process of reaching a decision by using the methodology of classical hypothesis testing involves selecting a level of significance and then rejecting or not rejecting H0 at that level, simply reporting the a used and the decision reached conveys little of the information contained in the sample data. Especially when the results of an experiment are to be communicated to a large audience, rejection of H0 at level .05 will be much more convincing if the observed value of the test statistic greatly exceeds the 5% critical value than if it barely exceeds that value. This is Table 9.1 n 25 100 400 900 1600 2500 10,000 An illustration of the effect of sample size on P-values and b P-value when x = 101 b(101) for Level .01 Test .3085 .1587 .0228 .0013 .0000335 .000000297 7.69 1024 .9664 .9082 .6293 .2514 .0475 .0038 .0000 9.5 Some Comments on Selecting a Test Procedure 469 precisely what led to the notion of P-value as a way of reporting significance without imposing a particular a on others who might wish to draw their own conclusions. Even if a P-value is included in a summary of results, however, there may be difficulty in interpreting this value and in making a decision. This is because a small P-value, which would ordinarily indicate statistical significance in that it would strongly suggest rejection of H0 in favor of Ha, may be the result of a large sample size in combination with a departure from H0 that has little practical significance. In many experimental situations, only departures from H0 of large magnitude would be worthy of detection, whereas a small departure from H0 would have little practical significance. Consider as an example testing H0: m ¼ 100 versus Ha: m > 100 where m is the mean of a normal population with s ¼ 10. Suppose a true value of m ¼ 101 would not represent a serious departure from H0 in the sense that not rejecting H0 when m ¼ 101 would be a relatively inexpensive error. For a reasonably large sample size n, this m would lead to an x value near 101, so we would not want this sample evidence to argue strongly for rejection of H0 when x ¼ 101 is observed. For various sample sizes, Table 9.1 records both the P-value when x ¼ 101 and also the probability of not rejecting H0 at level .01 when m ¼ 101. The second column in Table 9.1 shows that even for moderately large sample sizes, the P-value of x ¼ 101 argues very strongly for rejection of H0, whereas the observed x itself suggests that in practical terms the true value of m differs little from the null value m0 ¼ 100. The third column points out that even when there is little practical difference between the true m and the null value, for a fixed level of significance a large sample size will almost always lead to rejection of the null hypothesis at that level. To summarize, one must be especially careful in interpreting evidence when the sample size is large, since any small departure from H0 will almost surely be detected by a test, yet such a departure may have little practical significance. Best Tests for Simple Hypotheses The test procedures presented thus far are (hopefully) intuitively reasonable, but have not been shown to be best in any sense. How can an optimal test be obtained, one for which the type II error probability is as small as possible, subject to controlling the type I error probability at the desired level? Our starting point here will be a rather unrealistic situation from a practical viewpoint: testing a simple null hypothesis against a simple alternative hypothesis. A simple hypothesis is one which, when true, completely specifies the distribution of the sample Xi’s. Suppose, for example, that the Xi’s form a random sample from an exponential distribution with parameter l. Then the hypothesis H: l ¼ 1 is simple, since when H is true each Xi has an exponential distribution with parameter l ¼ 1. We might then consider H0: l ¼ 1 versus Ha: l ¼ 2, both of which are simple hypotheses. The hypothesis H: l 1 is not simple, because when H is true, the distribution of each Xi might be exponential with l ¼ 1 or with l ¼ .8 or . . . . Similarly, if the Xi’s constitute a random sample from a normal distribution with known s, then H: m ¼ 100 is a simple hypothesis. But if the value of s is unknown, this hypothesis is not simple because the distribution of each Xi is then not completely specified; it could be normal with m ¼ 100 and s ¼ 15 or normal with m ¼ 100 and s ¼ 12 or 470 CHAPTER 9 Tests of Hypotheses Based on a Single Sample normal with m ¼ 100 and any other positive value of s. For a hypothesis to be simple, the value of every parameter in the pmf or pdf of the Xi’s must be specified. The next result was a milestone in the theory of hypothesis testing—a method for constructing a best test for a simple null hypothesis versus a simple alternative hypothesis. Let f(x1, . . . , xn; y) be the joint pmf or pdf of the Xi’s. Then our null hypothesis will assert that y ¼ y0 and the relevant alternative hypothesis will claim that y ¼ ya. The result will carry over to the case of more than one parameter as long as the value of each parameter is completely specified in both H0 and Ha. THE NEYMANPEARSON THEOREM For testing a simple null hypothesis H0: y ¼ y0 versus a simple alternative hypothesis Ha: y ¼ ya, let k be a positive fixed number and form the rejection region R ¼ ðx1 ; . . . ; xn Þ : f ðx1 ; . . . ; xn ; ya Þ k f ðx1 ; . . . ; xn ; y0 Þ Thus R* is the set of all observations for which the likelihood ratio—ratio of the alternative likelihood to the null likelihood—is at least k. The probability of a type I error for the test with this rejection region is a* ¼ P[(X1, . . . , Xn) ∈ R* when y ¼ y0], whereas the type II error probability b* is the probability that the Xi’s lie in the complement of R* (in the “acceptance” region) when y ¼ y a. Then for any other test procedure with type I error probability a satisfying a a*, the probability of a type II error must satisfy b b*. Thus the test with rejection region R* has the smallest type II error probability among all tests for which the type I error probability is at most a*. The choice of the constant k in the rejection region will determine the type I error probability a*. In the continuous case, k can be selected to give one of the traditional significance levels .05, .01, and so on, whereas in the discrete case a* ¼ .057 or .039 may be as close as one can get to .05. Example 9.20 Consider randomly selecting n ¼ 5 new vehicles of a certain type and determining the number of major defects on each one. Letting Xi denote the number of such defects for the ith selected vehicle (i ¼ 1, . . . , 5), suppose that the Xi’s form a random sample from a Poisson distribution with parameter l. Let’s find the best test for testing H0: l ¼ 1 versus Ha: l ¼ 2. The Poisson likelihood is f ðx1 ; : : : ; x5 ; lÞ ¼ e5l lSxi =Pxi !. Substituting first l ¼ 2, then l ¼ 1, and then taking the ratio of these two likelihoods gives the rejection region R ¼ ðx1 ; . . . ; x5 Þ : e5 2Sxi k Multiplying both sides of the inequality by e5 and letting k 0 ¼ ke5 gives the rejection region 2Sxi k0 . Now take the natural logarithm of both sides and let c ¼ ln(k 0 )/ln(2) to obtain the rejection region Sxi c. This latter rejection region is completely equivalent to R*: For any particular value k there will be a corresponding value c, and vice versa. But it is much easier to 9.5 Some Comments on Selecting a Test Procedure 471 express the rejection region in this latter form and then select c to obtain a desired significance level than it is to determine an appropriate value of k for the likelihood ratio. In particular, T ¼ SXi has a Poisson distribution with parameter 5l (via a moment generating function argument), so when H0 is true T has a Poisson distribution with parameter 5. From the 5.0 column of our Poisson table (Table A.2), the cumulative probabilities for the values 8 and 9 are .932 and .968, respectively. Thus if we use c ¼ 9 in the rejection region, a ¼ PðPoisson rv with parameter 5 is 9Þ ¼ 1 :932 ¼ :068 Choosing instead c ¼ 10 gives a* ¼ .032. If we insist that the significance level be at most .05, then the optimal rejection region is Sxi 10. When Ha is true, the test statistic has a Poisson distribution with parameter 10. Thus b ¼ PðH0 is not rejected when Ha is trueÞ ¼ PðPoisson rv with parameter 10 is 9Þ ¼ :458 Obviously this type II error probability is quite large. This is because the sample size n ¼ 5 is too small to allow for effective discrimination between l ¼ 1 and l ¼ 2. For a sample size of 10, the Poisson table reveals that the best test having significance level at most .05 uses c ¼ 16, for which a* ¼ .049 (Poisson parameter ¼ 10) and b* ¼ .157 (Poisson parameter ¼ 20). Finally, returning to a sample size of 5, c ¼ 10 implies that 10 ¼ ln(ke5)/ln(2), from which k ¼ 210/e5 6.9. For the best test to have a significance level of at most .05, the null hypothesis should be rejected only when the likelihood for the alternative value of l is more than about 7 times what it is for the null value. ■ Example 9.21 Let X1, . . . , Xn be a random sample from a normal distribution with mean m and variance 1 (the argument to be given will work for any other known value of s2). Consider testing H0: m ¼ m0 versus Ha: m ¼ ma where ma > m0. The likelihood ratio is 1 n=2 ð1=2ÞSðx m Þ2 i a 2 2 e ¼ ema Sxi m0 Sxi ðn=2Þðma m0 Þ 2p 2 n=2 1 ð1=2ÞSðx m Þ i 0 e 2p h i h i 2 2 ¼ enðma m0 Þ=2 eðma m0 ÞSxi The term in the first set of brackets is a numerical constant. Then ma m0 > 0 implies that the likelihood ratio will be at least k if and only if Sxi k0 , that is, if and only if x k00 , which means if and only if z¼ x m0 pffiffiffi c 1= n If we now let c ¼ z.01 ¼ 2.33, this z test (one for which the test statistic has a standard normal distribution when H0 is true), will have minimum b among all tests for which a .01. ■ The key idea in these last two examples cannot be overemphasized: Write an expression for the likelihood ratio, and then manipulate the inequality likelihood ratio k so it is equivalent to an inequality involving a test statistic whose distribution when H0 is true is known or can be derived. Then this known or derived distribution 472 CHAPTER 9 Tests of Hypotheses Based on a Single Sample can be used to obtain a test with the desired a. In the first example the distribution was Poisson with parameter 5, and in the second it was the standard normal distribution. Proof of the Neyman-Pearson Theorem: We shall consider the case in which the Xi’s have a discrete distribution, so that type I and type II error probabilities are obtained by summation. In the continuous case, integration replaces summation. Then R ¼ fðx1 ; . . . ; xn Þ : f ðx1 ; . . . ; xn ; ya Þ k f ðx1 ; . . . ; xn ; y0 Þg X f ðx1 ; . . . ; xn ; y0 Þ a ¼ P½ðX1 ; . . . ; Xn Þ 2 R when y ¼ y0 ¼ R b ¼ P½ðX1 ; . . . ; Xn Þ 2 R0 when y ¼ ya ¼ X f ðx1 ; . . . ; xn ; ya Þ R0 (b* is the sum over values in the complement of the rejection region). Suppose that R is a rejection region different from R* whose type I error probability is at most a*; that is, X a ¼ P½ðX1 ; . . . ; Xn Þ 2 R when y ¼ y0 ¼ f ðx1 ; . . . ; xn ; y0 Þ a R We then wish to show that b for this rejection region must be at least as large as b*. Consider the difference X D¼ ½ f ðx1 ; . . . ; xn ; ya Þ k f ðx1 ; . . . ; xn ; y0 Þ R X ½ f ðx1 ; . . . ; xn ; ya Þ k f ðx1 ; . . . ; xn ; y0 Þ R ¼ X ½. . . þ R \R ¼ X R \R0 ( X ½. . . ½. . . þ R\R R \R0 ½. . . X X X ) ½. . . R\R0 ½. . . R\R 0 This last difference is nonnegative (i.e. 0) because the term in the square brackets is 0 for any set of xi’s in R* and is negative for any set of xi’s not in R*. It then follows that 0 X R f ðx1 ; . . . ; xn ; ya Þ k X R X R f ðx1 ; . . . ; xn ; ya Þ þ k f ðx1 ; . . . ; xn ; y0 Þ X f ðx1 ; . . . ; xn ; y0 Þ R ¼ ð1 b Þ ka ð1 bÞ þ ka ¼ b b kða aÞ b b ðsince a a implies that the term being subtracted is nonnegativeÞ Thus we have shown that b* b as desired. ■ 9.5 Some Comments on Selecting a Test Procedure 473 Power and Uniformly Most Powerful Tests The Neyman–Pearson theorem can be restated in a slightly different way by considering the power of a test, first introduced in Section 9.2. DEFINITION Let O0 and Oa be two disjoint sets of possible values of y, and consider testing H0: y ∈ O0 versus Ha: y ∈ Oa using a test with rejection region R. Then the power function of the test, denoted by p( ) is the probability of rejecting H0 considered as a function of y: pðy 0 Þ ¼ P½ðX1 ; :::; Xn Þ 2 R when y ¼ y 0 Since we don’t want to reject the null hypothesis when y ∈ O0 and do want to reject it when y ∈ Oa, we wish a test for which the power function is close to 0 whenever y0 is in O0 and close to 1 whenever y0 is in Oa. The power is easily related to the type I and type II error probabilities: ( Pðtype I error when y ¼ y0 Þ ¼ aðy0 Þ when y0 2 O0 0 pðy Þ ¼ 1 Pðtype II error when y ¼ y0 Þ ¼ 1 bðy0 Þ when y0 2 Oa Thus large power when y0 ∈ Oa is equivalent to small b for such parameter values. Example 9.22 The drying time (min) of a particular brand and type of paint on a test board under controlled conditions is known to be normally distributed with m ¼ 75 and s ¼ 9.4. A new additive has been developed for the purpose of improving drying time. Assume that drying time with the additive is still normally distributed with the same standard deviation, and consider testing H0: m 75 versus Ha: m < 75 based on a sample of size n ¼ 100. A test with significance level .01 rejects H0 if z 2.33, where pffiffiffiffiffiffiffiffi z ¼ ðx 75Þ=ð9:4= 100Þ ¼ ðx 75Þ=:94. Manipulating the inequality in the rejection region to isolate x gives the equivalent rejection region x 72:81. Thus the power of the test when m ¼ 70 (a substantial departure from the null hypothesis) is 72:81 70 pffiffiffiffiffiffiffiffi pð70Þ ¼ PðX 72:81 when m ¼ 70Þ ¼ F 9:4= 100 ¼ Fð2:99Þ ¼ :9986 so b ¼ .0014. It is easily verified that p(75) ¼ .01, the significance level. The power when m ¼ 76 (a parameter value for which H0 is true) is 72:81 76 pffiffiffiffiffiffiffiffi pð76Þ ¼ PðX 72:81 when m ¼ 76Þ ¼ F 9:4= 100 ¼ Fð3:39Þ ¼ :0003 which is quite small as it should be. By repeating this calculation for various other values of m we obtain the entire power function. A graph of the ideal power function appears in Figure 9.10(a) and the actual power function is graphed in Figure 9.10(b). The maximum power for m 75 (i.e. in O0) occurs at m ¼ 75, on the boundary between O0 and Oa. Because the power function is continuous, there are values of m smaller than 75 for which the power is quite small. Even with a large sample size, it is difficult to detect a very small departure from the null hypothesis. 474 CHAPTER 9 Tests of Hypotheses Based on a Single Sample b 1.0 1.0 0.8 0.8 0.6 0.6 POWER IDEAL POWER a 0.4 0.2 0.4 0.2 0.0 0.0 68 69 70 ideal 71 72 73 MEAN 74 75 76 77 68 69 actual 70 71 72 73 74 75 76 77 MEAN Figure 9.10 Graphs of power functions for Example 9.22 ■ The Neyman–Pearson theorem says that when O0 consists of a single value y0 and Oa also consists of a single value ya, the rejection region R* specifies a test for which the power p(ya) at the alternative value ya (which is just 1 b) is maximized subject to p(y0) a for some specified value of a. That is, R* specifies a most powerful test subject to the restriction on the power when the null hypothesis is true. What about best tests when at least one of the two hypotheses is composite, that is, O0 or Oa (or both) consist of more than a single value? Example 9.23 (Example 9.20 continued) Consider again a random sample of size n ¼ 5 from a Poisson distribution, and suppose we now wish to test H0: l 1 versus Ha: l > 1. Both of these hypotheses are composite. Arguing as in Example 9.20, for any value la exceeding 1, a most powerful test of H0: l ¼ 1 versus Ha: l ¼ la with significance level (power when l ¼ 1) .032 rejects the null hypothesis when Sxi 10. Furthermore, it is easily verified that the power of this test at l0 is smaller than .032 if l0 < 1. Thus the test that rejects H0: l 1 in favor of H0: l > 1 when Sxi 10 has maximum power for any l0 > 1 subject to the condition that p(l0 ) .032. This test is uniformly most powerful. ■ More generally, a uniformly most powerful (UMP) level a test is one for which p(y0 ) is maximized for any y ∈ Oa subject to p(y0 ) a for any y0 ∈ O0. Unfortunately UMP tests are fairly rare, especially in commonly encountered situations when H0 and Ha are assertions about a single parameter y1 whereas the distribution of the Xi’s involves not only y1 but also at least one other “nuisance parameter”. For example, when the population distribution is normal with values of both m and s unknown, s is a nuisance parameter when testing H0: m ¼ m0 versus Ha: m 6¼ m0. Be careful here—the null hypothesis is not simple because O0 consists of all pairs (m, s) for which m ¼ m0 and s > 0, and there is certainly more than one such pair. In this situation, the one-sample t test is not UMP. 9.5 Some Comments on Selecting a Test Procedure 475 However, suppose we restrict attention to unbiased tests, those for which the smallest value of p(y0 ) for y0 ∈ Oa is at least as large as the largest value of p(y0 ) for y0 ∈ O0. Unbiasedness simply says that we are at least as likely to reject the null hypothesis when H0 is false as we are to reject it when H0 is true. The test proposed in Example 9.22 involving paint drying times is unbiased because, as Figure 9.10(b) shows, the power function at or to the right of 75 is smaller than it is to the left of 75. It can be shown that the one-sample t test is UMP unbiased; that is, it is uniformly most powerful among all tests that are unbiased. Several other commonly used tests also have this property. Please consult one of the chapter references for more details. Likelihood Ratio Tests The likelihood ratio (LR) principle is the most frequently used method for finding an appropriate test statistic in a new situation. As before, denote the joint pmf or pdf of X1, . . . , Xn by f(x1, . . . , xn; y). In the case of a random sample, it will be a product f(x1;y) f(xn ;y). When the xi’s are the actual observations and f(x1, . . . , xn ;y) is regarded as a function of y, it is called the likelihood function. Again consider testing H0: y ∈ O0 versus Ha: y ∈ Oa, where O0 and Oa are disjoint sets, and let O ¼ O0 [ Oa. In the Neyman–Pearson theorem, we focused on the ratio of the likelihood when y ∈ Oa to the likelihood when y ∈ O0, rejecting H0 when the value of the ratio was “sufficiently large”. Now we consider the ratio of the likelihood when y ∈ O0 to the likelihood when y ∈ O. A very small value of this ratio argues against the null hypothesis, since a small value arises when the data is much more consistent with the alternative hypothesis than with the null hypothesis. More formally, 1. Find the largest value of the likelihood for any y ∈ O0 by finding the maximum likelihood estimate of y within O0 and substituting this mle into the ^ 0 Þ. likelihood function to obtain LðO 2. Find the largest value of the likelihood for any y ∈ O by finding the maximum likelihood estimate of y within O and substituting this mle into the likelihood ^ Because O0 is a subset of O, this likelihood LðOÞ ^ can’t function to obtain LðOÞ. ^ be any smaller than the likelihood LðO0 Þ obtained in the first step, and will be much larger when the data is much more consistent with Ha than with H0. ^ ^ 0 Þ=LðOÞand reject the null hypothesis in favor 3. Form the likelihood ratio LðO of the alternative when this ratio is k. The critical value k is chosen to give a ^ 0 Þ=LðOÞ ^ k test with the desired significance level. In practice, the inequality LðO is often re-expressed in terms of a more convenient statistic (such as the sum of the observations) whose distribution is known or can be derived. The above prescription remains valid if the single parameter y is replaced by several parameters y1, . . . , yk. The mle’s of all parameters must be obtained in both steps 1 and 2 and substituted back into the likelihood function. Example 9.24 Consider a random sample from a normal distribution with the values of both parameters unknown. We wish to test H0: m ¼ m0 versus Ha: m 6¼ m0. Here O consists of all values of m and s2 for which 1 < m < 1 and s2 > 0, and the likelihood function is 1 n=2 1=ð2s2 Þ P ðxi mÞ2 e 2ps2 476 CHAPTER 9 Tests of Hypotheses Based on a Single Sample P ^ ¼ x; s ^2 ¼ ðxi xÞ2 =n: Substituting In Section 7.2 we obtained the mle’s as m these estimates back into the likelihood function gives n=2 1 ^ LðOÞ ¼ en=2 P 2p ðxi xÞ2 =n Within O0, m in the foregoing likelihood is replacedPby m0, so that only s2 must be ^2 ¼ ðxi m0 Þ2 =n: Substitution of estimated. It is easily verified that the mle is s this estimate in the likelihood function yields n=2 1 ^ 0Þ ¼ LðO en=2 P 2p ðxi m0 Þ2 =n Thus we reject H0 in favor of Ha when ^ 0Þ LðO ¼ ^ LðOÞ !n=2 P ðxi xÞ2 k P ðxi m0 Þ2 Raising both sides of this inequality to the power 2/n, we reject H0 whenever P ðxi xÞ2 k2=n ¼ k0 P ðxi m0 Þ2 This is intuitively quite reasonable: the value m0 is implausible for m if the sum of squared deviations about the sample mean is much smaller than the sum of squared deviations about m0. The denominator of this latter ratio can be expressed as X X X ðxi xÞ2 þ 2 ðx m0 Þðxi xÞ þ nðx m0 Þ2 ½ðxi xÞ þ ðx m0 Þ2 ¼ The middle (i.e., cross-product) term in this expression is 0, because the constant x m0 can be moved outside the summation, and then the sum of deviations from the sample mean is 0. Thus we should reject H0 when P 1 ðxi xÞ2 ¼ k0 P P ðxi xÞ2 þ nðx m0 Þ2 1 þ nðx m0 Þ2 = ðxi xÞ2 This latter ratio will be small when the second term in the denominator is large, so the condition for rejection becomes nðx m0 Þ2 k00 P ðxi xÞ2 Dividing both sides by n 1 and taking square roots gives the rejection region either x m0 pffiffiffi c or s= n x m0 pffiffiffi c s= n If we now let c ¼ ta=2;n1 , we have exactly the two-tailed one-sample t test. The bottom line is that when testing H0: m ¼ m0 against the two-sided (6¼) alternative, the one-sample t test is the likelihood ratio test. This is also true of the upper-tailed version of the t test when the alternative is Ha: m > m0 and of the lower-tailed test when the alternative is Ha: m < m0. We could trace back through the argument to recover the critical constant k from c, but there is no point in doing this; the rejection region in terms of t is much more convenient than the rejection region in terms of the likelihood ratio. ■ 9.5 Some Comments on Selecting a Test Procedure 477 A number of tests discussed subsequently, including the “pooled” t test from the next chapter and various tests from ANOVA (the analysis of variance) and regression analysis, can be derived by the likelihood ratio principle. Rather frequently the inequality for the rejection region of a likelihood ratio test cannot be manipulated to express the test procedure in terms of a simple statistic whose distribution can be ascertained. The following large-sample result, valid under fairly general conditions, can then be used: If the sample size n is sufficiently large, then the statistic 2[ln(likelihood ratio)] has approximately a chi-squared distribution with n degrees of freedom, where n is the difference between the number of “freely varying” parameters in O and the number of such parameters in O0. For example, if the distribution sampled is bivariate normal with the 5 parameters m1, m2, s1, s2, and r and the null hypothesis asserts that m1 ¼ m2 and ^ 0 Þ=LðOÞ ^ 1, and the likelihood s1 ¼ s2, then n ¼ 5 3 ¼ 2. By definition LðO ratio test rejects H0 when this likelihood ratio is much less than 1. This is equivalent to rejecting when the logarithm of the likelihood ratio is quite negative, that is, when ln(LR) is quite positive. The large-sample version of the test is thus uppertailed: H0 should be rejected if 2ln(likelihood ratio) w2a;n (an upper-tail critical value extracted from Table A.6). Example 9.25 Suppose a scientist makes n measurements of some physical characteristic, such as the specific gravity of a liquid. Let X1, . . . , Xn denote the resulting measurement errors. Assume that these Xi’s are independent and identically distributed according to the double exponential (Laplace) distribution: f ðxÞ ¼ :5ejxyj for 1< x< 1: This pdf is symmetric about y with somewhat heavier tails than the normal pdf. If y ¼ 0 then the measurements are unbiased, so it is natural to test H0: y ¼ 0 versus Ha: y 6¼ 0. Here n ¼ 1 0 ¼ 1. The likelihood is LðyÞ ¼ ð:5Þn eSjxi yj Because P of the minus sign preceding the summation, the likelihood is maximized when jxi yj is minimized. The absolute value function is not differentiable, and therefore differential calculus cannot be used. Instead, consider for a moment the case n ¼ 5 and let y1, . . . , y5 denote the values of the xi’s ordered from smallest to largest—so the yi’s are the observed values of the order statistics. For example, a random sample of size five from the Laplace distribution with y ¼ 0 is .24998, .75446, .19053, 1.16237, .83229, so (y1, . . . , y5) ¼ (.24998, .19053, .75446, .83229, 1.16237). Then 8 > y1 þ y2 þ y3 þ y4 þ y5 5y y < y1 > > > > > y1 þ y2 þ y3 þ y4 þ y5 3y y1 y < y2 > > > < y y þ y þ y þ y y y y < y X X 1 2 3 4 5 2 3 jyi yj ¼ jxi yj ¼ > y y y þ y þ y þ y y y < y 1 2 3 4 5 3 4 > > > > > y1 y2 y3 y4 þ y5 þ 3y y4 y < y5 > > > : y y y y y þ 5y y y 1 2 3 4 5 5 The graph of this expression as a function of y appears in Figure 9.11, from which it is apparent that the minimum occurs at y3 ¼ x~ ¼ :75446, the sample median. The situation is similar whenever n is odd. When n is even, the function achieves its minimum for any y between yn/2 and y(n/2)+1; one such y is ðyn=2 þ yðn=2Þþ1 Þ=2 ¼ x~. In summary, the mle of y is the sample median. 478 CHAPTER 9 Tests of Hypotheses Based on a Single Sample Σ|xi − q | 5.5 5.0 4.5 4.0 3.5 3.0 2.5 −.5 0 .5 1.0 1.5 q Figure 9.11 Determining the mle of the double exponential parameter by minimizing P jxi yj The likelihood ratio statistic for testing the relevant hypotheses is =½ð:5Þn eSjxi ~xj . Taking the natural the likelihood ratio and multið:5Þ e P log ofP plying by 2 gives the rejection region 2 jxi j 2 jxi x~j w2a;1 for the largesample version of the LR test. P jxi j ¼ 38:6 and P Suppose that a sample of n ¼ 30 errors results in jxi x~j ¼ 37:3. Then X X 2 lnðLRÞ ¼ 2 jxi j jxi x~j ¼ 2:6 n Sjxi j Comparing this to w2:05;1 ¼ 3:84, we would not reject the null hypothesis at the 5% significance level. It is plausible that the measurement process is indeed unbiased. ■ Exercises Section 9.5 (60–71) 60. Reconsider the paint-drying problem discussed in Example 9.2. The hypotheses were H0: m ¼ 75 versus Ha: m < 75, with s assumed to have value 9.0. Consider the alternative value m ¼ 74, which in the context of the problem would presumably not be a practically significant departure from H0. a. For a level .01 test, compute b at this alternative for sample sizes n ¼ 100, 900, and 2500. b. If the observed value of X is x ¼ 74, what can you say about the resulting P-value when n ¼ 2500? Is the data statistically significant at any of the standard values of a? c. Would you really want to use a sample size of 2500 along with a level .01 test (disregarding the cost of such an experiment)? Explain. 61. Consider the large-sample level .01 test in Section 9.3 for testing H0: p ¼ .2 against Ha: p > .2. a. For the alternative value p ¼ .21, compute b(.21) for sample sizes n ¼ 100, 2500, 10,000, 40,000, and 90,000. b. For p^ ¼ x=n ¼ :21, compute the P-value when n ¼ 100, 2500, 10,000, and 40,000. 9.5 Some Comments on Selecting a Test Procedure c. In most situations, would it be reasonable to use a level .01 test in conjunction with a sample size of 40,000? Why or why not? 62. For a random sample of n individuals taking a licensing exam, let Xi ¼ 1 if the ith individual in the sample passes the exam and Xi ¼ 0 otherwise (i ¼ 1, . . . , n). a. With p denoting the proportion of all examtakers who pass, show that the most powerful test of H0: p ¼ .5 versus Ha: p ¼ .75 rejects H0 when Sxi c. b. If n ¼ 20 and you want a .05 for the test of (a), would you reject H0 if 15 of the 20 individuals in the sample pass the exam? c. What is the power of the test you used in (b) when p ¼ .75 [i.e., what is p(.75)]? d. Is the test derived in (a) UMP for testing the hypotheses H0: p ¼ .5 versus Ha: p >.5? Explain your reasoning. e. Graph the power function p(p) of the test for the hypotheses of (d) when n ¼ 20 and a .05. f. Return to the scenario of (a), and suppose the test is based on a sample size of 50. If the probability of a type II error is approximately .025, what is the approximate significance level of the test (use a normal approximation)? 479 a. Obtain a most powerful test for H0: l ¼ 1 versus Ha: l ¼ .5, and express the rejection region in terms of a “simple” statistic. b. Is the test of (a) uniformly most powerful for H0: l ¼ 1 versus Ha: l < 1? Justify your answer. 66. Consider a random sample of size n from the “shifted exponential” distribution with pdf f ðx; yÞ ¼ eðxyÞ for x > y and 0 otherwise (the graph is that of the ordinary exponential pdf with l ¼ 1 shifted so that it begins its descent at y rather than at 0). Let Y1 denote the smallest order statistic, and show that the likelihood ratio test of H0: y 1 versus Ha: y > 1 rejects the null hypothesis if y1, the observed value of Y1, is c. 67. Suppose that each of n randomly selected individuals is classified according to his/her genotype with respect to a particular genetic characteristic and that the three possible genotypes are AA, Aa, and aa with long-run proportions (probabilities) y2, 2y(1y), and (1y)2, respectively (0 < y < 1). It is then straightforward to show that the likelihood is y2x1 ½2yð1 yÞx2 ð1 yÞ2x3 63. The error X in a measurement has a normal distribution with mean value 0 and variance s2. Consider testing H0: s2 ¼ 2 versus Ha: s2 ¼ 3 based on a random sample X1, . . . , Xn of errors. a. Show that a most powerful test rejects H0 when P 2 xi c: b. For n ¼ 10, find the value of c for the test in (a) that results in a ¼ .05. c. Is the test of (a) UMP for H0: s2 ¼ 2 versus Ha: s2 > 2? Justify your assertion. where x1, x2, and x3 are the number of individuals in the sample who have the AA, Aa, and aa genotypes, respectively. Show that the most powerful test for testing H0: y ¼ .5 versus Ha: y ¼ .8 rejects the null hypothesis when 2x1 + x2 c. Is this test UMP for the alternative Ha: y > .5? Explain. [Note: The fact that the joint distribution of X1, X2, and X3 is multinomial can be used to obtain the value of c that yields a test with any desired significance level when n is large.] 64. Suppose that X, the fraction of a container that is filled, has pdf f(x;y) ¼ yxy1 for 0 < x < 1 (where y > 0), and let X1, . . . , Xn be a random sample from this distribution. a. Show that the most powerful test for H0: y ¼ 1 versus Ha: y ¼ 2 rejects the null hypothesis if Sln(xi) c. b. Is the test of (a) UMP for testing H0: y ¼ 1 versus Ha: y > 1? Explain your reasoning. c. If n ¼ 50, what is the (approximate) value of c for which the test has significance level .05? 68. The error in a measurement is normally distributed with mean m and standard deviation 1. Consider a random sample of n errors, and show that the likelihood ratio test for H0: m ¼ 0 versus Ha: m 6¼ 0 rejects the null hypothesis when either x c or x c. What is c for a test with a ¼ .05? How does the test change if the standard deviation of an error is s0 (known) and the relevant hypotheses are H0: m ¼ 0 versus Ha: m 6¼m0? 65. Consider a random sample of n component lifetimes, where the distribution of lifetime is exponential with parameter l. 69. Measurement error in a particular situation is normally distributed with mean value m and standard deviation 4. Consider testing H0: m ¼ 0 versus Ha: m 6¼ 0 based on a sample of n ¼ 16 measurements. a. Verify that the usual test with significance level .05 rejects H0 if either x 1:96 or 480 CHAPTER 9 Tests of Hypotheses Based on a Single Sample x 1:96. [Note: That this test is unbiased follows from the fact that the way to capture the largest area under the z curve above an interval having width 3.92 is to center that interval at 0 (so it extends from 1.96 to 1.96).] b. Consider the test that rejects H0 if either x 2:17 or x 1:81. What is a, that is, p(0)? c. What is the power of the test proposed in (b) when m ¼ .1 and when m ¼ .1? (Note that .1 and .1 are very close to the null value, so one would not expect large power for such values). Is the test unbiased? d. Calculate the power of the usual test when m ¼ .1 and when m ¼ .1. Is the usual test a most powerful test? [Hint: Refer to your calculations in (c).] [Note: It can be shown that the usual test is most powerful among all unbiased tests.] 70. A test of whether a coin is fair will be based on n ¼ 50 tosses. Let X be the resulting number of heads. Consider two rejection regions: R1 ¼ {x: either x 17 or x 33} and R2 ¼ {x: either x 18 or x 37}. a. Determine the significance level (type I error probability) for each rejection region. b. Determine the power of each test when p ¼ .49. Is the test with rejection region R1 a uniformly most powerful level .033 test? Explain. c. Is the test with rejection region R2 unbiased? Explain. d. Sketch the power function for the test with rejection region R1, and then do so for the test with the rejection region R2. What does your intuition suggest about the desirability of using the rejection region R2? 71. Consider Example 9.24. pffiffiffi a. With t ¼ ðx m0 Þ=ðs= nÞ, show that the likelihood ratio is equal to l ¼ [1 + t2/(n 1)]n/2, and therefore the approximate chi-square statistic is 2[ln(l)] ¼ n ln[1 + t2/(n 1)]. b. Apply part (a) to test the hypotheses of Exercise 55, using the data given there. Compare your results with the answers found in Exercise 55. Supplementary Exercises (72–94) 72. A sample of 50 lenses used in eyeglasses yields a sample mean thickness of 3.05 mm and a sample standard deviation of .34 mm. The desired true average thickness of such lenses is 3.20 mm. Does the data strongly suggest that the true average thickness of such lenses is something other than what is desired? Test using a ¼ .05. 73. In Exercise 72, suppose the experimenter had believed before collecting the data that the value of s was approximately .30. If the experimenter wished the probability of a type II error to be .05 when m ¼ 3.00, was a sample size of 50 unnecessarily large? 74. It is specified that a certain type of iron should contain .85 g of silicon per 100 g of iron (.85%). The silicon content of each of 25 randomly selected iron specimens was determined, and the accompanying MINITAB output resulted from a test of the appropriate hypotheses. Variable N sil cont Mean StDev SE Mean 25 0.8880 0.1807 0.0361 a. What hypotheses were tested? T P 1.05 0.30 b. What conclusion would be reached for a significance level of .05, and why? Answer the same question for a significance level of .10. 75. One method for straightening wire before coiling it to make a spring is called “roller straightening.” The article “The Effect of Roller and Spinner Wire Straightening on Coiling Performance and Wire Properties” (Springs, 1987: 27–28) reports on the tensile properties of wire. Suppose a sample of 16 wires is selected and each is tested to determine tensile strength (N/mm2). The resulting sample mean and standard deviation are 2160 and 30, respectively. a. The mean tensile strength for springs made using spinner straightening is 2150 N/mm2. What hypotheses should be tested to determine whether the mean tensile strength for the roller method exceeds 2150? b. Assuming that the tensile strength distribution is approximately normal, what test statistic would you use to test the hypotheses in part (a)? c. What is the value of the test statistic for this data? d. What is the P-value for the value of the test statistic computed in part (c)? Supplementary Exercises e. For a level .05 test, what conclusion would you reach? 76. A new method for measuring phosphorus levels in soil is described in the article “A Rapid Method to Determine Total Phosphorus in Soils” (Soil Sci. Amer. J., 1988: 1301–1304). Suppose a sample of 11 soil specimens, each with a true phosphorus content of 548 mg/kg, is analyzed using the new method. The resulting sample mean and standard deviation for phosphorus level are 587 and 10, respectively. a. Is there evidence that the mean phosphorus level reported by the new method differs significantly from the true value of 548 mg/kg? Use a ¼ .05. b. What assumptions must you make for the test in part (a) to be appropriate? 77. The article “Orchard Floor Management Utilizing Soil-Applied Coal Dust for Frost Protection” (Agric. Forest Meteorol., 1988: 71–82) reports the following values for soil heat flux of eight plots covered with coal dust. 34.7 35.4 34.7 37.7 32.5 28.0 18.4 24.9 The mean soil heat flux for plots covered only with grass is 29.0. Assuming that the heat-flux distribution is approximately normal, does the data suggest that the coal dust is effective in increasing the mean heat flux over that for grass? Test the appropriate hypotheses using a ¼ .05. 78. The article “Caffeine Knowledge, Attitudes, and Consumption in Adult Women” (J. Nutrit. Ed., 1992: 179–184) reports the following summary data on daily caffeine consumption for a sample of adult women: n ¼ 47, x ¼ 215 mg, s ¼ 235 mg, and range ¼ 51176. a. Does it appear plausible that the population distribution of daily caffeine consumption is normal? Is it necessary to assume a normal population distribution to test hypotheses about the value of the population mean consumption? Explain your reasoning. b. Suppose it had previously been believed that mean consumption was at most 200 mg. Does the given data contradict this prior belief? Test the appropriate hypotheses at significance level .10 and include a P-value in your analysis. 79. The accompanying output resulted when MINITAB was used to test the appropriate hypotheses about true average activation time based on the data in Exercise 56. Use this information to reach 481 a conclusion at significance level .05 and also at level .01. TEST OF MU ¼ 25.000 VS MU G.T. 25.000 time N MEAN STDEV SE MEAN T P VALUE 13 27.923 5.619 1.559 1.88 0.043 80. The true average breaking strength of ceramic insulators of a certain type is supposed to be at least 10 psi. They will be used for a particular application unless sample data indicates conclusively that this specification has not been met. A test of hypotheses using a ¼ .01 is to be based on a random sample of ten insulators. Assume that the breaking-strength distribution is normal with unknown standard deviation. a. If the true standard deviation is .80, how likely is it that insulators will be judged satisfactory when true average breaking strength is actually only 9.5? Only 9.0? b. What sample size would be necessary to have a 75% chance of detecting that true average breaking strength is 9.5 when the true standard deviation is .80? 81. The accompanying observations on residual flame time (sec) for strips of treated children’s nightwear were given in the article “An Introduction to Some Precision and Accuracy of Measurement Problems” (J. Test. Eval., 1982: 132–140). Suppose a true average flame time of at most 9.75 had been mandated. Does the data suggest that this condition has not been met? Carry out an appropriate test after first investigating the plausibility of assumptions that underlie your method of inference. 9.85 9.94 9.88 9.93 9.85 9.95 9.75 9.75 9.95 9.77 9.83 9.93 9.67 9.92 9.92 9.87 9.74 9.89 9.67 9.99 82. The incidence of a certain type of chromosome defect in the U.S. adult male population is believed to be 1 in 75. A random sample of 800 individuals in U.S. penal institutions reveals 16 who have such defects. Can it be concluded that the incidence rate of this defect among prisoners differs from the presumed rate for the entire adult male population? a. State and test the relevant hypotheses using a ¼ .05. What type of error might you have made in reaching a conclusion? b. What P-value is associated with this test? Based on this P-value, could H0 be rejected at significance level .20? 83. In an investigation of the toxin produced by a certain poisonous snake, a researcher prepared 26 482 CHAPTER 9 Tests of Hypotheses Based on a Single Sample different vials, each containing 1 g of the toxin, and then determined the amount of antitoxin needed to neutralize the toxin. The sample average amount of antitoxin necessary was found to be 1.89 mg, and the sample standard deviation was .42. Previous research had indicated that the true average neutralizing amount was 1.75 mg/g of toxin. Does the new data contradict the value suggested by prior research? Test the relevant hypotheses using the P-value approach. Does the validity of your analysis depend on any assumptions about the population distribution of neutralizing amount? Explain. 84. The sample average unrestrained compressive strength for 45 specimens of a particular type of brick was computed to be 3107 psi, and the sample standard deviation was 188. The distribution of unrestrained compressive strength may be somewhat skewed. Does the data strongly indicate that the true average unrestrained compressive strength is less than the design value of 3200? Test using a ¼ .001. 85. To test the ability of auto mechanics to identify simple engine problems, an automobile with a single such problem was taken in turn to 72 different car repair facilities. Only 42 of the 72 mechanics who worked on the car correctly identified the problem. Does this strongly indicate that the true proportion of mechanics who could identify this problem is less than .75? Compute the P-value and reach a conclusion accordingly. 86. When X1, X2, . . . , Xn are independent Poisson variables, each with parameter l, and n is large, the sample mean X has approximately a normal distribution with m ¼ EðXÞ ¼ l and s2 ¼ VðXÞ ¼ l=n. This implies that Xl Z ¼ pffiffiffiffiffiffiffiffi l=n has approximately a standard normal distribution. For testing H0: l ¼ l0, we can replace l by l0 in the equation for Z to obtain a test statistic. This statistic is actually preferred to the large-sample pffiffiffi statistic with denominator S= n (when the Xi’s are Poisson) because it is tailored explicitly to the Poisson assumption. If the number of requests for consulting received by a certain statistician during a 5-day work week has a Poisson distribution and the total number of consulting requests during a 36-week period is 160, does this suggest that the true average number of weekly requests exceeds 4.0? Test using a ¼ .02. 87. A hot-tub manufacturer advertises that with its heating equipment, a temperature of 100 F can be achieved in at most 15 min. A random sample of 32 tubs is selected, and the time necessary to achieve a 100 F temperature is determined for each tub. The sample average time and sample standard deviation are 17.5 min and 2.2 min, respectively. Does this data cast doubt on the company’s claim? Compute the P-value and use it to reach a conclusion at level .05 (assume that the heating-time distribution is approximately normal). 88. Chapter 8 presented a CI for the variance s2 of a normal population distribution. The key result there was that the rv w2 ¼ ðn 1ÞS2 =s2 has a chi-squared distribution with n 1 df. Consider the null hypothesis H0 : s2 ¼ s20 (equivalently, s ¼ s0). Then when H0 is true, the test statistic w2 ¼ ðn 1ÞS2 =s20 has a chi-squared distribution with n 1 df. If the relevant alternative is Ha : s2 > s20 , rejecting H0 if ðn 1ÞS2 =s20 w2a;n1 gives a test with significance level a. To ensure reasonably uniform characteristics for a particular application, it is desired that the true standard deviation of the softening point of a certain type of petroleum pitch be at most .50 C. The softening points of ten different specimens were determined, yielding a sample standard deviation of .58 C. Does this strongly contradict the uniformity specification? Test the appropriate hypotheses using a ¼ .01. 89. Referring to Exercise 88, suppose an investigator wishes to test H0: s2 ¼ .04 versus Ha: s2 < .04 based on a sample of 21 observations. The computed value of 20s2/.04 is 8.58. Place bounds on the P-value and then reach a conclusion at level .01. 90. When the population distribution is normal and n is large, the sample standard deviation S has approximately a normal distribution with E(S) s and V(S) s2/(2n). We already know that in this case, for any n, X is normal with EðXÞ ¼ m and VðXÞ ¼ s2 =n. a. Assuming that the underlying distribution is normal, what is an approximately unbiased estimator of the 99th percentile y ¼ m + 2.33s? b. As discussed in Section 6.4, when the Xi’s are normal X and S are independent rv’s (one measures location whereas the other measures Bibliography spread). Use this to compute Vð^yÞ and s^y for the estimator ^y of part (a). What is the esti^ ^y ? mated standard error s c. Write a test statistic for testing H0: y ¼ y0 that has approximately a standard normal distribution when H0 is true. If soil pH is normally distributed in a certain region and 64 soil samples yield x ¼ 6:33, s ¼ .16, does this provide strong evidence for concluding that at most 99% of all possible samples would have a pH of less than 6.75? Test using a ¼ .01. 91. Let X1, X2, . . . , Xn be a random sample from an exponential distribution with parameter l. Then it can be shown that 2lSXi has a chi-squared distribution with n ¼ 2n(by first showing that 2lXi has a chi-squared distribution with n ¼ 2). a. Use this fact to obtain a test statistic and rejection region that together specify a level a test for H0: m ¼ m0 versus each of the three commonly encountered alternatives. [Hint: E(Xi) ¼ m ¼ 1/l, so m ¼ m0 is equivalent to l ¼ 1/m0.] b. Suppose that ten identical components, each having exponentially distributed time until failure, are tested. The resulting failure times are 95 16 11 3 42 71 225 64 87 123 Use the test procedure of part (a) to decide whether the data strongly suggests that the true average lifetime is less than the previously claimed value of 75. 92. Suppose the population distribution is normal with known s. Let g be such that 0 < g < a. For testing H0: m ¼ m0 versus Ha: m 6¼ m0, consider the test that rejects H0 if either z zg or z zag, where pffiffiffi the test statistic is Z ¼ ðX m0 Þ=ðs= nÞ: a. Show that P(type I error) ¼ a. Bibliography See the bibliographies for Chapters 7 and 8. 483 b. Derive an expression for b(m0 ). [Hint: Express the test in the form “reject H0 if either x c1 or c2 .”] c. Let D > 0. For what values of g (relative to a) will b(m0 + D) < b(m0 D)? 93. After a period of apprenticeship, an organization gives an exam that must be passed to be eligible for membership. Let p ¼ P(randomly chosen apprentice passes). The organization wishes an exam that most but not all should be able to pass, so it decides that p ¼ .90 is desirable. For a particular exam, the relevant hypotheses are H0: p ¼ .90 versus the alternative Ha: p 6¼ .90. Suppose ten people take the exam, and let X ¼ the number who pass. a. Does the lower-tailed region {0, 1, . . . , 5} specify a level .01 test? b. Show that even though Ha is two-sided, no two-tailed test is a level .01 test. c. Sketch a graph of b(p0 ) as a function of p0 for this test. Is this desirable? 94. A service station has six gas pumps. When no vehicles are at the station, let pi denote the probability that the next vehicle will select pump i (i ¼ 1, 2, . . . , 6). Based on a sample of size n, we wish to test H0: p1 ¼ . . . ¼ p6 versus the alternative Ha: p1 ¼ p3 ¼ p5, p2 ¼ p4 ¼ p6 (note that Ha is not a simple hypothesis). Let X be the number of customers in the sample that select an even-numbered pump. a. Show that the likelihood ratio test rejects H0 if either X c or X n c. [Hint: When Ha is true, let y denote the common value of p2, p4, and p6.] b. Let n ¼ 10 and c ¼ 9. Determine the power of the test both when H0 is true and also when 1 7 p2 ¼ p4 ¼ p6 ¼ 10 ; p1 ¼ p3 ¼ p5 ¼ 30 :