Tests of Hypotheses Based on a Single Sample Introduction

Transcription

Tests of Hypotheses Based on a Single Sample Introduction
CHAPTER NINE
Tests of
Hypotheses Based
on a Single Sample
Introduction
A parameter can be estimated from sample data either by a single number (a point
estimate) or an entire interval of plausible values (a confidence interval). Frequently, however, the objective of an investigation is not to estimate a parameter
but to decide which of two contradictory claims about the parameter is correct.
Methods for accomplishing this comprise the part of statistical inference called
hypothesis testing. In this chapter, we first discuss some of the basic concepts and
terminology in hypothesis testing and then develop decision procedures for the
most frequently encountered testing problems based on a sample from a single
population.
J.L. Devore and K.N. Berk, Modern Mathematical Statistics with Applications, Springer Texts in Statistics,
DOI 10.1007/978-1-4614-0391-3_9, # Springer Science+Business Media, LLC 2012
425
426
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
9.1 Hypotheses and Test Procedures
A statistical hypothesis, or just hypothesis, is a claim or assertion either about the
value of a single parameter (population characteristic or characteristic of a probability distribution), about the values of several parameters, or about the form of an
entire probability distribution. One example of a hypothesis is the claim m ¼ $311,
where m is the true average one–term textbook expenditure for students at a
university. Another example is the statement p < .50, where p is the proportion of
adults who approve of the job that the President is doing. If m1 and m2 denote the true
average decreases in systolic blood pressure for two different drugs, one hypothesis
is the assertion that m1 m2 ¼ 0, and another is the statement m1 m2 > 5.
Yet another example of a hypothesis is the assertion that the stopping distance
for a car under particular conditions has a normal distribution. Hypotheses of this
latter sort will be considered in Chapter 13. In this and the next several chapters, we
concentrate on hypotheses about parameters.
In any hypothesis-testing problem, there are two contradictory hypotheses
under consideration. One hypothesis might be the claim m ¼ $311 and the other
m¼
6 $311, or the two contradictory statements might be p .50 and p < .50. The
objective is to decide, based on sample information, which of the two hypotheses is
correct. There is a familiar analogy to this in a criminal trial. One claim is the assertion
that the accused individual is innocent. In the U.S. judicial system, this is the claim that
is initially believed to be true. Only in the face of strong evidence to the contrary
should the jury reject this claim in favor of the alternative assertion that the accused
is guilty. In this sense, the claim of innocence is the favored or protected hypothesis,
and the burden of proof is placed on those who believe in the alternative claim.
Similarly, in testing statistical hypotheses, the problem will be formulated so
that one of the claims is initially favored. This initially favored claim will not be
rejected in favor of the alternative claim unless sample evidence contradicts it and
provides strong support for the alternative assertion.
DEFINITION
The null hypothesis, denoted by H0, is the claim that is initially assumed to
be true (the “prior belief” claim). The alternative hypothesis, denoted by Ha,
is the assertion that is contradictory to H0.
The null hypothesis will be rejected in favor of the alternative hypothesis only if sample evidence suggests that H0 is false. If the sample does not
strongly contradict H0, we will continue to believe in the plausibility of the
null hypothesis. The two possible conclusions from a hypothesis-testing
analysis are then reject H0 or fail to reject H0.
A test of hypotheses is a method for using sample data to decide whether the null
hypothesis should be rejected. Thus we might test H0: m ¼ .75 against the alternative Ha: m 6¼ .75. Only if sample data strongly suggests that m is something other
than .75 should the null hypothesis be rejected. In the absence of such evidence, H0
should not be rejected, since it is still quite plausible.
Sometimes an investigator does not want to accept a particular assertion unless
and until data can provide strong support for the assertion. As an example, suppose a
company is considering putting a new additive in the dried fruit that it produces.
9.1 Hypotheses and Test Procedures
427
The true average shelf life with the current additive is known to be 200 days. With m
denoting the true average life for the new additive, the company would not want to
make a change unless evidence strongly suggested that m exceeds 200. An appropriate problem formulation would involve testing H0: m ¼ 200 against Ha: m > 200.
The conclusion that a change is justified is identified with Ha, and it would take
conclusive evidence to justify rejecting H0 and switching to the new additive.
Scientific research often involves trying to decide whether a current theory
should be replaced by a more plausible and satisfactory explanation of the phenomenon under investigation. A conservative approach is to identify the current theory
with H0 and the researcher’s alternative explanation with Ha. Rejection of the current
theory will then occur only when evidence is much more consistent with the new
theory. In many situations, Ha is referred to as the “research hypothesis,” since it is
the claim that the researcher would really like to validate. The word null means “of
no value, effect, or consequence,” which suggests that H0 should be identified with
the hypothesis of no change (from current opinion), no difference, no improvement,
and so on. Suppose, for example, that 10% of all computer circuit boards produced by
a manufacturer during a recent period were defective. An engineer has suggested a
change in the production process in the belief that it will result in a reduced defective
rate. Let p denote the true proportion of defective boards resulting from the changed
process. Then the research hypothesis, on which the burden of proof is placed, is the
assertion that p < .10. Thus the alternative hypothesis is Ha: p < .10.
In our treatment of hypothesis testing, H0 will generally be stated as an
equality claim. If y denotes the parameter of interest, the null hypothesis will
have the form H0: y ¼ y0, where y0 is a specified number called the null value of
the parameter (value claimed for y by the null hypothesis). As an example, consider
the circuit board situation just discussed. The suggested alternative hypothesis was
Ha: p < .10, the claim that the defective rate is reduced by the process modification. A natural choice of H0 in this situation is the claim that p .10, according to
which the new process is either no better or worse than the one currently used. We
will instead consider H0: p ¼ .10 versus Ha: p < .10. The rationale for using this
simplified null hypothesis is that any reasonable decision procedure for deciding
between H0: p ¼ .10 and Ha: p < .10 will also be reasonable for deciding between
the claim that p .10 and Ha. The use of a simplified H0 is preferred because it has
certain technical benefits, which will be apparent shortly.
The alternative to the null hypothesis H0: y ¼ y0 will look like one of the
following three assertions:
1. Ha: y > y0 (in which case the implicit null hypothesis is y y0)
2. Ha: y < y0 (so the implicit null hypothesis states that y y0)
3. Ha: y 6¼ y0.
For example, let s denote the standard deviation of the distribution of outside diameters
(inches) for an engine piston. If the decision was made to use the piston unless sample
evidence conclusively demonstrated that s > .0001 in., the appropriate hypotheses
would be H0: s ¼ .0001 versus Ha: s > .0001. The number y0 that appears in both H0
and Ha (separates the alternative from the null) is called the null value.
Test Procedures
A test procedure is a rule, based on sample data, for deciding whether to reject H0.
A test of H0: p ¼ .10 versus Ha: p < .10 in the circuit board problem might be
428
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
based on examining a random sample of n ¼ 200 boards. Let X denote the number
of defective boards in the sample, a binomial random variable; x represents the
observed value of X. If H0 is true, E(X) ¼ np ¼ 200(.10) ¼ 20, whereas we can
expect fewer than 20 defective boards if Ha is true. A value x just a bit below 20
does not strongly contradict H0, so it is reasonable to reject H0 only if x is
substantially < 20. One such test procedure is to reject H0 if x 15 and not reject
H0 otherwise. This procedure has two constituents: (1) a test statistic or function of
the sample data used to make a decision and (2) a rejection region consisting of
those x values for which H0 will be rejected in favor of Ha. For the rule just
suggested, the rejection region consists of x ¼ 0, 1, 2, . . . , 15. H0 will not be
rejected if x ¼ 16, 17, . . . , 199, or 200.
A test procedure is specified by the following:
1. A test statistic, a function of the sample data on which the decision (reject
H0 or do not reject H0) is to be based
2. A rejection region, the set of all test statistic values for which H0 will be
rejected
The null hypothesis will then be rejected if and only if the observed or
computed test statistic value falls in the rejection region.
As another example, suppose a cigarette manufacturer claims that the average nicotine content m of brand B cigarettes is (at most) 1.5 mg. It would be unwise
to reject the manufacturer’s claim without strong contradictory evidence, so an
appropriate problem formulation is to test H0: m ¼ 1.5 versus Ha: m > 1.5. Consider a decision rule based on analyzing a random sample of 32 cigarettes. Let X
denote the sample average nicotine content. If H0 is true, EðXÞ ¼ m ¼ 1:5, whereas
if H0 is false, we expect X to exceed 1.5. Strong evidence against H0 is provided by
a value x that considerably exceeds 1.5. Thus we might use X as a test statistic along
with the rejection region x 1:60.
In both the circuit board and nicotine examples, the choice of test statistic and
form of the rejection region make sense intuitively. However, the choice of cutoff
value used to specify the rejection region is somewhat arbitrary. Instead of rejecting
H0: p ¼ .10 in favor of Ha: p < .10 when x 15, we could use the rejection region
x 14. For this region, H0 would not be rejected if 15 defective boards are
observed, whereas this occurrence would lead to rejection of H0 if the initially
suggested region is employed. Similarly, the rejection region x 1:55 might be
used in the nicotine problem in place of the region x 1:60.
Errors in Hypothesis Testing
The basis for choosing a particular rejection region lies in an understanding of
the errors that one might be faced with in drawing a conclusion. Consider the
rejection region x 15 in the circuit board problem. Even when H0: p ¼ .10 is
true, it might happen that an unusual sample results in x ¼ 13, so that H0
is erroneously rejected. On the other hand, even when Ha: p < .10 is true,
9.1 Hypotheses and Test Procedures
429
an unusual sample might yield x ¼ 20, in which case H0 would not be rejected,
again an incorrect conclusion. Thus it is possible that H0 may be rejected when it
is true or that H0 may not be rejected when it is false. These possible errors are not
consequences of a foolishly chosen rejection region. Either one of these two
errors might result when the region x 14 is employed, or indeed when any
other sensible region is used.
DEFINITION
A type I error consists of rejecting the null hypothesis H0 when it is true.
A type II error involves not rejecting H0 when H0 is false.
In the nicotine scenario, a type I error consists of rejecting the manufacturer’s claim
that m ¼ 1.5 when it is actually true. If the rejection region x 1:60 is employed,
it might happen that x ¼ 1:63 even when m ¼ 1.5, resulting in a type I error.
Alternatively, it may be that H0 is false and yet x ¼ 1:52 is observed, leading to
H0 not being rejected (a type II error).
In the best of all possible worlds, test procedures for which neither type of
error is possible could be developed. However, this ideal can be achieved only by
basing a decision on an examination of the entire population, which is almost
always impractical. The difficulty with using a procedure based on sample data
is that because of sampling variability, an unrepresentative sample may result.
Even though EðXÞ ¼ m, the observed value x may differ substantially from m
(at least if n is small). Thus when m ¼ 1.5 in the nicotine situation, x may be
much larger than 1.5, resulting in erroneous rejection of H0. Alternatively, it
may be that m ¼ 1.6 yet an x much smaller than this is observed, leading to a
type II error.
Instead of demanding error-free procedures, we must look for procedures for
which either type of error is unlikely to occur. That is, a good procedure is one for
which the probability of making either type of error is small. The choice of a
particular rejection region cutoff value fixes the probabilities of type I and type II
errors. These error probabilities are traditionally denoted by a and b, respectively.
Because H0 specifies a unique value of the parameter, there is a single value of a.
However, there is a different value of b for each value of the parameter consistent
with Ha.
Example 9.1
An automobile model is known to sustain no visible damage 25% of the time in
10-mph crash tests. A modified bumper design has been proposed in an effort to
increase this percentage. Let p denote the proportion of all 10-mph crashes with this
new bumper that result in no visible damage. The hypotheses to be tested are H0:
p ¼ .25 (no improvement) versus Ha: p > .25. The test will be based on an
experiment involving n ¼ 20 independent crashes with prototypes of the new
design. Intuitively, H0 should be rejected if a substantial number of the crashes
show no damage. Consider the following test procedure:
Test statistic:
X ¼ the number of crashes with no visible damage
Rejection region:
R8 ¼ {8, 9, 10, . . . , 19, 20}; that is, reject H0 if x 8,
where x is the observed value of the test statistic
This rejection region is called upper-tailed because it consists only of large values
of the test statistic.
430
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
When H0 is true, X has a binomial probability distribution with n ¼ 20 and
p ¼ .25. Then
a ¼ P(type I errorÞ ¼ PðH0 is rejected when it is trueÞ
¼ P½X 8 when X Binð20; :25Þ ¼ 1 Bð7; 20; :25Þ
¼ 1 :898 ¼ :102
That is, when H0 is actually true, roughly 10% of all experiments consisting of
20 crashes would result in H0 being incorrectly rejected (a type I error).
In contrast to a, there is not a single b. Instead, there is a different b for each
different p that exceeds .25. Thus there is a value of b for p ¼ .3 [in which case
X ~ Bin(20, .3)], another value of b for p ¼ .5, and so on. For example,
bð:3Þ ¼ Pðtype II error when p ¼ :3Þ
¼ PðH0 is not rejected when it is false because p ¼ :3Þ
¼ P½X 7 when X Bin(20, .3)] = B(7; 20, .3) = .772
When p is actually .3 rather than .25 (a “small” departure from H0), roughly 77% of
all experiments of this type would result in H0 being incorrectly not rejected!
The accompanying table displays b for selected values of p (each calculated
for the rejection region R8). Clearly, b decreases as the value of p moves farther
to the right of the null value .25. Intuitively, the greater the departure from H0,
the more likely it is that such a departure will be detected.
p
.3
.4
.5
.6
.7
.8
b(p)
.772
.416
.132
.021
.001
.000
The proposed test procedure is still reasonable for testing the more realistic null
hypothesis that p .25. In this case, there is no longer a single a, but instead there
is an a for each p that is at most .25: a(.25), a(.23), a(.20), a(.15), and so on. It is
easily verified, though, that a(p) < a(.25) ¼ .102 if p < .25. That is, the largest
value of a occurs for the boundary value .25 between H0 and Ha. Thus if a is small
for the simplified null hypothesis, it will also be as small as or smaller for the more
realistic H0.
■
Example 9.2
The drying time of a type of paint under specified test conditions is known to
be normally distributed with mean value 75 min and standard deviation 9 min.
Chemists have proposed a new additive designed to decrease average drying time.
It is believed that drying times with this additive will remain normally distributed
with s ¼ 9. Because of the expense associated with the additive, evidence should
strongly suggest an improvement in average drying time before such a conclusion
is adopted. Let m denote the true average drying time when the additive is used.
The appropriate hypotheses are H0: m ¼ 75 versus Ha: m < 75. Only if H0 can be
rejected will the additive be declared successful and used.
Experimental data is to consist of drying times from n ¼ 25 test specimens.
Let X1, . . . , X25 denote the 25 drying times—a random sample of size 25 from a
normal distribution with mean value m and standard deviation s ¼ 9. The sample
mean drying time X then hasp
a ffiffinormal
with expected value mX ¼ m and
pdistribution
ffiffiffiffiffi
ffi
standard deviation sX ¼ s= n ¼ 9= 25 ¼ 1:80. When H0 is true, mX ¼ 75, so
only an x value substantially < 75 would strongly contradict H0. A reasonable
9.1 Hypotheses and Test Procedures
431
rejection region has the form x c, where the cutoff value c is suitably chosen.
Consider the choice c ¼ 70.8, so that the test procedure consists of test statistic X
and rejection region x 70:8. Because the rejection region consists only of small
values of the test statistic, the test is said to be lower-tailed. Calculation of a and b
now involves a routine standardization of X followed by reference to the standard
normal probabilities of Appendix Table A.3:
a ¼ Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ
¼ PðX 70:8 when X normal with mX ¼ 75; sX ¼ 1:8Þ
70:8 75
¼ Fð2:33Þ ¼ :01
¼F
1:8
bð72Þ ¼ Pðtype II error when m ¼ 72Þ
¼ PðH0 is not rejected when it is false because m ¼ 72Þ
¼ PðX > 70:8 when X normal with mX ¼ 72; sX ¼ 1:8Þ
70:8 72
¼ 1 Fð:67Þ ¼ 1 :2514 ¼ :7486
¼ 1F
1:8
70:8 70
¼ :3300 bð67Þ ¼ :0174
bð70Þ ¼ 1 F
1:8
For the specified test procedure, only 1% of all experiments carried out as described
will result in H0 being rejected when it is actually true. However, the chance of a
type II error is very large when m ¼ 72 (only a small departure from H0), somewhat
less when m ¼ 70, and quite small when m ¼ 67 (a very substantial departure
from H0). These error probabilities are illustrated in Figure 9.1 on the next page.
Notice that a is computed using the probability distribution of the test statistic
when H0 is true, whereas determination of b requires knowing the test statistic’s
distribution when H0 is false.
As in Example 9.1, if the more realistic null hypothesis m 75 is considered,
there is an a for each parameter value for which H0 is true: a(75), a(75.8), a(76.5),
and so on. It is easily verified, though, that a(75) is the largest of all these type I
error probabilities. Focusing on the boundary value amounts to working explicitly
with the “worst case.”
■
The specification of a cutoff value for the rejection region in the examples
just considered was somewhat arbitrary. Use of the rejection region R8 ¼ {8, 9, . . ., 20}
in Example 9.1 resulted in a ¼ .102, b(.3) ¼ .772, and b(.5) ¼ .132. Many would
think these error probabilities intolerably large. Perhaps they can be decreased by
changing the cutoff value.
Example 9.3
(Example 9.1
continued)
Let us use the same experiment and test statistic X as previously described in the
automobile bumper problem but now consider the rejection region R9 ¼ {9, 10,
. . ., 20}. Since X still has a binomial distribution with parameters n ¼ 20 and p,
a ¼ PðH0 is rejected when p ¼ :25Þ
¼ P½X 9 when X Bin(20, .25)] = 1 Bð8; 20; :25Þ ¼ :041
432
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
a
Shaded
area = a = .01
73
75
70.8
b
Shaded area = b (72)
72
75
70.8
c
Shaded area = b (70)
70
75
70.8
Figure 9.1 a and b illustrated for Example 9.2: (a) the distribution of X when
m ¼ 75 (H0 true); (b) the distribution of X when m ¼ 72 (H0 false); (c) the distribution
of X when m ¼ 70 (H0 false)
The type I error probability has been decreased by using the new rejection region.
However, a price has been paid for this decrease:
bð:3Þ ¼ PðH0 is not rejected when p ¼ :3Þ
¼ P½X 8 when X Binð20; :3Þ ¼ Bð8; 20; :3Þ ¼ :887
bð:5Þ ¼ Bð8; 20; :5Þ ¼ :252
Both these b’s are larger than the corresponding error probabilities .772 and .132
for the region R8. In retrospect, this is not surprising; a is computed by summing
over probabilities of test statistic values in the rejection region, whereas b is
the probability that X falls in the complement of the rejection region. Making the
rejection region smaller must therefore decrease a while increasing b for any fixed
alternative value of the parameter.
■
Example 9.4
(Example 9.2
continued)
The use of cutoff value c ¼ 70.8 in the paint-drying example resulted in a very
small value of a (.01) but rather large b’s. Consider the same experiment and test
statistic X with the new rejection region x 72. Because X is still normally
distributed with mean value mX ¼ m and sX ¼ 1:8,
9.1 Hypotheses and Test Procedures
433
a ¼ PðH0 is rejected when it is trueÞ
¼ P½X 72 when X Nð75;1:82 Þ
72 75
¼F
¼ Fð1:67Þ ¼ :0475 :05
1:8
bð72Þ ¼ PðH0 is not rejected when m ¼ 72Þ
¼ PðX > 72 when X is a normal rv with mean 72 and standard deviation 1:8Þ
72 72
¼ 1 Fð0Þ ¼ :5
¼ 1F
1:8
72 70
¼ :1335 bð67Þ ¼ :0027
bð70Þ ¼ 1 F
1:8
The change in cutoff value has made the rejection region larger (it includes more x
values), resulting in a decrease in b for each fixed m less than 75. However, a for
this new region has increased from the previous value .01 to approximately .05. If a
type I error probability this large can be tolerated, though, the second region
(c ¼ 72) is preferable to the first (c ¼ 70.8) because of the smaller b’s.
■
The results of these examples can be generalized in the following manner.
PROPOSITION
Suppose an experiment and a sample size are fixed and a test statistic is
chosen. Then decreasing the size of the rejection region to obtain a smaller
value of a results in a larger value of b for any particular parameter value
consistent with Ha.
This proposition says that once the test statistic and n are fixed, there is no rejection
region that will simultaneously make both a and all b’s small. A region must be
chosen to effect a compromise between a and b.
Because of the suggested guidelines for specifying H0 and Ha, a type I error is
usually more serious than a type II error (this can always be achieved by proper
choice of the hypotheses). The approach adhered to by most statistical practitioners
is then to specify the largest value of a that can be tolerated and find a rejection
region having that value of a rather than anything smaller. This makes b as small as
possible subject to the bound on a. The resulting value of a is often referred to as the
significance level of the test. Traditional levels of significance are .10, .05, and .01,
although the level in any particular problem will depend on the seriousness of a
type I error—the more serious this error, the smaller should be the significance
level. The corresponding test procedure is called a level a test (e.g., a level .05 test
or a level .01 test). A test with significance level a is one for which the type I error
probability is controlled at the specified level.
Example 9.5
Consider the situation mentioned previously in which m was the true average
nicotine content of brand B cigarettes. The objective is to test H0: m ¼ 1.5 versus
Ha: m > 1.5 based on a random sample X1, X2, . . . , X32 of nicotine contents.
Suppose the distribution of nicotine content is known to be normal with s ¼ .20.
It follows that X is p
normally
distributed with mean value mX ¼ m and standard
ffiffiffiffiffi
deviation sX ¼ :20= 32 ¼ :0354:
434
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
Rather than use X itself as the test statistic, let’s standardize X assuming that
H0 is true.
Test statistic : Z ¼
X 1:5 X 1:5
pffiffiffi ¼
s= n
:0354
Z expresses the distance between X and its expected value when H0 is true as some
number of standard deviations. For example, z ¼ 3 results from an x that is 3
standard deviations larger than we would have expected it to be were H0 true.
Rejecting H0 when x “considerably” exceeds 1.5 is equivalent to rejecting H0
when z “considerably” exceeds 0. That is, the form of the rejection region is z c.
Let’s now determine c so that a ¼ .05. When H0 is true, Z has a standard normal
distribution. Thus
a ¼ Pðtype I error) = P(rejecting H0 when it is trueÞ
¼ P½Z c when Z N ð0; 1Þ
The value c must capture upper-tail area .05 under the z curve. Either from
Section 4.3 or directly from Appendix Table A.3, c ¼ z.05 ¼ 1.645.
Notice that z 1.645 is equivalent to x 1:5 ð:0354Þð1:645Þ; that is,
x 1:56. Then b is the probability that X < 1:56 and can be calculated for any
m >1.5.
■
Exercises Section 9.1 (1–14)
1. For each of the following assertions, state whether
it is a legitimate statistical hypothesis and why:
a. H: s > 100
b. H: x~ ¼ 45
c. H: s .20
d. H: s1/s2 < 1
e. H: X Y ¼ 5
f. H: l .01, where l is the parameter of an
exponential distribution used to model component lifetime
2. For the following pairs of assertions, indicate
which do not comply with our rules for setting
up hypotheses and why (the subscripts 1 and 2 differentiate between quantities for two different
populations or samples):
a. H0: m ¼ 100, Ha: m > 100
b. H0: s ¼ 20, Ha: s 20
c. H0: p 6¼ .25, Ha: p ¼ .25
d. H0: m1 m2 ¼ 25, Ha: m1 m2 > 100
e. H0 : S21 ¼ S22 ; Ha : S21 6¼ S22
f. H0: m ¼ 120, Ha: m ¼ 150
g. H0: s1/s2 ¼ 1, Ha: s1/s2 6¼ 1
h. H0: p1 p2 ¼ .1, Ha: p1 p2 <.1
3. To determine whether the girder welds in a
new performing arts center meet specifications,
a random sample of welds is selected, and tests
are conducted on each weld in the sample. Weld
strength is measured as the force required to break
the weld. Suppose the specifications state that
mean strength of welds should exceed 100 lb/in2;
the inspection team decides to test H0: m ¼ 100
versus Ha: m > 100. Explain why it might be
preferable to use this Ha rather than m < 100.
4. Let m denote the true average radioactivity level
(picocuries per liter). The value 5 pCi/L is considered the dividing line between safe and unsafe
water. Would you recommend testing H0: m ¼ 5
versus Ha: m > 5 or H0: m ¼ 5 versus Ha: m < 5?
Explain your reasoning. [Hint: Think about the
consequences of a type I and type II error for
each possibility.]
5. Before agreeing to purchase a large order of
polyethylene sheaths for a particular type of
high-pressure oil-filled submarine power cable,
a company wants to see conclusive evidence that
the true standard deviation of sheath thickness is
< .05 mm. What hypotheses should be tested, and
why? In this context, what are the type I and
type II errors?
9.1 Hypotheses and Test Procedures
6. Many older homes have electrical systems that use
fuses rather than circuit breakers. A manufacturer
of 40-amp fuses wants to make sure that the mean
amperage at which its fuses burn out is in fact 40.
If the mean amperage is lower than 40, customers
will complain because the fuses require replacement too often. If the mean amperage is higher
than 40, the manufacturer might be liable for
damage to an electrical system due to fuse malfunction. To verify the amperage of the fuses, a
sample of fuses is to be selected and inspected. If a
hypothesis test were to be performed on the resulting data, what null and alternative hypotheses
would be of interest to the manufacturer? Describe
type I and type II errors in the context of this
problem situation.
7. Water samples are taken from water used for cooling as it is being discharged from a power plant
into a river. It has been determined that as long as
the mean temperature of the discharged water is at
most 150 F, there will be no negative effects on
the river’s ecosystem. To investigate whether the
plant is in compliance with regulations that prohibit a mean discharge-water temperature above
150 , 50 water samples will be taken at randomly
selected times, and the temperature of each sample
recorded. The resulting data will be used to test the
hypotheses H0: m ¼ 150 versus Ha: m > 150 . In
the context of this situation, describe type I and
type II errors. Which type of error would you
consider more serious? Explain.
8. A regular type of laminate is currently being used
by a manufacturer of circuit boards. A special
laminate has been developed to reduce warpage.
The regular laminate will be used on one sample
of specimens and the special laminate on another
sample, and the amount of warpage will then be
determined for each specimen. The manufacturer
will then switch to the special laminate only if it
can be demonstrated that the true average amount
of warpage for that laminate is less than for the
regular laminate. State the relevant hypotheses,
and describe the type I and type II errors in the
context of this situation.
9. Two different companies have applied to provide
cable television service in a region. Let p denote
the proportion of all potential subscribers who
favor the first company over the second. Consider
testing H0: p ¼ .5 versus Ha: p 6¼ .5 based on a
random sample of 25 individuals. Let X denote the
number in the sample who favor the first company
and x represent the observed value of X.
435
a. Which of the following rejection regions is
most appropriate and why?
R1 ¼ fx : x 7 or x 18g;
R2 ¼ fx : x 8g; R3 ¼ fx : x 17g
b. In the context of this problem situation,
describe what type I and type II errors are.
c. What is the probability distribution of the test
statistic X when H0 is true? Use it to compute
the probability of a type I error.
d. Compute the probability of a type II error for
the selected region when p ¼ .3, again when
p ¼ .4, and also for both p ¼ .6 and p ¼ .7.
e. Using the selected region, what would you
conclude if 6 of the 25 queried favored company 1?
10. For healthy individuals the level of prothrombin in
the blood is approximately normally distributed
with mean 20 mg/100 mL and standard deviation
4 mg/100 mL. Low levels indicate low clotting
ability. In studying the effect of gallstones on prothrombin, the level of each patient in a sample is
measured to see if there is a deficiency. Let m be the
true average level of prothrombin for gallstone
patients.
a. What are the appropriate null and alternative
hypotheses?
b. Let X denote the sample average level of prothrombin in a sample of n ¼ 20 randomly
selected gallstone patients. Consider the test
procedure with test statistic X and rejection
region x 17:92. What is the probability distribution of the test statistic when H0 is true?
What is the probability of a type I error for the
test procedure?
c. What is the probability distribution of the test
statistic when m ¼ 16.7? Using the test procedure of part (b), what is the probability that
gallstone patients will be judged not deficient
in prothrombin, when in fact m ¼ 16.7 (a type
II error)?
d. How would you change the test procedure of
part (b) to obtain a test with significance level
.05? What impact would this change have on
the error probability of part (c)?
e. Consider the standardized test statistic Z ¼
pffiffiffiffiffi
ðX 20Þ=ðs= nÞ ¼ ðX 20Þ=:8944. What
are the values of Z corresponding to the rejection region of part (b)?
11. The calibration of a scale is to be checked by
weighing a 10-kg test specimen 25 times. Suppose
that the results of different weighings are
436
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
independent of one another and that the weight on
each trial is normally distributed with s ¼ .200 kg.
Let m denote the true average weight reading on
the scale.
a. What hypotheses should be tested?
b. Suppose the scale is to be recalibrated if either
x 10:1032 or x 9:8968. What is the probability that recalibration is carried out when it
is actually unnecessary?
c. What is the probability that recalibration is
judged unnecessary when in fact m ¼ 10.1?
When m ¼ 9.8?
pffiffiffiffiffi
d. Let z ¼ ðx 10Þ=ðs= nÞ. For what value c is
the rejection region of part (b) equivalent to the
“two-tailed” region either z c or z c?
e. If the sample size were only 10 rather than 25,
how should the procedure of part (d) be altered
so that a ¼ .05?
f. Using the test of part (e), what would you
conclude from the following sample data?
9.981
9.728
10.006
10.439
9.857
10.214
10.107
10.190
9.888
9.793
g. Re-express the test procedure of part (b) in
terms of the standardized test statistic
pffiffiffiffiffi
Z ¼ ðX 10Þ=ðs= nÞ:
12. A new design for the braking system on a certain
type of car has been proposed. For the current
system, the true average braking distance at 40
mph under specified conditions is known to be
120 ft. It is proposed that the new design be
implemented only if sample data strongly indicates a reduction in true average braking distance
for the new design.
a. Define the parameter of interest and state the
relevant hypotheses.
b. Suppose braking distance for the new system is
normally distributed with s ¼ 10. Let X
denote the sample average braking distance
for a random sample of 36 observations.
Which of the following rejection regions
is appropriate: R1 ¼ fx : x 124:80g; R2 ¼
fx : x 115:20g; R3 ¼ fx : either x 125:13 or
x 114:87g?
c. What is the significance level for the appropriate region of part (b)? How would you change
the region to obtain a test with a ¼ .001?
d. What is the probability that the new design is
not implemented when its true average braking
distance is actually 115 ft and the appropriate
region from part (b) is used?
pffiffiffiffiffi
e. Let Z ¼ ðX 120Þ=ðs= nÞ. What is the significance level for the rejection region {z:
z 2.33}? For the region {z: z 2.88}?
13. Let X1, . . . , Xn denote a random sample from a
normal population distribution with a known
value of s.
a. For testing the hypotheses H0: m ¼ m0 versus
Ha: m > m0 (where m0 is a fixed number), show
that the test with test statistic X and rejection
pffiffiffi
region x m0 þ 2:33s= n has significance
level .01.
b. Suppose the procedure of part (a) is used to test
H0: m m0 versus Ha: m > m0. If m0 ¼ 100,
n ¼ 25, and s ¼ 5, what is the probability of
committing a type I error when m ¼ 99? When
m ¼ 98? In general, what can be said about
the probability of a type I error when the
actual value of m is less than m0? Verify your
assertion.
14. Reconsider the situation of Exercise 11 and suppose the rejection region is x : x 10:1004 or
x 9:8940g ¼ fz : z 2:51 or z 2:65g:
a. What is a for this procedure?
b. What is b when m ¼ 10.1? When m ¼ 9.9? Is
this desirable?
9.2 Tests About a Population Mean
The general discussion in Chapter 8 of confidence intervals for a population mean m
focused on three different cases. We now develop test procedures for these same
three cases.
Case I: A Normal Population with Known s
Although the assumption that the value of s is known is rarely met in practice, this
case provides a good starting point because of the ease with which general
procedures and their properties can be developed. The null hypothesis in all three
cases will state that m has a particular numerical value, the null value, which we will
9.2 Tests About a Population Mean
437
denote by m0. Let X1, . . . , Xn represent a random sample of size n from the normal
population. Then the sample mean X has a normal distribution with expected value
pffiffiffi
mX ¼ m and standard deviation sX ¼ s= n. When H0 is true, mX ¼ m0 . Consider
now the statistic Z obtained by standardizing X under the assumption that H0 is true:
Z¼
X m0
pffiffiffi
s= n
Substitution of the computed sample mean x gives z, the distance between x and
m0 expressed in “standard
units.” For example, if the null hypothesis is
pffiffiffiffiffi
pffiffiffideviation
H0: m ¼ 100, sX ¼ s= n ¼ 10= 25 ¼ 2:0 and x ¼ 103, then the test statistic
value is given by z ¼ (103 100)/2.0 ¼ 1.5. That is, the observed value of x
is 1.5 standard deviations (of X) above what we expect it to be when H0 is true.
The statistic Z is a natural measure of the distance between X, the estimator of m,
and its expected value when H0 is true. If this distance is too great in a direction
consistent with Ha, the null hypothesis should be rejected.
Suppose first that the alternative hypothesis has the form Ha: m > m0. Then an x
value less than m0 certainly does not provide support for Ha. Such an xpcorresponds
to
ffiffiffi
a negative value of z (since x m0 is negative and the divisor s= n is positive).
Similarly, an x value that exceeds m0 by only a small amount (corresponding to z
which is positive but small) does not suggest that H0 should be rejected in favor
of Ha. The rejection of H0 is appropriate only when x considerably exceeds m0—that
is, when the z value is positive and large. In summary, the appropriate rejection
region, based on the test statistic Z rather than X, has the form z c.
As discussed in Section 9.1, the cutoff value c should be chosen to control the
probability of a type I error at the desired level a. This is easily accomplished because
the distribution of the test statistic Z when H0 is true is the standard normal distribution (that’s why m0 was subtracted in standardizing). The required cutoff c is the z
critical value that captures upper-tail area a under the standard normal curve. As an
example, let c ¼ 1.645, the value that captures tail area .05 (z.05 ¼ 1.645). Then,
a ¼ Pðtype I errorÞ ¼ PðH0 is rejected when H0 is trueÞ
¼ P½Z 1:645 when Z Nð0; 1Þ ¼ 1 Fð1:645Þ ¼ :05
More generally, the rejection region z za has type I error probability a. The test
procedure is upper-tailed because the rejection region consists only of large values
of the test statistic.
Analogous reasoning for the alternative hypothesis Ha: m < m0 suggests a
rejection region of the form z c, where c is a suitably chosen negative number
(x is far below m0 if and only if z is quite negative). Because Z has a standard normal
distribution when H0 is true, taking c ¼ za yields P(type I error) ¼ a. This is a
lower-tailed test. For example, z.10 ¼ 1.28 implies that the rejection region
z 1.28 specifies a test with significance level .10.
Finally, when the alternative hypothesis is Ha: m 6¼ m0, H0 should be rejected
if x is too far to either side of m0. This is equivalent to rejecting H0 either if z c or
if z c. Suppose we desire a ¼ .05. Then,
:05 ¼ PðZ c or Z c when Z has a standard normal distributionÞ
¼ FðcÞ þ 1 FðcÞ ¼ 2½1 FðcÞ
438
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
Thus c is such that 1 F(c), the area under the standard normal curve to the right
of c, is .025 (and not .05!). From Section 4.3 or Appendix Table A.3, c ¼ 1.96, and
the rejection region is z 1.96 or z 1.96. For any a, the two-tailed rejection
region z za/2 or z za/2 has type I error probability a (since area a/2 is captured
under each of the two tails of the z curve). Again, the key reason for using the
standardized test statistic Z is that because Z has a known distribution when H0 is
true (standard normal), a rejection region with desired type I error probability is
easily obtained by using an appropriate critical value.
The test procedure for Case I is summarized in the accompanying box, and
the corresponding rejection regions are illustrated in Figure 9.2.
Null hypothesis: H0: m ¼ m0
xm
Test statistic value: z ¼ pffiffi0ffi
s= n
Alternative Hypothesis
Rejection Region for Level a Test
Ha: m > m0
Ha: m < m0
Ha: m 6¼ m0
z za (upper-tailed test)
z za (lower-tailed test)
either z za/2 or z za/2 (two-tailed test)
z curve (probability distribution of test statistic Z when H 0 is true)
a
b
c
Total shaded area
= a = P(type I error)
Shaded area
= a = P(type I error)
0
−z a
za
Shaded area
= a /2
0
Rejection region: z £ −z a
Rejection region: z Ï z a
−z a/2
Shaded
area = a /2
0
z a/2
Rejection region: either
z Ï za/2 or z £ −za/2
Figure 9.2 Rejection regions for z tests: (a) upper-tailed test; (b) lower-tailed test; (c) two-tailed test
Use of the following sequence of steps is recommended when testing hypotheses
about a parameter.
1. Identify the parameter of interest and describe it in the context of the problem
situation.
9.2 Tests About a Population Mean
439
2. Determine the null value and state the null hypothesis.
3. State the appropriate alternative hypothesis.
4. Give the formula for the computed value of the test statistic (substituting the null
value and the known values of any other parameters, but not those of any
sample-based quantities).
5. State the rejection region for the selected significance level a.
6. Compute any necessary sample quantities, substitute into the formula for the test
statistic value, and compute that value.
7. Decide whether H0 should be rejected and state this conclusion in the problem
context.
The formulation of hypotheses (steps 2 and 3) should be done before
examining the data.
Example 9.6
A manufacturer of sprinkler systems used for fire protection in office buildings claims
that the true average system-activation temperature is 130 . A sample of n ¼ 9
systems, when tested, yields a sample average activation temperature of 131.08 F.
If the distribution of activation times is normal with standard deviation 1.5 F, does
the data contradict the manufacturer’s claim at significance level a ¼ .01?
1. Parameter of interest:
m ¼ true average activation temperature.
2. Null hypothesis:
H0: m ¼ 130 (null value ¼ m0 ¼ 130).
3. Alternative hypothesis:
Ha: m 6¼ 130 (a departure from the claimed value in
either direction is of concern).
4. Test statistic value:
z¼
x m0 x 130
pffiffiffi
pffiffiffi ¼
s= n 1:5= n
5. Rejection region: The form of Ha implies use of a two-tailed test with rejection
region either z z.005 or z z.005. From Section 4.3 or Appendix Table A.3,
z.005 ¼ 2.58, so we reject H0 if either z 2.58 or z 2.58.
6. Substituting n ¼ 9 and x ¼ 131:08;
z¼
131:08 130 1:08
pffiffiffi ¼
¼ 2:16
:5
1:5= 9
That is, the observed sample mean is a bit more than 2 standard deviations above
what would have been expected were H0 true.
7. The computed value z ¼ 2.16 does not fall in the rejection region
(2.58 < 2.16 < 2.58), so H0 cannot be rejected at significance level .01. The
data does not give strong support to the claim that the true average differs from
the design value of 130.
■
Another view of the analysis in the previous example involves calculating a 99% CI
for m based on Equation 8.5:
pffiffiffi
pffiffiffi
x 2:58s= n ¼ 131:08 2:58ð1:5= 9Þ ¼ 131:08 1:29 ¼ ð129:79; 132:37Þ
440
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
Notice that the interval includes m0 ¼ 130, and it is not hard to see that the 99% CI
excludes m0 if and only if the two-tailed hypothesis test rejects H0 at level .01.
In general, the 100(1 a)% CI excludes m0 if and only if the two-tailed hypothesis
test rejects H0 at level a. Although we will not always call attention to it, this kind
of relationship between hypothesis tests and confidence intervals will occur over
and over in the remainder of the book. It should be intuitively reasonable that the
CI will exclude a value when the corresponding test rejects the value. There is a
similar relationship between lower-tailed tests and upper confidence bounds, and
also between upper-tailed tests and lower confidence bounds.
b and Sample Size Determination The z tests for Case I are among the few in
statistics for which there are simple formulas available for b, the probability of a
type II error. Consider first thep
upper-tailed
test with rejection region z za. This
ffiffiffi
pffiffiisffi
equivalent to x m0 þ za s= n, so H0 will not be rejected if x < m0 þ za s= n.
Now let m0 denote a particular value of m that exceeds the null value m0. Then,
bðm0 Þ ¼ PðH0 is not rejected when m ¼ m0 Þ
pffiffiffi
¼ PðX < m0 þ za s= n when m ¼ m0 Þ
X m0
m m0
pffiffiffi < za þ 0 pffiffiffi when m ¼ m0
¼P
s= n
s= n
0
m m
¼ F za þ 0 pffiffiffi
s= n
As m0 increases, m0 m0 becomes more negative, so b(m0 ) will be small when m0
greatly exceeds m0 (because the value at which F is evaluated will then be quite
negative). Error probabilities for the lower-tailed and two-tailed tests are derived in
an analogous manner.
If s is large, the probability of a type II error can be large at an alternative
value m0 that is of particular concern to an investigator. Suppose we fix a and also
specify b for such an alternative value. In the sprinkler example, company officials
might view m0 ¼ 132 as a very substantial departure from H0: m ¼ 130 and
therefore wish b(132) ¼ .10 in addition to a ¼ .01. More generally, consider the
two restrictions P(type I error) ¼ a and b(m0 ) ¼ b for specified a, m0 , and b. Then
for an upper-tailed test, the sample size n should be chosen to satisfy
m0 m0
pffiffiffi ¼ b
F za þ
s= n
This implies that
zb ¼
m m0
z critical value that
¼ za þ 0 pffiffiffi
captures lower tail area b
s= n
It is easy to solve this equation for the desired n. A parallel argument yields the
necessary sample size for lower- and two-tailed tests as summarized in the next
box.
9.2 Tests About a Population Mean
Alternative Hypothesis
Ha: m > m0
Ha: m < m0
Ha: m 6¼ m0
441
Type II Error Probability b(m0 ) for a Level a Test
m m0
F za þ 0 pffiffiffi
s= n
m0 m0
pffiffiffi
1 F za þ
s= n
0
m m
m m0
F za=2 þ 0 pffiffiffi F za=2 þ 0 pffiffiffi
s= n
s= n
where F(z) ¼ the standard normal cdf.
The sample size n for which a level a test also has b(m0 ) ¼ b at the alternative
value m0 is
8
sðza þ zb Þ 2
>
>
>
< m m0
n¼ 0
2
>
> sðza=2 þ zb Þ
>
:
m0 m0
Example 9.7
for a one - tailed
(upper or lower) test
for a two - tailed test
(an approximate solution)
Let m denote the true average tread life of a type of tire. Consider testing H0:
m ¼ 30,000 versus Ha: m > 30,000 based on a sample of size n ¼ 16 from a normal
population distribution with s ¼ 1500. A test with a ¼ .01 requires za ¼ z.01
¼ 2.33. The probability of making a type II error when m ¼ 31,000 is
30;000 31;000
pffiffiffiffiffi
¼ Fð:34Þ ¼ :3669
bð31;000Þ ¼ F 2:33 þ
1500= 16
Since z.1 ¼ 1.28, the requirement that the level .01 test also have b(31,000) ¼ .1
necessitates
1500ð2:33 þ 1:28Þ 2
n¼
¼ ð5:42Þ2 ¼ 29:32
30;000 31;000
The sample size must be an integer, so n ¼ 30 tires should be used.
■
Case II: Large-Sample Tests
When the sample size is large, the z tests for Case I are easily modified to yield
valid test procedures without requiring either a normal population distribution or
known s. The key result was used in Chapter 8 to justify large-sample confidence
intervals: A large n implies that the sample standard deviation s will be close to s
for most samples, so that the standardized variable
Z¼
Xm
pffiffiffi
S= n
442
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
has approximately a standard normal distribution. Substitution of the null value
m0 in place of m yields the test statistic
Z¼
X m0
pffiffiffi
S= n
which has approximately a standard normal distribution when H0 is true. The use of
rejection regions given previously for Case I (e.g., z za when the alternative
hypothesis is Ha: m > m0) then results in test procedures for which the significance
level is approximately (rather than exactly) a. The rule of thumb n > 40 will again
be used to characterize a large sample size.
Example 9.8
A sample of bills for meals was obtained at a restaurant (by Erich Brandt). For each
of 70 bills the tip was found as a percentage of the raw bill (before taxes). Does it
appear that the population mean tip percentage for this restaurant exceeds the
standard 15%? Here are the 70 tip percentages:
14.21
19.12
29.87
13.46
11.48
15.23
21.53
20.24
20.37
17.92
16.79
13.96
16.09
12.76
20.10
15.29
19.74
19.03
21.58
19.19
18.07
15.0
22.5
14.94
18.39
22.73
19.19
11.94
11.91
14.11
30.0
15.69
27.55
14.56
19.23
19.02
18.21
15.86
37.5
** *
15.04
16.01
15.16
12.39
17.73
15.37
20.67
45.0
*
*
95% Confidence Intervals
Mean
Median
16
27
18
19
12.04
10.94
16.09
16.89
20.07
16.31
15.66
20.16
13.52
16.42
18.93
40.09
16.03
18.54
17.85
17.42
19.07
13.56
19.88
48.77
27.88
16.35
14.48
13.74
17.70
22.79
12.31
13.81
Anderson-Darting Normality Test
A-Squared
4.17
P-Value <
0.005
Mean
17.986
StDev
5.937
Variance
35.247
Skewness
2.9391
Kurtosis
12.0154
N
70
Minimum
10.940
1st Quartile
14.540
Median
16.840
3st Quartile
19.358
48.770
Maximum
95% Confidence Interval for Mean
16.571
19.402
95% Confidence Interval for Median
15.913
18.402
95% Confidence Interval for StDev
5.090
7.124
Figure 9.3 MINITAB descriptive summary for the tip data of Example 9.8
Figure 9.3 shows a descriptive summary obtained from MINITAB. The sample mean
tip percentage is >15. Notice that the distribution is positively skewed because there
are some very large tips (and a normal probability plot therefore does not exhibit a
linear pattern), but the large-sample z tests do not require a normal population
distribution.
1. m ¼ true average tip percentage
2. H0: m ¼ 15
9.2 Tests About a Population Mean
443
3. Ha: m > 15
x 15
4. z ¼ pffiffiffi
s= n
5. Using a test with a significance level .05, H0 will be rejected if z 1.645 (an
upper tailed test).
6. With n ¼ 70, x ¼ 17:99, and s ¼ 5.937,
z¼
17:99 15
2:99
pffiffiffiffiffi ¼
¼ 4:21
5:937= 70 :7096
7. Since 4.21 > 1.645, H0 is rejected. There is evidence that the population mean
tip percentage exceeds 15%.
■
Determination of b and the necessary sample size for these large-sample tests
can be based either on specifying a plausible value of s and using the Case I
formulas (even though s is used in the test) or on using the methods to be introduced
shortly in connection with Case III.
Case III: A Normal Population Distribution
with Unknown s
When n is small, the Central Limit Theorem (CLT) can no longer be invoked to
justify the use of a large-sample test. We faced this same difficulty in obtaining a
small-sample confidence interval (CI) for m in Chapter 8. Our approach here will be
the same one used there: We will assume that the population distribution is at least
approximately normal and describe test procedures whose validity rests on this
assumption. If an investigator has good reason to believe that the population
distribution is quite nonnormal, a distribution-free test from Chapter 14 can be
used. Alternatively, a statistician can be consulted regarding procedures valid for
specific families of population distributions other than the normal family. Or a
bootstrap procedure can be developed.
The key result on which tests for a normal population mean are based was
used in Chapter 8 to derive the one-sample t CI: If X1, X2, . . . , Xn is a random
sample from a normal distribution, the standardized variable
T¼
Xm
pffiffiffi
S= n
has a t distribution with n 1 degrees of freedom (df). Considerpffiffitesting
H0:
ffi
m ¼ m0 against Ha: m > m0 by using the test statistic ðX m0 Þ=ðS= nÞ. That is,
the test statistic
results from standardizing X under the assumption
pffiffiffi
pffiffithat
ffi H0 is true
(using S= n, the estimated standard deviation of X, rather than s= n). When H0 is
true, the test statistic has a t distribution with n 1 df. Knowledge of the test
statistic’s distribution when H0 is true (the “null distribution”) allows us to construct a rejection region for which the type I error probability is controlled at the
desired level. In particular, use of the upper-tail t critical value ta,n1 to specify the
rejection region t ta,n1 implies that
444
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ
¼ PðT ta;n1 when T has a t distribution with n 1 dfÞ
¼a
The test statistic is really the same here as in the large-sample case but is
labeled T to emphasize that its null distribution is a t distribution with n 1 df
rather than the standard normal (z) distribution. The rejection region for the t test
differs from that for the z test only in that a t critical value ta,n1 replaces the
z critical value za. Similar comments apply to alternatives for which a lower-tailed
or two-tailed test is appropriate.
THE
ONE-SAMPLE
t TEST
Null hypothesis: H0: m ¼ m0
xm
Test statistic value: t ¼ pffiffiffi0
s= n
Rejection Region for a Level a Test
Alternative Hypothesis
Ha: m > m0
Ha: m < m0
Ha: m 6¼ m0
Example 9.9
t ta,n1 (upper-tailed)
t ta,n1 (lower-tailed)
either t ta/2,n1 or t ta/2,n1 (two-tailed)
A well-designed and safe workplace can contribute greatly to increased productivity. It is especially important that workers not be asked to perform tasks, such as
lifting, that exceed their capabilities. The accompanying data on maximum weight
of lift (MAWL, in kg) for a frequency of four lifts/min was reported in the article
“The Effects of Speed, Frequency, and Load on Measured Hand Forces for a
Floor-to-Knuckle Lifting Task” (Ergonomics, 1992: 833–843); subjects were
randomly selected from the population of healthy males age 18–30. Assuming
that MAWL is normally distributed, does the following data suggest that the
population mean MAWL exceeds 25?
25.8
36.6
26.3
21.8
27.2
Let’s carry out a test using a significance level of .05.
1. m ¼ population mean MAWL
2. H0: m ¼ 25
3. Ha: m > 25
x 25
4. t ¼ pffiffiffi
s= n
5. Reject H0 if t ta, n1 ¼ t.05,4 ¼ 2.132.
6. Sxi ¼ 137.7 and Sx2i ¼ 3911:97, from which x ¼ 27:54, s ¼ 5.47, and
9.2 Tests About a Population Mean
t¼
445
27:54 25 2:54
pffiffiffi ¼
¼ 1:04
2:45
5:47= 5
The accompanying MINITAB output from a request for a one-sample t test has
the same calculated values (the P-value is discussed in Section 9.4).
Test of mu ¼ 25.00 vs mu > 25.00
Variable
mawl
N
5
Mean
27.54
StDev
5.47
SE Mean
2.45
T
1.04
P-Value
0.18
7. Since 1.04 does not fall in the rejection region (1.04 < 2.132), H0 cannot be
rejected at significance level .05. It is still plausible that m is (at most) 25. ■
b and Sample Size Determination The calculation of b at the alternative value m0
in Case I was carried out by expressing the rejection region in terms of x (e.g.,
pffiffiffi
x m0 þ za s= n) and then subtracting m0 to standardize correctly. An equivalent
pffiffiffi
approach involves noting that when m ¼ m0 , the test statistic Z ¼ ðX m0 Þ=ðs= nÞ
still has a normal distribution
pffiffiffi with variance 1, but now the mean value of Z is
given by ðm0 m0 Þ=ðs= nÞ. That is, when m ¼ m0 , the test statistic still has a
normal distribution though not the standard normal distribution. Because of this,
b(m0 ) is an area under the normal curve corresponding to mean value
pffiffiffi
ðm0 m0 Þ=ðs= nÞ and variance 1. Both a and b involve working with normally
distributed variables.
The calculation of b(m0 ) for the t test is much less straightforward.
This
pffiffiffi
is because the distribution of the test statistic T ¼ ðX m0 Þ=ðS= nÞ is quite
complicated when H0 is false and Ha is true. Thus, for an upper-tailed test,
determining
bðm0 Þ ¼ PðT < ta;n1
when m ¼ m0 rather than m0 Þ
involves integrating a very unpleasant density function. This must be done numerically, but fortunately it has been done by research statisticians for both one- and
two-tailed t tests. The results are summarized in graphs of b that appear in
Appendix Table A.16. There are four sets of graphs, corresponding to one-tailed
tests at level .05 and level .01 and two-tailed tests at the same levels.
To understand how these graphs are used, note first that both b and the
necessary sample size n in Case I are functions not just of the absolute difference
|m0 m0 | but of d ¼ |m0 m0 |/s. Suppose, for example, that |m0 m0 | ¼ 10.
This departure from H0 will be much easier to detect (smaller b) when s ¼ 2,
in which case m0 and m0 are 5 population standard deviations apart, than when
s ¼ 10. The fact that b for the t test depends on d rather than just |m0 m0 | is
unfortunate, since to use the graphs one must have some idea of the true value of s.
A conservative (large) guess for s will yield a conservative (large) value of
b(m0 ) and a conservative estimate of the sample size necessary for prescribed
a and b(m0 ).
Once the alternative m0 and value of s are selected, d is calculated and its
value located on the horizontal axis of the relevant set of curves. The value of b
is the height of the n 1 df curve above the value of d (visual interpolation is
necessary if n 1 is not a value for which the corresponding curve appears), as
illustrated in Figure 9.4.
446
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
1
b curve for n − 1 df
b when m = m⬘
d
0
Value of d corresponding to specified alternative m⬘
Figure 9.4 A typical b curve for the t test
Rather than fixing n (i.e., n 1, and thus the particular curve from which b is
read), one might prescribe both a (.05 or .01 here) and a value of b for the chosen m0
and s. After computing d, the point (d, b) is located on the relevant set of graphs.
The curve below and closest to this point gives n 1 and thus n (again, interpolation is often necessary).
Example 9.10
The true average voltage drop from collector to emitter of insulated gate bipolar
transistors of a certain type is supposed to be at most 2.5 V. An investigator selects a
sample of n ¼ 10 such transistors and uses the resulting voltages as a basis for testing
H0: m ¼ 2.5 versus Ha: m > 2.5 using a t test with significance level a ¼ .05. If the
standard deviation of the voltage distribution is s ¼ .100, how likely is it that H0 will
not be rejected when m ¼ 2.6? With d ¼ |2.5 2.6|/.100 ¼ 1.0, the point on the b
curve at 9 df for a one-tailed test with a ¼ .05 above 1.0 has height approximately .1,
so b .1. The investigator might think that this is too large a value of b for such a
substantial departure from H0 and may wish to have b ¼ .05 for this alternative value
of m. Since d ¼ 1.0, the point (d, b) ¼ (1.0, .05) must be located. This point is very
close to the 14 df curve, so using n ¼ 15 will give both a ¼ .05 and b ¼ .05 when
the value of m is 2.6 and s ¼ .10. A larger value of s would give a larger b for this
alternative, and an alternative value of m closer to 2.5 would also result in an
increased value of b.
■
Most of the widely used statistical computer packages will also calculate
type II error probabilities and determine necessary sample sizes. As an example, we
asked MINITAB to do the calculations from Example 9.10. Its computations are
based on power, which is simply 1 b. We want b to be small, which is equivalent
to asking that the power of the test be large. For example, b ¼ .05 corresponds to a
value of .95 for power. Here is the resulting MINITAB output.
Power and Sample Size
Testing mean
¼
null (versus
Calculating power for mean
> null)
¼ null +
0.1
9.2 Tests About a Population Mean
Alpha
¼
0.05
Sample
Size
10
Sigma
¼
447
0.1
Power
0.8975
Power and Sample Size
1-Sample t Test
Testing mean
¼
null (versus
Calculating power for mean
Alpha
¼
Sample
Size
13
0.05
Sigma
Target
Power
0.9500
¼
> null)
¼ null +
0.1
0.1
Actual
Power
0.9597
Notice from the second part of the output that the sample size necessary to obtain a
power of .95 (b ¼ .05) for an upper-tailed test with a ¼ .05 when s ¼ .1 and m0 is
.1 larger than m0 is only n ¼ 13, whereas eyeballing our b curves gave 15. When
available, this type of software is more trustworthy than the curves.
Exercises Section 9.2 (15–35)
15. Let the test statistic Z have a standard normal
distribution when H0 is true. Give the significance
level for each of the following situations:
a. Ha: m > m0, rejection region z 1.88
b. Ha: m < m0, rejection region z 2.75
c. Ha: m 6¼ m0, rejection region z 2.88 or z 2.88
16. Let the test statistic T have a t distribution when
H0 is true. Give the significance level for each of
the following situations:
a. Ha: m > m0, df ¼ 15, rejection region
t 3.733
b. Ha: m < m0, n ¼ 24, rejection region
t 2.500
c. Ha: m 6¼ m0, n ¼ 31, rejection region t 1.697
or t 1.697
17. Answer the following questions for the tire problem in Example 9.7.
a. If x ¼ 30; 960 and a level a ¼ .01 test is used,
what is the decision?
b. If a level .01 test is used, what is b(30,500)?
c. If a level .01 test is used and it is also required
that b(30,500) ¼ .05, what sample size n is
necessary?
d. If x ¼ 30; 960, what is the smallest a at which
H0 can be rejected (based on n ¼ 16)?
18. Reconsider the paint-drying situation of Example
9.2, in which drying time for a test specimen is
normally distributed with s ¼ 9. The hypotheses
H0: m ¼ 75 versus Ha: m < 75 are to be tested
using a random sample of n ¼ 25 observations.
a. How many standard deviations (of X) below
the null value is x ¼ 72:3?
b. If x ¼ 72:3, what is the conclusion using
a ¼ .01?
c. What is a for the test procedure that rejects H0
when z 2.88?
d. For the test procedure of part (c), what is
b(70)?
e. If the test procedure of part (c) is used, what n
is necessary to ensure that b(70) ¼ .01?
f. If a level .01 test is used with n ¼ 100, what is
the probability of a type I error when m ¼ 76?
19. The melting point of each of 16 samples of a brand
of hydrogenated vegetable oil was determined,
resulting in x ¼ 94:32. Assume that the distribution
of melting point is normal with s ¼ 1.20.
a. Test H0: m ¼ 95 versus Ha: m 6¼ 95 using a
two-tailed level .01 test.
b. If a level .01 test is used, what is b(94), the
probability of a type II error when m ¼ 94?
c. What value of n is necessary to ensure that
b(94) ¼ .1 when a ¼ .01?
448
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
20. Lightbulbs of a certain type are advertised as having
an average lifetime of 750 h. The price of these
bulbs is very favorable, so a potential customer
has decided to go ahead with a purchase arrangement unless it can be conclusively demonstrated
that the true average lifetime is smaller than what
is advertised. A random sample of 50 bulbs was
selected, the lifetime of each bulb determined, and
the appropriate hypotheses were tested using MINITAB, resulting in the accompanying output.
Variable
N
Mean
StDev
SEMean
lifetime
50
738.44
38.20
5.40
Z
2.14
P-Value
0.016
What conclusion would be appropriate for a
significance level of .05? A significance level
of .01? What significance level and conclusion
would you recommend?
21. The true average diameter of ball bearings of a
certain type is supposed to be .5 in. A one-sample
t test will be carried out to see whether this is the
case. What conclusion is appropriate in each of
the following situations?
a. n ¼ 13, t ¼ 1.6, a ¼ .05
b. n ¼ 13, t ¼ 1.6, a ¼ .05
c. n ¼ 25, t ¼ 2.6, a ¼ .01
d. n ¼ 25, t ¼ 3.9
22. The article “The Foreman’s View of Quality Control” (Quality Engrg., 1990: 257–280) described
an investigation into the coating weights for large
pipes resulting from a galvanized coating process. Production standards call for a true average
weight of 200 lb per pipe. The accompanying
descriptive summary and boxplot are from
MINITAB.
Variable N
Mean
ctg wt
206.73 206.00 206.81 6.35
30
Variable Min
ctg wt
Max
Median TrMean StDev SEMean
Q1
1.16
Q3
193.00 218.00 202.75 212.00
Coating weight
190
200
210
220
a. What does the boxplot suggest about the status
of the specification for true average coating
weight?
b. A normal probability plot of the data was quite
straight. Use the descriptive output to test the
appropriate hypotheses.
23. Exercise 33 in Chapter 1 gave n ¼ 26 observations
on escape time (sec) for oil workers in a simulated
exercise, from which the sample mean and sample
standard deviation are 370.69 and 24.36, respectively. Suppose the investigators had believed a
priori that true average escape time would be at
most 6 min. Does the data contradict this prior
belief? Assuming normality, test the appropriate
hypotheses using a significance level of .05.
24. Reconsider the sample observations on stabilized
viscosity of asphalt specimens introduced in
Exercise 43 in Chapter 1 (2781, 2900, 3013,
2856, and 2888). Suppose that for a particular
application, it is required that true average viscosity
be 3000. Does this requirement appear to have been
satisfied? State and test the appropriate hypotheses.
25. Recall the first-grade IQ scores of Example 1.2.
Here is a random sample of 10 of those scores:
107 113 108 127 146 103 108 118 111 119
The IQ test score has approximately a normal
distribution with mean 100 and standard deviation
15 for the entire U.S. population of first-graders.
Here we are interested in seeing whether the population of first-graders at this school is different
from the national population. Assume that the
normal distribution with standard deviation 15 is
valid for the school, and test at the .05 level to see
whether the school mean differs from the national
mean. Summarize your conclusion in a sentence
about these first-graders.
26. In recent years major league baseball games have
averaged 3 h in duration. However, because games
in Denver tend to be high-scoring, it might be
expected that the games would be longer there.
In 2001, the 81 games in Denver averaged
185.54 min with standard deviation 24.6 min.
What would you conclude?
27. On the label, Pepperidge Farm bagels are said to
weigh four ounces each (113 g). A random sample
of six bagels resulted in the following weights (in
grams):
117.6
109.5
111.6
109.2
119.1
110.8
a. Based on this sample, is there any reason to
doubt that the population mean is at least 113 g?
9.2 Tests About a Population Mean
b. Assume that the population mean is actually
110 g and that the distribution is normal with
standard deviation 4 g. In a z test of H0:
m ¼ 113 against Ha: m < 113 with a ¼ .05,
find the probability of rejecting H0 with six
observations.
c. Under the conditions of part (b) with a ¼ .05,
how many more observations would be needed
in order for the power to be at least .95?
28. Minor surgery on horses under field conditions
requires a reliable short-term anesthetic producing
good muscle relaxation, minimal cardiovascular
and respiratory changes, and a quick, smooth
recovery with minimal aftereffects so that horses
can be left unattended. The article “A Field Trial
of Ketamine Anesthesia in the Horse” (Equine
Vet. J., 1984: 176–179) reports that for a sample
of n ¼ 73 horses to which ketamine was administered under certain conditions, the sample average
lateral recumbency (lying-down) time was
18.86 min and the standard deviation was
8.6 min. Does this data suggest that true average
lateral recumbency time under these conditions is
less than 20 min? Test the appropriate hypotheses
at level of significance .10.
29. The amount of shaft wear (.0001 in.) after a fixed
mileage was determined for each of n ¼ 8 internal
combustion engines having copper lead as a bearing material, resulting in x ¼ 3:72 and s ¼ 1.25.
a. Assuming that the distribution of shaft wear is
normal with mean m, use the t test at level .05 to
test H0: m ¼ 3.50 versus Ha: m > 3.50.
b. Using s ¼ 1.25, what is the type II error probability b(m0 ) of the test for the alternative
m0 ¼ 4.00?
30. The recommended daily dietary allowance for zinc
among males older than age 50 years is 15 mg/day.
The article “Nutrient Intakes and Dietary Patterns
of Older Americans: A National Study” (J. Gerontol., 1992: M145–150) reports the following summary data on intake for a sample of males age
65–74 years: n ¼ 115, x ¼ 11:3, and s ¼ 6.43.
Does this data indicate that average daily zinc
intake in the population of all males age 65–74
falls below the recommended allowance?
31. In an experiment designed to measure the time
necessary for an inspector’s eyes to become used
to the reduced amount of light necessary for penetrant inspection, the sample average time for
n ¼ 9 inspectors was 6.32 s and the sample standard deviation was 1.65 s. It has previously been
assumed that the average adaptation time was at
least 7 s. Assuming adaptation time to be normally
449
distributed, does the data contradict prior belief?
Use the t test with a ¼ .1.
32. A sample of 12 radon detectors of a certain
type was selected, and each was exposed to
100 pCi/L of radon. The resulting readings were
as follows:
105.6
100.1
90.9
105.0
91.2
99.6
96.9
107.7
96.5
103.3
91.3
92.4
a. Does this data suggest that the population mean
reading under these conditions differs from
100? State and test the appropriate hypotheses
using a ¼ .05.
b. Suppose that prior to the experiment, a value of
s ¼ 7.5 had been assumed. How many determinations would then have been appropriate to
obtain b ¼ .10 for the alternative m ¼ 95?
33. Show that for any D > 0, when the population
distribution is normal and s is known, the twotailed test satisfies b(m0 D) ¼ b(m0 + D), so
that b(m0 ) is symmetric about m0.
34. For a fixed alternative value m0 , show that
b(m0 ) ! 0 as n ! 1 for either a one-tailed or a
two-tailed z test in the case of a normal population
distribution with known s.
35. The industry standard for the amount of alcohol
poured into many types of drinks (e.g., gin for a
gin and tonic, whiskey on the rocks) is 1.5 oz.
Each individual in a sample of 8 bartenders with
at least 5 years of experience was asked to pour
rum for a rum and coke into a short, wide (tumbler) glass, resulting in the following data:
2.00 1.78 2.16 1.91 1.70 1.67 1.83 1.48
(Summary quantities agree with those given in the
article “Bottoms Up! The Influence of Elongation on
Pouring and Consumption Volume,” J. Consumer
Res., 2003: 455–463.)
a. What does a boxplot suggest about the distribution of the amount poured?
b. Carry out a test of hypotheses to decide
whether there is strong evidence for concluding that the true average amount poured differs
from the industry standard.
c. Does the validity of the test you carried out in
(b) depend on any assumptions about the population distribution? If so, check the plausibility of such assumptions.
d. Suppose the actual standard deviation of the
amount poured is .20 oz. Determine the probability of a type II error for the test of (b) when
the true average amount poured is actually
(1) 1.6, (2) 1.7, (3) 1.8.
450
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
9.3 Tests Concerning a Population Proportion
Let p denote the proportion of individuals or objects in a population who possess
a specified property (e.g., cars with manual transmissions or smokers who smoke a
filter cigarette). If an individual or object with the property is labeled a success (S),
then p is the population proportion of successes. Tests concerning p will be based
on a random sample of size n from the population. Provided that n is small relative
to the population size, X (the number of S’s in the sample) has (approximately) a
binomial distribution. Furthermore, if n itself is large, both X and the estimator
p^ ¼ X=n are approximately normally distributed. We first consider large-sample
tests based on this latter fact and then turn to the small-sample case that directly
uses the binomial distribution.
Large-Sample Tests
Large-sample tests concerning p are a special case of the more general large-sample
procedures for a parameter y. Let ^y be an estimator of y that is (at least approximately) unbiased and has approximately a normal distribution. The null hypothesis
has the form H0: y ¼ y0, where y0 denotes a number (the null value) appropriate
to the problem context. Suppose that when H0 is true, the standard deviation of
^
y, s^y , involves no unknown parameters. For example, if y ¼ m and ^y ¼ X,
pffiffiffi
s^y ¼ sX ¼ s= n, which involves no unknown parameters only if the value of s
is known. A large-sample test statistic results from standardizing ^y under the
assumption that H0 is true [so that Eð^yÞ ¼ y0 ]:
Test statistic:
^y y0
s^y
If the alternative hypothesis is Ha: y > y0, an upper-tailed test whose significance
level is approximately a is specified by the rejection region z za. The other two
alternatives, Ha: y < y0 and Ha: y 6¼ y0, are tested using a lower-tailed z test and a
two-tailed z test, respectively.
In the case y ¼ p, s^y will not involve any unknown parameters when H0 is
true, but this is atypical. When s^y does involve unknown parameters, it is often
possible to use an estimated standard deviation S^y in place of s^y and still have Z
approximately normally distributed when H0 is true (because when n is large,
s^y s^y for most samples). The large-sample test of the previous section furnishes
pffiffiffi
an example
pffiffiffi of this: Because s is usually unknown, we use s^y ¼ sX ¼ s= n in place
of s= n in the denominator of z.
The estimator p^ ¼ X=n is unbiased [Eð^
pÞ ¼ p], has approximately a normal
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
distribution, and its standard deviation is sp^ ¼ pð1 pÞ=n. These facts were
used in Section 8.2 to obtain a confidence interval for p. When H0 is true, Eð^
pÞ ¼ p0
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
and sp^ ¼ p0 ð1 p0 Þ=n , so sp^ does not involve any unknown parameters. It then
follows that when n is large and H0 is true, the test statistic
p^ p0
Z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p0 ð1 p0 Þ=n
9.3 Tests Concerning a Population Proportion
451
has approximately a standard normal distribution. If the alternative hypothesis is
Ha: p > p0 and the upper-tailed rejection region z za is used, then
Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ
¼ PðZ za when Z has approximately a standard normal
distributionÞ a
Thus the desired level of significance a is attained by using the critical value that
captures area a in the upper tail of the z curve. Rejection regions for the other two
alternative hypotheses, lower-tailed for Ha: p < p0 and two-tailed for Ha: p 6¼ p0,
are justified in an analogous manner.
Null hypothesis: H0: p ¼ p0
p^ p0
Test statistic value: z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p0 ð1 p0 Þ=n
Alternative Hypothesis
Rejection Region
H a: p > p 0
H a: p < p 0
Ha: p 6¼ p0
z za (upper-tailed)
z za (lower-tailed)
either z za/2 or z za/2 (two-tailed)
These test procedures are valid provided that np0 10 and n(1 p0) 10.
Example 9.11
Recent information suggests that obesity is an increasing problem in America
among all age groups. The Associated Press (Oct. 9, 2002) reported that 1276
individuals in a sample of 4115 adults were found to be obese (a body mass index
exceeding 30; this index is a measure of weight relative to height). A 1998 survey
based on people’s own assessment revealed that 20% of adult Americans considered themselves obese. Does the recent data suggest that the true proportion of
adults who are obese is more than 1.5 times the percentage from the self-assessment
survey? Let’s carry out a test of hypotheses using a significance level of .10.
1. p ¼ the proportion of all American adults who are obese.
2. Saying that the current percentage is 1.5 times the self-assessment percentage is
equivalent to the assertion that the current percentage is 30%, from which we
have the null hypothesis as H0: p ¼ .30.
3. The phrase “more than” in the problem description implies that the alternative
hypothesis is Ha: p > .30.
4. Since np0 ¼ 4115(.3) 10 and nq0 ¼ 4115(.7) 10, the large-sample z test
can certainly be used. The test statistic value is
z ¼ ð^
p :3Þ=
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð:3Þð:7Þ=n
452
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
5. The form of Ha implies that an upper-tailed test is appropriate: Reject H0
if z z.10 ¼ 1.28.
6. p^ ¼ 1276=4115 ¼ :310, from which
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
z ¼ ð:310 :3Þ= ð:3Þð:7Þ=4115 ¼ :010=:0071 ¼ 1:40:
7. Since 1.40 exceeds the critical value 1.28, z lies in the rejection region. This
justifies rejecting the null hypothesis. Using a significance level of .10, it does
appear that more than 30% of American adults are obese.
■
b and Sample Size Determination When H0 is true, the test statistic Z has
approximately a standard normal distribution. Now suppose that H0 is not true
and that p ¼ p0 . Then Z still has approximately a normal distribution (because it is a
linear function of p^), but its mean value and variance are no longer 0 and 1,
respectively. Instead,
p0 p0
EðZÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p0 ð1 p0 Þ=n
VðZÞ ¼
p0 ð1 p0 Þ=n
p0 ð1 p0 Þ=n
The probability of a type II error for an upper-tailed test is b(p0 ) ¼ P(Z < za when
p ¼ p0 ). This can be computed by using the given mean and variance to standardize
and then referring to the standard normal cdf. In addition, if it is desired that
the level a test also have b(p0 ) ¼ b for a specified value of b, this equation can
be solved for the necessary n as in Section 9.2. General expressions for b(p0 ) and n
are given in the accompanying box.
Alternative Hypothesis
H a: p > p 0
H a: p < p 0
Ha: p 6¼ p0
b(p0 )
"
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#
p0 p0 þ za p0 ð1 p0 Þ=n
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
F
p0 ð1 p0 Þ=n
"
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#
p0 p0 za p0 ð1 p0 Þ=n
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1F
p0 ð1 p0 Þ=n
"
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#
p0 p0 þ za=2 p0 ð1 p0 Þ=n
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
F
p0 ð1 p0 Þ=n
"
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#
p0 p0 za=2 p0 ð1 p0 Þ=n
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
F
p0 ð1 p0 Þ=n
The sample size n for which the level a test also satisfies b(p0 ) ¼ b is
8 " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#2
>
>
z
ð1
p
Þ
þ
z
p
p0 ð1 p0 Þ
a
0
0
b
>
>
one tailed test
>
<
p0 p0
n ¼ " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#2
>
>
za=2 p0 ð1 p0 Þ þ zb p0 ð1 p0 Þ
two tailed test (an
>
>
>
:
0
approximate
solution)
p p0
9.3 Tests Concerning a Population Proportion
Example 9.12
453
A package-delivery service advertises that at least 90% of all packages brought to its
office by 9 a.m. for delivery in the same city are delivered by noon that day. Let
p denote the true proportion of such packages that are delivered as advertised and
consider the hypotheses H0: p ¼ .9 versus Ha: p < .9. If only 80% of the packages
are delivered as advertised, how likely is it that a level .01 test based on n ¼ 225
packages will detect such a departure from H0? What should the sample size be to
ensure that b(.8) ¼ .01? With a ¼ .01, p0 ¼ .9, p0 ¼ .8, and n ¼ 225,
"
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#
:9 :8 2:33 ð:9Þð:1Þ=225
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 1 Fð2:00Þ ¼ :0228
bð:8Þ ¼ 1 F
ð:8Þð:2Þ=225
Thus the probability that H0 will be rejected using the test when p ¼ .8 is .9772—
roughly 98% of all samples will result in correct rejection of H0.
Using za ¼ zb ¼ 2.33 in the sample size formula yields
"
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#2
2:33 ð:9Þð:1Þ þ 2:33 ð:8Þð:2Þ
n¼
266
:8 :9
■
Small-Sample Tests
Test procedures when the sample size n is small are based directly on the binomial
distribution rather than the normal approximation. Consider the alternative hypothesis Ha: p > p0 and again let X be the number of successes in the sample. Then X
is the test statistic, and the upper-tailed rejection region has the form x c. When
H0 is true, X has a binomial distribution with parameters n and p0, so
Pðtype I errorÞ ¼ PðH0 is rejected when it is trueÞ
¼ P½X c when X Binðn; p0 Þ
¼ 1 P½X c 1 when X Binðn; p0 Þ
¼ 1 Bðc 1; n; p0 Þ
As the critical value c decreases, more x values are included in the rejection
region and P(type I error) increases. Because X has a discrete probability distribution, it is usually not possible to find a value of c for which P(type I error) is exactly
the desired significance level a (e.g., .05 or .01). Instead, the largest rejection region
of the form {c, c + 1, . . . , n} satisfying 1 B(c 1; n, p0) a is used.
Let p0 denote an alternative value of p ðp0 >p0 Þ. When p ¼ p0 ; X Binðn; p0 Þ,
so
bðp0 Þ ¼ Pðtype II error when p ¼ p0 Þ ¼ P½X < c when X Binðn; p0 Þ
¼ Bðc 1; n; p0 Þ
That is, b(p0 ) is the result of a straightforward binomial probability calculation.
The sample size n necessary to ensure that a level a test also has specified b at a
particular alternative value p0 must be determined by trial and error using the
binomial cdf.
Test procedures for Ha: p < p0 and for Ha: p ¼
6 p0 are constructed in a similar
manner. In the former case, the appropriate rejection region has the form x c (a lowertailed test). The critical value c is the largest number satisfying B(c; n, p0) a.
454
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
The rejection region when the alternative hypothesis is Ha: p ¼
6 p0 consists of both large
and small x values.
Example 9.13
A plastics manufacturer has developed a new type of plastic trash can and
proposes to sell them with an unconditional 6-year warranty. To see whether this
is economically feasible, 20 prototype cans are subjected to an accelerated life
test to simulate 6 years of use. The proposed warranty will be modified only if
the sample data strongly suggests that fewer than 90% of such cans would survive
the 6-year period. Let p denote the proportion of all cans that survive the accelerated test. The relevant hypotheses are then H0: p ¼ .9 versus Ha: p < .9. A decision
will be based on the test statistic X, the number among the 20 that survive.
If the desired significance level is a ¼ .05, c must satisfy B(c; 20, .9) .05.
From Appendix Table A.1, B(15; 20, .9) ¼ .043, and B(16; 20, .9) ¼ .133. The
appropriate rejection region is therefore x 15. If the accelerated test results in
x ¼ 14, H0 would be rejected in favor of Ha, necessitating a modification of
the proposed warranty. The probability of a type II error for the alternative value
p0 ¼ .8 is
bð:8Þ ¼ P½H0 is not rejected when X Binð20; :8Þ
¼ P½X 16 when X Binð20; :8Þ
¼ 1 Bð15; 20; :8Þ 1 :370 ¼ :630
That is, when p ¼ .8, 63% of all samples consisting of n ¼ 20 cans would result in
H0 being incorrectly not rejected. This error probability is high because 20 is a
small sample size and p0 ¼ .8 is close to the null value p0 ¼ .9.
■
Exercises Section 9.3 (36–44)
36. State DMV records indicate that of all vehicles
undergoing emissions testing during the previous
year, 70% passed on the first try. A random sample of 200 cars tested in a particular county during
the current year yields 124 that passed on the
initial test. Does this suggest that the true proportion for this county during the current year differs
from the previous statewide proportion? Test the
relevant hypotheses using a ¼ .05.
37. A manufacturer of nickel–hydrogen batteries randomly selects 100 nickel plates for test cells,
cycles them a specified number of times, and
determines that 14 of the plates have blistered.
a. Does this provide compelling evidence for concluding that more than 10% of all plates blister
under such circumstances? State and test the
appropriate hypotheses using a significance
level of .05. In reaching your conclusion,
what type of error might you have committed?
b. If it is really the case that 15% of all plates
blister under these circumstances and a sample
size of 100 is used, how likely is it that the null
hypothesis of part (a) will not be rejected by
the level .05 test? Answer this question for a
sample size of 200.
c. How many plates would have to be tested to
have b(.15) ¼ .10 for the test of part (a)?
38. A random sample of 150 recent donations at a
blood bank reveals that 82 were type A blood.
Does this suggest that the actual percentage of
type A donations differs from 40%, the percentage
of the population having type A blood? Carry out
a test of the appropriate hypotheses using a significance level of .01. Would your conclusion have
been different if a significance level of .05 had
been used?
39. A university library ordinarily has a complete
shelf inventory done once every year. Because of
new shelving rules instituted the previous year, the
head librarian believes it may be possible to save
money by postponing the inventory. The librarian
decides to select at random 1000 books from the
9.3 Tests Concerning a Population Proportion
library’s collection and have them searched in a
preliminary manner. If evidence indicates strongly
that the true proportion of misshelved or unlocatable books is <.02, then the inventory will be
postponed.
a. Among the 1000 books searched, 15 were misshelved or unlocatable. Test the relevant
hypotheses and advise the librarian what to do
(use a ¼ .05).
b. If the true proportion of misshelved and lost
books is actually .01, what is the probability
that the inventory will be (unnecessarily) taken?
c. If the true proportion is .05, what is the probability that the inventory will be postponed?
40. The article “Statistical Evidence of Discrimination” (J. Amer. Statist. Assoc., 1982: 773–783)
discusses the court case Swain v. Alabama
(1965), in which it was alleged that there was
discrimination against blacks in grand jury selection. Census data suggested that 25% of those
eligible for grand jury service were black, yet a
random sample of 1050 people called to appear for
possible duty yielded only 177 blacks. Using a
level .01 test, does this data argue strongly for a
conclusion of discrimination?
41. A plan for an executive traveler’s club has been
developed by an airline on the premise that 5% of
its current customers would qualify for membership. A random sample of 500 customers yielded
40 who would qualify.
a. Using this data, test at level .01 the null hypothesis that the company’s premise is correct
against the alternative that it is not correct.
b. What is the probability that when the test of
part (a) is used, the company’s premise will be
judged correct when in fact 10% of all current
customers qualify?
42. Each of a group of 20 intermediate tennis players
is given two rackets, one having nylon strings and
the other synthetic gut strings. After several weeks
of playing with the two rackets, each player will
be asked to state a preference for one of the two
types of strings. Let p denote the proportion of all
such players who would prefer gut to nylon, and
let X be the number of players in the sample who
prefer gut. Because gut strings are more expensive, consider the null hypothesis that at most 50%
of all such players prefer gut. We simplify this to
H0: p ¼ .5, planning to reject H0 only if sample
evidence strongly favors gut strings.
455
a. Which of the rejection regions {15, 16, 17, 18,
19, 20}, {0, 1, 2, 3, 4, 5}, or {0, 1, 2, 3, 17, 18,
19, 20} is most appropriate, and why are the
other two not appropriate?
b. What is the probability of a type I error for the
chosen region of part (a)? Does the region specify a level .05 test? Is it the best level .05 test?
c. If 60% of all enthusiasts prefer gut, calculate
the probability of a type II error using the
appropriate region from part (a). Repeat if
80% of all enthusiasts prefer gut.
d. If 13 out of the 20 players prefer gut, should H0
be rejected using a significance level of .10?
43. A manufacturer of plumbing fixtures has developed a new type of washerless faucet. Let p ¼ P(a
randomly selected faucet of this type will develop
a leak within 2 years under normal use). The
manufacturer has decided to proceed with production unless it can be determined that p is too large;
the borderline acceptable value of p is specified as
.10. The manufacturer decides to subject n of these
faucets to accelerated testing (approximating
2 years of normal use). With X ¼ the number
among the n faucets that leak before the test concludes, production will commence unless the
observed X is too large. It is decided that if
p ¼ .10, the probability of not proceeding should
be at most .10, whereas if p ¼ .30 the probability
of proceeding should be at most .10. Can n ¼ 10
be used? n ¼ 20? n ¼ 25? What is the appropriate
rejection region for the chosen n, and what are the
actual error probabilities when this region is used?
44. Scientists have recently become concerned about
the safety of Teflon cookware and various food
containers because perfluorooctanoic acid (PFOA)
is used in the manufacturing process. An article in
the July 27, 2005, New York Times reported that of
600 children tested, 96% had PFOA in their blood.
According to the FDA, 90% of all Americans have
PFOA in their blood.
a. Does the data on PFOA incidence among children suggest that the percentage of all children
who have PFOA in their blood exceeds the
FDA percentage for all Americans? Carry out
an appropriate test of hypotheses.
b. If 95% of all children have PFOA in their
blood, how likely is it that the null hypothesis
tested in (a) will be rejected when a significance level of .01 is employed?
c. Referring back to (b), what sample size would be
necessary for the relevant probability to be .10?
456
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
9.4 P-Values
Using the rejection region method to test hypotheses entails first selecting a
significance level a. Then after computing the value of the test statistic, the null
hypothesis H0 is rejected if the value falls in the rejection region and is otherwise
not rejected. We now consider another way of reaching a conclusion in a hypothesis
testing analysis. This alternative approach is based on calculation of a certain
probability called a P-value. One advantage is that the P-value provides an intuitive
measure of the strength of evidence in the data against H0
DEFINITION
The P-value is the probability, calculated assuming that the null hypothesis is
true, of obtaining a value of the test statistic at least as contradictory to H0 as
the value calculated from the available sample.
The definition is quite a mouthful. Here are some key points:
• The P-value is a probability.
• This probability is calculated assuming that the null hypothesis is true.
• To determine the P-value, we must first decide which values of the test
statistic are at least as contradictory to H0 as the value obtained from our sample.
Example 9.14
Urban storm water can be contaminated by many sources, including discarded
batteries. When ruptured, these batteries release metals of environmental significance.
The paper “Urban Battery Litter” (J. Environ. Engr., 2009: 46–57) presented summary data for characteristics of a variety of batteries found in urban areas around
Cleveland. A sample of 51 Panasonic AAA batteries gave a sample mean zinc mass of
2.06 g. and a sample standard deviation of .141 g. Does this data provide compelling
evidence for concluding that the population mean zinc mass exceeds 2.0 g.?
With m denoting the true average zinc mass for such batteries, the relevant
hypotheses are H0: m ¼ 2.0 versus Ha: m > 2.0. The sample size is large enough so
that a z test can be used without making any specific assumption about the shape of
the population distribution. The test statistic value is
z¼
x 2:0 2:06 2:0
pffiffiffiffiffi ¼ 3:04
pffiffiffi ¼
s= n
:141= 51
Now we must decide which values of z are at least as contradictory to H0. Let’s first
consider an easier task: Which values of x are at least as contradictory to the null
hypothesis as 2.06, the mean of the observations in our sample? Because > appears
in Ha, it should be clear that 2.10 is at least as contradictory to H0 as is 2.06, so is
2.25, and so in fact is any x value that exceeds 2.06. But an x value that exceeds 2.06
corresponds to a value of z that exceeds 3.04. Thus the P-value is
P-value ¼ PðZ 3:04 when m ¼ 2:0Þ
9.4 P-Values
457
Since the test statistic Z was created by subtracting the null value 2.0 in the
numerator, when m ¼ 2.0 (i.e., when H0 is true) Z has approximately a standard
normal distribution. As a result,
P-value ¼ PðZ 3:04 when m ¼ 2:0Þ
area under the z curve to the right of 3:04
¼ 1 Fð3:04Þ ¼ :0012
■
We will shortly illustrate how to determine the P-value for any z or t test; that is,
any test where the reference distribution is the standard normal distribution (and z
curve) or some t distribution (and corresponding t curve). For the moment, though,
let’s focus on reaching a conclusion once the P-value is available. Because it is a
probability, the P-value must be between 0 and 1. What kinds of P-values provide
evidence against the null hypothesis? Consider two specific instances:
• P-value ¼ .250: In this case, fully 25% of all possible test statistic values are
more contradictory to H0 than the one that came out of our sample. So our data is
not that contradictory to the null hypothesis.
• P-value ¼ .0018: Here, only .18%, much less than 1%, of all possible test
statistic values, are at least as contradictory to H0 as what we obtained. Thus the
sample appears to be highly contradictory to the null hypothesis.
More generally, the smaller the P-value, the more evidence there is in the sample
data against the null hypothesis and for the alternative hypothesis. That is,
H0 should be rejected in favor of Ha when the P-value is sufficiently small.
So what constitutes “sufficiently small”?
DECISION
RULE BASED
ON THE
P-VALUE
Select a significance level a (as before, the desired type I error probability).
Then reject H0 if P-value a; do not reject H0 if P-value > a
Thus if the P-value exceeds the chosen significance level, the null hypothesis cannot
be rejected at that level. But if the P-value is equal to or < a, then there is enough
evidence to justify rejecting H0. In Example 8.14, we calculated P-value ¼ .0012.
Then using a significance level of .01, we would reject the null hypothesis in favor of
the alternative hypothesis because .0012 .01. However, suppose we select a
significance level of only .001, which requires more substantial evidence from the
data before H0 can be rejected. In this case we would not reject H0 because
.0012 > .001.
How does the decision rule based on the P-value compare to the decision
rule employed in the rejection region approach? The two procedures—the
rejection region method and the P-value method—are in fact identical. Whatever
the conclusion reached by employing the rejection region approach with a particular
a, the same conclusion will be reached via the P-value approach using that same a.
Example 9.15
The nicotine content problem discussed in Example 9.5 involved testing H0:
m ¼ 1.5 versus Ha: m > 1.5 using a z test (i.e., a test which utilizes the z curve
as the reference distribution). The inequality in Ha implies that the upper-tailed
458
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
rejection region z za is appropriate. Suppose z ¼ 2.10. Then using exactly the
same reasoning as in Example 8.14 gives P-value ¼ 1 F(2.10) ¼ .0179. Consider now testing with several different significance levels:
a ¼ :10 ) za ¼ z:10 ¼ 1:28 ) 2:10 1:28 ) reject H0
a ¼ :05 ) za ¼ z:05 ¼ 1:645 ) 2:10 1:645 ) reject H0
a ¼ :01 ) za ¼ z:01 ¼ 2:33 ) 2:10 < 2:33 ) do not reject H0
Because P-value ¼ .0179 .10 and also .0179 .05, using the P-value approach
results in rejection of H0 for the first two significance level. However, for a ¼ :01,
2.10 is not in the rejection region and .0179 is larger than .01. More generally,
whenever a is smaller than the P-value .0179, the critical value za will be beyond
the P-value and H0 cannot be rejected by either method. This is illustrated in
Figure 9.5.
a
Standard normal (z) curve
Shaded
area = .0179
0
2.10 = computed z
z curve
b
z curve
c
Shaded
area = a
0
2.10
za
Shaded
area = a
0
2.10
za
Figure 9.5 Relationship between a and tail area captured by computed z: (a) tail
area captured by computed z; (b) when a > .0179, za < 2.10 and H0 is rejected;
(c) when a < .0179, za > 2.10 and H0 is not rejected
■
Let’s reconsider the P-value .0012 in Example 9.14 once again. H0 can be
rejected only if :0012 a. Thus the null hypothesis can be rejected if a ¼ .05 or
.01 or .005 or .0015 or .00125. What is the smallest significance level a here for
which H0 can be rejected? It is the P-value .0012.
PROPOSITION
The P-value is the smallest significance level a at which the null hypothesis
can be rejected. Because of this, the P-value is alternatively referred to as the
observed significance level (OSL) for the data.
It is customary to call the data significant when H0 is rejected and not
significant otherwise. The P-value is then the smallest level at which the data is
9.4 P-Values
459
P−value = smallest level at which
H0 can be rejected
0 (b)
(a)
1
Figure 9.6 Comparing a and the P-value: (a) reject H0 when a lies here; (b) do not
reject H0 when a lies here
significant. An easy way to visualize the comparison of the P-value with the chosen
a is to draw a picture like that of Figure 9.6. The calculation of the P-value depends
on whether the test is upper-, lower-, or two-tailed. However, once it has been
calculated, the comparison with a does not depend on which type of test was used.
Example 9.16
The true average time to initial relief of pain for a best-selling pain reliever is
known to be 10 min. Let m denote the true average time to relief for a company’s
newly developed reliever. Suppose that when data from an experiment involving
the new pain reliever was analyzed, the P-value for testing H0: m ¼ 10 versus Ha:
m < 10 was calculated as .0384. Since a ¼ .05 is larger than the P-value [.05 lies in
the interval (a) of Figure 9.6], H0 would be rejected by anyone carrying out the test
at level .05. However, at level .01, H0 would not be rejected because .01 is smaller
than the smallest level (.0384) at which H0 can be rejected.
■
The most widely used statistical computer packages automatically include
a P-value when a hypothesis-testing analysis is performed. A conclusion can then
be drawn directly from the output, without reference to a table of critical values.
With the P-value in hand, an investigator can see at a quick glance for which
significance levels H0 would or would not be rejected. Also, each individual can
then select his or her own significance level. In addition, knowing the P-value allows
a decision maker to distinguish between a close call (e.g., a ¼ .05, P-value ¼ .0498)
and a very clear-cut conclusion (e.g., a ¼ .05, P-value ¼ .0003), something that
would not be possible just from the statement “H0 can be rejected at significance
level .05.”
P-Values for z Tests
The P-value for a z test (one based on a test statistic whose distribution when H0
is true is at least approximately standard normal) is easily determined from
the information in Appendix Table A.3. Consider an upper-tailed test and let z
denote the computed value of the test statistic Z. The null hypothesis is rejected if
z za, and the P-value is the smallest a for which this is the case. Since za increases
as a decreases, the P-value is the value of a for which z ¼ za. That is, the P-value
is just the area captured by the computed value z in the upper tail of the standard
normal curve. The corresponding cumulative area is F(z), so in this case
P-value ¼ 1 F(z).
An analogous argument for a lower-tailed test shows that the P-value is the area
captured by the computed value z in the lower tail of the standard normal curve. More
care must be exercised in the case of a two-tailed test. Suppose first that z is positive.
Then the P-value is the value of a satisfying z ¼ za/2 (i.e., computed z ¼ upper-tail
460
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
z curve
P-value = area in upper tail
1. Upper-tailed test
Ha contains the inequality >
= 1 – Φ(z)
0
Calculated z
z curve
P-value = area in lower tail
2. Lower-tailed test
= Φ(z)
Ha contains the inequality <
0
Calculated z
P-value = sum of area in two tails = 2[1 – Φ(|z|)]
z curve
3. Two-tailed test
Ha contains the inequality ≠
0
Calculated z, −z
Figure 9.7 Determination of the P-value for a z test
critical value). This says that the area captured in the upper tail is half the P-value, so
that P-value ¼ 2[1 F(z)]. If z is negative, the P-value is the a for which z ¼ za/2,
or, equivalently, z ¼ za/2, so P-value ¼ 2[1 F(z)]. Since z ¼ |z| when z is
negative, the P-value ¼ 2[1 F(|z|)] for either positive or negative z.
P-value:
8
>
< 1 FðzÞ
P ¼ FðzÞ
>
:
2½1 FðjzjÞ
for an upper -tailed test
for a lower -tailed test
for a two -tailed test
Each of these is the probability of getting a value at least as extreme as what was
obtained (assuming H0 true). The three cases are illustrated in Figure 9.7.
The next example illustrates the use of the P-value approach to hypothesis
testing by means of a sequence of steps modified from our previously recommended sequence.
Example 9.17
The target thickness for silicon wafers used in a type of integrated circuit is
245 mm. A sample of 50 wafers is obtained and the thickness of each one is
determined, resulting in a sample mean thickness of 246.18 mm and a sample
standard deviation of 3.60 mm. Does this data suggest that true average wafer
thickness is something other than the target value?
9.4 P-Values
461
1. Parameter of interest: m ¼ true average wafer thickness
2. Null hypothesis:
H0: m ¼ 245
Ha: m 6¼ 245
x 245
pffiffiffi
4. Formula for test statistic value: z ¼
s= n
246:18 245
pffiffiffiffiffi ¼ 2:32
5. Calculation of test statistic value: z ¼
3:60 50
6. Determination of P-value: Because the test is two-tailed,
3. Alternative hypothesis:
P-value ¼ 2½1Fð2:32Þ ¼ :0204
7. Conclusion: Using a significance level of .01, H0 would not be rejected since
.0204 > .01. At this significance level, there is insufficient evidence to conclude
that true average thickness differs from the target value.
■
P-Values for t Tests
Just as the P-value for a z test is a z curve area, the P-value for a t test will be a
t curve area. Figure 9.8 illustrates the three different cases. The number of df for the
one-sample t test is n 1.
The table of t critical values used previously for confidence and prediction
intervals doesn’t contain enough information about any particular t distribution to
allow for accurate determination of desired areas. So we have included another
t table in Appendix Table A.7, one that contains a tabulation of upper-tail t curve
areas. Each different column of the table is for a different number of df, and the
rows are for calculated values of the test statistic t ranging from 0.0 to 4.0 in
increments of .1. For example, the number .074 appears at the intersection of the 1.6
row and the 8 df column, so the area under the 8 df curve to the right of 1.6 (an
upper-tail area) is .074. Because t curves are symmetric, .074 is also the area under
the 8 df curve to the left of 1.6 (a lower-tail area).
Suppose, for example, that a test of H0: m ¼ 100 versus Ha: m > 100 is based
on the 8 df t distribution. If the calculated value of the test statistic is t ¼ 1.6, then
the P-value for this upper-tailed test is .074. Because .074 exceeds .05, we would
not be able to reject H0 at a significance level of .05. If the alternative hypothesis is
Ha: m < 100 and a test based on 20 df yields t ¼ 3.2, then Appendix Table A.7
shows that the P-value is the captured lower-tail area .002. The null hypothesis can
be rejected at either level .05 or .01. Consider testing H0: m1 m2 ¼ 0 versus
Ha: m1 m2 6¼ 0; the null hypothesis states that the means of the two populations
are identical, whereas the alternative hypothesis states that they are different
without specifying a direction of departure from H0. If a t test is based on 20 df
and t ¼ 3.2, then the P-value for this two-tailed test is 2(.002) ¼ .004. This would
also be the P-value for t ¼ 3.2. The tail area is doubled because values both
larger than 3.2 and smaller than 3.2 are more contradictory to H0 than what was
calculated (values farther out in either tail of the t curve).
462
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
t curve for relevant df
P-value = area in upper tail
1. Upper-tailed test
Ha contains the inequality >
0
Calculated t
t curve for relevant df
P-value = area in lower tail
2. Lower-tailed test
Ha contains the inequality <
0
Calculated t
P-value = sum of area in two tails
t curve for relevant df
3. Two-tailed test
Ha contains the inequality ≠
0
Calculated t, −t
Figure 9.8 P-values for t tests
Example 9.18
In Example 9.9, we carried out a test of H0: m ¼ 25 versus Ha: m > 25 based on
4 df. The calculated value of t was 1.04. Looking to the 4 df column of Appendix
Table A.7 and down to the 1.0 row, we see that the entry is .187, so the
P-value .187. This P-value is clearly larger than any reasonable significance
level a (.01, .05, and even .10), so there is no reason to reject the null hypothesis.
The MINITAB output included in Example 9.9 has P-value ¼ .18. P-values from
software packages will be more accurate than what results from Appendix Table A.7
since values of t in our table are accurate only to the tenths digit.
■
More on Interpreting P-Values
The P-value resulting from carrying out a test on a selected sample is not the
probability that H0 is true, nor is it the probability of rejecting the null hypothesis.
Once again, it is the probability, calculated assuming that H0 is true, of obtaining a
test statistic value at least as contradictory to the null hypothesis as the value that
actually resulted. For example, consider testing H0: m ¼ 50 against H0: m < 50
using a lower-tailed z test. If the calculated value of the test statistic is z ¼ 2.00,
then
9.4 P-Values
463
P-value ¼ PðZ < 2:00 when m ¼ 50Þ
¼ area under the z curve to the left of 2:00 ¼ :0228
But if a second sample is selected, the resulting value of z will almost surely be
different from 2.00, so the corresponding P-value will also likely differ from
.0228. Because the test statistic value itself varies from one sample to another, the
P-value will also vary from one sample to another. That is, the test statistic is a
random variable, and so the P-value will also be a random variable. A first sample
may give a P-value of .0228, a second sample result in a P-value of .1175, a third
yield .0606 as the P-value, and so on.
If H0 is false, we hope the P-value will be close to 0 so that the null
hypothesis can be rejected. On the other hand, when H0 is true, we’d like the
P-value to exceed the selected significance level so that the correct decision to not
reject H0 is made. The next example presents simulations to show how the P-value
behaves both when the null hypothesis is true and when it is false.
Example 9.19
The fuel efficiency (mpg) of any particular new vehicle under specified driving
conditions may not be identical to the EPA figure that appears on the vehicle’s
sticker. Suppose that four different vehicles of a particular type are to be selected
and driven over a certain course, after which the fuel efficiency of each one is to be
determined. Let m denote the true average fuel efficiency under these conditions.
Consider testing H0: m ¼ 20 versus H0: m > 20 using the one-sample t test
based on the resulting sample. Since the test is based on n 1 ¼ 3 degrees of
freedom, the P-value for an upper-tailed test is the area under the t curve with 3 df
to the right of the calculated t.
Let’s first suppose that the null hypothesis is true. We asked MINITAB to
generate 10,000 different samples, each containing 4 observations, from a normal
population distribution with mean value m ¼ 20 and standard deviation s ¼ 2. The
first sample and resulting summary quantities were
x1 ¼ 20:830; x2 ¼ 22:232; x3 ¼ 20:276; x4 ¼ 17:718
20:264 20
pffiffiffi ¼ :2799
x ¼ 20:264 s ¼ 1:8864 t ¼
:1:8864= 4
The P-value is the area under the 3-df t curve to the right of .2799, which according
to MINITAB is .3989. Using a significance level of .05, the null hypothesis would of
course not be rejected. The values of t for the next four samples were 1.7591,
.6082, .7020, and 3.1053, with corresponding P-values .912, .293, .733, and .0265.
Figure 9.9(a) shows a histogram of the 10,000 P-values from this simulation
experiment. About 4.5% of these P-values are in the first class interval from 0 to
.05. Thus when using a significance level of .05, the null hypothesis is rejected in
roughly 4.5% of these 10,000 tests. If we continue to generate samples and carry
out the test for each one at significance level .05, in the long run 5% of the P-values
would be in the first class interval—because when H0 is true and a test with
significance level .05 is used, by definition the probability of rejecting H0 is .05.
Looking at the histogram, it appears that the distribution of P-values is
relatively flat. In fact, it can be shown that when H0 is true, the probability
distribution of the P-value is a uniform distribution on the interval from 0 to 1.
That is, the density curve is completely flat on this interval, and thus must have a
Tests of Hypotheses Based on a Single Sample
a
6
5
4
Percent
9
3
2
1
0
0.00
0.15
0.30
0.45
0.60
0.75
0.90
0.60
0.75
0.90
0.60
0.75
0.90
P-value
b
20
15
Percent
CHAPTER
10
5
0
0.00
c
0.15
0.30
0.45
P-value
50
40
Percent
464
30
20
10
0
0.00
0.15
0.30
0.45
P-value
Figure 9.9 P-value simulation results for Example 9.19
9.4 P-Values
465
height of 1 if the total area under the curve is to be 1. Since the area under such a
curve to the left of .05 is (.05)(1) ¼ .05, we again have that the probability of
rejecting H0 when it is true is .05, the chosen significance level.
Now consider what happens when H0 is false because m ¼ 21. We again had
MINITAB generate 10,000 different samples of size 4, each
a normal
pffiffifrom
ffi
distribution with m ¼ 21 and s ¼ 2, calculate t ¼ ðx 20Þ=ðs= 4Þ for each one,
and then determine the P-value. The first such sample resulted in x ¼ 20:6411;
s ¼ :49637; t ¼ 2:5832; P-value ¼ :0408. Figure 9.9(b) gives a histogram of the
10,000 resulting P-values. The shape of this histogram is quite different from that
of Figure 9.9(a): there is a much greater tendency for the P-value to be small (closer
to 0) when m ¼ 21 than when m ¼ 20. Again H0 is rejected at significance level .05
whenever the P-value is at most .05 (in the first class interval). Unfortunately this is
the case for only about 19% of the 10,000 P-values. So only about 19% of the
10,000 tests correctly reject the null hypothesis; for the other 81%, a type II error is
committed. The difficulty is that the sample size is quite small and 21 is not very
different from the value asserted by the null hypothesis.
Figure 9.9(c) illustrates what happens to the P-value when H0 is false because
m ¼ 22 (still with n ¼ 4 and s ¼ 2). The histogram is even more concentrated
toward values close to 0 than was the case when m ¼ 21. In general, as m moves
further to the right of the null value 20, the distribution of the P-value will become
more and more concentrated on values close to 0. Even here a bit fewer than 50% of
the 10,000 P-values are smaller than .05. So it is still slightly more likely than
not that the null hypothesis is incorrectly not rejected. Only for values of m much
larger than 20 (e.g., at least 24 or 25) is it highly likely that the P-value will be
smaller than .05 and thus give the correct conclusion.
The big idea of this example is that because the value of any test statistic is
random, the P-value will also be a random variable and thus have a distribution.
The farther the actual value of the parameter is from the value specified by the null
hypothesis, the more the distribution of the P-value will be concentrated on values
close to 0 and the greater the chance that the test will correctly reject H0
▄
(corresponding to smaller b).
Exercises Section 9.4 (45–59)
45. For which of the given P-values would the null
hypothesis be rejected when performing a level
.05 test?
a. .001
b. .021
c. .078
d. .047
e. .148
46. Pairs of P-values and significance levels, a, are
given. For each pair, state whether the observed Pvalue would lead to rejection of H0 at the given
significance level.
a. P-value ¼ .084, a ¼ .05
b. P-value ¼ .003, a ¼ .001
c. P-value ¼ .498, a ¼ .05
d. P-value ¼ .084, a ¼ .10
e. P-value ¼ .039, a ¼ .01
f. P-value ¼ .218, a ¼ .10
47. Let m denote the mean reaction time to a certain
stimulus. For a large-sample z test of H0: m ¼ 5
versus Ha: m > 5, find the P-value associated with
each of the given values of the z test statistic.
a. 1.42
b. .90
c. 1.96
d. 2.48
e. .11
466
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
48. Newly purchased tires of a certain type are supposed to be filled to a pressure of 30 lb/in2. Let
m denote the true average pressure. Find the P-value
associated with each given z statistic value for testing H0: m ¼ 30 versus Ha: m 6¼ 30.
a. 2.10
b. 1.75
c. .55
d. 1.41
e. 5.3
49. Give as much information as you can about the
P-value of a t test in each of the following situations:
a. Upper-tailed test, df ¼ 8, t ¼ 2.0
b. Lower-tailed test, df ¼ 11, t ¼ 2.4
c. Two-tailed test, df ¼ 15, t ¼ 1.6
d. Upper-tailed test, df ¼ 19, t ¼ .4
e. Upper-tailed test, df ¼ 5, t ¼ 5.0
f. Two-tailed test, df ¼ 40, t ¼ 4.8
50. The paint used to make lines on roads must
reflect enough light to be clearly visible at night.
Let m denote the true average reflectometer
reading for a new type of paint under consideration. A test of H0: m ¼ 20 versus Ha: m > 20 will
be based on a random sample of size n from a
normal population distribution. What conclusion
is appropriate in each of the following situations?
a. n ¼ 15, t ¼ 3.2, a ¼ .05
b. n ¼ 9, t ¼ 1.8, a ¼ .01
c. n ¼ 24, t ¼ .2
51. Let m denote true average serum receptor concentration for all pregnant women. The average for all
women is known to be 5.63. The article “Serum
Transferrin Receptor for the Detection of Iron Deficiency in Pregnancy” (Amer. J. Clin. Nutrit., 1991:
1077–1081) reports that P-value > .10 for a test of
H0: m ¼ 5.63 versus Ha: m ¼
6 5.63 based on
n ¼ 176 pregnant women. Using a significance
level of .01, what would you conclude?
52. An aspirin manufacturer fills bottles by weight
rather than by count. Since each bottle should
contain 100 tablets, the average weight per tablet
should be 5 grains. Each of 100 tablets taken from
a very large lot is weighed, resulting in a sample
average weight per tablet of 4.87 grains and a
sample standard deviation of .35 grain. Does this
information provide strong evidence for concluding that the company is not filling its bottles as
advertised? Test the appropriate hypotheses using
a ¼ .01 by first computing the P-value and then
comparing it to the specified significance level.
53. Because of variability in the manufacturing process, the actual yielding point of a sample of mild
steel subjected to increasing stress will usually
differ from the theoretical yielding point. Let
p denote the true proportion of samples that
yield before their theoretical yielding point. If on
the basis of a sample it can be concluded that more
than 20% of all specimens yield before the theoretical point, the production process will have to
be modified.
a. If 15 of 60 specimens yield before the theoretical point, what is the P-value when the appropriate test is used, and what would you advise
the company to do?
b. If the true percentage of “early yields” is actually 50% (so that the theoretical point is the
median of the yield distribution) and a level .01
test is used, what is the probability that the
company concludes a modification of the process is necessary?
54. Many consumers are turning to generics as a way
of reducing the cost of prescription medications.
The article “Commercial Information on Drugs:
Confusing to the Physician?” (J. Drug Issues,
1988: 245–257) gives the results of a survey of
102 doctors. Only 47 of those surveyed knew the
generic name for the drug methadone. Does this
provide strong evidence for concluding that fewer
than half of all physicians know the generic name
for methadone? Carry out a test of hypotheses
with a significance level of .01 using the P-value
method.
55. A random sample of soil specimens was obtained,
and the amount of organic matter (%) in the soil
was determined for each specimen, resulting in the
accompanying data (from “Engineering Properties
of Soil,” Soil Sci., 1998: 93–102).
1.10
0.14
3.98
0.76
5.09
4.47
3.17
1.17
0.97
1.20
3.03
1.57
1.59
3.50
2.21
2.62
4.60
5.02
0.69
1.66
0.32 0.55 1.45
4.67 5.22 2.69
4.47 3.31 1.17
2.05
The values of the sample mean, sample standard
deviation, and (estimated) standard error of the
mean are 2.481, 1.616, and .295, respectively.
Does this data suggest that the true average percentage of organic matter in such soil is something
other than 3%? Carry out a test of the appropriate
hypotheses at significance level .10 by first determining the P-value. Would your conclusion be
different if a ¼ .05 had been used? [Note: A normal probability plot of the data shows an
9.5 Some Comments on Selecting a Test Procedure
acceptable pattern in light of the reasonably large
sample size.]
56. The times of first sprinkler activation for a series
of tests with fire prevention sprinkler systems
using an aqueous film-forming foam were (in sec)
27 41 22 27 23 35 30 33 24 27 28 22 24
(see “Use of AFFF in Sprinkler Systems,” Fire
Tech., 1976: 5). The system has been designed so
that true average activation time is at most 25 s
under such conditions. Does the data strongly
contradict the validity of this design specification?
Test the relevant hypotheses at significance level
.05 using the P-value approach.
57. A pen has been designed so that true average
writing lifetime under controlled conditions
(involving the use of a writing machine) is at
least 10 h. A random sample of 18 pens is selected,
the writing lifetime of each is determined, and a
normal probability plot of the resulting data supports the use of a one-sample t test.
a. What hypotheses should be tested if the investigators believe a priori that the design specification has been satisfied?
b. What conclusion is appropriate if the hypotheses of part (a) are tested, t ¼ 2.3, and
a ¼ .05?
c. What conclusion is appropriate if the hypotheses of part (a) are tested, t ¼ 1.8, and
a ¼ .01?
d. What should be concluded if the hypotheses of
part (a) are tested and t ¼ 3.6?
467
58. A spectrophotometer used for measuring CO concentration [ppm (parts per million) by volume] is
checked for accuracy by taking readings on a
manufactured gas (called span gas) in which the
CO concentration is very precisely controlled at
70 ppm. If the readings suggest that the spectrophotometer is not working properly, it will have to
be recalibrated. Assume that if it is properly calibrated, measured concentration for span gas samples is normally distributed. On the basis of the six
readings—85, 77, 82, 68, 72, and 69—is recalibration necessary? Carry out a test of the relevant
hypotheses using the P-value approach with
a ¼ .05.
59. The relative conductivity of a semiconductor
device is determined by the amount of impurity
“doped” into the device during its manufacture.
A silicon diode to be used for a specific purpose
requires an average cut-on voltage of .60 V, and if
this is not achieved, the amount of impurity must
be adjusted. A sample of diodes was selected and
the cut-on voltage was determined. The accompanying SAS output resulted from a request to test
the appropriate hypotheses.
N Mean
Std Dev
T
Prob > |T|
15 0.0453333 0.0899100 1.9527887 0.0711
[Note: SAS explicitly tests H0: m ¼ 0, so to test
H0: m ¼ .60, the null value .60 must be subtracted
from each xi; the reported mean is then the average
of the (xi .60) values. Also, SAS’s P-value is
always for a two-tailed test.] What would be concluded for a significance level of .01? .05? .10?
9.5 Some Comments on Selecting
a Test Procedure
Once the experimenter has decided on the question of interest and the method for
gathering data (the design of the experiment), construction of an appropriate test
procedure consists of three distinct steps:
1. Specify a test statistic (the decision is based on this function of the data).
2. Decide on the general form of the rejection region (typically, reject H0 for
suitably large values of the test statistic, reject for suitably small values, or reject
for either small or large values).
3. Select the specific numerical critical value or values that will separate the
rejection region from the acceptance region (by obtaining the distribution of the
test statistic when H0 is true, and then selecting a level of significance).
468
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
In the examples thus far, both steps 1 and 2 were carried out in an ad hoc manner
through intuition. For example, when the underlying population was assumed normal
with mean m and known s, we were led from X to the standardized test statistic
Z¼
X m0
pffiffiffi
s= n
For testing H0: m ¼ m0 versus Ha: m > m0, intuition then suggested rejecting H0
when z was large. Finally, the critical value was determined by specifying the level
of significance a and using the fact that Z has a standard normal distribution when
H0 is true. The reliability of the test in reaching a correct decision can be assessed
by studying type II error probabilities.
Issues to be considered in carrying out steps 1–3 encompass the following
questions:
1. What are the practical implications and consequences of choosing a particular
level of significance once the other aspects of a test procedure have been
determined?
2. Does there exist a general principle, not dependent just on intuition, that can be
used to obtain best or good test procedures?
3. When two or more tests are appropriate in a given situation, how can the tests be
compared to decide which should be used?
4. If a test is derived under specific assumptions about the distribution or
population being sampled, how well will the test procedure work when the
assumptions are violated?
Statistical Versus Practical Significance
Although the process of reaching a decision by using the methodology of classical
hypothesis testing involves selecting a level of significance and then rejecting or
not rejecting H0 at that level, simply reporting the a used and the decision reached
conveys little of the information contained in the sample data. Especially when the
results of an experiment are to be communicated to a large audience, rejection of H0
at level .05 will be much more convincing if the observed value of the test statistic
greatly exceeds the 5% critical value than if it barely exceeds that value. This is
Table 9.1
n
25
100
400
900
1600
2500
10,000
An illustration of the effect of sample size on P-values and b
P-value when x = 101
b(101) for Level .01 Test
.3085
.1587
.0228
.0013
.0000335
.000000297
7.69 1024
.9664
.9082
.6293
.2514
.0475
.0038
.0000
9.5 Some Comments on Selecting a Test Procedure
469
precisely what led to the notion of P-value as a way of reporting significance
without imposing a particular a on others who might wish to draw their own
conclusions.
Even if a P-value is included in a summary of results, however, there may be
difficulty in interpreting this value and in making a decision. This is because a
small P-value, which would ordinarily indicate statistical significance in that it
would strongly suggest rejection of H0 in favor of Ha, may be the result of a large
sample size in combination with a departure from H0 that has little practical
significance. In many experimental situations, only departures from H0 of large
magnitude would be worthy of detection, whereas a small departure from H0 would
have little practical significance.
Consider as an example testing H0: m ¼ 100 versus Ha: m > 100 where m is
the mean of a normal population with s ¼ 10. Suppose a true value of m ¼ 101
would not represent a serious departure from H0 in the sense that not rejecting H0
when m ¼ 101 would be a relatively inexpensive error. For a reasonably large
sample size n, this m would lead to an x value near 101, so we would not want this
sample evidence to argue strongly for rejection of H0 when x ¼ 101 is observed.
For various sample sizes, Table 9.1 records both the P-value when x ¼ 101 and also
the probability of not rejecting H0 at level .01 when m ¼ 101.
The second column in Table 9.1 shows that even for moderately large sample
sizes, the P-value of x ¼ 101 argues very strongly for rejection of H0, whereas
the observed x itself suggests that in practical terms the true value of m differs little
from the null value m0 ¼ 100. The third column points out that even when there is
little practical difference between the true m and the null value, for a fixed level of
significance a large sample size will almost always lead to rejection of the null
hypothesis at that level. To summarize, one must be especially careful in interpreting evidence when the sample size is large, since any small departure from H0 will
almost surely be detected by a test, yet such a departure may have little practical
significance.
Best Tests for Simple Hypotheses
The test procedures presented thus far are (hopefully) intuitively reasonable, but
have not been shown to be best in any sense. How can an optimal test be obtained,
one for which the type II error probability is as small as possible, subject to
controlling the type I error probability at the desired level? Our starting point
here will be a rather unrealistic situation from a practical viewpoint: testing a
simple null hypothesis against a simple alternative hypothesis. A simple hypothesis
is one which, when true, completely specifies the distribution of the sample Xi’s.
Suppose, for example, that the Xi’s form a random sample from an exponential
distribution with parameter l. Then the hypothesis H: l ¼ 1 is simple, since when
H is true each Xi has an exponential distribution with parameter l ¼ 1. We might
then consider H0: l ¼ 1 versus Ha: l ¼ 2, both of which are simple hypotheses.
The hypothesis H: l 1 is not simple, because when H is true, the distribution of
each Xi might be exponential with l ¼ 1 or with l ¼ .8 or . . . . Similarly, if the Xi’s
constitute a random sample from a normal distribution with known s, then
H: m ¼ 100 is a simple hypothesis. But if the value of s is unknown, this hypothesis
is not simple because the distribution of each Xi is then not completely specified; it
could be normal with m ¼ 100 and s ¼ 15 or normal with m ¼ 100 and s ¼ 12 or
470
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
normal with m ¼ 100 and any other positive value of s. For a hypothesis to be
simple, the value of every parameter in the pmf or pdf of the Xi’s must be specified.
The next result was a milestone in the theory of hypothesis testing—a method
for constructing a best test for a simple null hypothesis versus a simple alternative
hypothesis. Let f(x1, . . . , xn; y) be the joint pmf or pdf of the Xi’s. Then our null
hypothesis will assert that y ¼ y0 and the relevant alternative hypothesis will claim
that y ¼ ya. The result will carry over to the case of more than one parameter as
long as the value of each parameter is completely specified in both H0 and Ha.
THE
NEYMANPEARSON
THEOREM
For testing a simple null hypothesis H0: y ¼ y0 versus a simple alternative
hypothesis Ha: y ¼ ya, let k be a positive fixed number and form the rejection
region
R ¼
ðx1 ; . . . ; xn Þ :
f ðx1 ; . . . ; xn ; ya Þ
k
f ðx1 ; . . . ; xn ; y0 Þ
Thus R* is the set of all observations for which the likelihood ratio—ratio of
the alternative likelihood to the null likelihood—is at least k. The probability
of a type I error for the test with this rejection region is a* ¼ P[(X1, . . . , Xn)
∈ R* when y ¼ y0], whereas the type II error probability b* is the probability
that the Xi’s lie in the complement of R* (in the “acceptance” region) when
y ¼ y a.
Then for any other test procedure with type I error probability a
satisfying a a*, the probability of a type II error must satisfy b b*.
Thus the test with rejection region R* has the smallest type II error probability among all tests for which the type I error probability is at most a*.
The choice of the constant k in the rejection region will determine the type I
error probability a*. In the continuous case, k can be selected to give one of the
traditional significance levels .05, .01, and so on, whereas in the discrete case
a* ¼ .057 or .039 may be as close as one can get to .05.
Example 9.20
Consider randomly selecting n ¼ 5 new vehicles of a certain type and determining
the number of major defects on each one. Letting Xi denote the number of such
defects for the ith selected vehicle (i ¼ 1, . . . , 5), suppose that the Xi’s form
a random sample from a Poisson distribution with parameter l. Let’s find the
best test for testing H0: l ¼ 1 versus Ha: l ¼ 2. The Poisson likelihood is
f ðx1 ; : : : ; x5 ; lÞ ¼ e5l lSxi =Pxi !. Substituting first l ¼ 2, then l ¼ 1, and then
taking the ratio of these two likelihoods gives the rejection region
R ¼ ðx1 ; . . . ; x5 Þ : e5 2Sxi k
Multiplying both sides of the inequality by e5 and letting k 0 ¼ ke5 gives the
rejection region 2Sxi k0 . Now take the natural logarithm of both sides and let
c ¼ ln(k 0 )/ln(2) to obtain the rejection region Sxi c.
This latter rejection region is completely equivalent to R*: For any particular
value k there will be a corresponding value c, and vice versa. But it is much easier to
9.5 Some Comments on Selecting a Test Procedure
471
express the rejection region in this latter form and then select c to obtain a desired
significance level than it is to determine an appropriate value of k for the likelihood
ratio. In particular, T ¼ SXi has a Poisson distribution with parameter 5l (via a
moment generating function argument), so when H0 is true T has a Poisson distribution with parameter 5. From the 5.0 column of our Poisson table (Table A.2),
the cumulative probabilities for the values 8 and 9 are .932 and .968, respectively.
Thus if we use c ¼ 9 in the rejection region,
a ¼ PðPoisson rv with parameter 5 is 9Þ ¼ 1 :932 ¼ :068
Choosing instead c ¼ 10 gives a* ¼ .032. If we insist that the significance level be
at most .05, then the optimal rejection region is Sxi 10.
When Ha is true, the test statistic has a Poisson distribution with parameter 10.
Thus
b ¼ PðH0 is not rejected when Ha is trueÞ
¼ PðPoisson rv with parameter 10 is 9Þ ¼ :458
Obviously this type II error probability is quite large. This is because the sample
size n ¼ 5 is too small to allow for effective discrimination between l ¼ 1 and
l ¼ 2. For a sample size of 10, the Poisson table reveals that the best test having
significance level at most .05 uses c ¼ 16, for which a* ¼ .049 (Poisson parameter ¼ 10) and b* ¼ .157 (Poisson parameter ¼ 20).
Finally, returning to a sample size of 5, c ¼ 10 implies that 10 ¼ ln(ke5)/ln(2),
from which k ¼ 210/e5 6.9. For the best test to have a significance level of at most
.05, the null hypothesis should be rejected only when the likelihood for the alternative
value of l is more than about 7 times what it is for the null value.
■
Example 9.21
Let X1, . . . , Xn be a random sample from a normal distribution with mean m and
variance 1 (the argument to be given will work for any other known value of s2).
Consider testing H0: m ¼ m0 versus Ha: m ¼ ma where ma > m0. The likelihood ratio is
1 n=2 ð1=2ÞSðx m Þ2
i
a
2
2
e
¼ ema Sxi m0 Sxi ðn=2Þðma m0 Þ
2p
2
n=2
1
ð1=2ÞSðx
m
Þ
i
0
e
2p
h
i h
i
2
2
¼ enðma m0 Þ=2 eðma m0 ÞSxi
The term in the first set of brackets is a numerical constant. Then ma m0 > 0
implies that the likelihood ratio will be at least k if and only if Sxi k0 , that is, if
and only if x k00 , which means if and only if
z¼
x m0
pffiffiffi c
1= n
If we now let c ¼ z.01 ¼ 2.33, this z test (one for which the test statistic has a
standard normal distribution when H0 is true), will have minimum b among all tests
for which a .01.
■
The key idea in these last two examples cannot be overemphasized: Write an
expression for the likelihood ratio, and then manipulate the inequality likelihood
ratio k so it is equivalent to an inequality involving a test statistic whose distribution
when H0 is true is known or can be derived. Then this known or derived distribution
472
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
can be used to obtain a test with the desired a. In the first example the distribution was
Poisson with parameter 5, and in the second it was the standard normal distribution.
Proof of the Neyman-Pearson Theorem: We shall consider the case in which the
Xi’s have a discrete distribution, so that type I and type II error probabilities are
obtained by summation. In the continuous case, integration replaces summation.
Then
R ¼ fðx1 ; . . . ; xn Þ : f ðx1 ; . . . ; xn ; ya Þ k f ðx1 ; . . . ; xn ; y0 Þg
X
f ðx1 ; . . . ; xn ; y0 Þ
a ¼ P½ðX1 ; . . . ; Xn Þ 2 R when y ¼ y0 ¼
R
b ¼ P½ðX1 ; . . . ; Xn Þ 2 R0 when y ¼ ya ¼
X
f ðx1 ; . . . ; xn ; ya Þ
R0
(b* is the sum over values in the complement of the rejection region). Suppose that
R is a rejection region different from R* whose type I error probability is at most a*;
that is,
X
a ¼ P½ðX1 ; . . . ; Xn Þ 2 R when y ¼ y0 ¼
f ðx1 ; . . . ; xn ; y0 Þ a
R
We then wish to show that b for this rejection region must be at least as large
as b*. Consider the difference
X
D¼
½ f ðx1 ; . . . ; xn ; ya Þ k f ðx1 ; . . . ; xn ; y0 Þ
R
X
½ f ðx1 ; . . . ; xn ; ya Þ k f ðx1 ; . . . ; xn ; y0 Þ
R
¼
X
½. . . þ
R \R
¼
X
R \R0
(
X
½. . . ½. . . þ
R\R
R \R0
½. . . X
X
X
)
½. . .
R\R0
½. . .
R\R 0
This last difference is nonnegative (i.e. 0) because the term in the square brackets
is 0 for any set of xi’s in R* and is negative for any set of xi’s not in R*. It then
follows that
0
X
R
f ðx1 ; . . . ; xn ; ya Þ k
X
R
X
R
f ðx1 ; . . . ; xn ; ya Þ þ k
f ðx1 ; . . . ; xn ; y0 Þ
X
f ðx1 ; . . . ; xn ; y0 Þ
R
¼ ð1 b Þ ka ð1 bÞ þ ka
¼ b b kða aÞ b b
ðsince a a implies that the term being subtracted is nonnegativeÞ
Thus we have shown that b* b as desired.
■
9.5 Some Comments on Selecting a Test Procedure
473
Power and Uniformly Most Powerful Tests
The Neyman–Pearson theorem can be restated in a slightly different way by
considering the power of a test, first introduced in Section 9.2.
DEFINITION
Let O0 and Oa be two disjoint sets of possible values of y, and consider testing
H0: y ∈ O0 versus Ha: y ∈ Oa using a test with rejection region R. Then the
power function of the test, denoted by p(
) is the probability of rejecting H0
considered as a function of y:
pðy 0 Þ ¼ P½ðX1 ; :::; Xn Þ 2 R when y ¼ y 0 Since we don’t want to reject the null hypothesis when y ∈ O0 and do want to reject
it when y ∈ Oa, we wish a test for which the power function is close to 0 whenever
y0 is in O0 and close to 1 whenever y0 is in Oa. The power is easily related to the
type I and type II error probabilities:
(
Pðtype I error when y ¼ y0 Þ ¼ aðy0 Þ
when y0 2 O0
0
pðy Þ ¼
1 Pðtype II error when y ¼ y0 Þ ¼ 1 bðy0 Þ
when y0 2 Oa
Thus large power when y0 ∈ Oa is equivalent to small b for such parameter values.
Example 9.22
The drying time (min) of a particular brand and type of paint on a test board under
controlled conditions is known to be normally distributed with m ¼ 75 and s ¼ 9.4. A
new additive has been developed for the purpose of improving drying time. Assume
that drying time with the additive is still normally distributed with the same standard
deviation, and consider testing H0: m 75 versus Ha: m < 75 based on a sample of
size n ¼ 100. A test with significance level .01 rejects H0 if z 2.33, where
pffiffiffiffiffiffiffiffi
z ¼ ðx 75Þ=ð9:4= 100Þ ¼ ðx 75Þ=:94. Manipulating the inequality in the rejection region to isolate x gives the equivalent rejection region x 72:81. Thus the
power of the test when m ¼ 70 (a substantial departure from the null hypothesis) is
72:81 70
pffiffiffiffiffiffiffiffi
pð70Þ ¼ PðX 72:81 when m ¼ 70Þ ¼ F
9:4= 100
¼ Fð2:99Þ ¼ :9986
so b ¼ .0014. It is easily verified that p(75) ¼ .01, the significance level. The
power when m ¼ 76 (a parameter value for which H0 is true) is
72:81 76
pffiffiffiffiffiffiffiffi
pð76Þ ¼ PðX 72:81 when m ¼ 76Þ ¼ F
9:4= 100
¼ Fð3:39Þ ¼ :0003
which is quite small as it should be. By repeating this calculation for various
other values of m we obtain the entire power function. A graph of the ideal
power function appears in Figure 9.10(a) and the actual power function is graphed
in Figure 9.10(b). The maximum power for m 75 (i.e. in O0) occurs at m ¼ 75, on
the boundary between O0 and Oa. Because the power function is continuous, there are
values of m smaller than 75 for which the power is quite small. Even with a large
sample size, it is difficult to detect a very small departure from the null hypothesis.
474
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
b
1.0
1.0
0.8
0.8
0.6
0.6
POWER
IDEAL POWER
a
0.4
0.2
0.4
0.2
0.0
0.0
68
69
70
ideal
71
72
73
MEAN
74
75
76
77
68
69
actual
70
71
72
73
74
75
76
77
MEAN
Figure 9.10 Graphs of power functions for Example 9.22
■
The Neyman–Pearson theorem says that when O0 consists of a single value
y0 and Oa also consists of a single value ya, the rejection region R* specifies a test
for which the power p(ya) at the alternative value ya (which is just 1 b) is
maximized subject to p(y0) a for some specified value of a. That is, R* specifies
a most powerful test subject to the restriction on the power when the null hypothesis
is true.
What about best tests when at least one of the two hypotheses is composite,
that is, O0 or Oa (or both) consist of more than a single value?
Example 9.23
(Example 9.20
continued)
Consider again a random sample of size n ¼ 5 from a Poisson distribution, and
suppose we now wish to test H0: l 1 versus Ha: l > 1. Both of these hypotheses
are composite. Arguing as in Example 9.20, for any value la exceeding 1, a most
powerful test of H0: l ¼ 1 versus Ha: l ¼ la with significance level (power when
l ¼ 1) .032 rejects the null hypothesis when Sxi 10. Furthermore, it is easily
verified that the power of this test at l0 is smaller than .032 if l0 < 1. Thus the test
that rejects H0: l 1 in favor of H0: l > 1 when Sxi 10 has maximum power
for any l0 > 1 subject to the condition that p(l0 ) .032. This test is uniformly most
powerful.
■
More generally, a uniformly most powerful (UMP) level a test is one for
which p(y0 ) is maximized for any y ∈ Oa subject to p(y0 ) a for any y0 ∈ O0.
Unfortunately UMP tests are fairly rare, especially in commonly encountered
situations when H0 and Ha are assertions about a single parameter y1 whereas the
distribution of the Xi’s involves not only y1 but also at least one other “nuisance
parameter”. For example, when the population distribution is normal with values of
both m and s unknown, s is a nuisance parameter when testing H0: m ¼ m0 versus
Ha: m 6¼ m0. Be careful here—the null hypothesis is not simple because O0 consists
of all pairs (m, s) for which m ¼ m0 and s > 0, and there is certainly more than one
such pair. In this situation, the one-sample t test is not UMP.
9.5 Some Comments on Selecting a Test Procedure
475
However, suppose we restrict attention to unbiased tests, those for which the
smallest value of p(y0 ) for y0 ∈ Oa is at least as large as the largest value of p(y0 ) for
y0 ∈ O0. Unbiasedness simply says that we are at least as likely to reject the null
hypothesis when H0 is false as we are to reject it when H0 is true. The test proposed in
Example 9.22 involving paint drying times is unbiased because, as Figure 9.10(b)
shows, the power function at or to the right of 75 is smaller than it is to the left of 75.
It can be shown that the one-sample t test is UMP unbiased; that is, it is uniformly
most powerful among all tests that are unbiased. Several other commonly used tests
also have this property. Please consult one of the chapter references for more details.
Likelihood Ratio Tests
The likelihood ratio (LR) principle is the most frequently used method for finding an
appropriate test statistic in a new situation. As before, denote the joint pmf or pdf of
X1, . . . , Xn by f(x1, . . . , xn; y). In the case of a random sample, it will be a product
f(x1;y)
f(xn ;y). When the xi’s are the actual observations and f(x1, . . . , xn ;y) is
regarded as a function of y, it is called the likelihood function. Again consider
testing H0: y ∈ O0 versus Ha: y ∈ Oa, where O0 and Oa are disjoint sets, and
let O ¼ O0 [ Oa. In the Neyman–Pearson theorem, we focused on the ratio of the
likelihood when y ∈ Oa to the likelihood when y ∈ O0, rejecting H0 when the value of
the ratio was “sufficiently large”. Now we consider the ratio of the likelihood when
y ∈ O0 to the likelihood when y ∈ O. A very small value of this ratio argues against
the null hypothesis, since a small value arises when the data is much more consistent
with the alternative hypothesis than with the null hypothesis. More formally,
1. Find the largest value of the likelihood for any y ∈ O0 by finding the maximum
likelihood estimate of y within O0 and substituting this mle into the
^ 0 Þ.
likelihood function to obtain LðO
2. Find the largest value of the likelihood for any y ∈ O by finding the maximum
likelihood estimate of y within O and substituting this mle into the likelihood
^ Because O0 is a subset of O, this likelihood LðOÞ
^ can’t
function to obtain LðOÞ.
^
be any smaller than the likelihood LðO0 Þ obtained in the first step, and will be
much larger when the data is much more consistent with Ha than with H0.
^
^ 0 Þ=LðOÞand
reject the null hypothesis in favor
3. Form the likelihood ratio LðO
of the alternative when this ratio is k. The critical value k is chosen to give a
^ 0 Þ=LðOÞ
^ k
test with the desired significance level. In practice, the inequality LðO
is often re-expressed in terms of a more convenient statistic (such as the sum
of the observations) whose distribution is known or can be derived.
The above prescription remains valid if the single parameter y is replaced by
several parameters y1, . . . , yk. The mle’s of all parameters must be obtained in
both steps 1 and 2 and substituted back into the likelihood function.
Example 9.24
Consider a random sample from a normal distribution with the values of both
parameters unknown. We wish to test H0: m ¼ m0 versus Ha: m 6¼ m0. Here O
consists of all values of m and s2 for which 1 < m < 1 and s2 > 0, and the
likelihood function is
1 n=2 1=ð2s2 Þ P ðxi mÞ2
e
2ps2
476
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
P
^ ¼ x; s
^2 ¼ ðxi xÞ2 =n: Substituting
In Section 7.2 we obtained the mle’s as m
these estimates back into the likelihood function gives
n=2
1
^
LðOÞ ¼
en=2
P
2p ðxi xÞ2 =n
Within O0, m in the foregoing likelihood is replacedPby m0, so that only s2 must be
^2 ¼ ðxi m0 Þ2 =n: Substitution of
estimated. It is easily verified that the mle is s
this estimate in the likelihood function yields
n=2
1
^ 0Þ ¼
LðO
en=2
P
2p ðxi m0 Þ2 =n
Thus we reject H0 in favor of Ha when
^ 0Þ
LðO
¼
^
LðOÞ
!n=2
P
ðxi xÞ2
k
P
ðxi m0 Þ2
Raising both sides of this inequality to the power 2/n, we reject H0 whenever
P
ðxi xÞ2
k2=n ¼ k0
P
ðxi m0 Þ2
This is intuitively quite reasonable: the value m0 is implausible for m if the sum of
squared deviations about the sample mean is much smaller than the sum of squared
deviations about m0. The denominator of this latter ratio can be expressed as
X
X
X
ðxi xÞ2 þ 2
ðx m0 Þðxi xÞ þ nðx m0 Þ2
½ðxi xÞ þ ðx m0 Þ2 ¼
The middle (i.e., cross-product) term in this expression is 0, because the constant
x m0 can be moved outside the summation, and then the sum of deviations from
the sample mean is 0. Thus we should reject H0 when
P
1
ðxi xÞ2
¼
k0
P
P
ðxi xÞ2 þ nðx m0 Þ2 1 þ nðx m0 Þ2 = ðxi xÞ2
This latter ratio will be small when the second term in the denominator is large, so
the condition for rejection becomes
nðx m0 Þ2
k00
P
ðxi xÞ2
Dividing both sides by n 1 and taking square roots gives the rejection region
either
x m0
pffiffiffi c or
s= n
x m0
pffiffiffi c
s= n
If we now let c ¼ ta=2;n1 , we have exactly the two-tailed one-sample t test. The
bottom line is that when testing H0: m ¼ m0 against the two-sided (6¼) alternative,
the one-sample t test is the likelihood ratio test. This is also true of the upper-tailed
version of the t test when the alternative is Ha: m > m0 and of the lower-tailed test
when the alternative is Ha: m < m0. We could trace back through the argument to
recover the critical constant k from c, but there is no point in doing this; the
rejection region in terms of t is much more convenient than the rejection region
in terms of the likelihood ratio.
■
9.5 Some Comments on Selecting a Test Procedure
477
A number of tests discussed subsequently, including the “pooled” t test from
the next chapter and various tests from ANOVA (the analysis of variance) and
regression analysis, can be derived by the likelihood ratio principle. Rather frequently the inequality for the rejection region of a likelihood ratio test cannot be
manipulated to express the test procedure in terms of a simple statistic whose
distribution can be ascertained. The following large-sample result, valid under
fairly general conditions, can then be used: If the sample size n is sufficiently
large, then the statistic 2[ln(likelihood ratio)] has approximately a chi-squared
distribution with n degrees of freedom, where n is the difference between the
number of “freely varying” parameters in O and the number of such parameters
in O0. For example, if the distribution sampled is bivariate normal with the 5 parameters m1, m2, s1, s2, and r and the null hypothesis asserts that m1 ¼ m2 and
^ 0 Þ=LðOÞ
^ 1, and the likelihood
s1 ¼ s2, then n ¼ 5 3 ¼ 2. By definition LðO
ratio test rejects H0 when this likelihood ratio is much less than 1. This is equivalent
to rejecting when the logarithm of the likelihood ratio is quite negative, that is,
when ln(LR) is quite positive. The large-sample version of the test is thus uppertailed: H0 should be rejected if 2ln(likelihood ratio) w2a;n (an upper-tail critical
value extracted from Table A.6).
Example 9.25
Suppose a scientist makes n measurements of some physical characteristic, such as
the specific gravity of a liquid. Let X1, . . . , Xn denote the resulting measurement
errors. Assume that these Xi’s are independent and identically distributed according
to the double exponential (Laplace) distribution: f ðxÞ ¼ :5ejxyj for 1< x< 1:
This pdf is symmetric about y with somewhat heavier tails than the normal pdf.
If y ¼ 0 then the measurements are unbiased, so it is natural to test H0: y ¼ 0 versus
Ha: y 6¼ 0. Here n ¼ 1 0 ¼ 1. The likelihood is
LðyÞ ¼ ð:5Þn eSjxi yj
Because
P of the minus sign preceding the summation, the likelihood is maximized
when
jxi yj is minimized. The absolute value function is not differentiable,
and therefore differential calculus cannot be used. Instead, consider for a moment
the case n ¼ 5 and let y1, . . . , y5 denote the values of the xi’s ordered from smallest
to largest—so the yi’s are the observed values of the order statistics. For example, a
random sample of size five from the Laplace distribution with y ¼ 0 is .24998,
.75446, .19053, 1.16237, .83229, so (y1, . . . , y5) ¼ (.24998, .19053, .75446,
.83229, 1.16237). Then
8
>
y1 þ y2 þ y3 þ y4 þ y5 5y y < y1
>
>
>
>
>
y1 þ y2 þ y3 þ y4 þ y5 3y y1 y < y2
>
>
>
< y y þ y þ y þ y y y y < y
X
X
1
2
3
4
5
2
3
jyi yj ¼
jxi yj ¼
>
y
y
y
þ
y
þ
y
þ
y
y
y
<
y
1
2
3
4
5
3
4
>
>
>
>
>
y1 y2 y3 y4 þ y5 þ 3y y4 y < y5
>
>
>
: y y y y y þ 5y y y
1
2
3
4
5
5
The graph of this expression as a function of y appears in Figure 9.11, from which it
is apparent that the minimum occurs at y3 ¼ x~ ¼ :75446, the sample median. The
situation is similar whenever n is odd. When n is even, the function achieves its
minimum for any y between yn/2 and y(n/2)+1; one such y is ðyn=2 þ yðn=2Þþ1 Þ=2 ¼ x~.
In summary, the mle of y is the sample median.
478
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
Σ|xi − q |
5.5
5.0
4.5
4.0
3.5
3.0
2.5
−.5
0
.5
1.0
1.5
q
Figure 9.11 Determining the mle of the double exponential parameter by minimizing
P
jxi yj
The likelihood ratio statistic for testing the relevant hypotheses is
=½ð:5Þn eSjxi ~xj . Taking the natural
the likelihood ratio and multið:5Þ e
P log ofP
plying by 2 gives the rejection region 2 jxi j 2 jxi x~j w2a;1 for the largesample version of the LR test.
P
jxi j ¼ 38:6 and
P Suppose that a sample of n ¼ 30 errors results in
jxi x~j ¼ 37:3. Then
X
X
2 lnðLRÞ ¼ 2
jxi j jxi x~j ¼ 2:6
n Sjxi j
Comparing this to w2:05;1 ¼ 3:84, we would not reject the null hypothesis at the
5% significance level. It is plausible that the measurement process is indeed
unbiased.
■
Exercises Section 9.5 (60–71)
60. Reconsider the paint-drying problem discussed in
Example 9.2. The hypotheses were H0: m ¼ 75
versus Ha: m < 75, with s assumed to have value
9.0. Consider the alternative value m ¼ 74, which
in the context of the problem would presumably
not be a practically significant departure from H0.
a. For a level .01 test, compute b at this alternative for sample sizes n ¼ 100, 900, and 2500.
b. If the observed value of X is x ¼ 74, what can
you say about the resulting P-value when
n ¼ 2500? Is the data statistically significant
at any of the standard values of a?
c. Would you really want to use a sample size
of 2500 along with a level .01 test (disregarding the cost of such an experiment)? Explain.
61. Consider the large-sample level .01 test in Section 9.3 for testing H0: p ¼ .2 against Ha: p > .2.
a. For the alternative value p ¼ .21, compute
b(.21) for sample sizes n ¼ 100, 2500,
10,000, 40,000, and 90,000.
b. For p^ ¼ x=n ¼ :21, compute the P-value when
n ¼ 100, 2500, 10,000, and 40,000.
9.5 Some Comments on Selecting a Test Procedure
c. In most situations, would it be reasonable to
use a level .01 test in conjunction with a sample
size of 40,000? Why or why not?
62. For a random sample of n individuals taking a
licensing exam, let Xi ¼ 1 if the ith individual in
the sample passes the exam and Xi ¼ 0 otherwise
(i ¼ 1, . . . , n).
a. With p denoting the proportion of all examtakers who pass, show that the most powerful
test of H0: p ¼ .5 versus Ha: p ¼ .75 rejects H0
when Sxi c.
b. If n ¼ 20 and you want a .05 for the test of
(a), would you reject H0 if 15 of the 20 individuals in the sample pass the exam?
c. What is the power of the test you used in (b)
when p ¼ .75 [i.e., what is p(.75)]?
d. Is the test derived in (a) UMP for testing
the hypotheses H0: p ¼ .5 versus Ha: p >.5?
Explain your reasoning.
e. Graph the power function p(p) of the test for the
hypotheses of (d) when n ¼ 20 and a .05.
f. Return to the scenario of (a), and suppose the
test is based on a sample size of 50. If the
probability of a type II error is approximately
.025, what is the approximate significance level
of the test (use a normal approximation)?
479
a. Obtain a most powerful test for H0: l ¼ 1
versus Ha: l ¼ .5, and express the rejection
region in terms of a “simple” statistic.
b. Is the test of (a) uniformly most powerful for
H0: l ¼ 1 versus Ha: l < 1? Justify your
answer.
66. Consider a random sample of size n from the
“shifted exponential” distribution with pdf
f ðx; yÞ ¼ eðxyÞ for x > y and 0 otherwise (the
graph is that of the ordinary exponential pdf with
l ¼ 1 shifted so that it begins its descent at y rather
than at 0). Let Y1 denote the smallest order statistic,
and show that the likelihood ratio test of H0: y 1
versus Ha: y > 1 rejects the null hypothesis if y1,
the observed value of Y1, is c.
67. Suppose that each of n randomly selected individuals is classified according to his/her genotype
with respect to a particular genetic characteristic
and that the three possible genotypes are AA, Aa,
and aa with long-run proportions (probabilities) y2,
2y(1y), and (1y)2, respectively (0 < y < 1).
It is then straightforward to show that the
likelihood is
y2x1 ½2yð1 yÞx2 ð1 yÞ2x3
63. The error X in a measurement has a normal distribution with mean value 0 and variance s2. Consider testing H0: s2 ¼ 2 versus Ha: s2 ¼ 3 based
on a random sample X1, . . . , Xn of errors.
a. Show that a most powerful test rejects H0 when
P 2
xi c:
b. For n ¼ 10, find the value of c for the test in
(a) that results in a ¼ .05.
c. Is the test of (a) UMP for H0: s2 ¼ 2 versus
Ha: s2 > 2? Justify your assertion.
where x1, x2, and x3 are the number of individuals
in the sample who have the AA, Aa, and aa genotypes, respectively. Show that the most powerful
test for testing H0: y ¼ .5 versus Ha: y ¼ .8
rejects the null hypothesis when 2x1 + x2 c. Is
this test UMP for the alternative Ha: y > .5?
Explain. [Note: The fact that the joint distribution
of X1, X2, and X3 is multinomial can be used to
obtain the value of c that yields a test with any
desired significance level when n is large.]
64. Suppose that X, the fraction of a container that is
filled, has pdf f(x;y) ¼ yxy1 for 0 < x < 1
(where y > 0), and let X1, . . . , Xn be a random
sample from this distribution.
a. Show that the most powerful test for H0: y ¼ 1
versus Ha: y ¼ 2 rejects the null hypothesis if
Sln(xi) c.
b. Is the test of (a) UMP for testing H0: y ¼ 1
versus Ha: y > 1? Explain your reasoning.
c. If n ¼ 50, what is the (approximate) value of c
for which the test has significance level .05?
68. The error in a measurement is normally distributed with mean m and standard deviation 1.
Consider a random sample of n errors, and show
that the likelihood ratio test for H0: m ¼ 0 versus
Ha: m 6¼ 0 rejects the null hypothesis when either
x c or x c. What is c for a test with
a ¼ .05? How does the test change if the standard
deviation of an error is s0 (known) and the relevant hypotheses are H0: m ¼ 0 versus Ha: m 6¼m0?
65. Consider a random sample of n component lifetimes, where the distribution of lifetime is exponential with parameter l.
69. Measurement error in a particular situation is
normally distributed with mean value m and
standard deviation 4. Consider testing H0: m ¼ 0
versus Ha: m 6¼ 0 based on a sample of n ¼ 16
measurements.
a. Verify that the usual test with significance
level .05 rejects H0 if either x 1:96 or
480
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
x 1:96. [Note: That this test is unbiased
follows from the fact that the way to capture
the largest area under the z curve above an
interval having width 3.92 is to center that interval at 0 (so it extends from 1.96 to 1.96).]
b. Consider the test that rejects H0 if either
x 2:17 or x 1:81. What is a, that is, p(0)?
c. What is the power of the test proposed in (b)
when m ¼ .1 and when m ¼ .1? (Note that .1
and .1 are very close to the null value, so one
would not expect large power for such values).
Is the test unbiased?
d. Calculate the power of the usual test when
m ¼ .1 and when m ¼ .1. Is the usual test a
most powerful test? [Hint: Refer to your calculations in (c).] [Note: It can be shown that
the usual test is most powerful among all
unbiased tests.]
70. A test of whether a coin is fair will be based on
n ¼ 50 tosses. Let X be the resulting number of
heads. Consider two rejection regions: R1 ¼ {x:
either x 17 or x 33} and R2 ¼ {x: either
x 18 or x 37}.
a. Determine the significance level (type I error
probability) for each rejection region.
b. Determine the power of each test when
p ¼ .49. Is the test with rejection region R1 a
uniformly most powerful level .033 test?
Explain.
c. Is the test with rejection region R2 unbiased?
Explain.
d. Sketch the power function for the test with
rejection region R1, and then do so for the test
with the rejection region R2. What does your
intuition suggest about the desirability of using
the rejection region R2?
71. Consider Example 9.24.
pffiffiffi
a. With t ¼ ðx m0 Þ=ðs= nÞ, show that the likelihood ratio is equal to l ¼ [1 + t2/(n 1)]n/2,
and therefore the approximate chi-square statistic is 2[ln(l)] ¼ n ln[1 + t2/(n 1)].
b. Apply part (a) to test the hypotheses of
Exercise 55, using the data given there. Compare your results with the answers found in
Exercise 55.
Supplementary Exercises (72–94)
72. A sample of 50 lenses used in eyeglasses yields a
sample mean thickness of 3.05 mm and a sample
standard deviation of .34 mm. The desired true
average thickness of such lenses is 3.20 mm. Does
the data strongly suggest that the true average
thickness of such lenses is something other than
what is desired? Test using a ¼ .05.
73. In Exercise 72, suppose the experimenter had
believed before collecting the data that the value
of s was approximately .30. If the experimenter
wished the probability of a type II error to be .05
when m ¼ 3.00, was a sample size of 50 unnecessarily large?
74. It is specified that a certain type of iron should
contain .85 g of silicon per 100 g of iron (.85%).
The silicon content of each of 25 randomly
selected iron specimens was determined, and the
accompanying MINITAB output resulted from a
test of the appropriate hypotheses.
Variable N
sil cont
Mean
StDev
SE
Mean
25 0.8880 0.1807 0.0361
a. What hypotheses were tested?
T
P
1.05 0.30
b. What conclusion would be reached for a significance level of .05, and why? Answer the
same question for a significance level of .10.
75. One method for straightening wire before coiling
it to make a spring is called “roller straightening.”
The article “The Effect of Roller and Spinner
Wire Straightening on Coiling Performance and
Wire Properties” (Springs, 1987: 27–28) reports
on the tensile properties of wire. Suppose a sample
of 16 wires is selected and each is tested to determine tensile strength (N/mm2). The resulting sample mean and standard deviation are 2160 and 30,
respectively.
a. The mean tensile strength for springs made
using spinner straightening is 2150 N/mm2.
What hypotheses should be tested to determine
whether the mean tensile strength for the roller
method exceeds 2150?
b. Assuming that the tensile strength distribution
is approximately normal, what test statistic
would you use to test the hypotheses in part (a)?
c. What is the value of the test statistic for this
data?
d. What is the P-value for the value of the test
statistic computed in part (c)?
Supplementary Exercises
e. For a level .05 test, what conclusion would you
reach?
76. A new method for measuring phosphorus levels in
soil is described in the article “A Rapid Method to
Determine Total Phosphorus in Soils” (Soil Sci.
Amer. J., 1988: 1301–1304). Suppose a sample of
11 soil specimens, each with a true phosphorus
content of 548 mg/kg, is analyzed using the new
method. The resulting sample mean and standard
deviation for phosphorus level are 587 and 10,
respectively.
a. Is there evidence that the mean phosphorus
level reported by the new method differs significantly from the true value of 548 mg/kg?
Use a ¼ .05.
b. What assumptions must you make for the test
in part (a) to be appropriate?
77. The article “Orchard Floor Management Utilizing
Soil-Applied Coal Dust for Frost Protection”
(Agric. Forest Meteorol., 1988: 71–82) reports
the following values for soil heat flux of eight
plots covered with coal dust.
34.7 35.4 34.7 37.7 32.5 28.0 18.4 24.9
The mean soil heat flux for plots covered only
with grass is 29.0. Assuming that the heat-flux
distribution is approximately normal, does the
data suggest that the coal dust is effective in
increasing the mean heat flux over that for
grass? Test the appropriate hypotheses using
a ¼ .05.
78. The article “Caffeine Knowledge, Attitudes, and
Consumption in Adult Women” (J. Nutrit. Ed.,
1992: 179–184) reports the following summary
data on daily caffeine consumption for a sample
of adult women: n ¼ 47, x ¼ 215 mg, s ¼ 235
mg, and range ¼ 51176.
a. Does it appear plausible that the population
distribution of daily caffeine consumption is
normal? Is it necessary to assume a normal
population distribution to test hypotheses
about the value of the population mean consumption? Explain your reasoning.
b. Suppose it had previously been believed that
mean consumption was at most 200 mg. Does
the given data contradict this prior belief? Test
the appropriate hypotheses at significance level
.10 and include a P-value in your analysis.
79. The accompanying output resulted when MINITAB was used to test the appropriate hypotheses
about true average activation time based on the
data in Exercise 56. Use this information to reach
481
a conclusion at significance level .05 and also at
level .01.
TEST OF MU ¼ 25.000 VS MU G.T. 25.000
time
N
MEAN
STDEV
SE MEAN
T
P VALUE
13
27.923
5.619
1.559
1.88
0.043
80. The true average breaking strength of ceramic
insulators of a certain type is supposed to be at
least 10 psi. They will be used for a particular
application unless sample data indicates conclusively that this specification has not been met.
A test of hypotheses using a ¼ .01 is to be based
on a random sample of ten insulators. Assume
that the breaking-strength distribution is normal
with unknown standard deviation.
a. If the true standard deviation is .80, how
likely is it that insulators will be judged satisfactory when true average breaking strength is
actually only 9.5? Only 9.0?
b. What sample size would be necessary to have
a 75% chance of detecting that true average
breaking strength is 9.5 when the true standard deviation is .80?
81. The accompanying observations on residual flame
time (sec) for strips of treated children’s nightwear
were given in the article “An Introduction to Some
Precision and Accuracy of Measurement Problems” (J. Test. Eval., 1982: 132–140). Suppose a
true average flame time of at most 9.75 had been
mandated. Does the data suggest that this condition
has not been met? Carry out an appropriate test
after first investigating the plausibility of assumptions that underlie your method of inference.
9.85
9.94
9.88
9.93
9.85
9.95
9.75
9.75
9.95
9.77
9.83
9.93
9.67
9.92
9.92
9.87
9.74
9.89
9.67
9.99
82. The incidence of a certain type of chromosome
defect in the U.S. adult male population is
believed to be 1 in 75. A random sample of 800
individuals in U.S. penal institutions reveals 16
who have such defects. Can it be concluded that
the incidence rate of this defect among prisoners
differs from the presumed rate for the entire adult
male population?
a. State and test the relevant hypotheses using
a ¼ .05. What type of error might you have
made in reaching a conclusion?
b. What P-value is associated with this test?
Based on this P-value, could H0 be rejected at
significance level .20?
83. In an investigation of the toxin produced by a
certain poisonous snake, a researcher prepared 26
482
CHAPTER
9
Tests of Hypotheses Based on a Single Sample
different vials, each containing 1 g of the toxin, and
then determined the amount of antitoxin needed to
neutralize the toxin. The sample average amount of
antitoxin necessary was found to be 1.89 mg, and
the sample standard deviation was .42. Previous
research had indicated that the true average neutralizing amount was 1.75 mg/g of toxin. Does the
new data contradict the value suggested by prior
research? Test the relevant hypotheses using the
P-value approach. Does the validity of your analysis depend on any assumptions about the population
distribution of neutralizing amount? Explain.
84. The sample average unrestrained compressive
strength for 45 specimens of a particular type of
brick was computed to be 3107 psi, and the sample
standard deviation was 188. The distribution of
unrestrained compressive strength may be somewhat skewed. Does the data strongly indicate that
the true average unrestrained compressive
strength is less than the design value of 3200?
Test using a ¼ .001.
85. To test the ability of auto mechanics to identify
simple engine problems, an automobile with a
single such problem was taken in turn to 72 different car repair facilities. Only 42 of the 72 mechanics who worked on the car correctly identified the
problem. Does this strongly indicate that the true
proportion of mechanics who could identify this
problem is less than .75? Compute the P-value and
reach a conclusion accordingly.
86. When X1, X2, . . . , Xn are independent Poisson
variables, each with parameter l, and n is large,
the sample mean X has approximately a normal
distribution with m ¼ EðXÞ ¼ l and s2 ¼ VðXÞ ¼
l=n. This implies that
Xl
Z ¼ pffiffiffiffiffiffiffiffi
l=n
has approximately a standard normal distribution.
For testing H0: l ¼ l0, we can replace l by l0 in
the equation for Z to obtain a test statistic. This
statistic is actually preferred to the large-sample
pffiffiffi
statistic with denominator S= n (when the Xi’s
are Poisson) because it is tailored explicitly to the
Poisson assumption. If the number of requests for
consulting received by a certain statistician during
a 5-day work week has a Poisson distribution and
the total number of consulting requests during a
36-week period is 160, does this suggest that the
true average number of weekly requests exceeds
4.0? Test using a ¼ .02.
87. A hot-tub manufacturer advertises that with its
heating equipment, a temperature of 100 F can be
achieved in at most 15 min. A random sample of 32
tubs is selected, and the time necessary to achieve a
100 F temperature is determined for each tub. The
sample average time and sample standard deviation
are 17.5 min and 2.2 min, respectively. Does this
data cast doubt on the company’s claim? Compute
the P-value and use it to reach a conclusion at level
.05 (assume that the heating-time distribution is
approximately normal).
88. Chapter 8 presented a CI for the variance s2 of a
normal population distribution. The key result
there was that the rv w2 ¼ ðn 1ÞS2 =s2 has a
chi-squared distribution with n 1 df. Consider
the null hypothesis H0 : s2 ¼ s20 (equivalently,
s ¼ s0). Then when H0 is true, the test statistic
w2 ¼ ðn 1ÞS2 =s20 has a chi-squared distribution
with n 1 df. If the relevant alternative is
Ha : s2 > s20 , rejecting H0 if ðn 1ÞS2 =s20 w2a;n1 gives a test with significance level a. To
ensure reasonably uniform characteristics for a
particular application, it is desired that the true
standard deviation of the softening point of a
certain type of petroleum pitch be at most .50 C.
The softening points of ten different specimens
were determined, yielding a sample standard deviation of .58 C. Does this strongly contradict the
uniformity specification? Test the appropriate
hypotheses using a ¼ .01.
89. Referring to Exercise 88, suppose an investigator
wishes to test H0: s2 ¼ .04 versus Ha: s2 < .04
based on a sample of 21 observations. The computed value of 20s2/.04 is 8.58. Place bounds
on the P-value and then reach a conclusion at
level .01.
90. When the population distribution is normal and n is
large, the sample standard deviation S has approximately a normal distribution with E(S) s and
V(S) s2/(2n). We already know that in this
case, for any n, X is normal with EðXÞ ¼ m and
VðXÞ ¼ s2 =n.
a. Assuming that the underlying distribution is
normal, what is an approximately unbiased
estimator of the 99th percentile y ¼ m + 2.33s?
b. As discussed in Section 6.4, when the Xi’s are
normal X and S are independent rv’s (one measures location whereas the other measures
Bibliography
spread). Use this to compute Vð^yÞ and s^y for
the estimator ^y of part (a). What is the esti^ ^y ?
mated standard error s
c. Write a test statistic for testing H0: y ¼ y0 that
has approximately a standard normal distribution when H0 is true. If soil pH is normally
distributed in a certain region and 64 soil samples yield x ¼ 6:33, s ¼ .16, does this provide
strong evidence for concluding that at most
99% of all possible samples would have a pH
of less than 6.75? Test using a ¼ .01.
91. Let X1, X2, . . . , Xn be a random sample from an
exponential distribution with parameter l. Then it
can be shown that 2lSXi has a chi-squared distribution with n ¼ 2n(by first showing that 2lXi has
a chi-squared distribution with n ¼ 2).
a. Use this fact to obtain a test statistic and rejection region that together specify a level a test
for H0: m ¼ m0 versus each of the three commonly encountered alternatives. [Hint: E(Xi) ¼
m ¼ 1/l, so m ¼ m0 is equivalent to l ¼ 1/m0.]
b. Suppose that ten identical components, each
having exponentially distributed time until
failure, are tested. The resulting failure times
are
95
16
11
3
42
71
225
64
87
123
Use the test procedure of part (a) to decide
whether the data strongly suggests that the
true average lifetime is less than the previously
claimed value of 75.
92. Suppose the population distribution is normal with
known s. Let g be such that 0 < g < a. For testing
H0: m ¼ m0 versus Ha: m 6¼ m0, consider the test
that rejects H0 if either z zg or z zag, where
pffiffiffi
the test statistic is Z ¼ ðX m0 Þ=ðs= nÞ:
a. Show that P(type I error) ¼ a.
Bibliography
See the bibliographies for Chapters 7 and 8.
483
b. Derive an expression for b(m0 ). [Hint: Express
the test in the form “reject H0 if either
x c1 or c2 .”]
c. Let D > 0. For what values of g (relative to a)
will b(m0 + D) < b(m0 D)?
93. After a period of apprenticeship, an organization
gives an exam that must be passed to be eligible
for membership. Let p ¼ P(randomly chosen
apprentice passes). The organization wishes an
exam that most but not all should be able to pass,
so it decides that p ¼ .90 is desirable. For a particular exam, the relevant hypotheses are H0:
p ¼ .90 versus the alternative Ha: p 6¼ .90. Suppose ten people take the exam, and let X ¼ the
number who pass.
a. Does the lower-tailed region {0, 1, . . . , 5}
specify a level .01 test?
b. Show that even though Ha is two-sided, no
two-tailed test is a level .01 test.
c. Sketch a graph of b(p0 ) as a function of p0 for
this test. Is this desirable?
94. A service station has six gas pumps. When no
vehicles are at the station, let pi denote the probability that the next vehicle will select pump i
(i ¼ 1, 2, . . . , 6). Based on a sample of size n,
we wish to test H0: p1 ¼ . . . ¼ p6 versus the alternative Ha: p1 ¼ p3 ¼ p5, p2 ¼ p4 ¼ p6 (note
that Ha is not a simple hypothesis). Let X be the
number of customers in the sample that select an
even-numbered pump.
a. Show that the likelihood ratio test rejects H0 if
either X c or X n c. [Hint: When Ha is
true, let y denote the common value of p2, p4,
and p6.]
b. Let n ¼ 10 and c ¼ 9. Determine the power of
the test both when H0 is true and also when
1
7
p2 ¼ p4 ¼ p6 ¼ 10
; p1 ¼ p3 ¼ p5 ¼ 30
: