1. Ho: µ= 28 years vs. Ha: µ > 28 years.

Transcription

1. Ho: µ= 28 years vs. Ha: µ > 28 years.
Statistics E100
Final Exam
Extra Practice Problems
1. Multiple Choice: require no justification. Note: these parts are not related.
a) A magazine states the following hypotheses about the average age of their subscribers:
Ho: µ= 28 years vs. Ha: µ > 28 years.
Making a Type I error with this test means that:
a) The sample result gives little evidence to conclude that the average age of the subscribers is greater
than 28 years when in fact the average age is 28 years.
b) The sample result gives little evidence to conclude that the average age of the subscribers is greater
than 28 years when in fact the average age is much greater than 28 years.
c) The sample result gives strong evidence that the average age is greater than 28 years when in fact
the average age IS 28 years.
d) The sample result gives strong evidence that the average age is greater than 28 years when in fact
the average age is much greater than 28 years.
b) The business college computing center wants to determine the proportion of business students who have
personal computers at home. If the proportion differs from 25%, then the lab will modify a proposed
enlargement of its facilities. Suppose data is collected from 100 randomly chosen students of the business
college and the sample proportion is found to be 34%. What is the test statistic for testing H0: π = 0.25
versus HA: π > 0.25??
a)
b)
c)
d)
1.65
2.08
1.90
1.78
c) What is the result of the hypothesis test in Problem (b) above?
a) Reject the null hypothesis
b) Fail to reject the null hypothesis
d) As the degrees of freedom for the t distribution increase, the distribution approaches
1)
2)
3)
4)
The value of zero for the mean.
The t distribution.
The normal distribution.
The binomial distribution.
e) Which statement is NOT true about hypothesis tests?
a) Hypothesis tests are only valid when the sample is representative of the population for
the question of interest.
b) Hypotheses are statements about the population represented by the samples.
c) Hypotheses are statements about the sample (or samples) from the population.
d) Conclusions are statements about the population represented by the samples.
f) In regression analysis, if the coefficient of determination (R2) is 1.0, then:
a. SSE (error sum of squares) must be 1.0
b. SSR (regression sum of squares) must be 1.0
c. SSE must be 0.0
d. SSR must be 0.0
g) A sample size of 200 light bulbs was tested and found that 11 were defective. What is the 95% confidence
interval around this sample proportion?
a)
b)
c)
d)
0.055 ± 0.0316
0.055 ± 0.0079
0.055 ± 0.0158
0.055 ± 0.0180
h) You wish to estimate the proportion of shoppers that use credit cards. Determine the sample size needed
if the margin of error should be at most 0.01 (that is, we want the confidence interval to be +/- .01) and the
confidence level is 95%.
a) 8,298
b) 23,050
c) 15,914
d) 9,604
i) Suppose individuals with a certain gene have a 0.4 probability of eventually contracting a particular
disease. If 15 individuals with the gene participate in a lifetime study, what is the distribution of the random
variable X describing the number of these individuals who will contract the disease?
a)
b)
c)
d)
e)
X is a binomial random variable with n=6 and p=1.897
X is a normally distributed random variable with mean 15 and variance 0.4
X is a binomial random variable with n=15 and p=0.4
X is a normally distributed random variable with mean 6 and variance 1.897
None of the above
j) After performing a simple linear regression, we calculated the residuals and obtained the residual plot
shown below. Does the plot indicate any potential problems with the regression?
a) The plot indicates the residuals are not normally distributed.
b) The plot shows curvature. Hence, a linear model is not appropriate.
c) The plot indicates that the error variance is not constant.
d) All of the above
e) There are no apparent problems.
k) What do residuals represent in the simple linear regression model?
a)
b)
c)
d)
e)
The difference between the actual Y values and the mean of Y.
The difference between the actual Y values and the predicted Y values.
The square root of the slope.
The predicted value of Y for the average X value
None of the above.
l) The probability that a region prone to hurricanes will be hit by a hurricane in any single year is 0.1 and
independent of other years. What is the probability of a hurricane hit at least once in the next 5 years?
a)
b)
c)
d)
e)
0.00001
0.40951
0.5
0.99999
None of the above
m) What is the expected number of hurricanes to hit the area described above in the next 90 years?
a)
b)
c)
d)
e)
9
3
8.1
2.85
None of the above
q) For which of the following hypotheses tests would the p-value be the same whether
the sample mean is 44 or 46 (see table to the right)
a)
b)
c)
d)
e)
I.
I. and IV.
II. and III.
IV.
Stop bothering me with these silly questions.
r) (3 points) We are told that a 95% prediction interval for a response variable, y, is (23.2, 35.6)
from a simple regression on a sample of n = 100 observations at x* = 10. Which of the following is
a reasonable estimate for the confidence interval for µy at x* = 10?
a.
b.
c.
d.
(13.2, 45.6)
(24.2, 27.8)
(28.6, 30.2)
(33.2, 45.6)
2. (25 points total) Kellogg’s wants to increase sales of its Fruit Loops cereal, and decides to run an
experiment at Stop & Shop stores in New England. They randomize which shelf (bottom vs. middle
vs. top) Fruit Loops is placed at a total of 150 stores (50 on each shelf). The variables collected
were then:
sales: number of boxes sold at the store in one day
middle: a 0/1 binary variable to indicate if Fruit Loops was on the middle shelf
top: a 0/1 binary variable to indicate if Fruit Loops was on the top shelf
A regression was run in SPSS, and the results are shown below:
a) (7 points) Is there any evidence that sales varied across the 3 shelf locations? Perform a formal
hypothesis test to determine this: be sure to include your hypotheses, the test statistic, the degrees of
freedom, the p-value, and your conclusion in context of the problem.
b) (4 points) What is the interpretation of the coefficient, b1, for the middle variable in the above
regression model?
c) (6 points) Calculate the 95% confidence interval for the middle variable in the above regression
model.
d) (4 points) Which shelf location is predicted to have the most sales? Which shelf location is
predicted to have the fewest sales? Please justify.
e) (4 points) What percent of total variability in sales can be predicted by this model?
3. (15 points total) Suppose that past history shows that 35% of college students prefer Pepsi over
Coca-Cola.
a. (4 points) A sample of 5 students is selected. What is the probability that at least 1 prefers Pepsi?
b. (6 points) A sample of size 49 is collected. What are the mean and standard deviation for the number of
students who prefer Pepsi in this sample?
c. (5 points) In this sample of 49 students, what is the probability that the majority (strictly more than half)
of the students selected prefer Pepsi to coke?
4. (29 points total) An investigator is interested in modeling the progression over time in the Men’s
100 meter run in the Olympics. He measures 2 variables:
time: the winning time in the men’s 100 meter sprint, in seconds
year: the year of the Olympics (from 1900 to 2008)
Some relevant SPSS output is shown below:
a) (7 points) Are time and year significantly associated at α = 0.05? Be sure to include the
hypotheses, the test statistic, the degrees of freedom, the p-value, and your conclusion in the context
of the problem.
b) (4 points) What is the estimated correlation between time and year?
c) (3 points) What is the estimated standard deviation of winning times within Olympic year (aka,
standard deviation of the residuals)?
d) (4 points) Based on this model, in what year will the winning times be forecasted to drop below
9 seconds (please round up to the nearest year)?
e) (4 points) In 1 or 2 sentences, please comment on the validity of your forecast in part (e) above.
f)
(7
points) Above are the histogram of residuals
along with the scatterplot of the residuals vs. the x-variable, year. Please comment on the validity of
the assumptions for this regression model. (I’ve comment on one, please list it and the others and
comment on the other 3).
Assumptions:
1. _______________________________: cannot be checked here
2. _______________________________:
3. _______________________________:
4. _______________________________:
5. (16 points total) Kevin is flying directly to Philadelphia on Saturday for a friend's wedding next
weekend. He has his flight booked through US Airways. US Airways reports that whether his
flight is on time or not depends on the weather in Boston. If it is raining in Boston, the flight will be
late 50% of the time. If it is not raining in Boston, the flight will be late only 10% of the time.
There is a forecasted 25% chance for rain on Saturday (assume that this forecast is correct).
a. (8 points) What is the overall probability that Kevin's flight will be delayed? [If Kevin arrives
late, he will miss the beginning of the wedding]. Recall: P(A) = P(A and B) + P(A and BC)
b. (8 points) Saturday rolls around and Kevin's friend notices Kevin has not arrived at the wedding
on time because his flight was delayed. What is the conditional probability that it actually was
raining up in Boston given the fact that Kevin's flight was delayed?
6. An elevator serving a hospital is designed to hold up to 15 passengers and has a maximum safe
capacity of 2440 pounds. The weight of passengers who use the elevator is normally-distributed
with an average of 149 pounds and a standard deviation of 20 pounds.
a) What is the probability that a single passenger on the elevator weighs between 140 and 150 pounds?
b) What is the probability that a single passenger on the elevator weighs more than 200 pounds?
c) If five passengers enter the elevator together, what is the probability that all five of them weigh 200
pounds or less?
d) What is the probability that the elevator's safe capacity is exceeded by a full load of 15 passengers?
7. (32 points total) GPA’s at Harvard are known to be approximately Normally distributed with a mean of µ
= 3.25 and a standard deviation σ = 0.30.
a. (6 points) Show that the 20.33% of Harvard students have a GPA below 3.00.
b. (6 points) There are 6 students living in a suite in a Harvard house. If we assume their GPA’s to be
independent, what is the probability that at least one of them has a GPA below 3.00?
c. (6 points) A random sample of 50 Harvard students was taken. Assuming their GPA’s are independent,
what is the probability that at least 20 of them have a GPA below 3.00?
d. (6 points) What is the probability that the average GPA for these 50 randomly sampled students is below
3.00?
e. (8 points) A random sample of 50 Harvard athletes had a mean of x  3.12 and standard
deviation of s = 0.37 (there is no reason to suspect athletes have the same standard deviation as the
general Harvard population). Perform a formal hypothesis test to determine whether Harvard
athletes have a different mean GPA than all Harvard students. Be sure to include your hypotheses,
the test statistic, the degrees of freedom (if applicable), an estimate of the p-value, and your
conclusion in context of the problem.
8. (32 points total) Over the last twenty years, the daily change (in decimal form) of a mutual fund
based on the S&P 500 Index fund is known to follow a normal distribution with a mean of
μ = 0.002849 and a sd of σ = 0.01175.
a. (8 points) What is the probability that this mutual fund loses money in any one day?
b. (8 points) What is the probability that this mutual fund loses money in at least one day over the
next week (5 days) assuming days are independent?
c. (8 points) What is the approximate probability that this mutual fund will lose money in at least
15 of the next 30 days assuming days are independent?
d. (8 points) Let X be a random variable to represent the average daily change across 250 days
(which is essentially a full year of business days). If you assume each day is independent, what is
the probability that your investment will have an average change below zero (essentially meaning
the fund lost money during the year)?
e. Now assume that instead of this mutual fund's daily change is not independent from day to day,
but it actually has a positive correlation from one day to the next. Would the probability of losing
money increase, decrease or stay the same from your answer in part (d)? Please justify your answer.
9. (26 points total) The table and graph below show numerical and graphical summaries of the
monthly precipitation (in inches) over the last 60 months in Cambridge, MA.
a. (8 points) Is this distribution left-skewed, right-skewed, or symmetric? Briefly justify your
answer.
b. (10 points) Identify any suspected high outliers in the data using the quantitative methods
discussed in class. Show your work.
c. (8 points) Calculate the mean and standard deviation of monthly precipitation in centimeters (1
inch = 2.54 cm).
10. (20 points total) Below are the summary statistics for two variables measured on the top 10
grossing box office movies so far in 2011: how much revenue they generated in US markets and the
amount of revenue generated in all international markets combined (both in millions of US dollars),
along with the correlation table between the two, and the related scatterplot with international
revenue on the y-axis, and US revenue on the x-axis:
a. (7 points) What is the formula for the least-squares regression line to predict international
revenue based on US revenue?
b. (4 points) What is the predicted amount of international revenue for a movie that generated 162
million dollars in the US?
c. (4 points) Kung Fu Panda 2 made 162 million dollars in the US and 614 million dollars
internationally. What is Kung Fu Panda's estimated residual?
d. (5 points) What percentage of variability in international revenue can be explained by US
revenue?
10. (30 points total) Each part of this problem requires a short response with a brief explanation
(simply yes or no will not suffice). Note: these parts are not related.
a. (6 points) In a study of cold symptoms, every one of the 50 study subjects with a cold was found
to be improved 2 weeks after taking ginger pills. The authors concluded that ginger pills cure colds.
What is the major flaw in this study?
b. (6 points) Let H be the event that the Democrats win the majority of the seats in the House of
Representatives, and let S be the event that the Democrats win the majority of the seats in the
Senate. Let P(H) = 0.5, P(S) = 0.6, and P(H or S) = 0.7. Are H and S independent?
c. (5 points) The sensitivity for a diagnostic test, P(positive test | disease), is 0.85 and the specificity
of the same test, P(negative test | no disease), is also 0.85. Are the two events, (A = having the
disease) and (B = receiving a positive testing), independent? Show your work.
d. (6 points) It is known that 30% of young girls favorite color is blue while 70% of young boys
favorite color is blue (you can also assume the population is split evenly into 50% boys and 50%
girls). Are the two events (being a boy) and (favorite color is blue) independent?
e. (6 points) Suppose that A and B are two disjoint events within the same sample space. In
addition, let P(A) = 1/8 and P(B) = 1/4. Are events A and B independent? Exaplain or show your
calculations.
f. (6 points) In 1990 a research organization sent questionnaires to all of the approximately 15,000
high school systems in the United States. These questionnaires asked about computer useage in the
school system. As many as 3,600 schools systems returned answers. Of these 3,600, 60% indicated
that some of their students used computers.
In a speech shortly thereafter, an authority on the use of computers in high school education cited
this study as evidence that "students in 60% of the high school systems in the United States use
computers during their high school careers." Do you regard 60% as a trustworthy estimate of the
proportion of school systems providing computer access in 1990? In two sentences or fewer, explain
your answer.
g. (6 points) A company in Hawaii builds bridges for married couples to walk over during their
weddings. There are 3 islands in Hawaii that each have the same mean and variance of husbands’
weights and the wives’ weights. However, the relationship of weights within couples is different
on the 3 islands: in Inde: the weights within couples are independent; in Posi: they are positively
correlated; and in Nega the weights are negatively correlated. On which island should the company
build the strongest bridge? Defend your answer in 2 sentences or less.
11. (35 points total) A researcher is investigating variables that might be associated with death rates
in the US states. He examined data from 2008 for each of the 50 states plus Washington, D.C. The
data included information on the following variables:
deathrate The annual deathrate per one million inhabitants
smokers The percent of inhabitants who smoke heavily, in percentage points
college
The percent of inhabitants that have a bachelors degree, in percentage points
As part of his investigation, he ran the following multiple regression model:
deathrate = α + 1(smokers) + 2(college) + 
This model was fit to the data using the method of least squares. The following results were
obtained from statistical software:
a. (4 points) What is the estimated standard deviation of the residuals?
b. (6 points) Suppose we wish to test the hypotheses H0:  1 =  2 = 0 versus Ha: at least one of the  j is not 0.
What is the value of the appropriate test statistic, the p-value, and conclusion to this test?
c. (6 points) What is the interpretation of the value for b1, the estimated coefficient for the variable
smokers?
d. (7 points) Calculate the 95% confidence interval for  1, the coefficient for the variable smokers.
e. (6 points) Briefly comment on the residual diagnostic plot for this model shown below. Please be specific
and limit your response to 3 sentences or bullet points.
Another researcher, using the same data, ran the following simple linear regression model:
deathrate = α + 1(college) + 
The following results were obtained from statistical software:
f. (6 points) The second researcher concluded that because the coefficient for the variable college was
negative in his results, spending additional money on education to have more college graduates would
decrease the death rate in his state. This researcher therefore recommended more money be spent on
education. The second researcher concluded that because the coefficient for the college variable was
positive in his results, spending additional money on students would increase the death rate. This researcher
therefore recommended less money be spent on education. Why are these two conclusions different even
though the researchers used the same data? Explain using a few concise sentences.
12. (20 points total) It is known that 20% of all Harvard students are varsity athletes. 50% of
varsity athletes eat breakfast on any particular weekday, while only 25% of all other Harvard
students eat breakfast on any particular weekday. Define the events:
A: the event that a student is a varsity athlete
B: the event that a student eats breakfast on a particular weekday
a. (5 points) Are the events A and B independent? Please briefly justify.
b. (5 points) Are the events A and B disjoint? Please briefly justify.
c. (5 points) Find the overall proportion of students that eat breakfast on any particular weekday.
d. (5 points) Given a student is eating breakfast on a particular weekday, what is the probability that
that student is a varsity athlete?
e. (4 points) In actuality the non-varsity-athlete students are comprised of two further subgroups:
30% of them are club athletes and 70% are nonathletes [so there are actually 3 distinct groups in the
Harvard student population: varsity athletes, club athletes, and nonathletes]. Of the club athletes,
40% eat breakfast.
i) What percent of nonathletes eat breakfast?
ii) Given a student is eating breakfast, what is the probability they are a nonathlete?
13. (55 points total) A study was conducted to determine the association between the maximum
distance at which a highway sign can be read (in feet) and the age of the driver (in years). Fourty
drivers of various ages were studied. The summary statistics for distance and age are shown below
in a table from Stata:
a. (8 points) The correlation coefficient between distance and age in this sample is r = -0.5644.
Calculate a and b of the least-squares regression equation that would predict the distance at which a
highway sign can be read given the age of the driver.
b. (10 points) The standard error of b was calculated to be 0.9164 from Stata. Is “age” a significant
predictor of “distance” in this linear model? Conduct this statistical test of H0: β = 0 using α = 0.05.
Be sure to include your hypotheses, test statistic, degrees of freedom if appropriate, either the pvalue or critical value, and your conclusion in terms of the problem.
c. (4 points) What is the predicted distance that a sign can be read for someone who is 40 years old?
d. (6 points) What is R2 for this regression model? What is the interpretation of R2 here?
The investigators also decided to look at whether someone wore glasses had an effect on the
distance a driver could read a sign. Below is the binary-predictor regression output, labeled as
Model A, of the distance someone was able to read the sign predicted from whether or not that
person wore glasses (which has value 1 for those wearing glasses or contact lenses, 0 otherwise):
Model A:
e. (3 points) What is the reference group in this model?
f. (4 points) What is the predicted distance that a sign can be read for someone who wears glasses
based on this model?
Below is the Stata output of a multiple regression, labeled as Model B, of the distance someone was
able to read the sign predicted from “age” and “glasses” (again, which has value 1 for individuals
wearing classes or contact lenses, 0 otherwise):
Model B:
g. (5 points) What is the interpretation of the value -25.678 in this regression model?
h. (8 points) Compare the results of the two regressions, Model A and Model B, above.
Specifically mention any signs or significance that are different between the two models. Why do
you suspect this is the case?
i. (7 points) Above are the residual vs. fitted scatterplot and histogram of the residuals for the
multiple regression model (Model B) above. Use these plot to comment on whether the
assumptions for this model seem to be valid. Be specific.
14. (5 points) As part of a study on student loan debt, a national agency that underwrites student loans is
examining the differences in student loan debt for undergraduate students. One question the agency would
like to address specifically is whether the mean undergraduate debt of Hispanic students graduating in 2009
is less than the mean undergraduate debt of Asian- American students graduating in 2009. To conduct the
study, a random sample of 92 Hispanic students and a random sample 110 Asian- American students who
completed an undergraduate degree in 2009 were taken. The undergraduate debt incurred for financing
college for each sampled student was collected. Let  H denote the population average student loan debt for
Hispanic students, and  A the population average student loan debt for Asian-American students. Using the
summary statistics below, test the hypothesis H o :  A  H
Group
Hispanic
Asian-American
Total
n
92
110
202
H a :  A  H . Clearly interpret your results.
mean
18659.18
20002.54
19390.71
Std. Dev.
4700.04
5807.52
5361.05
15. (33 points total) An investigator is trying to determine what factors are important in
determining the graduation rate at US colleges. She collects a random sample of 53 four-year
colleges, and records three variables:
GradRate: the graduation rate for the class of 2007 (as a percentage of all entering students that
were full-time, in percentage points: 0 to 100)
Tuition: the tuition in 2007-08 (in thousands of dollars)
SATMath: the median SAT math score for entering freshmen in 2007-08
Below is SPSS’s regression output of predicting GradRate based on Tuition and median
SATMath scores.
a. (5 points) Based on the above model, what is the predicted graduation rate for a college with
tuition of $35 thousand and median math SAT score of 750 (aka, Harvard)?
b. (5 points) In words, what is the interpretation of the coefficient for SATMath (which has value
0.1842) in the above table?
c. (4 points) What is the proportion of total variability in graduation rate that can be explained by
this model?
d. (6 points) Perform a single hypothesis test to determine whether any of the variables are
associated with graduation rate. Be sure to state your hypotheses, test statistic, degrees of freedom
(if applicable), p-value, and conclusion.
e. (7 points) Perform a hypothesis test to determine whether specifically tuition is associated with
graduation rate in the above model. Be sure to state your hypotheses, test statistic, degrees of
freedom (if applicable), p-value, and conclusion.
f. (6 points) The dean at a college sees these results and suggests to his board of trustees that they
raise their tuition in order to improve their graduation rate. What is the major mistake the Dean is
making in concluding from these data that raising their tuition will lead to a higher graduation rate?
16. (12 points) A survey of male and female university students asked which popular musical artist they
preferred. The survey focused on Lady Gaga and Justin Bieber but allowed for “other” artists as well. Some
of the values from the two-way table are missing, but you can determine what they are and answer the given
questions.
Artist
Male
Female
Total
a) What is the value of “a”?
Lady Gaga
a
50
Justin Bieber
90
Other
100
140
Total
150
300
b) What is the probability that a randomly chosen student will prefer Justin Bieber?
c) Given a student prefers Justin Bieber, what is the probability that they are female ?
d) Is gender and artist preference dependent or independent events? (do not use a chi-square test)
17. (24 points total) Your younger sister and brother are strong believers in the tooth fairy.
Whenever a baby tooth falls out, your sibling places it under his/her pillow before going to sleep,
and in the morning the tooth fairy replaces it with cash. You observe this week that their baby teeth
are close to falling out. Let A be the event that one of your sister’s teeth will fall out today, and let B
be the event that one of your brother’s teeth will fall out today. You estimate that P(A) = 0.3 and
P(B) = 0.2. Assume that for each sibling at most one tooth will fall out.
a. (6 points) Assuming whether your brother’s tooth falls out is independent of whether your
sister’s tooth falls out, the probability that neither falls out today is 0.56. Demonstrate with
appropriate calculations why this is true.
b. (6 points) Under the assumption of independence as in part (a), what is the probability that
exactly one (i.e., not both) of your siblings’ teeth falls out today?
c. (6 points) Describe a scenario involving your younger siblings where A and B are clearly not
independent events. Be sure to state this scenario in context of this problem (do not just give the
definition of dependence).
d. (6 points) The tooth fairy replaces a tooth with cash with probability 0.25 independently from
child to child. On a given night, 10 children in a town have placed teeth that have fallen out under
their pillows. What is the probability that at least 1 of these 10 children is visited by the tooth fairy?
18. (16 points total) With the popularity of traditional lotteries waning across the US, many states
are turning to instant games, called “scratch-off tickets,” to lure new players and raise revenue.
However, many critics are concerned that instant gratification scratch-off tickets are more likely to
contribute to gambling addiction and take particular advantage of the poor members of society. A
survey of 100 randomly selected gamblers with below median incomes was conducted in the El
Paso area of Texas to study the association between gambling addiction and the primary type of
gambling (traditional state lottery versus scratch-off tickets). The results are given below.
Primary type
of gambling
Scratch-off tickets
Traditional lottery
Total
Diagnosed with a
gambling addiction
11
2
13
No gambling
addiction
39
48
87
Total
50
50
100
a. (8 points) Is this significant evidence that the primary type of gambling affects the risk of a
gambling addiction? Test at level α = 0.05 and include the null and alternative hypotheses, the test
statistic, the rejection region, an estimate of the P-value, a statement of whether or not you reject the
null hypothesis, and a sentence summarizing your conclusion.
b. (3 points) Find the difference in proportions of a gambling addiction comparing scratch-off ticket
users to traditional lottery users.
c. (5 points) Find the 95% confidence interval for the difference in gambling addiction for scratchoff ticket users vs. traditional lottery users.
19. The mean length of stay in a hospital is useful for planning purposes. Suppose that the following
is the distribution of the length of stay in a hospital after a minor operation.
Number of Days
Probability
2
0.2
3
0.3
a) What is the mean (expected value) length of stay?
b) What is the variance of length of stay?
4
0.5