# 1. Ho: µ= 28 years vs. Ha: µ > 28 years.

## Transcription

1. Ho: µ= 28 years vs. Ha: µ > 28 years.

Statistics E100 Final Exam Extra Practice Problems 1. Multiple Choice: require no justification. Note: these parts are not related. a) A magazine states the following hypotheses about the average age of their subscribers: Ho: µ= 28 years vs. Ha: µ > 28 years. Making a Type I error with this test means that: a) The sample result gives little evidence to conclude that the average age of the subscribers is greater than 28 years when in fact the average age is 28 years. b) The sample result gives little evidence to conclude that the average age of the subscribers is greater than 28 years when in fact the average age is much greater than 28 years. c) The sample result gives strong evidence that the average age is greater than 28 years when in fact the average age IS 28 years. d) The sample result gives strong evidence that the average age is greater than 28 years when in fact the average age is much greater than 28 years. b) The business college computing center wants to determine the proportion of business students who have personal computers at home. If the proportion differs from 25%, then the lab will modify a proposed enlargement of its facilities. Suppose data is collected from 100 randomly chosen students of the business college and the sample proportion is found to be 34%. What is the test statistic for testing H0: π = 0.25 versus HA: π > 0.25?? a) b) c) d) 1.65 2.08 1.90 1.78 c) What is the result of the hypothesis test in Problem (b) above? a) Reject the null hypothesis b) Fail to reject the null hypothesis d) As the degrees of freedom for the t distribution increase, the distribution approaches 1) 2) 3) 4) The value of zero for the mean. The t distribution. The normal distribution. The binomial distribution. e) Which statement is NOT true about hypothesis tests? a) Hypothesis tests are only valid when the sample is representative of the population for the question of interest. b) Hypotheses are statements about the population represented by the samples. c) Hypotheses are statements about the sample (or samples) from the population. d) Conclusions are statements about the population represented by the samples. f) In regression analysis, if the coefficient of determination (R2) is 1.0, then: a. SSE (error sum of squares) must be 1.0 b. SSR (regression sum of squares) must be 1.0 c. SSE must be 0.0 d. SSR must be 0.0 g) A sample size of 200 light bulbs was tested and found that 11 were defective. What is the 95% confidence interval around this sample proportion? a) b) c) d) 0.055 ± 0.0316 0.055 ± 0.0079 0.055 ± 0.0158 0.055 ± 0.0180 h) You wish to estimate the proportion of shoppers that use credit cards. Determine the sample size needed if the margin of error should be at most 0.01 (that is, we want the confidence interval to be +/- .01) and the confidence level is 95%. a) 8,298 b) 23,050 c) 15,914 d) 9,604 i) Suppose individuals with a certain gene have a 0.4 probability of eventually contracting a particular disease. If 15 individuals with the gene participate in a lifetime study, what is the distribution of the random variable X describing the number of these individuals who will contract the disease? a) b) c) d) e) X is a binomial random variable with n=6 and p=1.897 X is a normally distributed random variable with mean 15 and variance 0.4 X is a binomial random variable with n=15 and p=0.4 X is a normally distributed random variable with mean 6 and variance 1.897 None of the above j) After performing a simple linear regression, we calculated the residuals and obtained the residual plot shown below. Does the plot indicate any potential problems with the regression? a) The plot indicates the residuals are not normally distributed. b) The plot shows curvature. Hence, a linear model is not appropriate. c) The plot indicates that the error variance is not constant. d) All of the above e) There are no apparent problems. k) What do residuals represent in the simple linear regression model? a) b) c) d) e) The difference between the actual Y values and the mean of Y. The difference between the actual Y values and the predicted Y values. The square root of the slope. The predicted value of Y for the average X value None of the above. l) The probability that a region prone to hurricanes will be hit by a hurricane in any single year is 0.1 and independent of other years. What is the probability of a hurricane hit at least once in the next 5 years? a) b) c) d) e) 0.00001 0.40951 0.5 0.99999 None of the above m) What is the expected number of hurricanes to hit the area described above in the next 90 years? a) b) c) d) e) 9 3 8.1 2.85 None of the above q) For which of the following hypotheses tests would the p-value be the same whether the sample mean is 44 or 46 (see table to the right) a) b) c) d) e) I. I. and IV. II. and III. IV. Stop bothering me with these silly questions. r) (3 points) We are told that a 95% prediction interval for a response variable, y, is (23.2, 35.6) from a simple regression on a sample of n = 100 observations at x* = 10. Which of the following is a reasonable estimate for the confidence interval for µy at x* = 10? a. b. c. d. (13.2, 45.6) (24.2, 27.8) (28.6, 30.2) (33.2, 45.6) 2. (25 points total) Kellogg’s wants to increase sales of its Fruit Loops cereal, and decides to run an experiment at Stop & Shop stores in New England. They randomize which shelf (bottom vs. middle vs. top) Fruit Loops is placed at a total of 150 stores (50 on each shelf). The variables collected were then: sales: number of boxes sold at the store in one day middle: a 0/1 binary variable to indicate if Fruit Loops was on the middle shelf top: a 0/1 binary variable to indicate if Fruit Loops was on the top shelf A regression was run in SPSS, and the results are shown below: a) (7 points) Is there any evidence that sales varied across the 3 shelf locations? Perform a formal hypothesis test to determine this: be sure to include your hypotheses, the test statistic, the degrees of freedom, the p-value, and your conclusion in context of the problem. b) (4 points) What is the interpretation of the coefficient, b1, for the middle variable in the above regression model? c) (6 points) Calculate the 95% confidence interval for the middle variable in the above regression model. d) (4 points) Which shelf location is predicted to have the most sales? Which shelf location is predicted to have the fewest sales? Please justify. e) (4 points) What percent of total variability in sales can be predicted by this model? 3. (15 points total) Suppose that past history shows that 35% of college students prefer Pepsi over Coca-Cola. a. (4 points) A sample of 5 students is selected. What is the probability that at least 1 prefers Pepsi? b. (6 points) A sample of size 49 is collected. What are the mean and standard deviation for the number of students who prefer Pepsi in this sample? c. (5 points) In this sample of 49 students, what is the probability that the majority (strictly more than half) of the students selected prefer Pepsi to coke? 4. (29 points total) An investigator is interested in modeling the progression over time in the Men’s 100 meter run in the Olympics. He measures 2 variables: time: the winning time in the men’s 100 meter sprint, in seconds year: the year of the Olympics (from 1900 to 2008) Some relevant SPSS output is shown below: a) (7 points) Are time and year significantly associated at α = 0.05? Be sure to include the hypotheses, the test statistic, the degrees of freedom, the p-value, and your conclusion in the context of the problem. b) (4 points) What is the estimated correlation between time and year? c) (3 points) What is the estimated standard deviation of winning times within Olympic year (aka, standard deviation of the residuals)? d) (4 points) Based on this model, in what year will the winning times be forecasted to drop below 9 seconds (please round up to the nearest year)? e) (4 points) In 1 or 2 sentences, please comment on the validity of your forecast in part (e) above. f) (7 points) Above are the histogram of residuals along with the scatterplot of the residuals vs. the x-variable, year. Please comment on the validity of the assumptions for this regression model. (I’ve comment on one, please list it and the others and comment on the other 3). Assumptions: 1. _______________________________: cannot be checked here 2. _______________________________: 3. _______________________________: 4. _______________________________: 5. (16 points total) Kevin is flying directly to Philadelphia on Saturday for a friend's wedding next weekend. He has his flight booked through US Airways. US Airways reports that whether his flight is on time or not depends on the weather in Boston. If it is raining in Boston, the flight will be late 50% of the time. If it is not raining in Boston, the flight will be late only 10% of the time. There is a forecasted 25% chance for rain on Saturday (assume that this forecast is correct). a. (8 points) What is the overall probability that Kevin's flight will be delayed? [If Kevin arrives late, he will miss the beginning of the wedding]. Recall: P(A) = P(A and B) + P(A and BC) b. (8 points) Saturday rolls around and Kevin's friend notices Kevin has not arrived at the wedding on time because his flight was delayed. What is the conditional probability that it actually was raining up in Boston given the fact that Kevin's flight was delayed? 6. An elevator serving a hospital is designed to hold up to 15 passengers and has a maximum safe capacity of 2440 pounds. The weight of passengers who use the elevator is normally-distributed with an average of 149 pounds and a standard deviation of 20 pounds. a) What is the probability that a single passenger on the elevator weighs between 140 and 150 pounds? b) What is the probability that a single passenger on the elevator weighs more than 200 pounds? c) If five passengers enter the elevator together, what is the probability that all five of them weigh 200 pounds or less? d) What is the probability that the elevator's safe capacity is exceeded by a full load of 15 passengers? 7. (32 points total) GPA’s at Harvard are known to be approximately Normally distributed with a mean of µ = 3.25 and a standard deviation σ = 0.30. a. (6 points) Show that the 20.33% of Harvard students have a GPA below 3.00. b. (6 points) There are 6 students living in a suite in a Harvard house. If we assume their GPA’s to be independent, what is the probability that at least one of them has a GPA below 3.00? c. (6 points) A random sample of 50 Harvard students was taken. Assuming their GPA’s are independent, what is the probability that at least 20 of them have a GPA below 3.00? d. (6 points) What is the probability that the average GPA for these 50 randomly sampled students is below 3.00? e. (8 points) A random sample of 50 Harvard athletes had a mean of x 3.12 and standard deviation of s = 0.37 (there is no reason to suspect athletes have the same standard deviation as the general Harvard population). Perform a formal hypothesis test to determine whether Harvard athletes have a different mean GPA than all Harvard students. Be sure to include your hypotheses, the test statistic, the degrees of freedom (if applicable), an estimate of the p-value, and your conclusion in context of the problem. 8. (32 points total) Over the last twenty years, the daily change (in decimal form) of a mutual fund based on the S&P 500 Index fund is known to follow a normal distribution with a mean of μ = 0.002849 and a sd of σ = 0.01175. a. (8 points) What is the probability that this mutual fund loses money in any one day? b. (8 points) What is the probability that this mutual fund loses money in at least one day over the next week (5 days) assuming days are independent? c. (8 points) What is the approximate probability that this mutual fund will lose money in at least 15 of the next 30 days assuming days are independent? d. (8 points) Let X be a random variable to represent the average daily change across 250 days (which is essentially a full year of business days). If you assume each day is independent, what is the probability that your investment will have an average change below zero (essentially meaning the fund lost money during the year)? e. Now assume that instead of this mutual fund's daily change is not independent from day to day, but it actually has a positive correlation from one day to the next. Would the probability of losing money increase, decrease or stay the same from your answer in part (d)? Please justify your answer. 9. (26 points total) The table and graph below show numerical and graphical summaries of the monthly precipitation (in inches) over the last 60 months in Cambridge, MA. a. (8 points) Is this distribution left-skewed, right-skewed, or symmetric? Briefly justify your answer. b. (10 points) Identify any suspected high outliers in the data using the quantitative methods discussed in class. Show your work. c. (8 points) Calculate the mean and standard deviation of monthly precipitation in centimeters (1 inch = 2.54 cm). 10. (20 points total) Below are the summary statistics for two variables measured on the top 10 grossing box office movies so far in 2011: how much revenue they generated in US markets and the amount of revenue generated in all international markets combined (both in millions of US dollars), along with the correlation table between the two, and the related scatterplot with international revenue on the y-axis, and US revenue on the x-axis: a. (7 points) What is the formula for the least-squares regression line to predict international revenue based on US revenue? b. (4 points) What is the predicted amount of international revenue for a movie that generated 162 million dollars in the US? c. (4 points) Kung Fu Panda 2 made 162 million dollars in the US and 614 million dollars internationally. What is Kung Fu Panda's estimated residual? d. (5 points) What percentage of variability in international revenue can be explained by US revenue? 10. (30 points total) Each part of this problem requires a short response with a brief explanation (simply yes or no will not suffice). Note: these parts are not related. a. (6 points) In a study of cold symptoms, every one of the 50 study subjects with a cold was found to be improved 2 weeks after taking ginger pills. The authors concluded that ginger pills cure colds. What is the major flaw in this study? b. (6 points) Let H be the event that the Democrats win the majority of the seats in the House of Representatives, and let S be the event that the Democrats win the majority of the seats in the Senate. Let P(H) = 0.5, P(S) = 0.6, and P(H or S) = 0.7. Are H and S independent? c. (5 points) The sensitivity for a diagnostic test, P(positive test | disease), is 0.85 and the specificity of the same test, P(negative test | no disease), is also 0.85. Are the two events, (A = having the disease) and (B = receiving a positive testing), independent? Show your work. d. (6 points) It is known that 30% of young girls favorite color is blue while 70% of young boys favorite color is blue (you can also assume the population is split evenly into 50% boys and 50% girls). Are the two events (being a boy) and (favorite color is blue) independent? e. (6 points) Suppose that A and B are two disjoint events within the same sample space. In addition, let P(A) = 1/8 and P(B) = 1/4. Are events A and B independent? Exaplain or show your calculations. f. (6 points) In 1990 a research organization sent questionnaires to all of the approximately 15,000 high school systems in the United States. These questionnaires asked about computer useage in the school system. As many as 3,600 schools systems returned answers. Of these 3,600, 60% indicated that some of their students used computers. In a speech shortly thereafter, an authority on the use of computers in high school education cited this study as evidence that "students in 60% of the high school systems in the United States use computers during their high school careers." Do you regard 60% as a trustworthy estimate of the proportion of school systems providing computer access in 1990? In two sentences or fewer, explain your answer. g. (6 points) A company in Hawaii builds bridges for married couples to walk over during their weddings. There are 3 islands in Hawaii that each have the same mean and variance of husbands’ weights and the wives’ weights. However, the relationship of weights within couples is different on the 3 islands: in Inde: the weights within couples are independent; in Posi: they are positively correlated; and in Nega the weights are negatively correlated. On which island should the company build the strongest bridge? Defend your answer in 2 sentences or less. 11. (35 points total) A researcher is investigating variables that might be associated with death rates in the US states. He examined data from 2008 for each of the 50 states plus Washington, D.C. The data included information on the following variables: deathrate The annual deathrate per one million inhabitants smokers The percent of inhabitants who smoke heavily, in percentage points college The percent of inhabitants that have a bachelors degree, in percentage points As part of his investigation, he ran the following multiple regression model: deathrate = α + 1(smokers) + 2(college) + This model was fit to the data using the method of least squares. The following results were obtained from statistical software: a. (4 points) What is the estimated standard deviation of the residuals? b. (6 points) Suppose we wish to test the hypotheses H0: 1 = 2 = 0 versus Ha: at least one of the j is not 0. What is the value of the appropriate test statistic, the p-value, and conclusion to this test? c. (6 points) What is the interpretation of the value for b1, the estimated coefficient for the variable smokers? d. (7 points) Calculate the 95% confidence interval for 1, the coefficient for the variable smokers. e. (6 points) Briefly comment on the residual diagnostic plot for this model shown below. Please be specific and limit your response to 3 sentences or bullet points. Another researcher, using the same data, ran the following simple linear regression model: deathrate = α + 1(college) + The following results were obtained from statistical software: f. (6 points) The second researcher concluded that because the coefficient for the variable college was negative in his results, spending additional money on education to have more college graduates would decrease the death rate in his state. This researcher therefore recommended more money be spent on education. The second researcher concluded that because the coefficient for the college variable was positive in his results, spending additional money on students would increase the death rate. This researcher therefore recommended less money be spent on education. Why are these two conclusions different even though the researchers used the same data? Explain using a few concise sentences. 12. (20 points total) It is known that 20% of all Harvard students are varsity athletes. 50% of varsity athletes eat breakfast on any particular weekday, while only 25% of all other Harvard students eat breakfast on any particular weekday. Define the events: A: the event that a student is a varsity athlete B: the event that a student eats breakfast on a particular weekday a. (5 points) Are the events A and B independent? Please briefly justify. b. (5 points) Are the events A and B disjoint? Please briefly justify. c. (5 points) Find the overall proportion of students that eat breakfast on any particular weekday. d. (5 points) Given a student is eating breakfast on a particular weekday, what is the probability that that student is a varsity athlete? e. (4 points) In actuality the non-varsity-athlete students are comprised of two further subgroups: 30% of them are club athletes and 70% are nonathletes [so there are actually 3 distinct groups in the Harvard student population: varsity athletes, club athletes, and nonathletes]. Of the club athletes, 40% eat breakfast. i) What percent of nonathletes eat breakfast? ii) Given a student is eating breakfast, what is the probability they are a nonathlete? 13. (55 points total) A study was conducted to determine the association between the maximum distance at which a highway sign can be read (in feet) and the age of the driver (in years). Fourty drivers of various ages were studied. The summary statistics for distance and age are shown below in a table from Stata: a. (8 points) The correlation coefficient between distance and age in this sample is r = -0.5644. Calculate a and b of the least-squares regression equation that would predict the distance at which a highway sign can be read given the age of the driver. b. (10 points) The standard error of b was calculated to be 0.9164 from Stata. Is “age” a significant predictor of “distance” in this linear model? Conduct this statistical test of H0: β = 0 using α = 0.05. Be sure to include your hypotheses, test statistic, degrees of freedom if appropriate, either the pvalue or critical value, and your conclusion in terms of the problem. c. (4 points) What is the predicted distance that a sign can be read for someone who is 40 years old? d. (6 points) What is R2 for this regression model? What is the interpretation of R2 here? The investigators also decided to look at whether someone wore glasses had an effect on the distance a driver could read a sign. Below is the binary-predictor regression output, labeled as Model A, of the distance someone was able to read the sign predicted from whether or not that person wore glasses (which has value 1 for those wearing glasses or contact lenses, 0 otherwise): Model A: e. (3 points) What is the reference group in this model? f. (4 points) What is the predicted distance that a sign can be read for someone who wears glasses based on this model? Below is the Stata output of a multiple regression, labeled as Model B, of the distance someone was able to read the sign predicted from “age” and “glasses” (again, which has value 1 for individuals wearing classes or contact lenses, 0 otherwise): Model B: g. (5 points) What is the interpretation of the value -25.678 in this regression model? h. (8 points) Compare the results of the two regressions, Model A and Model B, above. Specifically mention any signs or significance that are different between the two models. Why do you suspect this is the case? i. (7 points) Above are the residual vs. fitted scatterplot and histogram of the residuals for the multiple regression model (Model B) above. Use these plot to comment on whether the assumptions for this model seem to be valid. Be specific. 14. (5 points) As part of a study on student loan debt, a national agency that underwrites student loans is examining the differences in student loan debt for undergraduate students. One question the agency would like to address specifically is whether the mean undergraduate debt of Hispanic students graduating in 2009 is less than the mean undergraduate debt of Asian- American students graduating in 2009. To conduct the study, a random sample of 92 Hispanic students and a random sample 110 Asian- American students who completed an undergraduate degree in 2009 were taken. The undergraduate debt incurred for financing college for each sampled student was collected. Let H denote the population average student loan debt for Hispanic students, and A the population average student loan debt for Asian-American students. Using the summary statistics below, test the hypothesis H o : A H Group Hispanic Asian-American Total n 92 110 202 H a : A H . Clearly interpret your results. mean 18659.18 20002.54 19390.71 Std. Dev. 4700.04 5807.52 5361.05 15. (33 points total) An investigator is trying to determine what factors are important in determining the graduation rate at US colleges. She collects a random sample of 53 four-year colleges, and records three variables: GradRate: the graduation rate for the class of 2007 (as a percentage of all entering students that were full-time, in percentage points: 0 to 100) Tuition: the tuition in 2007-08 (in thousands of dollars) SATMath: the median SAT math score for entering freshmen in 2007-08 Below is SPSS’s regression output of predicting GradRate based on Tuition and median SATMath scores. a. (5 points) Based on the above model, what is the predicted graduation rate for a college with tuition of $35 thousand and median math SAT score of 750 (aka, Harvard)? b. (5 points) In words, what is the interpretation of the coefficient for SATMath (which has value 0.1842) in the above table? c. (4 points) What is the proportion of total variability in graduation rate that can be explained by this model? d. (6 points) Perform a single hypothesis test to determine whether any of the variables are associated with graduation rate. Be sure to state your hypotheses, test statistic, degrees of freedom (if applicable), p-value, and conclusion. e. (7 points) Perform a hypothesis test to determine whether specifically tuition is associated with graduation rate in the above model. Be sure to state your hypotheses, test statistic, degrees of freedom (if applicable), p-value, and conclusion. f. (6 points) The dean at a college sees these results and suggests to his board of trustees that they raise their tuition in order to improve their graduation rate. What is the major mistake the Dean is making in concluding from these data that raising their tuition will lead to a higher graduation rate? 16. (12 points) A survey of male and female university students asked which popular musical artist they preferred. The survey focused on Lady Gaga and Justin Bieber but allowed for “other” artists as well. Some of the values from the two-way table are missing, but you can determine what they are and answer the given questions. Artist Male Female Total a) What is the value of “a”? Lady Gaga a 50 Justin Bieber 90 Other 100 140 Total 150 300 b) What is the probability that a randomly chosen student will prefer Justin Bieber? c) Given a student prefers Justin Bieber, what is the probability that they are female ? d) Is gender and artist preference dependent or independent events? (do not use a chi-square test) 17. (24 points total) Your younger sister and brother are strong believers in the tooth fairy. Whenever a baby tooth falls out, your sibling places it under his/her pillow before going to sleep, and in the morning the tooth fairy replaces it with cash. You observe this week that their baby teeth are close to falling out. Let A be the event that one of your sister’s teeth will fall out today, and let B be the event that one of your brother’s teeth will fall out today. You estimate that P(A) = 0.3 and P(B) = 0.2. Assume that for each sibling at most one tooth will fall out. a. (6 points) Assuming whether your brother’s tooth falls out is independent of whether your sister’s tooth falls out, the probability that neither falls out today is 0.56. Demonstrate with appropriate calculations why this is true. b. (6 points) Under the assumption of independence as in part (a), what is the probability that exactly one (i.e., not both) of your siblings’ teeth falls out today? c. (6 points) Describe a scenario involving your younger siblings where A and B are clearly not independent events. Be sure to state this scenario in context of this problem (do not just give the definition of dependence). d. (6 points) The tooth fairy replaces a tooth with cash with probability 0.25 independently from child to child. On a given night, 10 children in a town have placed teeth that have fallen out under their pillows. What is the probability that at least 1 of these 10 children is visited by the tooth fairy? 18. (16 points total) With the popularity of traditional lotteries waning across the US, many states are turning to instant games, called “scratch-off tickets,” to lure new players and raise revenue. However, many critics are concerned that instant gratification scratch-off tickets are more likely to contribute to gambling addiction and take particular advantage of the poor members of society. A survey of 100 randomly selected gamblers with below median incomes was conducted in the El Paso area of Texas to study the association between gambling addiction and the primary type of gambling (traditional state lottery versus scratch-off tickets). The results are given below. Primary type of gambling Scratch-off tickets Traditional lottery Total Diagnosed with a gambling addiction 11 2 13 No gambling addiction 39 48 87 Total 50 50 100 a. (8 points) Is this significant evidence that the primary type of gambling affects the risk of a gambling addiction? Test at level α = 0.05 and include the null and alternative hypotheses, the test statistic, the rejection region, an estimate of the P-value, a statement of whether or not you reject the null hypothesis, and a sentence summarizing your conclusion. b. (3 points) Find the difference in proportions of a gambling addiction comparing scratch-off ticket users to traditional lottery users. c. (5 points) Find the 95% confidence interval for the difference in gambling addiction for scratchoff ticket users vs. traditional lottery users. 19. The mean length of stay in a hospital is useful for planning purposes. Suppose that the following is the distribution of the length of stay in a hospital after a minor operation. Number of Days Probability 2 0.2 3 0.3 a) What is the mean (expected value) length of stay? b) What is the variance of length of stay? 4 0.5