UNIVERSITY OF TORONTO AT SCARBOROUGH SAMPLE FINAL EXAM STAB57
Transcription
UNIVERSITY OF TORONTO AT SCARBOROUGH SAMPLE FINAL EXAM STAB57
UNIVERSITY OF TORONTO AT SCARBOROUGH UNIVERSITY OF TORONTO AT SCARBOROUGH SAMPLE FINAL EXAM STAB57 Duration - 3 hours THIS EXAM IS OPEN BOOK (NOTES) AIDS ALLOWED: Non-communicating calculator LAST NAME_____________________________________________________ FIRST NAME_____________________________________________________ STUDENT NUMBER___________________________________________ All relevant work MUST be shown for credit. Answer alone (even though correct) will only qualify for ZERO credit. Please show your work in the space provided; you may use the back of the pages, if necessary but you MUST remain organized. PLEASE CHECK AND MAKE SURE THAT THERE ARE NO MISSING PAGES IN THIS BOOKLET. Page 1 of 20 ( 4 points ) 1) Suppose that a statistical model is given by the family of exponential( ) distributions where (0, ) . If our interest is in making inferences about the third moment of the distribution, then determine the characteristic of interest as a function of (i.e. ( )). Note: Exponential( ) distribution has p.d.f. f ( x) e x for x> 0 (and 0 otherwise). Sol ( ) E[ X ] x e 3 3 x 0 dx x 41e x dx 0 (4) 4 3! 3 6 3 ( 5 points) 2) Suppose that X 1 , X 2 ,…., X n is a random sample from a distribution with p.d.f. e ( x ) , x f ( x) otherwise 0, where R . Show that X (1) min( X1 , X 2 , X n ) is a sufficient statistic for the model. e ( x ) , x f ( x) otherwise 0, Sol n L( x1 , x2 , xn | ) e ( xi ) I ( xi ) i 1 n x x e i e n I ( xi ) e i e n I ( x(1) ) i 1 x Let g ( x(1) ) en I ( x(1) ) and h( x1 , x2 , xn ) e i . Thus X (1) min( X1 , X 2 , X n ) is a sufficient statistic for the model (by factorization theorem i.e.Thm 6.1.1 p287 text) Page 2 of 20 3) The conditional distribution of X, the number of claims for an insured in one year, given has p.d.f. given by: e x e2 (2 ) x p ( x) 0.5 0.5 x 0,1, 2, x! x! The prior distribution of the parameter is exponential with a mean of 1. An insured is chosen at random and observed to have no claims in the first year (i.e. X = 0). ( 5 points) a) Determine the posterior density of . Sol: ( 4 points) b) Determine the posterior mean of (numerical value is required) Sol Posterior mean ( | 0) d 0 6 e 2 e 3 d 5 0 6 6 e 2 d e 3 d 50 50 2 6 1 6 1 5 2 5 3 2 Page 3 of 20 4) Suppose that X 1 , X 2 ,…., X n is a random sample from uniform distribution with p.d.f. 1 , 0 x 2 +1 . f ( x) 2 1 0, otherwise ( 6 points) a) Determine the maximum likelihood estimator of . Sol (a) The likelihood function is n 1 1 I (0 xi 2 1) I (0 x(1) 2 1, 0 x( n ) 2 1) (2 1) n i 1 (2 1) Note that 1 1 is a decreasing function of on the interval ( , ) . As 0 x( n) 2 1 n 2 (2 1) 1 1 1 1 (i.e. ( x( n ) 1) ), the maximum occurs when ˆ ( x( n ) 1) . i.e ˆ ( x( n ) 1) is the 2 2 2 2 MLE of . [3 points] b) Determine the maximum likelihood estimator of the variance of this distribution. (2 1) 2 1 . Since , this is a one-to-one function 2 12 1 (2 ( x( n ) 1) 1) 2 x 2 2 ˆ (2 1) (n) 2 of . And so the MLE of the variance is 12 12 12 Sol (b) The variance of the distribution is Page 4 of 20 5 A study of the costs involved in a particular surgery was done in California. The 95% confidence interval for the mean cost was ($6061.41, $6338.59). No other information was given in the report, but we have enough information here to answer the following questions. Assume that this interval was calculated based on the normal distribution (i.e. using a location normal model with known standard deviation). ( 5 points) a) Calculate the upper limit of the 90% confidence interval for the mean cost. Sol x = (6061.41+ 6338.59)/2 = 6200 1.96 = (6338.59-6061.41)/2 = 138.59 and n = 138.59 /1.96 = 70.70918367 n and so 90% CI is 6200+/- 1.645 * 70.70918367 upper limit = 6316.316607 ( 4 points) b) Calculate the p-value for testing the null hypothesis H 0 : 6100 against the alternative hypothesis H a : 6100 . (Note: An interval or a range of possible values for the pvalue is not sufficient for question. A numerical answer is required.) X 0 6200 6100 =1.41 and the p-value is the area under the standard normal curve 70.71 / n after 1.41 = 0.0793 Sol z Page 5 of 20 (6 points) 6) The sponsors of television shows targeted at the children’s market wanted to know the amount of time children spend watching television, since the types and numbers of programs and commercials are greatly influenced by this information. A random sample of 100 children was asked to keep track of the number of hours of television they watch each week and the average time for this sample was 27.19 hours. From past experience, it is known that the population standard deviation of the weekly amount of television watched is = 8.0 hours and that that the weekly amount of television watched is normally distributed with mean µ. Suppose that the prior distribution of µ is N (30,102 ) . Calculate a 95% credible interval for µ. Sol , the posterior distribution ( | X1 , , X n ) is nX 2 0 N n 2 0 0 2 0 1 , 1 n 2 0 0 2 1 2 0 A HPD set that has credible probability (1 ) is nX 2 0 n 2 0 0 nX 2 0 z 1 2 1 2 0 2 1 , n 2 0 0 1 n 2 0 0 2 0 2 0 1 2 z 1 2 0 1 n 2 0 1 2 0 For the given data, if we have a prior for µ of N (30,102 ) , then the HPD set with credible probability 95% is 100 * 27.19 64 100 64 30 100 1 1.96 1 100 100 64 [25.645, 28.771] 1 100 Page 6 of 20 7) (In this term, we could not discuss the material for this and so my may ignore this question.)The following MINITAB output was obtained from a study of the relationship between the salary (in thousands of dollars) and length of service (in years) based on a random sample of 25 employees from a large firm. Descriptive Statistics: Length, Salary Variable Length Salary N 25 25 N* 0 0 Mean 9.720 29.229 SE Mean 0.797 0.994 StDev 3.985 4.971 Minimum 3.000 21.353 Q1 6.500 26.047 Median 9.000 28.446 Q3 12.500 31.721 Regression Analysis: Salary versus Length The regression equation is Salary = 20.3 + 0.915 Length ( 6 points) a) Test whether there is a linear relationship between Length and salary. Use = 0.05. State the null and the alternative hypotheses. Sol H : 0 0 2 H : 0 0 2 SST = (n-1) var(Y) = 24*(4.971^2) = 593.060184 n=25 , sx=3.985, b2=.915 ssr=(b2^2)*(n-1)*(sx^2) = 319.087713 SSE = 593.060184 - 319.09 = 274.0 MSE = SSE/(n-2) = 274/(25-2) = 11.9 F = 319.09/11.9 = 26.81428571 ~F( 1, 23) Table value 4.35 (for F(1, 20) < F_calc and so rejet the hull hypothesis. That is sufficient evidence of a linear relationship. ( 4 points) b) Calculate a 95% confidence interval for the slope of the regression line of Salary on Length. Show your work clearly. Page 7 of 20 Sol S2ˆ 2 11.9 0.03122331915 24 3.9852 Sˆ 0.1767012143 2 t (25-2, 0.05) = 2.069 CI = 0.915 +/- 2.069 * 0.1767 ( 4 points) c) Estimate the expected salary (i.e. mean salary) of employees with 5 years experience. Calculate the standard error of your estimate. Sol (We could not discuss multiple regression this term.) 2 1 (x x ) 2 Var ( B B x | X x , , X x ) 1 2 1 1 n n 2 n ( xi x ) Estimate the expected salary = 20.3 + 0.915 x 5 s = sqrt(11.9) = 3.45 1/2 1/2 2 2 1 ( x x ) 1 (5 9.72) 3.45 Std error = s 2 2 n ( xi x ) 25 243.985 8) (Ex 6.18 p251 Neter, some data deleted to make n =65 and so dfError = 60) In a study of the relationship between rental rates (y) and other variables, a commercial real estate company collected data on n = 65 commercial properties. The following variables were measured on each property: y = rental rate x2 = age x3 = operating expenses and taxes Page 8 of 20 x4 = vacancy rate x5= total square footage The company was interested in estimating the regression model: E (Y | x2 , x3 , x4 , x4 ) 1 x1 2 x2 3 x3 4 x4 5 x5 with x1 1 for all observations. Some MINITAB outputs (with some values deleted) used for estimating this model are given below: Descriptive Statistics: x2, x3, x4, x5, y Variable x2 x3 x4 x5 y N 65 65 65 65 65 Mean 7.077 9.462 0.0892 158042 15.182 StDev 6.258 2.655 0.1462 108701 1.850 Minimum 0.000000000 3.000 0.000000000 27000 10.500 Q1 2.000 7.955 0.000000000 65000 14.000 Q3 14.000 11.620 0.1250 237966 16.500 Maximum 18.000 14.620 0.7300 484290 19.250 Regression Analysis: y versus x2, x3, x4, x5 The regression equation is ommited Predictor Constant x2 x3 x4 x5 Coef omitted -0.15977 0.27442 0.302 0.00000918 S = omitted SE Coef omitted 0.02517 omitted 1.144 0.00000156 T omitted -6.35 4.00 0.26 5.89 P omitted 0.000 0.000 0.793 0.000 R-Sq = omitted Analysis of Variance Source Regression Residual Error Total DF 4 60 64 SS 138.715 omitted omitted MS omitted omitted ( 5 points) a) Test whether or not there is a relationship between the response and the predictors. Use = 0.05. State the null and the alternative hypotheses. Sol H 0 : 2 3 4 5 0 (Note this textbook uses 1 for the intercept.) H a : at least one of 2 , 3 , 4 or 5 is not equal to 0 MSR = 138.715/4 = 34.67875 Page 9 of 20 SST = (65-1)*Sy^2 = 64*(1.850^2) = 219.04 SSE = 219.04- 138.715= 80.325 and MSE = 80.325/60 = 1.33875 F = 34.67875/1.33875= 25.9038282 F(4, 60, 0.95) = 2.53 (From table, p669 taxt ) F cal > F table and so rej Ho. ( 3 points) b) Calculate the least squares estimate of 1 . Sol 15.182-(-0.15977)*7.077-0.27442*9.462-0.302*0.0892-0.00000918*158042 = 12.23836629 ( 3 points) c) Calculate and interpret the value of R-square. Sol R-sq = SSR/SST = 138.715/219.04 = 0.6332861578 63% of the variability in the y values is explained by this model. ( 3 points) d) Calculate a 95% confidence interval for 3 , the coefficient of x3 in the above regression model. SE = 0.27442 / 4 =0.068605 CI = 0.27442 +/- t * 0.068605 t with df 60 ( 5 points) e) The least squares estimate of the simple linear regression equation with x (and 2 with x = 1 for all observations) is y = 15.8 - 0.0835 x2. Use this information (and 1 the information above) to test the null hypothesis H 0 : 3 4 5 0 against the alternative hypothesis H a : at least one of 3 , 4 or 5 is not equal to 0 . Sol SSR(X2) = (0.0835^2)*64*(6.258^2)= 17.47527596 SS(drop) =SSR(x1-x5)-SS(X1 x2) = 138.715 - 17.47527596 = 121.239724 F = (121.239724/3)/ 1.33875 = 30.2 Compare this with F table value with df 3, 60. Page 10 of 20 Here is the full minitab output Regression Analysis: y versus x2, x3, x4, x5 The regression equation is y = 12.2 - 0.160 x2 + 0.274 x3 + 0.30 x4 + 0.000009 x5 Predictor Constant x2 x3 x4 x5 Coef 12.2392 -0.15977 0.27442 0.302 0.00000918 S = 1.15784 SE Coef 0.6341 0.02517 0.06859 1.144 0.00000156 R-Sq = 63.3% T 19.30 -6.35 4.00 0.26 5.89 P 0.000 0.000 0.000 0.793 0.000 R-Sq(adj) = 60.8% Analysis of Variance Source Regression Residual Error Total DF 4 60 64 SS 138.715 80.435 219.15 MS 34.679 1.341 F 25.87 P 0.000 Regression Analysis: y versus x2 The regression equation is y = 15.8 - 0.0835 x2 Predictor Constant x2 Coef 15.7735 -0.08354 S = 1.78910 SE Coef 0.3364 0.03573 R-Sq = 8.0% T 46.88 -2.34 P 0.000 0.023 R-Sq(adj) = 6.5% Analysis of Variance Source Regression Residual Error Total DF 1 63 64 SS 17.495 201.655 219.150 MS 17.495 3.201 F 5.47 P 0.023 Page 11 of 20 ( 6 points) 9) Consider a large population of families in which each family has exactly three children. If the probability of a male birth is 0.5 and the genders of the three children in any family are independent of one another, the number of male children in a randomly selected family will have a binomial distribution with three trials (i.e. n =3 and p = 0.5). Suppose a random sample of 160 families (each family with three children) yields the following results. Number of male children Frequency 0 1 2 or more 14 66 80 Test whether the distribution of the number of males in a family with three children selected at random from this population has a binomial ( 3, 0.5) distribution. Use = 0.05. State the null and the alternative hypotheses. Sol: H0: The distribution is Bin (3, 0.5) Ha: The distribution is not Bin (3, 0.5). P( X =0) = 0.5^3 = 0.125 P(X = 1) = 3 * 0.5^3 = 0.375 P(X>=2)= 1- 0.125-0.375 = 0.5 And so expected frequencies are 160*0.125 = 20, 60 and 80 respectively and Chi-sq= ((14-20)^2)/20+((66-60)^2)/60+((80-80)^2)/80 = 2.4 Chisq Table Value (df= 3-1, 0.05) = 5.99 ChiSq calculated < Table value and so we do not reject the null hypothesis and so no evidence against the assumption of a bin(3, 0.5) distribution. Page 12 of 20 10) The idea of a 95% confidence interval is that the interval captures the true parameter value in 95% of all samples selected from the population. Write a MINITAB code to verify this by simulation. More specifically, write a MINITAB code to do the following: (2 points) a) Generate 500 samples, each of size 10 from a normal distribution with mean 300 and standard deviation 5. Your MINITAB code should store the 500 samples in 500 rows of a MINITAB worksheet. (3 points) b) For each sample generated in part (a) above, calculate a 95% confidence for the population mean assuming that population standard deviation () is known (and equal to 5). (2 points) c) Calculate the proportion of the intervals (in part (b) above) containing the value 300. Sol MTB > SUBC> MTB > MTB > MTB > MTB > MTB > MTB > random 500 c1-c10; normal 300 5. RMean c1-c10 c11. let c12=c11-(1.96*5)/sqrt(10) let c13=c11+(1.96*5)/sqrt(10) let c14=(c12<300)*(300<c13) let k1=mean(c14) print k1 Data Display K1 0.950000 MTB > Page 13 of 20 Multiple-choice questions. Circle the most appropriate answer from the list of answers labeled A), B), C), D), and. E) (3 points for each question below) 11) Two students selected from a large class had weights 130 and 150 pounds. Assuming that the distribution of weights of students in this class is normal, construct a 95% confidence interval for the mean weight of students in this class (i.e. population mean.). Choose the closest value from the options below. All the values below are in pounds. A) B) C) D) E) (77, 203) (13, 267) (97, 183) (130, 150) (120, 160) Ans: B x 140 s (130 140)2 (150 140)2 200 2 1 T has 2 – 1 – 1 df. CI x t * s 200 140 12.71 140 127.1 140 127 (13, 267) n 2 Page 14 of 20 12) In a study of the effects of college student employment on academic performance, the researchers analyzed the GPAs of a random sample of students who were employed (denoted Emp) and a random sample of students who were not employed (denoted NotEmp). Some MINITAB outputs obtained from this study are given below. In the questions below, Emp denotes the population mean GPA of all students employed and NotEmp denotes the population mean GPA of all students not employed. Descriptive Statistics: Emp, NotEmp Variable Emp NotEmp N 55 65 N* 0 0 Mean 2.8734 3.0224 SE Mean 0.0579 0.0337 Minimum 1.9257 2.3223 Q1 2.5987 2.8569 Median 2.8761 3.0286 Q3 3.2340 3.2633 Maximum 3.6547 3.4923 Probability Plot of Emp, NotEmp Normal 2.5 Emp 99 99.9 90 NotEmp 95 90 80 Percent 3.5 99 95 80 70 60 50 40 30 20 70 60 50 40 30 20 10 5 10 5 1 3.0 1 2.0 2.4 2.8 3.2 3.6 0.1 Based on the information given above, which of the following statements is true? A) The 1.5 IQR criterion shows that the maximum observed GPA of the sample of students who were employed, is an outlier. Ans: F q3=3.2340 q1 = 2.5987 = 2.5987 Page 15 of 20 iqr=q3-q1 = 0.6353 q3+1.5*iqr = 4.18695 The max in the sample of students who were employed is 3.6547 < q3+1.5IQR = 4.18695and so the max is not an outlier. B) The distribution the GPAs in the sample of students, who were not employed, is right skewed. Ans F The normal scores plot for the sample of students who were not employed is pretty close to a straight line and so the distribution is pretty close to normal. (Might be seen as slightly left skewed (The curving slightly to the left, also the mean is less than the median) but certainly not right skewed.) C) In the sample of students who were employed, there are more than 15 students with a GPA of 3.25 or higher. Ans F The third quartile of sample of students who were employed = 3.2340. i.e. 25% (or more) ( 0.25 * 55 = 13.75) of the students have GPA 3.2340 or greater. But 3.2340 < 3.25 and so the proportion of students with GPA 3.25 or high must be less than 25%( i.e. 13.75. ) and cannot be more than 15. D) At least 25% of the students in the sample of employed students have a GPA equal to or below 2.6000. Page 16 of 20 Ans T Q1 = 2.5987 . The percent of students below Q1 is 25% (can be greater if there are more than one observation equal to Q1 in the data set ) E) None of the above four statements (A)-(D) is true. Ans F (eg. D is true) Ans D 13) A nutrition laboratory tested a random sample of 50 “reduced sodium” hot dogs. The mean sodium content of the sample was 309mg. The p-value of the t-test for testing the null hypothesis H 0 : 300 against H a : 300 was 0.038. ( is the population mean sodium content). The pvalue of the t-test for testing the null hypothesis H 0 : 298 against H a : 298 (using information from the same sample) was 0.015. Assume that the data satisfy all assumptions required for the t-procedures. If we calculate the 95% confidence interval (using t-procedures) for using the data from this sample, what can we say about its margin of error? Choose the correct range for this margin of error from the following list. A) B) C) D) E) it must be less than 4.00mg it must be between 4.00mg and 8.00mg it must be between 8.00mg and 12.00mg it must be between 12.00mg and 16.00mg it must be greater than 16.00mg Ans: C Page 17 of 20 The p-value for H 0 : 300 against H a : 300 = 0.038 implies that the p-value for H 0 : 300 against H a : 300 = 0.038 x 2 = 0.076 > 0.05 (also note x 309 300 ). This implies that the value 300 is in the 95% CI. The 95% CI has its centre at x 309 and so the margin of error (i.e. half length of the CI) is GREATER than 309 –300 = 9 Similarly, The p-value for H 0 : 298 against H a : 298 = 0.015 implies that the p-value for H 0 : 298 against H a : 298 = 0.015 x 2 = 0.030 < 0.05 (also note x 309 298 ). This implies that the value 298 is NOT in the 95% CI. The 95% CI has its centre at x 309 and so the margin of error (i.e. half length of the CI) is LESS than 309 –298 = 11. ie 9 ME 11 , ( and 9 ME 11 8 ME 12 ) ME (9,11) ME (8,12) ( (9,11) (8,12) ) Here are the MINITAB outputs One-Sample T Test of mu = 300 vs > 300 N 50 Mean 309.000 StDev 35.000 95% Lower Bound 300.701 T 1.82 P 0.038 SE Mean 4.950 95% Lower Bound 300.701 T 2.22 P 0.015 SE Mean 4.950 95% CI (299.053, 318.947) SE Mean 4.950 One-Sample T Test of mu = 298 vs > 298 N 50 Mean 309.000 StDev 35.000 One-Sample T N 50 Mean 309.000 StDev 35.000 ME = (318.947-299.053)/2 = 19.894/ 2 = 9.947 which is between 8 and 12. ANS/2 = 9.947 Page 18 of 20 14) A total of 210 emphysema patients entering a clinic over a one-year period were treated with one of the two drugs (either the standard drug, A, or an experimental compound, B) for a period of one week. After this period each patient’s condition was rated as greatly improved, improved, or no change. The sample results and some useful MINITAB outputs are shown below: Therapy Standard, A Experimental, B Patient’s Condition Improved 35 45 No change 20 15 Greatly Improved 45 50 Tabulated statistics: Therapy, Condition Rows: Therapy Columns: Condition Greatly Improved Improved No Change All A 45 45.24 35 38.10 20 omitted 100 100.00 B 50 49.76 45 41.90 15 18.33 110 110.00 All 95 95.00 80 80.00 35 35.00 210 210.00 Cell Contents: Count Expected count The value of the chi-square statistic for the test of independence of patient’s condition and therapy is: A) B) C) D) E) less than 1.00 between 1.00 and 2.00 between 2.00 and 3.00 between 3.00 and 4.00 greater than 4.00 Ans B Tabulated statistics: Therapy, Condition Using frequencies in Count Rows: Therapy Greatly Improved Columns: Condition Improved No Change All Page 19 of 20 A 45 45.24 35 38.10 20 16.67 100 100.00 B 50 49.76 45 41.90 15 18.33 110 110.00 All 95 95.00 80 80.00 35 35.00 210 210.00 Cell Contents: Count Expected count Pearson Chi-Square = 1.755, DF = 2, P-Value = 0.416 Likelihood Ratio Chi-Square = 1.757, DF = 2, P-Value = 0.415 Page 20 of 20