UNIVERSITY OF TORONTO AT SCARBOROUGH SAMPLE FINAL EXAM STAB57

Transcription

UNIVERSITY OF TORONTO AT SCARBOROUGH SAMPLE FINAL EXAM STAB57
UNIVERSITY OF TORONTO AT SCARBOROUGH
UNIVERSITY OF TORONTO AT SCARBOROUGH
SAMPLE FINAL EXAM
STAB57
Duration - 3 hours
THIS EXAM IS OPEN BOOK (NOTES)
AIDS ALLOWED: Non-communicating calculator
LAST NAME_____________________________________________________
FIRST NAME_____________________________________________________
STUDENT NUMBER___________________________________________
All relevant work MUST be shown for credit. Answer alone (even though correct) will only
qualify for ZERO credit. Please show your work in the space provided; you may use the back of
the pages, if necessary but you MUST remain organized.
PLEASE CHECK AND MAKE SURE THAT THERE ARE NO MISSING PAGES IN
THIS BOOKLET.
Page 1 of 20
( 4 points ) 1) Suppose that a statistical model is given by the family of exponential(  )
distributions where   (0, ) . If our interest is in making inferences about the third moment of
the distribution, then determine the characteristic of interest as a function of  (i.e.  (  )).
Note: Exponential(  ) distribution has p.d.f. f ( x)   e x for x> 0 (and 0 otherwise).

Sol  ( )  E[ X ]   x  e
3
3
 x
0

dx    x 41e x dx  
0
(4)

4

3!

3

6
3

( 5 points) 2) Suppose that X 1 , X 2 ,…., X n is a random sample from a distribution with p.d.f.
 e ( x  ) , x  
f ( x)  
otherwise
0,
where   R . Show that X (1)  min( X1 , X 2 , X n ) is a sufficient statistic for the model.
 e ( x  ) , x  
f ( x)  
otherwise
0,
Sol
n
L( x1 , x2 , xn |  )   e  ( xi  ) I (  xi  )
i 1
n
 x
 x
 e  i e n  I (  xi  )  e  i e n I (  x(1)  )
i 1
 x
Let g ( x(1) )  en I (  x(1)  ) and h( x1 , x2 , xn )  e  i . Thus X (1)  min( X1 , X 2 , X n ) is a
sufficient statistic for the model (by factorization theorem i.e.Thm 6.1.1 p287 text)
Page 2 of 20
3) The conditional distribution of X, the number of claims for an insured in one year, given 
has p.d.f. given by:
 e   x 
 e2 (2 ) x 
p ( x)  0.5  

0.5



 x  0,1, 2,
x!
 x! 


The prior distribution of the parameter  is exponential with a mean of 1.
An insured is chosen at random and observed to have no claims in the first year (i.e. X = 0).
( 5 points) a) Determine the posterior density of .
Sol:
( 4 points) b) Determine the posterior mean of  (numerical value is required)
Sol

Posterior mean    ( | 0) d 
0

6
   e 2   e 3  d 
5
0


6
6
   e 2  d     e 3 d 
50
50
2
6 1 6 1
     
5  2 5 3
2
Page 3 of 20
4) Suppose that X 1 , X 2 ,…., X n is a random sample from uniform distribution with p.d.f.
 1
, 0  x  2 +1

.
f ( x)   2  1
 0,
otherwise
( 6 points) a) Determine the maximum likelihood estimator of  .
Sol
(a)
The likelihood function is
n
1
1
I (0  xi  2  1) 
I (0  x(1)  2  1, 0  x( n )  2  1)

(2  1) n
i 1 (2  1)
Note that
1
1
is a decreasing function of  on the interval ( , ) . As 0  x( n)  2  1
n
2
(2  1)
1 1
1
1
(i.e.   ( x( n )  1)   ), the maximum occurs when ˆ  ( x( n )  1) . i.e ˆ  ( x( n )  1) is the
2 2
2
2
MLE of  .
[3 points] b) Determine the maximum likelihood estimator of the variance of this distribution.
(2  1) 2
1
. Since    , this is a one-to-one function
2
12
1
(2  ( x( n )  1)  1) 2 x 2
2
ˆ
(2  1)
(n)
2


of  . And so the MLE of the variance is
12
12
12
Sol (b) The variance of the distribution is
Page 4 of 20
5 A study of the costs involved in a particular surgery was done in California. The 95%
confidence interval for the mean cost was ($6061.41, $6338.59). No other information was given
in the report, but we have enough information here to answer the following questions. Assume
that this interval was calculated based on the normal distribution (i.e. using a location normal
model with known standard deviation).
( 5 points) a) Calculate the upper limit of the 90% confidence interval for the mean cost.
Sol
x = (6061.41+ 6338.59)/2 = 6200
1.96

= (6338.59-6061.41)/2 = 138.59 and
n

= 138.59 /1.96 = 70.70918367
n
and so 90% CI is 6200+/- 1.645 * 70.70918367
upper limit = 6316.316607
( 4 points) b) Calculate the p-value for testing the null hypothesis H 0 :   6100 against the
alternative hypothesis H a :   6100 . (Note: An interval or a range of possible values for the pvalue is not sufficient for question. A numerical answer is required.)
X  0 6200  6100
=1.41 and the p-value is the area under the standard normal curve

70.71
/ n
after 1.41 = 0.0793
Sol z 
Page 5 of 20
(6 points) 6) The sponsors of television shows targeted at the children’s market wanted to know
the amount of time children spend watching television, since the types and numbers of programs
and commercials are greatly influenced by this information. A random sample of 100 children
was asked to keep track of the number of hours of television they watch each week and the
average time for this sample was 27.19 hours. From past experience, it is known that the
population standard deviation of the weekly amount of television watched is  = 8.0 hours and
that that the weekly amount of television watched is normally distributed with mean µ. Suppose
that the prior distribution of µ is N (30,102 ) . Calculate a 95% credible interval for µ.
Sol
, the posterior distribution  (  | X1 ,  , X n ) is
 nX
 2
0
N
 n
 2
 0
0

2
0

1
,
1
n
2
0
0
2

1
2
0






A HPD set that has credible probability (1   ) is
 nX
 2
 0
 n
 2
 0

0
nX
2
0
z 
1
2
1

2
0
2
1
,
n
2
0

0
1
n
2
0
0

2
0
2
0

1
2
z 
1
2
0
1
n
2
0
1

2
0






For the given data, if we have a prior for µ of N (30,102 ) , then the HPD set with credible
probability 95% is
100 * 27.19

64
100
64
30
100

1
 1.96
1
100
100
64

 [25.645, 28.771]
1

100
Page 6 of 20
7) (In this term, we could not discuss the material for this and so my may ignore this
question.)The following MINITAB output was obtained from a study of the relationship between
the salary (in thousands of dollars) and length of service (in years) based on a random sample of
25 employees from a large firm.
Descriptive Statistics: Length, Salary
Variable
Length
Salary
N
25
25
N*
0
0
Mean
9.720
29.229
SE Mean
0.797
0.994
StDev
3.985
4.971
Minimum
3.000
21.353
Q1
6.500
26.047
Median
9.000
28.446
Q3
12.500
31.721
Regression Analysis: Salary versus Length
The regression equation is
Salary = 20.3 + 0.915 Length
( 6 points) a) Test whether there is a linear relationship between Length and salary. Use  =
0.05. State the null and the alternative hypotheses.
Sol
H : 0
0 2
H :  0
0 2
SST = (n-1) var(Y) = 24*(4.971^2) = 593.060184
n=25 , sx=3.985, b2=.915
ssr=(b2^2)*(n-1)*(sx^2) = 319.087713
SSE = 593.060184 - 319.09 = 274.0
MSE = SSE/(n-2) = 274/(25-2) = 11.9
F = 319.09/11.9 = 26.81428571 ~F( 1, 23) Table value 4.35 (for F(1, 20) < F_calc and so rejet
the hull hypothesis. That is sufficient evidence of a linear relationship.
( 4 points) b) Calculate a 95% confidence interval for the slope of the regression line of Salary
on Length. Show your work clearly.
Page 7 of 20
Sol
S2ˆ 
2
11.9
 0.03122331915
24  3.9852
Sˆ  0.1767012143
2
t (25-2, 0.05) = 2.069
CI = 0.915 +/- 2.069 * 0.1767

( 4 points) c) Estimate the expected salary (i.e. mean salary) of employees with 5 years
experience. Calculate the standard error of your estimate.
Sol (We could not discuss multiple regression this term.)
2 

1
(x x ) 
2

Var ( B  B x | X  x , , X  x )  


1 2
1 1
n
n
2
n
 ( xi  x ) 

Estimate the expected salary = 20.3 + 0.915 x 5
s = sqrt(11.9) = 3.45
1/2
1/2
2 
2


1
(
x

x
)
1
(5

9.72)


 3.45   
Std error = s  



2
2
 n  ( xi  x ) 
 25 243.985 
8) (Ex 6.18 p251 Neter, some data deleted to make n =65 and so dfError = 60) In a study of the
relationship between rental rates (y) and other variables, a commercial real estate company
collected data on n = 65 commercial properties. The following variables were measured on each
property:
y = rental rate
x2 = age
x3 = operating expenses and taxes
Page 8 of 20
x4 = vacancy rate
x5= total square footage
The company was interested in estimating the regression model:
E (Y | x2 , x3 , x4 , x4 )  1 x1  2 x2  3 x3  4 x4  5 x5 with x1  1 for all observations.
Some MINITAB outputs (with some values deleted) used for estimating this model are given
below:
Descriptive Statistics: x2, x3, x4, x5, y
Variable
x2
x3
x4
x5
y
N
65
65
65
65
65
Mean
7.077
9.462
0.0892
158042
15.182
StDev
6.258
2.655
0.1462
108701
1.850
Minimum
0.000000000
3.000
0.000000000
27000
10.500
Q1
2.000
7.955
0.000000000
65000
14.000
Q3
14.000
11.620
0.1250
237966
16.500
Maximum
18.000
14.620
0.7300
484290
19.250
Regression Analysis: y versus x2, x3, x4, x5
The regression equation is
ommited
Predictor
Constant
x2
x3
x4
x5
Coef
omitted
-0.15977
0.27442
0.302
0.00000918
S = omitted
SE Coef
omitted
0.02517
omitted
1.144
0.00000156
T
omitted
-6.35
4.00
0.26
5.89
P
omitted
0.000
0.000
0.793
0.000
R-Sq = omitted
Analysis of Variance
Source
Regression
Residual Error
Total
DF
4
60
64
SS
138.715
omitted
omitted
MS
omitted
omitted
( 5 points) a) Test whether or not there is a relationship between the response and the predictors.
Use  = 0.05. State the null and the alternative hypotheses.
Sol
H 0 : 2  3  4  5  0 (Note this textbook uses 1 for the intercept.)
H a : at least one of 2 , 3 , 4 or 5 is not equal to 0
MSR = 138.715/4 = 34.67875
Page 9 of 20
SST = (65-1)*Sy^2 = 64*(1.850^2) = 219.04
SSE = 219.04- 138.715= 80.325 and MSE = 80.325/60 = 1.33875
F = 34.67875/1.33875= 25.9038282
F(4, 60, 0.95) = 2.53 (From table, p669 taxt ) F cal > F table and so rej Ho.
( 3 points) b) Calculate the least squares estimate of 1 .
Sol
15.182-(-0.15977)*7.077-0.27442*9.462-0.302*0.0892-0.00000918*158042 = 12.23836629
( 3 points) c) Calculate and interpret the value of R-square.
Sol
R-sq = SSR/SST =
138.715/219.04 = 0.6332861578
63% of the variability in the y values is explained by this model.
( 3 points) d) Calculate a 95% confidence interval for  3 , the coefficient of x3 in the above
regression model.
SE = 0.27442 / 4 =0.068605
CI = 0.27442 +/- t * 0.068605
t with df 60
( 5 points) e) The least squares estimate of the simple linear regression equation with x (and
2
with x = 1 for all observations) is y = 15.8 - 0.0835 x2. Use this information (and
1
the information above) to test the null hypothesis H 0 : 3  4  5  0 against the alternative
hypothesis H a : at least one of 3 , 4 or 5 is not equal to 0 .
Sol SSR(X2) = (0.0835^2)*64*(6.258^2)= 17.47527596
SS(drop) =SSR(x1-x5)-SS(X1 x2) = 138.715 - 17.47527596 = 121.239724
F = (121.239724/3)/ 1.33875 = 30.2 Compare this with F table value with df 3, 60.
Page 10 of 20
Here is the full minitab output
Regression Analysis: y versus x2, x3, x4, x5
The regression equation is
y = 12.2 - 0.160 x2 + 0.274 x3 + 0.30 x4 + 0.000009 x5
Predictor
Constant
x2
x3
x4
x5
Coef
12.2392
-0.15977
0.27442
0.302
0.00000918
S = 1.15784
SE Coef
0.6341
0.02517
0.06859
1.144
0.00000156
R-Sq = 63.3%
T
19.30
-6.35
4.00
0.26
5.89
P
0.000
0.000
0.000
0.793
0.000
R-Sq(adj) = 60.8%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
4
60
64
SS
138.715
80.435
219.15
MS
34.679
1.341
F
25.87
P
0.000
Regression Analysis: y versus x2
The regression equation is
y = 15.8 - 0.0835 x2
Predictor
Constant
x2
Coef
15.7735
-0.08354
S = 1.78910
SE Coef
0.3364
0.03573
R-Sq = 8.0%
T
46.88
-2.34
P
0.000
0.023
R-Sq(adj) = 6.5%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
63
64
SS
17.495
201.655
219.150
MS
17.495
3.201
F
5.47
P
0.023
Page 11 of 20
( 6 points) 9) Consider a large population of families in which each family has exactly three
children. If the probability of a male birth is 0.5 and the genders of the three children in any
family are independent of one another, the number of male children in a randomly selected
family will have a binomial distribution with three trials (i.e. n =3 and p = 0.5). Suppose a
random sample of 160 families (each family with three children) yields the following results.
Number of male
children
Frequency
0
1
2 or more
14
66
80
Test whether the distribution of the number of males in a family with three children selected at
random from this population has a binomial ( 3, 0.5) distribution. Use  = 0.05. State the null
and the alternative hypotheses.
Sol:
H0: The distribution is Bin (3, 0.5)
Ha: The distribution is not Bin (3, 0.5).
P( X =0) = 0.5^3 = 0.125
P(X = 1) = 3 * 0.5^3 = 0.375
P(X>=2)= 1- 0.125-0.375 = 0.5
And so expected frequencies are 160*0.125 = 20, 60 and 80 respectively and
Chi-sq= ((14-20)^2)/20+((66-60)^2)/60+((80-80)^2)/80 = 2.4
Chisq Table Value (df= 3-1, 0.05) = 5.99
ChiSq calculated < Table value and so we do not reject the null hypothesis and so no evidence
against the assumption of a bin(3, 0.5) distribution.
Page 12 of 20
10) The idea of a 95% confidence interval is that the interval captures the true parameter value in
95% of all samples selected from the population. Write a MINITAB code to verify this by
simulation. More specifically, write a MINITAB code to do the following:
(2 points) a) Generate 500 samples, each of size 10 from a normal distribution with mean 300
and standard deviation 5. Your MINITAB code should store the 500 samples in 500 rows of a
MINITAB worksheet.
(3 points) b) For each sample generated in part (a) above, calculate a 95% confidence for the
population mean assuming that population standard deviation () is known (and equal to 5).
(2 points) c) Calculate the proportion of the intervals (in part (b) above) containing the value
300.
Sol
MTB >
SUBC>
MTB >
MTB >
MTB >
MTB >
MTB >
MTB >
random 500 c1-c10;
normal 300 5.
RMean c1-c10 c11.
let c12=c11-(1.96*5)/sqrt(10)
let c13=c11+(1.96*5)/sqrt(10)
let c14=(c12<300)*(300<c13)
let k1=mean(c14)
print k1
Data Display
K1
0.950000
MTB >
Page 13 of 20
Multiple-choice questions. Circle the most appropriate answer from the list of answers
labeled A), B), C), D), and. E) (3 points for each question below)
11) Two students selected from a large class had weights 130 and 150 pounds. Assuming that the
distribution of weights of students in this class is normal, construct a 95% confidence interval for
the mean weight of students in this class (i.e. population mean.).
Choose the closest value from the options below. All the values below are in pounds.
A)
B)
C)
D)
E)
(77, 203)
(13, 267)
(97, 183)
(130, 150)
(120, 160)
Ans: B
x  140
s
(130  140)2  (150  140)2
 200
2 1
T has 2 – 1 – 1 df.
CI  x  t *
s
200
 140  12.71
 140  127.1  140  127  (13, 267)
n
2
Page 14 of 20
12) In a study of the effects of college student employment on academic performance, the
researchers analyzed the GPAs of a random sample of students who were employed (denoted
Emp) and a random sample of students who were not employed (denoted NotEmp). Some
MINITAB outputs obtained from this study are given below. In the questions below,  Emp
denotes the population mean GPA of all students employed and  NotEmp denotes the population
mean GPA of all students not employed.
Descriptive Statistics: Emp, NotEmp
Variable
Emp
NotEmp
N
55
65
N*
0
0
Mean
2.8734
3.0224
SE Mean
0.0579
0.0337
Minimum
1.9257
2.3223
Q1
2.5987
2.8569
Median
2.8761
3.0286
Q3
3.2340
3.2633
Maximum
3.6547
3.4923
Probability Plot of Emp, NotEmp
Normal
2.5
Emp
99
99.9
90
NotEmp
95
90
80
Percent
3.5
99
95
80
70
60
50
40
30
20
70
60
50
40
30
20
10
5
10
5
1
3.0
1
2.0
2.4
2.8
3.2
3.6
0.1
Based on the information given above, which of the following statements is true?
A) The 1.5  IQR criterion shows that the maximum observed GPA of the sample of students
who were employed, is an outlier.
Ans: F
q3=3.2340
q1 = 2.5987 =
2.5987
Page 15 of 20
iqr=q3-q1 =
0.6353
q3+1.5*iqr =
4.18695
The max in the sample of students who were employed is 3.6547 < q3+1.5IQR = 4.18695and so
the max is not an outlier.
B) The distribution the GPAs in the sample of students, who were not employed, is right skewed.
Ans F
The normal scores plot for the sample of students who were not employed is pretty close to a
straight line and so the distribution is pretty close to normal. (Might be seen as slightly left
skewed (The curving slightly to the left, also the mean is less than the median) but certainly not
right skewed.)
C) In the sample of students who were employed, there are more than 15 students with a GPA of
3.25 or higher.
Ans F
The third quartile of sample of students who were employed = 3.2340. i.e. 25% (or more) ( 0.25
* 55 = 13.75) of the students have GPA 3.2340 or greater. But 3.2340 < 3.25 and so the
proportion of students with GPA 3.25 or high must be less than 25%( i.e. 13.75. ) and cannot be
more than 15.
D) At least 25% of the students in the sample of employed students have a GPA equal to or
below 2.6000.
Page 16 of 20
Ans T
Q1 = 2.5987 . The percent of students below Q1 is 25% (can be greater if there are more than
one observation equal to Q1 in the data set )
E) None of the above four statements (A)-(D) is true.
Ans F (eg. D is true)
Ans D
13) A nutrition laboratory tested a random sample of 50 “reduced sodium” hot dogs. The mean
sodium content of the sample was 309mg. The p-value of the t-test for testing the null hypothesis
H 0 :   300 against H a :   300 was 0.038. ( is the population mean sodium content). The pvalue of the t-test for testing the null hypothesis H 0 :   298 against H a :   298 (using
information from the same sample) was 0.015. Assume that the data satisfy all assumptions
required for the t-procedures.
If we calculate the 95% confidence interval (using t-procedures) for  using the data from this
sample, what can we say about its margin of error? Choose the correct range for this margin of
error from the following list.
A)
B)
C)
D)
E)
it must be less than 4.00mg
it must be between 4.00mg and 8.00mg
it must be between 8.00mg and 12.00mg
it must be between 12.00mg and 16.00mg
it must be greater than 16.00mg
Ans: C
Page 17 of 20
The p-value for H 0 :   300 against H a :   300 = 0.038 implies that the p-value for
H 0 :   300 against H a :   300 = 0.038 x 2 = 0.076 > 0.05 (also note x  309  300 ).
This implies that the value 300 is in the 95% CI. The 95% CI has its centre at x  309 and so the
margin of error (i.e. half length of the CI) is GREATER than 309 –300 = 9
Similarly,
The p-value for H 0 :   298 against H a :   298 = 0.015 implies that the p-value for
H 0 :   298 against H a :   298 = 0.015 x 2 = 0.030 < 0.05 (also note x  309  298 ).
This implies that the value 298 is NOT in the 95% CI. The 95% CI has its centre at x  309 and
so the margin of error (i.e. half length of the CI) is LESS than 309 –298 = 11.
ie 9  ME  11 , ( and 9  ME  11  8  ME  12 )
ME  (9,11)  ME  (8,12) ( (9,11)  (8,12) )
Here are the MINITAB outputs
One-Sample T
Test of mu = 300 vs > 300
N
50
Mean
309.000
StDev
35.000
95%
Lower
Bound
300.701
T
1.82
P
0.038
SE Mean
4.950
95%
Lower
Bound
300.701
T
2.22
P
0.015
SE Mean
4.950
95% CI
(299.053, 318.947)
SE Mean
4.950
One-Sample T
Test of mu = 298 vs > 298
N
50
Mean
309.000
StDev
35.000
One-Sample T
N
50
Mean
309.000
StDev
35.000
ME = (318.947-299.053)/2 = 19.894/ 2 = 9.947 which is between 8 and 12.
ANS/2 =
9.947
Page 18 of 20
14) A total of 210 emphysema patients entering a clinic over a one-year period were treated with
one of the two drugs (either the standard drug, A, or an experimental compound, B) for a period
of one week. After this period each patient’s condition was rated as greatly improved, improved,
or no change. The sample results and some useful MINITAB outputs are shown below:
Therapy
Standard, A
Experimental, B
Patient’s Condition
Improved
35
45
No change
20
15
Greatly Improved
45
50
Tabulated statistics: Therapy, Condition
Rows: Therapy
Columns: Condition
Greatly
Improved
Improved
No Change
All
A
45
45.24
35
38.10
20
omitted
100
100.00
B
50
49.76
45
41.90
15
18.33
110
110.00
All
95
95.00
80
80.00
35
35.00
210
210.00
Cell Contents:
Count
Expected count
The value of the chi-square statistic for the test of independence of patient’s condition and
therapy is:
A)
B)
C)
D)
E)
less than 1.00
between 1.00 and 2.00
between 2.00 and 3.00
between 3.00 and 4.00
greater than 4.00
Ans B
Tabulated statistics: Therapy, Condition
Using frequencies in Count
Rows: Therapy
Greatly
Improved
Columns: Condition
Improved
No Change
All
Page 19 of 20
A
45
45.24
35
38.10
20
16.67
100
100.00
B
50
49.76
45
41.90
15
18.33
110
110.00
All
95
95.00
80
80.00
35
35.00
210
210.00
Cell Contents:
Count
Expected count
Pearson Chi-Square = 1.755, DF = 2, P-Value = 0.416
Likelihood Ratio Chi-Square = 1.757, DF = 2, P-Value = 0.415
Page 20 of 20

Similar documents