Solution: - Full
Transcription
Solution: - Full
Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Question 1: We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2,...,p predictors. (a) Which of the three models with k predictors has the smallest training RSS? Solution: Best subset selection will have the smallest training RSS since it chooses the best model with k predictors no matter what it is. Forward and backward subset selection on the other hand can only choose the best k given either k-1 or k+1 , best predictors from before. (b) Which of the three models with k predictors has the smallest test RSS? Solution: As mentioned above in the answer a, best subset selection will have the smallest test RSS. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (c) True or False i) The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)variable model identified by forward stepwise selection. Solution: This statement is true, since forward stepwise selection adds the best variable to the other k predictors from before to make (k+1) – variable model. ii) The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1) - variable model identified by backward stepwise selection. Solution: This statement is true, since backward stepwise selecting removes the worse variable from the (k+1) – variable model to make the k- variable model. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Iii) The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection. Solution: This statement is false in general. The variable left in the kvariable model from backward stepwise selection can be different from the variables in the (k+1) –variable model identified by forward stepwise selection. Iv) The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)variable model identified by backward stepwise selection. Solution: This statement is false in general due to above mentioned reason in part iii. v) The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: This statement is false in general, since best subset at the best predictors for the k-variable model separate of the variables in the (k+1)-variable model. Thus the predictors could be different between models. 2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: The solution is option iii. Since we use lasso when least squares has very high variance, thus we decrease variance and like-wise flexibility decreases. (b) Repeat (a) for ridge regression relative to least squares. Solution: The solution is option iii. Since we use lasso when least squares has very high variance, thus we decrease variance and like-wise flexibility decreases. (c) Repeat (a) for non-linear methods relative to least squares. Solution: The solution is option ii. As non-linear methods increase flexibility, therefore increases variance and shrinks bias. So when variance is less than bias. This model predicts better. 3. Suppose we estimate the regression coefficients in a linear regression model by minimizing Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R ̂0 − ∑𝑝 𝛽𝑗𝑥𝑖𝑗 )2 subject to ∑𝑝 |𝛽𝑗 | ≤ s ∑𝑝𝑖=1(𝑦𝑖 − 𝛽 𝑗=1 𝑗=1 for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer. (a) As we increase s from 0, the training RSS will: i. Increase initially, and then eventually start decreasing in an inverted U shape. ii. Decrease initially, and then eventually start increasing in a U shape. iii. Steadily increase. iv. Steadily decrease. v. Remain constant. Solution: Option iv. The training RSS will steadily decrease. As p increases. (b) Repeat (a) for test RSS. Solution: Option ii, Decrease initially, and then eventually start increasing in a U shape. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (c) Repeat (a) for variance. Solution: The solution is option iii. Since as s increases, p increases. (d) Repeat (a) for (squared) bias. Solution: The solution is iv, since as s increases from 0, in this bias decreases since the least squares model is unbiased. (e) Repeat (a) for the irreducible error. Solution: The solution is v, since the irreducible error cannot be changed based on model, thus constant. 4. Suppose we estimate the regression coefficients in a linear regression model by minimizing ̂0 − ∑𝑝 𝛽𝑗𝑥𝑖𝑗 )2 + 𝜆 ∑𝑝 𝛽𝑗2 ∑𝑝𝑖=1(𝑦𝑖 − 𝛽 𝑗=1 𝑗=1 Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R for a particular value of λ. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer. (a) As we increase λ from 0, the training RSS will: i. Increase initially, and then eventually start decreasing in an inverted U shape. ii. Decrease initially, and then eventually start increasing in a U shape. iii. Steadily increase. iv. Steadily decrease. v. Remain constant. Solution: The solution is iii. Since as λ -> ∞ , 𝛽𝑗′ s = 0, causing training RSS to decrease since there are less. (b) Repeat (a) for test RSS. Solution: The solution is ii, since test RSS decreases as p increases, will increase giving a U shape. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (c) Repeat (a) for variance. Solution: The solution is iv, variance decreases. (d) Repeat (a) for (squared) bias. Solution: The solution iii, we are moving from the unbiased least square towards a biased model. (e) Repeat (a) for the irreducible error. Solution: The solution is iv, irreducible error cannot be changed, thus constant. 8. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector of length n = 100. Solution: rnorm() (b) Generate a response vector Y of length n = 100 according to the model Y = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + 𝛽3 𝑋 3 + 𝜖 Where 𝛽0 , 𝛽1 , 𝛽2 , and 𝛽3 are constants of your choice. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: 𝛽0 = 5.00, 𝛽1 = 0.35, 𝛽2 = 4.37, and 𝛽3 = 0.87 (c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X2,...,X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y . Solution: The best model was the one with 3 predictors. Best model coefficients were 𝛽0 = 5.00, 𝛽1 = 0.35, 𝛽2 = 4.37, and 𝛽3 = 0.87. (d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)? Solution: Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R The results of forward and backward subset selection showed that the coefficients for the model of size 3 match the results of best subset selection and the actual values. (e) Now fit a lasso model to the simulated data, again using X, X2, ...,X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained. Solution: Plot places degree of freedom on top and MSE as a function of log(lambda). The two dotted lines are the confidence interval for the best lambda values. The red dot is the MSE value for a given lambda and the line segment is a 95% confidence interval. We then plotted CV error as a function of λ. Lasso almost correctly predicted the right simulated model, except that it included 𝑥 4 , but made the predictor value very small. Thus, Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R 𝛽0 = 5.122005645 𝛽1 = 0.171749843 𝛽2 = 4.214476922 𝛽3 = 0.875687373 𝛽4 = 0.008424754 (f) Now generate a response vector Y according to the model Solution: The actual coefficient values chosen were 𝛽0 = 5 and 𝛽7 = .78 The best lambda was 1.105952. Thus lasso is a very effective technique in both identifying the right model and predicting correctly.