Q1
We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain models, containing predictors. Explain your answers:
1.a
Which of the three models with kk predictors has the smallest training RSS ?
- When performing best subset selection, the model with k predictors is the model with the smallest RSS among all the models with predictors.
- When performing forward stepwise selection, the model with k predictors is the model with smallest RSS among the models which augment the predictors in with one additional predictors(其在基础上增加了一个额外预测因子).
- When performing backward stepwise selection, the model with k predictors is the model with smallest RSS among the k models which contains all but one of the predictors in .
- So, the model with k predictors which has the smallest training RSS is the one obtained from best subset selectino as it it the one selected among all k perdictors models.
1.b
Which of the three models with kk predictors has the smallest test RSS ?
Difficult to answer: best subset selection may have the smallest test RSS because it takes into account more models than the other method.
However, the other methods might also pick a model with smaller test RSS by sheer luck.
1.c
True or False:
i: The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
True. The model with (k+1) predictors is obtained by augmenting the predictors in the models with k predictors with one additional predictor.
ii: The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
True. The model with k predictors is obtained by removing one predictor from the model with predictors.
iii: The predictors in the kk-variable model identified by backward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
False.There is no direct link between the models obtained from forward and backward selection.
iv: The predictors in the kk-variable model identified by best subset are a subset of the predictors in the (k+1)-variable model identified by best subset selection.
False. The model with (k+1) predictors is obtained by selecting among all possible models with (k+1) predictors, and so does not necessarily contain all the predictors selected for the k-variable model.
Q2
For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer.
2.a
The lasso, releative to least squares, is:
i: More flexible and hence will give improved prediction accruacy when its increase in bias is less than its decrease in variance.
ii: More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease than in bias.
iii: Less flexible and hence will give improved prediction accruacy when its increase in bias is less than its decrease in variance. (TRUE)
iv: Less flexible and hence will give improved prediction accruacy when its increase invariance is less than its decrease in variance.
2.b
Repeat (a) for ridge regression relative to least squares.
Same as lasso.
2.c
Repeat (a) for non-linear methods relative to least squares.
Non-linear methods are more flexible and will give improved prediction accuracy when their increase in variance are less than their decrease in bias.
Q3
Suppose we estimate the regression coefficients in a linear regression model by minimizing. for a particular value of , For pars (a) through (e), indicate which of i. through v. is correct. Justify your answer.
3.a
As we increase s from 0, the training RSS will:
Steadily decrease. As we increase s from 0, we are restricting the coefficients less and less (the coefficients will increase to their least squares estimates), and so the model is becoming more and more flexible which provokes a steady decrease in the training RSS.
3.b
Repeat (a) for test RSS.
Decrease initially, and then eventually start increasing in a U shape. As we increase ss from 0, the model is becoming more and more flexible which provokes at first a decrease in the test RSS before increasing again after that in a typical U shape.
3.c
Repeat (a) for test variance.
Steadily increase. the model is becoming more and more flexible which provokes a steady increase in variance.
3.d
Repeat (a) for (squared) bias.
Steadily decrease. the model is becoming more and more flexible which provokes a steady decrease in bias.
3.e
Repeat (a) for irreducible error.
Reamin constant. By definition, the irreducible error is independant of the model, and consequently independant of the value of .
Q4
Suppose we estimate the regression coefficients in a linear regression model by minimizing Same as Q3
Q5
It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.
Suppose that . Furthermore,supposethat andand,so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: .
Q6
We will now explore (6.12) and (6.13) further.
6.a
Consider (6.12) with . For some choice of and , plot (6.12) as a function of . Your plot should confirm that (6.12) is solved by (6.14)
y <- 3
lambda <- 2
beta <-seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * beta^2, pch = 20, xlab = "beta", ylab = "Ridge optimization")
beta.est <- y/(1+lambda)
points(beta.est, (y - beta.est)^2 + lambda * beta.est^2, col = "red", pch = 4, lwd = 5)
We may see that the function is minimized at
6.b
Consider (6.13) with . For some choice of and , plot (6.13) as a function of . Your plot should confirm that (6.13) is solved by (6.15)
y <- 3
lambda <- 2
beta <- seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * abs(beta), pch = 20, xlab = "beta", ylab = "Lasso optimization")
beta.est <- y - lambda / 2
points(beta.est, (y - beta.est)^2 + lambda * abs(beta.est), col = "red", pch = 4, lwd = 5)
We may see that the function is minimized at as .