The Validation Set Approach

It is generally a good idea to set a random seed when performing an analysis such as cross-validation that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.

  • Using the sample() function to split the set of observations into two halves.
  • Using predict() function to estimate the response for all 392 observations, -train index below selects only the observations that not in the training set.
  • Using mean() function to calculate the MSE of the 196 observations in the validation set.
    library(ISLR)
    set.seed(1)
    train = sample(392, 196)
    lm.fit = lm(mpg ~ horsepower, data=Auto, subset=train)
    attach(Auto)
    mean((mpg-predict(lm.fit, Auto))[-train]^2)
    
  • Using the poly() function to estimate the test error for the polynomial and cubic regreesions.
    lm.fit2=lm(mpg ~ poly(horsepower, 2), data=Auto, subset=train)
    mean((mpg-predict(lm.fit2, Auto))[-train]^2)
    lm.fit3=lm(mpg ~ poly(horsepower, 3), data=Auto, subset=train)
    mean((mpg-predict(lm.fit3, Auto))[-train]^2)
    

A model that predicts mpg using a quadratic function of horsepower performs better than a model that involves only a liner function of horsepower.

Leave-One-Out Cross-Validation

we will perform linear regression using glm() function rather than the lm() function because the latter can be used together with cv.glm().

glm.fit=glm(mpg~horsepower ,data=Auto)

loocv=function(fit){
  h=lm.influence(fit)$h
  mean((residuals(fit)/(1-h))^2)
}
loocv(glm.fit)

k-Folder Cross-Validation

The cv.glm() function can also be used to implement k-fold CV.

cv.error10=rep(0,5)
degree=1:5
for(d in degree){
  glm.fit=glm(mpg~poly(horsepower, d), data=Auto)
  cv.error10[d] = cv.glm(Auto, glm.fit, K=10)$delta[1]
}
lines(degree, cv.error10, type="b", col="red")

Using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit.

The Bootstrap

Estimating the Accuracy of a Statistic of Interest

Performing a bootstrap analysis in R entails only two steps:

  • Create a function that computes the statistic of interest
  • Using the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement
alpha=function(x, y){
  vx=var(x)
  vy=var(y)
  cxy=cov(x,y)
  (vy-cxy)/(vx+vy-2*cxy)
}
alpha(Portfolio$X, Portfolio$Y)

What is the standard error of alpha?
alpha.fn=function(data, index){
  with(data[index,], alpha(X, Y))
}

The following command tells R to estimate alpha using all 100 observations
alpha.fn(Portfolio, 1:100)

Using sample() function to randomly select 100 observations from the range 1 to 100 with replacement
set.seed(1)
alpha.fn(Portfolio, sample(1:100, 100, replace=TRUE))

We can implement a bootstrap analysis by performing this command many times, recording all of the corresponding estimates for , and computeing the resulting standard deviation, Or using the boot() function automates this approach.

boot.out=boot(Portfolio, alpha.fn, R=1000)

Estimating the Accuracy of a Linear Regression Model

The bootstrap approach can be used to access the variability of the coefficient estimate and predictions from a statistical learning method.(bootstrap方法可用于评估来自统计学习方法的系数估计和预测的变异性).

Create a simple function, takes in the data set as well as a set of indices for the observations, and returns the intercept and slope estimates for the linear regression model.

boot.fn=function(data, index){
  return (coef(lm(mpg~horsepower, data=data, subset=index)))
}
boot.fn(Auto ,1:392)

create bootstrap estimates for the intercept and slope terms by randomly samplying from among the observations with replacement

set.seed(1)
boot.fn(Auto, sample(1:392, 392, replace=TRUE))

Next, use the boot() function to compute the standard errors of 1000 bootstrap estimates for the intercept and slope terms.

set.seed(1)
boot.fn(Auto, sample(1:392, 392, replace=TRUE))
boot.out=boot(Auto, boot.fn, R=1000)
boot.out

results matching ""

    No results matching ""