In this data set, there are 10 baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. This dataset was originally used in ``Least Angle Regression’’ by Efron et al., 2004, in Annals of Statistics.
loc <- "http://statweb.lsu.edu/faculty/li/IIT/diabetes.txt"
dat <- as.matrix(read.table(loc,header=T,sep="\t"))
dim(dat)
## [1] 442 11
head(dat)
## AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y
## [1,] 59 2 32.1 101 157 93.2 38 4 4.8598 87 151
## [2,] 48 1 21.6 87 183 103.2 70 3 3.8918 69 75
## [3,] 72 2 30.5 93 156 93.6 41 4 4.6728 85 141
## [4,] 24 1 25.3 84 198 131.4 40 5 4.8903 89 206
## [5,] 50 1 23.0 101 192 125.4 52 4 4.2905 80 135
## [6,] 23 1 22.6 89 139 64.8 61 2 4.1897 68 97
summary(dat)
## AGE SEX BMI BP
## Min. :19.00 Min. :1.000 Min. :18.00 Min. : 62.00
## 1st Qu.:38.25 1st Qu.:1.000 1st Qu.:23.20 1st Qu.: 84.00
## Median :50.00 Median :1.000 Median :25.70 Median : 93.00
## Mean :48.52 Mean :1.468 Mean :26.38 Mean : 94.65
## 3rd Qu.:59.00 3rd Qu.:2.000 3rd Qu.:29.27 3rd Qu.:105.00
## Max. :79.00 Max. :2.000 Max. :42.20 Max. :133.00
## S1 S2 S3 S4
## Min. : 97.0 Min. : 41.60 Min. :22.00 Min. :2.00
## 1st Qu.:164.2 1st Qu.: 96.05 1st Qu.:40.25 1st Qu.:3.00
## Median :186.0 Median :113.00 Median :48.00 Median :4.00
## Mean :189.1 Mean :115.44 Mean :49.79 Mean :4.07
## 3rd Qu.:209.8 3rd Qu.:134.50 3rd Qu.:57.75 3rd Qu.:5.00
## Max. :301.0 Max. :242.40 Max. :99.00 Max. :9.09
## S5 S6 Y
## Min. :3.258 Min. : 58.00 Min. : 25.0
## 1st Qu.:4.277 1st Qu.: 83.25 1st Qu.: 87.0
## Median :4.620 Median : 91.00 Median :140.5
## Mean :4.641 Mean : 91.26 Mean :152.1
## 3rd Qu.:4.997 3rd Qu.: 98.00 3rd Qu.:211.5
## Max. :6.107 Max. :124.00 Max. :346.0
We randomly split the dataset into training (342 obs.) and test (100 obs.) sets.
set.seed(11)
indx <- sample(1:442,size=342,replace=F)
x1 <- dat[indx,1:10] #Training X
x2 <- dat[-indx,1:10] #Test X
y1 <- dat[indx,11] #Training Y
y2 <- dat[-indx,11] #Test Y
dat1 <- as.data.frame(x1); dat1$Y <- y1
dat2 <- as.data.frame(x2); dat2$Y <- y2
We applied the multiple linear regression on the training and predict on the test set.
mlr.fit <- lm(Y~.,dat1)
mlr.prd <- predict(mlr.fit,newdata=dat2)
summary(mlr.fit)
##
## Call:
## lm(formula = Y ~ ., data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137.537 -37.575 -0.885 34.533 156.133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -349.54630 74.35403 -4.701 3.80e-06 ***
## AGE -0.05372 0.23320 -0.230 0.81795
## SEX -19.60413 6.21538 -3.154 0.00176 **
## BMI 5.08624 0.77431 6.569 1.97e-10 ***
## BP 1.07324 0.24258 4.424 1.31e-05 ***
## S1 -1.30539 0.61059 -2.138 0.03326 *
## S2 1.02829 0.55606 1.849 0.06532 .
## S3 0.35195 0.85342 0.412 0.68032
## S4 4.23307 6.75945 0.626 0.53159
## S5 73.94047 17.22621 4.292 2.33e-05 ***
## S6 0.48806 0.29477 1.656 0.09872 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.59 on 331 degrees of freedom
## Multiple R-squared: 0.5456, Adjusted R-squared: 0.5318
## F-statistic: 39.74 on 10 and 331 DF, p-value: < 2.2e-16
We wrote a summary function to calculate the R-square and root mean squared error (RMSE) which are defined below. \[ R^2=1-\frac{\sum_{i=1}^n (y_i-\hat{y}_i)^2}{\sum_{i=1}^n (y_i-\bar{y})^2} \] \[ RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2} \]
summary.output <- function(y,pred){
#Calculate R2
r2 <- 1-mean((y-pred)^2)/mean((y-mean(y))^2)
#Calculate RMSE
rmse <- sqrt(mean((y-pred)^2))
c(r2,rmse)
}
summary.output(dat2$Y,mlr.prd)
## [1] 0.395155 62.770463
Elastic net is available in glmnet package in R. There are two tuning parameters in elastic net: \(\alpha\) and \(\lambda\). When \(\alpha=1\), it becomes lasso. When \(\alpha=0\), it is ridge regression. \[ \hat{\beta}^{ENET}=\arg\min_{\beta} RSS + \lambda \left( \alpha \sum_{j=1}^p |\beta_j|+ (1-\alpha) \sum_{j=1}^p \beta_j^2 \right) \]
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.3.3
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.3.3
## Loaded glmnet 2.0-16
lasso.fit <- glmnet(x1,y1,family="gaussian",alpha=1,nlambda = 100)
We can see the solution path for the fitted lasso model using the function from plotmo package. Option label is the number of variable names displayed on the right of the plot.
library(plotmo)
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 3.3.3
## Loading required package: TeachingDemos
## Warning: package 'TeachingDemos' was built under R version 3.3.3
plot_glmnet(lasso.fit, label=5)
We can run cross-validation to select the optimal value for \(\lambda\) in lasso. The vertical line on the left shows the value of \(\lambda\) that gives minimum CV error (lambda.min). The vertical line on the right shows largest value of lambda such that CV error is within one standard error of the minimum (lambda.1se).
set.seed(1) #To make the results reproducible
cv.lasso <- cv.glmnet(x1,y1,alpha=1,nfolds=5)
plot(cv.lasso)
We can see the prediction results for lasso are as good as MLR. Using lambda.min is even slightly better than MLR.
lasso.prd <- as.vector(predict(cv.lasso,x2,s="lambda.min"))
summary.output(y2,lasso.prd)
## [1] 0.397049 62.672108
lasso.prd2 <- as.vector(predict(cv.lasso,x2,s="lambda.1se"))
summary.output(y2,lasso.prd2)
## [1] 0.3536711 64.8873540
We can check the fitted coefficients from LASSO. The 1SE model selects a subset of variables into the model, while the MIN model includes all variables.
coef(cv.lasso,x2,s="lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -293.13114037
## AGE -0.03120088
## SEX -18.84561587
## BMI 5.15143688
## BP 1.05305399
## S1 -0.75291857
## S2 0.53546005
## S3 -0.29427200
## S4 2.11035581
## S5 59.95394763
## S6 0.48362639
coef(cv.lasso,x2,s="lambda.1se")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -190.17800872
## AGE .
## SEX .
## BMI 4.71602166
## BP 0.62893688
## S1 .
## S2 .
## S3 -0.50278481
## S4 .
## S5 37.12712775
## S6 0.08347521
We can also explore the optimal value for \(\alpha\) on a coarse grid. It seems lasso (i.e. \(\alpha=1\)) is the best choice in elastic net model.
alpha <- seq(from=0,to=1,by=0.2)
alpha
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
min.cv <- rep(0,6)
for (i in 1:6){
set.seed(1)
cv.fit <- cv.glmnet(x1,y1,alpha=alpha[i],nfolds=5)
min.cv[i] <- min(cv.fit$cvm)
}
which.min(min.cv)
## [1] 6
plot(alpha,min.cv,type="b",xlab="alpha",ylab="CV error")