Regression Analysis on Diabetes Dataset

Introduction of the diabetes dataset

In this data set, there are 10 baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. This dataset was originally used in ``Least Angle Regression’’ by Efron et al., 2004, in Annals of Statistics.

loc <- "http://statweb.lsu.edu/faculty/li/IIT/diabetes.txt"
dat <- as.matrix(read.table(loc,header=T,sep="\t"))
dim(dat)

## [1] 442  11

head(dat)

##      AGE SEX  BMI  BP  S1    S2 S3 S4     S5 S6   Y
## [1,]  59   2 32.1 101 157  93.2 38  4 4.8598 87 151
## [2,]  48   1 21.6  87 183 103.2 70  3 3.8918 69  75
## [3,]  72   2 30.5  93 156  93.6 41  4 4.6728 85 141
## [4,]  24   1 25.3  84 198 131.4 40  5 4.8903 89 206
## [5,]  50   1 23.0 101 192 125.4 52  4 4.2905 80 135
## [6,]  23   1 22.6  89 139  64.8 61  2 4.1897 68  97

summary(dat)

##       AGE             SEX             BMI              BP        
##  Min.   :19.00   Min.   :1.000   Min.   :18.00   Min.   : 62.00  
##  1st Qu.:38.25   1st Qu.:1.000   1st Qu.:23.20   1st Qu.: 84.00  
##  Median :50.00   Median :1.000   Median :25.70   Median : 93.00  
##  Mean   :48.52   Mean   :1.468   Mean   :26.38   Mean   : 94.65  
##  3rd Qu.:59.00   3rd Qu.:2.000   3rd Qu.:29.27   3rd Qu.:105.00  
##  Max.   :79.00   Max.   :2.000   Max.   :42.20   Max.   :133.00  
##        S1              S2               S3              S4      
##  Min.   : 97.0   Min.   : 41.60   Min.   :22.00   Min.   :2.00  
##  1st Qu.:164.2   1st Qu.: 96.05   1st Qu.:40.25   1st Qu.:3.00  
##  Median :186.0   Median :113.00   Median :48.00   Median :4.00  
##  Mean   :189.1   Mean   :115.44   Mean   :49.79   Mean   :4.07  
##  3rd Qu.:209.8   3rd Qu.:134.50   3rd Qu.:57.75   3rd Qu.:5.00  
##  Max.   :301.0   Max.   :242.40   Max.   :99.00   Max.   :9.09  
##        S5              S6               Y        
##  Min.   :3.258   Min.   : 58.00   Min.   : 25.0  
##  1st Qu.:4.277   1st Qu.: 83.25   1st Qu.: 87.0  
##  Median :4.620   Median : 91.00   Median :140.5  
##  Mean   :4.641   Mean   : 91.26   Mean   :152.1  
##  3rd Qu.:4.997   3rd Qu.: 98.00   3rd Qu.:211.5  
##  Max.   :6.107   Max.   :124.00   Max.   :346.0

We randomly split the dataset into training (342 obs.) and test (100 obs.) sets.

set.seed(11)
indx <- sample(1:442,size=342,replace=F)
x1   <- dat[indx,1:10] #Training X
x2   <- dat[-indx,1:10] #Test X
y1   <- dat[indx,11] #Training Y
y2   <- dat[-indx,11] #Test Y
dat1 <- as.data.frame(x1); dat1$Y <- y1
dat2 <- as.data.frame(x2); dat2$Y <- y2

Multiple linear regression

We applied the multiple linear regression on the training and predict on the test set.

mlr.fit <- lm(Y~.,dat1)
mlr.prd <- predict(mlr.fit,newdata=dat2)
summary(mlr.fit)

## 
## Call:
## lm(formula = Y ~ ., data = dat1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.537  -37.575   -0.885   34.533  156.133 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -349.54630   74.35403  -4.701 3.80e-06 ***
## AGE           -0.05372    0.23320  -0.230  0.81795    
## SEX          -19.60413    6.21538  -3.154  0.00176 ** 
## BMI            5.08624    0.77431   6.569 1.97e-10 ***
## BP             1.07324    0.24258   4.424 1.31e-05 ***
## S1            -1.30539    0.61059  -2.138  0.03326 *  
## S2             1.02829    0.55606   1.849  0.06532 .  
## S3             0.35195    0.85342   0.412  0.68032    
## S4             4.23307    6.75945   0.626  0.53159    
## S5            73.94047   17.22621   4.292 2.33e-05 ***
## S6             0.48806    0.29477   1.656  0.09872 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.59 on 331 degrees of freedom
## Multiple R-squared:  0.5456, Adjusted R-squared:  0.5318 
## F-statistic: 39.74 on 10 and 331 DF,  p-value: < 2.2e-16

We wrote a summary function to calculate the R-square and root mean squared error (RMSE) which are defined below. \[ R^2=1-\frac{\sum_{i=1}^n (y_i-\hat{y}_i)^2}{\sum_{i=1}^n (y_i-\bar{y})^2} \] \[ RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2} \]

summary.output <- function(y,pred){
  #Calculate R2
  r2 <- 1-mean((y-pred)^2)/mean((y-mean(y))^2) 
  #Calculate RMSE
  rmse <- sqrt(mean((y-pred)^2)) 
  c(r2,rmse)
}
summary.output(dat2$Y,mlr.prd)

## [1]  0.395155 62.770463

Elastic net is available in glmnet package in R. There are two tuning parameters in elastic net: \(\alpha\) and \(\lambda\). When \(\alpha=1\), it becomes lasso. When \(\alpha=0\), it is ridge regression. \[ \hat{\beta}^{ENET}=\arg\min_{\beta} RSS + \lambda \left( \alpha \sum_{j=1}^p |\beta_j|+ (1-\alpha) \sum_{j=1}^p \beta_j^2 \right) \]

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.3.3

## Loading required package: Matrix

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 3.3.3

## Loaded glmnet 2.0-16

lasso.fit <- glmnet(x1,y1,family="gaussian",alpha=1,nlambda = 100)

We can see the solution path for the fitted lasso model using the function from plotmo package. Option label is the number of variable names displayed on the right of the plot.

library(plotmo)

## Loading required package: plotrix

## Warning: package 'plotrix' was built under R version 3.3.3

## Loading required package: TeachingDemos

## Warning: package 'TeachingDemos' was built under R version 3.3.3

plot_glmnet(lasso.fit, label=5)

We can run cross-validation to select the optimal value for \(\lambda\) in lasso. The vertical line on the left shows the value of \(\lambda\) that gives minimum CV error (lambda.min). The vertical line on the right shows largest value of lambda such that CV error is within one standard error of the minimum (lambda.1se).

set.seed(1) #To make the results reproducible
cv.lasso <- cv.glmnet(x1,y1,alpha=1,nfolds=5)
plot(cv.lasso)

We can see the prediction results for lasso are as good as MLR. Using lambda.min is even slightly better than MLR.

lasso.prd <- as.vector(predict(cv.lasso,x2,s="lambda.min"))
summary.output(y2,lasso.prd)

## [1]  0.397049 62.672108

lasso.prd2 <- as.vector(predict(cv.lasso,x2,s="lambda.1se"))
summary.output(y2,lasso.prd2)

## [1]  0.3536711 64.8873540

We can check the fitted coefficients from LASSO. The 1SE model selects a subset of variables into the model, while the MIN model includes all variables.

coef(cv.lasso,x2,s="lambda.min")

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -293.13114037
## AGE           -0.03120088
## SEX          -18.84561587
## BMI            5.15143688
## BP             1.05305399
## S1            -0.75291857
## S2             0.53546005
## S3            -0.29427200
## S4             2.11035581
## S5            59.95394763
## S6             0.48362639

coef(cv.lasso,x2,s="lambda.1se")

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -190.17800872
## AGE            .         
## SEX            .         
## BMI            4.71602166
## BP             0.62893688
## S1             .         
## S2             .         
## S3            -0.50278481
## S4             .         
## S5            37.12712775
## S6             0.08347521

We can also explore the optimal value for \(\alpha\) on a coarse grid. It seems lasso (i.e. \(\alpha=1\)) is the best choice in elastic net model.

alpha <- seq(from=0,to=1,by=0.2)
alpha

## [1] 0.0 0.2 0.4 0.6 0.8 1.0

min.cv <- rep(0,6)
for (i in 1:6){
  set.seed(1)
  cv.fit <- cv.glmnet(x1,y1,alpha=alpha[i],nfolds=5)
  min.cv[i] <- min(cv.fit$cvm)
}
which.min(min.cv)

## [1] 6

plot(alpha,min.cv,type="b",xlab="alpha",ylab="CV error")

Regression Analysis on Diabetes Dataset

Bin Li

IIT Lecture Series

Introduction of the diabetes dataset

Multiple linear regression