Data Description

A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.

  1. sbp systolic blood pressure

  2. tobacco cumulative tobacco (kg)

  3. ldl low densiity lipoprotein cholesterol

  4. adiposity

  5. famhist family history of heart disease (Present, Absent)

  6. typea type-A behavior

  7. obesity

  8. alcohol current alcohol consumption

  9. age age at onset

  10. chd response, coronary heart disease

Let's import the data.

#You can import data directly from the EOSL website
#loc="http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
loc="C:/Users/bli/Desktop/AppStat/datasets/SAheart.data"
tmp1=read.table(loc,    sep=",",head=T,row.names=1)
head(tmp1)
##   sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 1 160   12.00 5.73     23.11 Present    49   25.30   97.20  52   1
## 2 144    0.01 4.41     28.61  Absent    55   28.87    2.06  63   1
## 3 118    0.08 3.48     32.28 Present    52   29.14    3.81  46   0
## 4 170    7.50 6.41     38.03 Present    51   31.99   24.26  58   1
## 5 134   13.60 3.50     27.78 Present    60   25.99   57.34  49   1
## 6 132    6.20 6.47     36.21 Present    62   30.77   14.14  45   0

Let's first change the famhist to binary 0 and 1, and save the \(X\) and \(Y\) separately (convenient for glmnet working environment.)

table(tmp1$famhist)
## 
##  Absent Present 
##     270     192
tmp1$famhist=as.numeric(tmp1$famhist=="Present")
table(tmp1$famhist)
## 
##   0   1 
## 270 192
x=as.matrix(tmp1[,1:9])
y=tmp1[,10]

Logistic regression with \(L_1\) penalty

library(glmnet)
fit = glmnet(x, y, family = "binomial", alpha=1) #LASSO type penalty
plot(fit, xvar="norm") #Solution path

Let's run CV to select optimal \(\lambda\).

set.seed(1) #able to reproduce CV results
cv.fit = cv.glmnet(x, y, family = "binomial", type.measure = "class")
plot(cv.fit)

cv.fit
## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "class", family = "binomial") 
## 
## Measure: Misclassification Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.04824    15  0.2662 0.01612       5
## 1se 0.05295    14  0.2771 0.01512       5
coef(cv.fit, s = "lambda.min")
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) -3.026249239
## sbp          .          
## tobacco      0.042431208
## ldl          0.077881736
## adiposity    .          
## famhist      0.484727396
## typea        0.004487505
## obesity      .          
## alcohol      .          
## age          0.031404116
pred1=predict(cv.fit,newx=x,s="lambda.min",type="class") #class
table(pred1,y)
##      y
## pred1   0   1
##     0 282  99
##     1  20  61
pred2=predict(cv.fit,newx=x,s="lambda.min",type="response") #probability
library(AUC)
plot(roc(pred2,factor(y)))
legend("bottomright",paste("AUC:",round(auc(roc(pred2,factor(y))),d=2),sep=""))