The original data are 506 observations on 14 variables, medv being the target variable:
crim: per capita crime rate by town
zn: proportion of residential land zoned for lots over 25,000 sq.ft
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox: nitric oxides concentration (parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted distances to five Boston employment centres
rad: index of accessibility to radial highways
tax: full-value property-tax rate per USD 10,000
ptratio: pupil-teacher ratio by town
b: \(1000(B - 0.63)^2\) where \(B\) is the proportion of blacks by town
lstat: percentage of lower status of the population
medv: median value of owner-occupied homes in USD 1000’s
library(mlbench)
data(BostonHousing)
summary(BostonHousing)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1: 35
## Median : 0.25651 Median : 0.00 Median : 9.69
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio b
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Since bagging tree is just a special case of random forest with \(m=p\), randomForest() function can be used to perform both of them. First, let’s split the dataset into training and test sets.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(2) #set random seed reproducible
indx <- sample(1:506,size=506,replace=F)
bh.train <- BostonHousing[indx[1:400],]
bh.test <- BostonHousing[indx[401:506],]
The default setting in randomForest was used here.
#set random seed to make the results reproducible
set.seed(2)
fit1 <- randomForest(medv~.,data=bh.train, xtest=bh.test[,-14], ytest=bh.test$medv,keep.forest=TRUE)
fit1
##
## Call:
## randomForest(formula = medv ~ ., data = bh.train, xtest = bh.test[, -14], ytest = bh.test$medv, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 10.39105
## % Var explained: 87.67
## Test set MSE: 11.96
## % Var explained: 85.68
The default value for ntree is 500 and mtry is \(\lfloor p/3 \rfloor = 4\). The MSE on the OOB samples is about 10 and R-square is about 88%. The MSE and R-square on the test samples are around 12 and 86%. If we use all the trees to predict the train samples, the MSE and R-square will be different.
pred.train <- predict(fit1,bh.train)
mean((pred.train-bh.train$medv)^2) #MSE using all trees
## [1] 2.061839
1-sum((pred.train-bh.train$medv)^2)/(399*var(bh.train$medv)) #R-square using all trees
## [1] 0.9755414
We can plot the MSE curves on the OOB samples.
plot(fit1, main="MSE on OOB samples")
We can compare the above random forest model with bagging trees.
# Bagging tree with mtry=p
set.seed(1)
fit2 <- randomForest(medv~.,data=bh.train, xtest=bh.test[,-14], ytest=bh.test$medv,mtry=13)
fit2
##
## Call:
## randomForest(formula = medv ~ ., data = bh.train, xtest = bh.test[, -14], ytest = bh.test$medv, mtry = 13)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.87102
## % Var explained: 85.92
## Test set MSE: 9.31
## % Var explained: 88.85
The bagging trees performs slightly better than RF with mtry=4. Let’s now take a look of variable importance.
importance(fit1)
## IncNodePurity
## crim 1948.6864
## zn 298.1840
## indus 2295.1513
## chas 138.8565
## nox 2431.0998
## rm 9414.1196
## age 921.4134
## dis 1826.0282
## rad 288.0495
## tax 1065.5845
## ptratio 1714.7493
## b 702.5280
## lstat 9980.3872
varImpPlot(fit1, sort=TRUE,main="Relative variable importance")
Variable ‘lstat’ and ‘rm’ are the two most important variables in the RF model. Let’s see their marginal effect on the response `medv’ adjusted by the average effects of other variables.
par(mfrow=c(1,2))
partialPlot(fit1, bh.train, 'lstat')
partialPlot(fit1, bh.train, 'rm')
As a result of using tree-based models, the partial dependence plots above are not strictly smooth. The hash marks at the bottom of the plot indicate the deciles of the corresponding variable. Note that the data is sparse near the edges, which causes the curves to be somewhat less well determined in these regions. The partial dependence plot of median house price on ‘lstat’ is monotonically decreasing over the main body of the data. On the other hand, house price is generally monotonically increasing with increasing number of rooms.
Function predict.randomForest() also can output the predicted value for each individual tree in RF. Hence, we can see how much improvement by averaging these ‘weak’ trees.
pred.test <- predict(fit1,bh.test,predict.all=T)
FUN <- function(x){
mean((x-bh.test$medv)^2)
}
hist(apply(pred.test$ind,2,FUN), xlim=c(0,100), main="Histogram of MSE on individual trees from RF", xlab="MSE on test set")
abline(v=8.4,col="red")
legend("topright",legend=c("Test MSE for RF is 8.4"), col="red", lty=1)
As a reference, we can compare RF with linear regression model.
lm.fit <- lm(medv~.,data=bh.train)
lm.pred.test <- predict(lm.fit,newdata=bh.test)
mean((lm.pred.test-bh.test$medv)^2)
## [1] 20.79452