Random Forest on Boston Housing Dataset

The original data are 506 observations on 14 variables, medv being the target variable:

crim: per capita crime rate by town

zn: proportion of residential land zoned for lots over 25,000 sq.ft

indus: proportion of non-retail business acres per town

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

nox: nitric oxides concentration (parts per 10 million)

rm: average number of rooms per dwelling

age: proportion of owner-occupied units built prior to 1940

dis: weighted distances to five Boston employment centres

rad: index of accessibility to radial highways

tax: full-value property-tax rate per USD 10,000

ptratio: pupil-teacher ratio by town

b: \(1000(B - 0.63)^2\) where \(B\) is the proportion of blacks by town

lstat: percentage of lower status of the population

medv: median value of owner-occupied homes in USD 1000’s

library(mlbench)
data(BostonHousing)
summary(BostonHousing)

##       crim                zn             indus       chas   
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1: 35  
##  Median : 0.25651   Median :  0.00   Median : 9.69          
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14          
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10          
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74          
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio            b         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Since bagging tree is just a special case of random forest with \(m=p\), randomForest() function can be used to perform both of them. First, let’s split the dataset into training and test sets.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

set.seed(2) #set random seed reproducible
indx <- sample(1:506,size=506,replace=F)
bh.train <- BostonHousing[indx[1:400],]
bh.test <- BostonHousing[indx[401:506],]

The default setting in randomForest was used here.

#set random seed to make the results reproducible
set.seed(2) 
fit1 <- randomForest(medv~.,data=bh.train, xtest=bh.test[,-14], ytest=bh.test$medv,keep.forest=TRUE)
fit1

## 
## Call:
##  randomForest(formula = medv ~ ., data = bh.train, xtest = bh.test[,      -14], ytest = bh.test$medv, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 10.39105
##                     % Var explained: 87.67
##                        Test set MSE: 11.96
##                     % Var explained: 85.68

The default value for ntree is 500 and mtry is \(\lfloor p/3 \rfloor = 4\). The MSE on the OOB samples is about 10 and R-square is about 88%. The MSE and R-square on the test samples are around 12 and 86%. If we use all the trees to predict the train samples, the MSE and R-square will be different.

pred.train <- predict(fit1,bh.train)
mean((pred.train-bh.train$medv)^2) #MSE using all trees

## [1] 2.061839

1-sum((pred.train-bh.train$medv)^2)/(399*var(bh.train$medv)) #R-square using all trees

## [1] 0.9755414

We can plot the MSE curves on the OOB samples.

plot(fit1, main="MSE on OOB samples")

We can compare the above random forest model with bagging trees.

# Bagging tree with mtry=p
set.seed(1) 
fit2 <- randomForest(medv~.,data=bh.train, xtest=bh.test[,-14], ytest=bh.test$medv,mtry=13)
fit2

## 
## Call:
##  randomForest(formula = medv ~ ., data = bh.train, xtest = bh.test[,      -14], ytest = bh.test$medv, mtry = 13) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.87102
##                     % Var explained: 85.92
##                        Test set MSE: 9.31
##                     % Var explained: 88.85

The bagging trees performs slightly better than RF with mtry=4. Let’s now take a look of variable importance.

importance(fit1)

##         IncNodePurity
## crim        1948.6864
## zn           298.1840
## indus       2295.1513
## chas         138.8565
## nox         2431.0998
## rm          9414.1196
## age          921.4134
## dis         1826.0282
## rad          288.0495
## tax         1065.5845
## ptratio     1714.7493
## b            702.5280
## lstat       9980.3872

varImpPlot(fit1, sort=TRUE,main="Relative variable importance")

Variable ‘lstat’ and ‘rm’ are the two most important variables in the RF model. Let’s see their marginal effect on the response `medv’ adjusted by the average effects of other variables.

par(mfrow=c(1,2))
partialPlot(fit1, bh.train, 'lstat')
partialPlot(fit1, bh.train, 'rm')

As a result of using tree-based models, the partial dependence plots above are not strictly smooth. The hash marks at the bottom of the plot indicate the deciles of the corresponding variable. Note that the data is sparse near the edges, which causes the curves to be somewhat less well determined in these regions. The partial dependence plot of median house price on ‘lstat’ is monotonically decreasing over the main body of the data. On the other hand, house price is generally monotonically increasing with increasing number of rooms.

Function predict.randomForest() also can output the predicted value for each individual tree in RF. Hence, we can see how much improvement by averaging these ‘weak’ trees.

pred.test <- predict(fit1,bh.test,predict.all=T)
FUN <- function(x){
   mean((x-bh.test$medv)^2)
}
hist(apply(pred.test$ind,2,FUN), xlim=c(0,100), main="Histogram of MSE on individual trees from RF", xlab="MSE on test set")
abline(v=8.4,col="red")
legend("topright",legend=c("Test MSE for RF is 8.4"), col="red", lty=1)

As a reference, we can compare RF with linear regression model.

lm.fit <- lm(medv~.,data=bh.train)
lm.pred.test <- predict(lm.fit,newdata=bh.test)
mean((lm.pred.test-bh.test$medv)^2)

## [1] 20.79452

Random Forest on Boston Housing Dataset

Bin Li