library(mlbench)
data(BostonHousing)
summary(BostonHousing)
##       crim                zn             indus       chas   
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1: 35  
##  Median : 0.25651   Median :  0.00   Median : 9.69          
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14          
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10          
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74          
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio            b         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Since bagging tree is just a special case of random forest with \(m=p\), randomForest() function can be used to perform both of them. First, let’s split the dataset into training and test sets.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1) #set random seed reproducible
indx <- sample(1:506,size=506,replace=F)
bh.train <- BostonHousing[indx[1:400],]
bh.test <- BostonHousing[indx[401:506],]

The default setting in randomForest was used here.

#set random seed to make the results reproducible
set.seed(1) 
fit1 <- randomForest(medv~.,data=bh.train, xtest=bh.test[,-14], ytest=bh.test$medv,keep.forest=TRUE)
fit1
## 
## Call:
##  randomForest(formula = medv ~ ., data = bh.train, xtest = bh.test[,      -14], ytest = bh.test$medv, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.2653
##                     % Var explained: 86.57
##                        Test set MSE: 8.4
##                     % Var explained: 90.27

The default value for ntree is 500 and mtry is \(\lfloor p/3 \rfloor = 4\). The MSE on the OOB samples is about 11 and R-square is about 87%. The MSE and R-square on the test samples are around 8 and 90%. If we use all the trees to predict the train samples, the MSE and R-square will be different.

pred.train <- predict(fit1,bh.train)
mean((pred.train-bh.train$medv)^2) #MSE using all trees
## [1] 2.239407
1-sum((pred.train-bh.train$medv)^2)/(399*var(bh.train$medv)) #R-square using all trees
## [1] 0.9733126

We can plot the MSE curves on the OOB samples.

plot(fit1, main="MSE on OOB samples")

We can compare the above random forest model with bagging trees.

# Bagging tree with mtry=p
set.seed(1) 
fit2 <- randomForest(medv~.,data=bh.train, xtest=bh.test[,-14], ytest=bh.test$medv,mtry=13)
fit2
## 
## Call:
##  randomForest(formula = medv ~ ., data = bh.train, xtest = bh.test[,      -14], ytest = bh.test$medv, mtry = 13) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.56671
##                     % Var explained: 86.22
##                        Test set MSE: 7.87
##                     % Var explained: 90.89

The bagging trees performs slightly better than RF with mtry=4. Let’s now take a look of variable importance.

importance(fit1)
##         IncNodePurity
## crim        1967.7439
## zn           137.4285
## indus       1933.0059
## chas         139.3136
## nox         2653.2365
## rm          8838.1791
## age         1116.1114
## dis         2154.7263
## rad          320.9337
## tax         1088.8654
## ptratio     2124.6905
## b            681.9328
## lstat       9530.9729
varImpPlot(fit1, sort=TRUE,main="Relative variable importance")

Variable ‘lstat’ and ‘rm’ are the two most important variables in the RF model. Let’s see their marginal effect on the response `medv’ adjusted by the average effects of other variables.

par(mfrow=c(1,2))
partialPlot(fit1, bh.train, 'lstat')
partialPlot(fit1, bh.train, 'rm')

As a result of using tree-based models, the partial dependence plots above are not strictly smooth. The hash marks at the bottom of the plot indicate the deciles of the corresponding variable. Note that the data is sparse near the edges, which causes the curves to be somewhat less well determined in these regions. The partial dependence plot of median house price on ‘lstat’ is monotonically decreasing over the main body of the data. On the other hand, house price is generally monotonically increasing with increasing number of rooms.

Function predict.randomForest() also can output the predicted value for each individual tree in RF. Hence, we can see how much improvement by averaging these ‘weak’ trees.

pred.test <- predict(fit1,bh.test,predict.all=T)
FUN <- function(x){
   mean((x-bh.test$medv)^2)
}
hist(apply(pred.test$ind,2,FUN), xlim=c(0,100), main="Histogram of MSE on individual trees from RF", xlab="MSE on test set")
abline(v=8.4,col="red")
legend("topright",legend=c("Test MSE for RF is 8.4"), col="red", lty=1)

As a reference, we can compare RF with linear regression model.

lm.fit <- lm(medv~.,data=bh.train)
lm.pred.test <- predict(lm.fit,newdata=bh.test)
mean((lm.pred.test-bh.test$medv)^2)
## [1] 21.66111