Random Forest on Satellite Image Dataset

The database is a sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values.

library(mlbench)
data(Satellite)
summary(Satellite)

##       x.1             x.2              x.3              x.4        
##  Min.   : 39.0   Min.   : 27.00   Min.   : 53.00   Min.   : 33.00  
##  1st Qu.: 60.0   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.0   Median : 87.00   Median :101.00   Median : 81.00  
##  Mean   : 69.4   Mean   : 83.59   Mean   : 99.29   Mean   : 82.59  
##  3rd Qu.: 80.0   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.0   Max.   :137.00   Max.   :140.00   Max.   :154.00  
##       x.5              x.6              x.7              x.8       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.0  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.0  
##  Median : 68.00   Median : 85.00   Median :101.00   Median : 81.0  
##  Mean   : 69.15   Mean   : 83.24   Mean   : 99.11   Mean   : 82.5  
##  3rd Qu.: 80.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.0  
##  Max.   :104.00   Max.   :137.00   Max.   :145.00   Max.   :157.0  
##       x.9              x.10             x.11             x.12       
##  Min.   : 40.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 68.00  
##  Median : 67.00   Median : 85.00   Median :100.00   Median : 81.00  
##  Mean   : 68.91   Mean   : 82.89   Mean   : 98.85   Mean   : 82.39  
##  3rd Qu.: 79.00   3rd Qu.:102.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##       x.13             x.14             x.15             x.16       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.00   Median : 85.00   Median :101.00   Median : 81.00  
##  Mean   : 69.29   Mean   : 83.48   Mean   : 99.31   Mean   : 82.64  
##  3rd Qu.: 80.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :137.00   Max.   :145.00   Max.   :154.00  
##       x.17             x.18             x.19             x.20      
##  Min.   : 40.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.0  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.0  
##  Median : 68.00   Median : 85.00   Median :100.00   Median : 81.0  
##  Mean   : 69.05   Mean   : 83.17   Mean   : 99.15   Mean   : 82.6  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.0  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.0  
##       x.21             x.22             x.23             x.24       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 68.00  
##  Median : 67.00   Median : 84.00   Median :100.00   Median : 81.00  
##  Mean   : 68.84   Mean   : 82.86   Mean   : 98.95   Mean   : 82.47  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##       x.25             x.26             x.27             x.28       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.00   Median : 85.00   Median :100.00   Median : 81.00  
##  Mean   : 69.16   Mean   : 83.37   Mean   : 99.21   Mean   : 82.66  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :131.00   Max.   :140.00   Max.   :154.00  
##       x.29             x.30             x.31             x.32       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.00   Median : 85.00   Median :100.00   Median : 81.00  
##  Mean   : 68.94   Mean   : 83.15   Mean   : 99.11   Mean   : 82.62  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##       x.33             x.34             x.35             x.36       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 68.00  
##  Median : 67.00   Median : 84.00   Median :100.00   Median : 81.00  
##  Mean   : 68.73   Mean   : 82.86   Mean   : 98.93   Mean   : 82.51  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##                 classes    
##  red soil           :1533  
##  cotton crop        : 703  
##  grey soil          :1358  
##  damp grey soil     : 626  
##  vegetation stubble : 707  
##  very damp grey soil:1508

Randomly split the dataset into training and test sets.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

N <- nrow(Satellite)
set.seed(1) #set random seed reproducible
indx     <- sample(1:N, size=N, replace=F)
si.train <- Satellite[indx[1:4000],]
si.test  <- Satellite[indx[4001:N],]

The default setting in randomForest was used to fit the training samples. The option keep.forest=TRUE is necessary to predict on new dataset and plot partial dependence function.

#set random seed to make the results reproducible
set.seed(2) 
fit1 <- randomForest(classes~.,data=si.train, xtest=si.test[,-37], ytest=si.test$classes,ntree=200,keep.forest=TRUE)
fit1

## 
## Call:
##  randomForest(formula = classes ~ ., data = si.train, xtest = si.test[,      -37], ytest = si.test$classes, ntree = 200, keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 9.05%
## Confusion matrix:
##                     red soil cotton crop grey soil damp grey soil
## red soil                 929           3         8              0
## cotton crop                0         415         0              2
## grey soil                  6           1       802             22
## damp grey soil             5           2        72            235
## vegetation stubble        21           3         2              2
## very damp grey soil        0           1        15             53
##                     vegetation stubble very damp grey soil class.error
## red soil                             4                   0  0.01588983
## cotton crop                          9                   2  0.03037383
## grey soil                            3                  10  0.04976303
## damp grey soil                       3                  68  0.38961039
## vegetation stubble                 404                  24  0.11403509
## very damp grey soil                 21                 853  0.09544008
##                 Test set error rate: 8.38%
## Confusion matrix:
##                     red soil cotton crop grey soil damp grey soil
## red soil                 580           0         6              0
## cotton crop                0         268         1              2
## grey soil                  1           0       497             12
## damp grey soil             1           1        40            158
## vegetation stubble        15           3         0              3
## very damp grey soil        0           0        11             33
##                     vegetation stubble very damp grey soil class.error
## red soil                             3                   0  0.01528014
## cotton crop                          2                   2  0.02545455
## grey soil                            0                   4  0.03307393
## damp grey soil                       1                  40  0.34439834
## vegetation stubble                 219                  11  0.12749004
## very damp grey soil                 12                 509  0.09911504

Since there are six classes on the response, there are seven OOB error curves. The black curve is for the overall error rate, while the other six are class-specific.

#Compare the prediction error with OOB estimates
pred.train <- predict(fit1,si.train,type="response")
tab <- table(pred.train,si.train$classes)
tab

##                      
## pred.train            red soil cotton crop grey soil damp grey soil
##   red soil                 944           0         0              0
##   cotton crop                0         428         0              0
##   grey soil                  0           0       844              0
##   damp grey soil             0           0         0            385
##   vegetation stubble         0           0         0              0
##   very damp grey soil        0           0         0              0
##                      
## pred.train            vegetation stubble very damp grey soil
##   red soil                             0                   0
##   cotton crop                          0                   0
##   grey soil                            0                   0
##   damp grey soil                       0                   0
##   vegetation stubble                 456                   0
##   very damp grey soil                  0                 943

1-sum(diag(tab))/sum(tab)

## [1] 0

par(mar=c(5,5,4,2),cex.axis=1.5,cex.lab=1.7,cex.main=2)
plot(fit1, main="Error rates on OOB samples")
legend("topright",legend=c("OOB",levels(si.train$classes)),col=1:7,lty=1:7,cex=1.5)

In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20.

Let’s examine the relative variable importance. From the relative variable importance plot, we see that the central pixels are the most important variables, particularly Variable ‘x.17’, ‘x.18’ and `x.20’.

varImpPlot(fit1, sort=TRUE)

For K-class classification, there are K partial dependence plot, one for each class. Therefore, you need to specify which class. The default is the first class, which is ‘red soil’ here.

levels(si.train$classes)

Let’s look the partial dependence plots for ‘x.17’ and ‘x.20’ on two classes: ‘red soil’ and ‘grey soil’ in particular. Obviously, the effects of ‘x.17’ on red soil and grey soil are very different. The log-odds of being red soil is generally monotonically decreasing with increasing ‘x.17’ value. For grey soil, it is the opposite. For ‘x.20’, we see the logit of being red soil increases with ‘x.20’ value.

par(mfrow=c(2,2),cex.axis=1.5,cex.lab=1.7,cex.main=2)
partialPlot(fit1, si.train, 'x.17',which.class='red soil',main="PDP of x.17 for red soil")
partialPlot(fit1, si.train, 'x.17',which.class='grey soil',main="PDP of x.17 for grey soil")
partialPlot(fit1, si.train, 'x.20',which.class='red soil',main="PDP of x.20 for red soil")

Now, let’s take a look of the density plot for ‘x.17’ on red and grey soil. and for ‘x.20’ on red soil. We can see that the red soil samples have smaller values of ‘x.17’ than grey soil samples. However, red soil samples tend to have large value on ‘x.20’.

d1 <- density(si.train$x.17[which(si.train$classes=='red soil')])
d2 <- density(si.train$x.17[which(si.train$classes=='grey soil')])
par(mfrow=c(1,2),cex.axis=1.5,cex.lab=1.7)
plot(d1,xlim=range(d1$x,d2$x), ylim=range(d1$y,d2$y), col="red",
     main="Density plot for x.17 on red/grey soil",
     xlab="x.17",lwd=2)
lines(d2,col="grey",lwd=2,lty=2)
legend("topleft",legend=c("red soil","grey soil"),lty=c(1,2),col=c("red","grey"),lwd=2,cex=1.2)

## Density plots on x.17 for red and grey soils
d3 <- density(si.train$x.20[which(si.train$classes=='red soil')])
plot(d1,xlim=range(d1$x,d3$x), ylim=range(d1$y,d3$y), col="red",
     main="Density plot for x.17 and x.20 on red soil",
     xlab="",lwd=2)
lines(d3,col="blue",lwd=2,lty=2)
legend("topleft",legend=c("x.17","x.20"),lty=c(1,2),col=c("red","blue"),lwd=2,cex=1.2)

Random Forest on Satellite Image Dataset

Bin Li