Random Forest on Satellite Image Dataset

This data set consists of four digital images (from Landsat MSS imagery) of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m \(\times\) 80m.

Each image contains 2340 \(\times\) 3380 such pixels. The database is a (tiny) sub-area of a scene, consisting of 82 \(\times\) 100 pixels. Each line of data corresponds to a \(3\times3\) square neighbourhood of pixels completely contained within the 82 \(\times\) 100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3 \(\times\) 3 neighbourhood and a number indicating the classification label of the central pixel.

The response variable contains six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, very damp grey soil. The data set contains 36 input variables and 1 response and 6435 observations.

library(mlbench)

## Warning: package 'mlbench' was built under R version 3.3.3

data(Satellite)
dim(Satellite)

## [1] 6435   37

summary(Satellite)

##       x.1             x.2              x.3              x.4        
##  Min.   : 39.0   Min.   : 27.00   Min.   : 53.00   Min.   : 33.00  
##  1st Qu.: 60.0   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.0   Median : 87.00   Median :101.00   Median : 81.00  
##  Mean   : 69.4   Mean   : 83.59   Mean   : 99.29   Mean   : 82.59  
##  3rd Qu.: 80.0   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.0   Max.   :137.00   Max.   :140.00   Max.   :154.00  
##       x.5              x.6              x.7              x.8       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.0  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.0  
##  Median : 68.00   Median : 85.00   Median :101.00   Median : 81.0  
##  Mean   : 69.15   Mean   : 83.24   Mean   : 99.11   Mean   : 82.5  
##  3rd Qu.: 80.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.0  
##  Max.   :104.00   Max.   :137.00   Max.   :145.00   Max.   :157.0  
##       x.9              x.10             x.11             x.12       
##  Min.   : 40.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 68.00  
##  Median : 67.00   Median : 85.00   Median :100.00   Median : 81.00  
##  Mean   : 68.91   Mean   : 82.89   Mean   : 98.85   Mean   : 82.39  
##  3rd Qu.: 79.00   3rd Qu.:102.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##       x.13             x.14             x.15             x.16       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.00   Median : 85.00   Median :101.00   Median : 81.00  
##  Mean   : 69.29   Mean   : 83.48   Mean   : 99.31   Mean   : 82.64  
##  3rd Qu.: 80.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :137.00   Max.   :145.00   Max.   :154.00  
##       x.17             x.18             x.19             x.20      
##  Min.   : 40.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.0  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.0  
##  Median : 68.00   Median : 85.00   Median :100.00   Median : 81.0  
##  Mean   : 69.05   Mean   : 83.17   Mean   : 99.15   Mean   : 82.6  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.0  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.0  
##       x.21             x.22             x.23             x.24       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 68.00  
##  Median : 67.00   Median : 84.00   Median :100.00   Median : 81.00  
##  Mean   : 68.84   Mean   : 82.86   Mean   : 98.95   Mean   : 82.47  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##       x.25             x.26             x.27             x.28       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.00   Median : 85.00   Median :100.00   Median : 81.00  
##  Mean   : 69.16   Mean   : 83.37   Mean   : 99.21   Mean   : 82.66  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :131.00   Max.   :140.00   Max.   :154.00  
##       x.29             x.30             x.31             x.32       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 69.00  
##  Median : 68.00   Median : 85.00   Median :100.00   Median : 81.00  
##  Mean   : 68.94   Mean   : 83.15   Mean   : 99.11   Mean   : 82.62  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##       x.33             x.34             x.35             x.36       
##  Min.   : 39.00   Min.   : 27.00   Min.   : 50.00   Min.   : 29.00  
##  1st Qu.: 60.00   1st Qu.: 71.00   1st Qu.: 85.00   1st Qu.: 68.00  
##  Median : 67.00   Median : 84.00   Median :100.00   Median : 81.00  
##  Mean   : 68.73   Mean   : 82.86   Mean   : 98.93   Mean   : 82.51  
##  3rd Qu.: 79.00   3rd Qu.:103.00   3rd Qu.:113.00   3rd Qu.: 92.00  
##  Max.   :104.00   Max.   :130.00   Max.   :145.00   Max.   :157.00  
##                 classes    
##  red soil           :1533  
##  cotton crop        : 703  
##  grey soil          :1358  
##  damp grey soil     : 626  
##  vegetation stubble : 707  
##  very damp grey soil:1508

Randomly split the dataset into training and test sets.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.3.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

N <- nrow(Satellite)
set.seed(1) #set random seed reproducible
indx     <- sample(1:N, size=N, replace=F)
si.train <- Satellite[indx[1:4000],]
si.test  <- Satellite[indx[4001:N],]

The default setting in randomForest was used to fit the training samples. The option keep.forest=TRUE is necessary to predict on new dataset and plot partial dependence function.

#set random seed to make the results reproducible
set.seed(1) 
fit1 <- randomForest(classes~.,data=si.train, xtest=si.test[,-37], ytest=si.test$classes,ntree=200,keep.forest=TRUE)
fit1

## 
## Call:
##  randomForest(formula = classes ~ ., data = si.train, xtest = si.test[,      -37], ytest = si.test$classes, ntree = 200, keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 8.82%
## Confusion matrix:
##                     red soil cotton crop grey soil damp grey soil
## red soil                 930           3        10              1
## cotton crop                1         432         1              2
## grey soil                  5           1       827             18
## damp grey soil             2           2        76            229
## vegetation stubble        19           2         0              3
## very damp grey soil        0           0        20             45
##                     vegetation stubble very damp grey soil class.error
## red soil                             5                   0  0.02002107
## cotton crop                          5                   3  0.02702703
## grey soil                            0                   9  0.03837209
## damp grey soil                       5                  70  0.40364583
## vegetation stubble                 380                  27  0.11832947
## very damp grey soil                 18                 849  0.08905579
##                 Test set error rate: 8.58%
## Confusion matrix:
##                     red soil cotton crop grey soil damp grey soil
## red soil                 576           0         6              0
## cotton crop                0         256         0              1
## grey soil                  4           0       479              8
## damp grey soil             4           1        42            153
## vegetation stubble        13           3         1              0
## very damp grey soil        0           0         8             35
##                     vegetation stubble very damp grey soil class.error
## red soil                             2                   0  0.01369863
## cotton crop                          1                   1  0.01158301
## grey soil                            1                   6  0.03815261
## damp grey soil                       2                  40  0.36776860
## vegetation stubble                 242                  17  0.12318841
## very damp grey soil                 13                 520  0.09722222

Since there are six classes on the response, there are seven OOB error curves. The black curve is for the overall error rate, while the other six are class-specific.

par(mar=c(5,5,4,2),cex.axis=1.5,cex.lab=1.7,cex.main=2)
plot(fit1, main="Error rates on OOB samples")
legend("topright",legend=c("OOB",levels(si.train$classes)),col=1:7,lty=1:7,cex=1.5)

In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20.

Let’s examine the relative variable importance. From the relative variable importance plot, we see that the central pixels are the most important variables, particularly Variable ‘x.17’, ‘x.18’ and `x.20’.

varImpPlot(fit1, sort=TRUE)

For K-class classification, there are K partial dependence plot, one for each class. Therefore, you need to specify which class. The default is the first class, which is ‘red soil’ here.

levels(si.train$classes)

## [1] "red soil"            "cotton crop"         "grey soil"          
## [4] "damp grey soil"      "vegetation stubble"  "very damp grey soil"

Let’s look the partial dependence plots for ‘x.17’ and ‘x.20’ on two classes: ‘red soil’ and ‘grey soil’ in particular. Obviously, the effects of ‘x.17’ on red soil and grey soil are very different. The log-odds of being red soil is generally monotonically decreasing with increasing ‘x.17’ value. For grey soil, it is the opposite. For ‘x.20’, we see the logit of being red soil increases with ‘x.20’ value.

par(mfrow=c(2,2),cex.axis=1.5,cex.lab=1.7,cex.main=2)
partialPlot(fit1, si.train, 'x.17',which.class='red soil',main="PDP of x.17 for red soil")
partialPlot(fit1, si.train, 'x.17',which.class='grey soil',main="PDP of x.17 for grey soil")
partialPlot(fit1, si.train, 'x.20',which.class='red soil',main="PDP of x.20 for red soil")

Now, let’s take a look of the density plot for ‘x.17’ on red and grey soil. and for ‘x.20’ on red soil. We can see that the red soil samples have smaller values of ‘x.17’ than grey soil samples. However, red soil samples tend to have large value on ‘x.20’.

d1 <- density(si.train$x.17[which(si.train$classes=='red soil')])
d2 <- density(si.train$x.17[which(si.train$classes=='grey soil')])
par(mfrow=c(1,2),cex.axis=1.5,cex.lab=1.7)
plot(d1,xlim=range(d1$x,d2$x), ylim=range(d1$y,d2$y), col="red",
     main="Density plot for x.17 on red/grey soil",
     xlab="x.17",lwd=2)
lines(d2,col="grey",lwd=2,lty=2)
legend("topleft",legend=c("red soil","grey soil"),lty=c(1,2),col=c("red","grey"),lwd=2,cex=1.2)

## Density plots on x.17 for red and grey soils
d3 <- density(si.train$x.20[which(si.train$classes=='red soil')])
plot(d1,xlim=range(d1$x,d3$x), ylim=range(d1$y,d3$y), col="red",
     main="Density plot for x.17 and x.20 on red soil",
     xlab="",lwd=2)
lines(d3,col="blue",lwd=2,lty=2)
legend("topleft",legend=c("x.17","x.20"),lty=c(1,2),col=c("red","blue"),lwd=2,cex=1.2)

Random Forest on Satellite Image Dataset

Bin Li

IIT Lecture Series