Soil Dataset Analysis using PCR

Introduction of the dataset

The dataset contains a total of 675 soil samples collected from Nebraska, California, and Texas in 2014. In particular, 75 sampling points were randomly selected within a single agricultural field in each of these states and soil samples were collected at three depths (0-15, 15-30, and 30-45 cm). Ten physicochemical properties were measured for all 675 soil sample. They are soil cation exchange capacity (CEC), electrical conductivity (EC), total nitrogen level, total carbon level, loss on ignition (LOI), soil organic matter (SOM), clay, sand, silt and soil pH level. All the soil samples were scanned using a portable VisNIR spectroradiometer with a spectral range of 350 to 2500 nm. After smoothing and taking first order derivatives, the processed reflectance spectra were measured from 360 to 2490 nm at 10 nm intervals. The purpose of the study is to predict the nine soil physicochemical properties from its reflectance spectra. This dataset was originally used in Wang, Chakraborty, Weindorf, Li et al. (2015).

loc  <- "http://statweb.lsu.edu/faculty/li/IIT/soil-dataset.txt"
temp <- read.table(loc,header=T,sep="\t")
soil <- as.matrix(temp[,4:13])
spec <- as.matrix(temp[,40:240])
wl   <- seq(from=450,to=2450,by=10)
#These are the ten response variables 
colnames(soil)

##  [1] "CEC"      "EC"       "Nitrogen" "Carbon"   "LOI"      "SOM"     
##  [7] "Clay"     "Sand"     "Silt"     "pH"

Figure below shows the first 30 VisNIR spectra from the dataset.

matplot(wl,t(spec[1:30,]),xlab="Wavelength (nm)", ylab="",type="l",lwd=0.5)

Principal component regression (PCR)

Let’s just focus on predicting CEC based on the VisNIR spectra using PCR. The pls package in R does PCR and Partial Least Squares Regression (PLSR).

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

#Use CEC as the response variable
data   <- as.data.frame(spec)
data$y <- soil[,1]

The main tuning parameter in PCR (also in PLSR) is the number of components included in the regression model, which is similar to the number of variables included in the multiple linear regression. The more components included in the model, the more complex the model is. So we need to choose the number of components carefully. Usually, we can use \(k-\)fold CV or LOOCV to choose the number of components.

The sum of squared errors in LOOCV is defined as \[ SSE=\sum_{i=1}^n \left(y_i-\hat{y}_i^{(-i)}\right)^2. \] We see that the optimal number of components in PCR is 15.

ptm<-proc.time() #used to find the computing time in seconds
fit1   <- pcr(y~.,20, data=data,validation="LOO")
proc.time()-ptm #used to find the computing time in seconds

##    user  system elapsed 
##  190.51    5.24  210.12

#fit1$validation$PRESS is the SSE in LOOCV
fit1$validation$PRESS

##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## y 28964.45 21969.15 22029.03 20681.55 12937.86 8099.351 7250.569 7289.435
##    9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps
## y 6184.348 5956.483 5892.017 5619.227 5545.221 5563.624 5457.652 5489.924
##   17 comps 18 comps 19 comps 20 comps
## y 5544.627 5546.122  5582.54  5559.72

opt.comp1 <- which.min(fit1$validation$PRESS)
opt.comp1

## [1] 15

The figure below shows the number of components against the corresponding SSE in LOOCV.

plot(1:20,fit1$validation$PRESS,xlab="No. of components in PCR",ylab="LOOCV results",type="b")
abline(v=opt.comp1,col="red")

The coefficient plot for the PCR with the optimal number of components.

plot(wl,fit1$coef[,,opt.comp1],type="l",xlab="Wavelength (nm)",ylab="PCR coefficient")

The figure below shows the predicted values against measured ones.

pred <- predict(fit1,data,ncomp=opt.comp1,type="response")
plot(data$y,pred,xlim=range(c(data$y,pred)),ylim=range(c(data$y,pred)), xlab="Measured CEC",ylab="Predicted CEC")
abline(0,1)

We can also use \(k-\)fold cross-validation to find the optimal number of components.

ptm<-proc.time()
fit2 <- pcr(y~.,20,data=data)
fit2.cv <- crossval(fit2,segments=5)
proc.time()-ptm

##    user  system elapsed 
##    2.13    0.12    2.86

#5-fold CV results
fit2.cv$valid$PRESS

##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps 7 comps  8 comps
## y 28929.18 22007.18 22242.19 20931.22 13070.73 8170.827 7313.58 7299.675
##   9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps
## y 6237.96  5994.14 5932.679 5682.782 5651.342 5681.922 5573.853 5608.253
##   17 comps 18 comps 19 comps 20 comps
## y 5669.228 5689.267 5738.981  5696.24

opt.comp2 <- which.min(fit2.cv$valid$PRESS)
opt.comp2

## [1] 15

plot(1:20,fit2.cv$valid$PRESS,xlab="No. of components in PCR",ylab="CV results",type="b")
abline(v=opt.comp2,col="red")

We get the same results as using LOOCV, but much faster than LOOCV. In practice, if the dataset is high-dimensional (i.e. having many predictors) and/or large sample size, \(k-\)fold CV is usually more computationally efficient than LOOCV.

Soil Dataset Analysis using PCR

Bin Li

IIT Lecture Series

Introduction of the dataset

Principal component regression (PCR)