Simulation example

The data set, which consists of 10 \(X\) variables and response \(Y\), were generated from the following model: \[\begin{eqnarray} y=1.5I(x_5>0)x_1-1.5I(x_5 \leq 0)x_1+0.2x_2^2+140\phi(x_3,x_4)+\epsilon, \,\, \epsilon \sim N(0,\sigma^2) \end{eqnarray}\]

where the \(X\) variables are from a multivariate normal distribution (\(p=10\)) with zero mean. The variances for all the \(X\) variables are 4, and pairwise correlation is \(\rho=0.125\). The function \(\phi()\) is a probability density function for a bivariate normal distribution with \(mean=(0,0)\), \(variance=(4,4)\) and \(covariance=2\). Only the first five variables are related to the response \(Y\). The coefficents are chosen so that each term has approximately the same contribution (i.e. variance for each term is about the same). The sigma is chosen so that the signal-to-noise ratio is 100 (i.e. \(var(y)=101\sigma^2\)). We see only the first 5 variables are relevent to the resposne. There are two pairs of bivariate interaction: \(X_1\) and \(X_5\), and \(X_3\) and \(X_4\).

library(mvtnorm)
n=500
avg=rep(0,10)
sigma=diag(3.5,10)+matrix(0.5,10,10)

set.seed(4)
x=rmvnorm(n, avg, sigma)
f=200*dmvnorm(x[,3:4],mean=c(0,0),sigma=diag(c(2,2))+matrix(2,2,2))+0.25*x[,2]^2+1.5*x[,1]*as.numeric(x[,5]>0)-
1.5*x[,1]*as.numeric(x[,5]<=0)
y=f+rnorm(n,mean=0,sd=sd(f)/10)
data=as.data.frame(x)
names(data)=paste("X",1:10,sep="")
data$Y=y

Gradient boosting with trees was applied to the data set. Out-of-bag (OOB) performance was used to monitor the fitting procedure and determine the optimal number of trees.

library(gbm)
set.seed(1)
fit.gbm1=gbm(Y~., distribution = "gaussian", data=data,
                 n.tree=100, interaction.depth=4, shrinkage=0.02,
                 bag.fraction=0.5, verbose=FALSE)
best.iter=gbm.perf(fit.gbm1, plot.it=FALSE, oobag.curve=F, method="OOB")
while(fit.gbm1$n.trees-best.iter<20){
      fit.gbm1   = gbm.more(fit.gbm1, 50)           # do another 50 iterations
      best.iter = gbm.perf(fit.gbm1, plot.it=FALSE, oobag.curve=F, method="OOB")
}
best.iter #optimal number of trees
## [1] 500
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 550 
## Equivalent Number of Parameters: 41.57 
## Residual Standard Error: 0.02215382

Feature importance

The iml package in R is a very useful toolbox which makes machine learning models and predictions interpretable. First, let’s first check the feature importance, which measures the increase in the prediction error of the model after we permuted the feature’s values, which breaks the relationship between the feature and the true outcome. To compute the feature importance for a single feature, the model prediction loss (error) is measured before and after shuffling the values of the feature. By shuffling the feature values, the association between the outcome and the feature is destroyed. The larger the increase in prediction error, the more important the feature was. The shuffling is repeated to get more accurate results, since the permutation feature importance tends to be quite instable. For this regression task, there are several choices for the loss function, such as mae (median absolute error), rmse (root mean squared error). We can see the list of loss functions by library(help = "Metrics"). Once we create a new object of FeatureImp, the importance is automatically computed. We can call the plot() function of the object or look at the results in a data.frame.

library(iml)
library(devtools)
library(ggplot2)
predict.fun1=function(obj,newdata){
  predict(obj,newdata=newdata,n.tree=fit.gbm1$n.trees)
}
obj1 = Predictor$new(fit.gbm1,data=data[,1:10],y=data$Y,predict.fun=predict.fun1)
imp1=FeatureImp$new(obj1,loss="mse")
par(cex.lab=2,cex.axis=2)
plot(imp1)

imp1
## Interpretation method:  FeatureImp 
## error function: mse
## 
## Analysed predictor: 
## Prediction task: unknown 
## 
## 
## Analysed data:
## Sampling from data.frame with 500 rows and 10 columns.
## 
## Head of results:
##   feature importance.05 importance importance.95 permutation.error
## 1      X1      5.871361   6.175999      7.452310          9.037452
## 2      X4      5.152372   5.913281      6.098329          8.653011
## 3      X5      5.779844   5.890748      6.639052          8.620038
## 4      X3      4.759601   5.145368      5.198751          7.529311
## 5      X2      3.960868   4.146214      4.626036          6.067230
## 6      X7      1.013457   1.176904      1.369538          1.722185

We see that the first five \(X\) variables (\(X_1,\ldots,X_5\)) dominate the importance. Now let’s check their effects on the response. There are a few approaches available.

Partial dependence function

The partial dependence function (PDF) shows the marginal effect one or two features have on the predicted outcome of a machine learning model (Friedman, 2001). A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex. Denote a model \(\hat{f}\), \(S\) is a set of variables (usually contains one or two variables) of interest, \(C\) is the complement of \(S\) (i.e. \(C \cup S=\{1,2,\ldots,p\}\)). Then, the PDF of \(X_S\), \(PD_S(x_S)\), is defined as: \[\begin{eqnarray} PD_S(x_S)&=&E_{C}\left[\hat{f}(X_S,X_C)\right]\\ &=&\int \hat{f}(X_S,X_C)dP(X_C)\\ &\approx& \frac{1}{n}\sum_{i=1}^n\hat{f}(x_S,x_{iC}) \end{eqnarray}\]

The partial dependence function at a particular feature value represents the average prediction if we force all data points to assume that feature value. Therefore, PDF assumes that the feature(s) for which the partial dependence is computed \(X_S\) are not correlated with other features \(X_C\).

plot(FeatureEffects$new(obj1,method="pdp"))

From the PDPs above, we see that:

  1. \(X_1\) is almost flat.

  2. \(X_2\) has the quadratic curve as in the true model.

  3. \(X_3\) and \(X_4\) have the bell-shaped curves as in the true model.

  4. Although \(X_5\) is one of the most influential variables, its PDP is flat, which agrees with the true model.

Accumulated local effects (ALE)

Accumulated local effects describe how features influence the prediction of a machine learning model on average. ALE plots are a faster and unbiased alternative to partial dependence plots (PDPs). The ALE differs with PDF in two aspects:

  1. ALE shows the differences in prediction instead of the averages. For example, if ALE for \(X_2=1\) is -2, it means that when \(X_2=1\), the prediction is lower than the average prediction by 2 unit of \(y\).

  2. ALE averages the difference in a small neigborhood of \(x_{iS}\) (e.g. all the data points that have \(X_2\) value close to 1). Note, PDF averages over all the data points (e.g. with \(X_2\) replaced to 1).

Both methods reduce the function by averaging the effects of the other features, but they differ in whether averages of predictions or of differences in predictions are calculated and whether averaging is done over the marginal or conditional distribution.

Therefore, ALE plots tells us how the model predictions change in a small “window” of the feature around \(x_{iS}\) for data instances in that window. PDP tell us what the model predicts on average when each data instance has the value \(x_{iS}\) for that feature, ignoring whether the value \(x_{iS}\) makes sense for all data instances or not.

plot(FeatureEffects$new(obj1,method="ale"))

From the above ALE plots, we see the following observations.

  1. They are all around 0, instead of a non-zero constant (like in PDP). This is because ALE plots show the difference instead of the average.

  2. The ALE plots have similar shapes to PDPs.

  3. The ALE plots are generally less smoother than PDPs, since it averages over a small neighorhood instead of all the data points.

Compared to PDP, ALE plots have the following advantages:

  1. ALE plots are unbiased, which means they still work when features are correlated. Partial dependence plots fail in this scenario because they marginalize over unlikely or even physically impossible combinations of feature values.

  2. ALE plots are faster to compute than PDP.

  3. ALE plots are centered at zero. This makes their interpretation nice, because the value at each point of the ALE curve is the difference to the mean prediction.

  4. The 2D ALE plot only shows the interaction: If two features do not interact, the plot shows nothing.

Individual conditional expectation (ICE)

Individual Conditional Expectation (ICE) plots display one line per instance that shows how the instance’s prediction changes when a feature changes. PDF is computed using \(\frac{1}{n}\sum_{i=1}^n\hat{f}(x_S,x_{iC})\). ICE of \(X_S\) for the \(i^{th}\) subject is \(\hat{f}(x_S,x_{iC})\). Thus, PDF is the average of ICE.

Then, what is the point of looking at ICE plot instead of PDP? PDPs can obscure a heterogeneous relationship created by interactions. PDPs can show you what the average relationship between a feature and the prediction looks like. PDP only works well if the interactions between \(X_S\) and \(X_C\) are weak. In case of interactions, the ICE plot will provide much more insight.

plot(FeatureEffect$new(obj1,"X1",method="ice"))

We see there are two groups of ICE for \(X_1\): one group has the positive slope, the other group has the negative slope. This agrees with the true model (i.e. \(1.5I(x_5>0)x_1-1.5I(x_5 \leq 0)x_1\)). Let’s take a close look on subset of the data based on \(X_5\).

ind = which(data$X5>0)
obj2 = Predictor$new(fit.gbm1,data=data[ind,1:10],y=data$Y[ind],predict.fun=predict.fun1)
plot(FeatureEffect$new(obj2,"X1",method="ice"))

obj3 = Predictor$new(fit.gbm1,data=data[-ind,1:10],y=data$Y[-ind],predict.fun=predict.fun1)
plot(FeatureEffect$new(obj3,"X1",method="ice"))

Feature interaction

When features interact with each other in a prediction model, the prediction cannot be expressed as the sum of the feature effects (i.e. additive effects), because the effect of one feature depends on the value of the other feature. If two features do not interact, we can decompose the partial dependence function as follows (assuming the partial dependence functions are centered at zero): \[\begin{eqnarray} PD_{jk}(x_j,x_k)=PD_j(x_j)+PD_k(x_k), \end{eqnarray}\] where \(PD_{jk}(x_j,x_k)\) is the 2-way partial dependence function of both features and \(PD_j(x_j)\) and \(PD_k(x_k)\) the partial dependence functions of the single features. Likewise, if a feature has no interaction with any of the other features, we can express the prediction function \(\hat{f}\) as a sum of partial dependence functions, where the first summand depends only on \(j\) and the second on all other features except \(j\): \[\begin{eqnarray} \hat{f}=PD_j(x_j)+PD_{-j}(x_{-j}), \end{eqnarray}\]

where \(PD_{-j}(x_{-j})\) is the partial dependence function that depends on all features except the \(j^{th}\) feature. Friedman and Popescu also propose a test statistic to evaluate whether the H-statistic differs significantly from zero. The null hypothesis is the absence of interaction. To generate the interaction statistic under the null hypothesis, you must be able to adjust the model so that it has no interaction between feature j and k or all others. This is not possible for all types of models. Therefore this test is model-specific, not model-agnostic, and as such not covered here.

The interaction strength statistic can also be applied in a classification setting if the prediction is a probability. Friedman and Popescu (2008) proposed H-statistic to quantify the interaction between two sets of variables. They also propose a test statistic to evaluate whether the H-statistic differs significantly from 0 (i.e. no interaction). The null hypothesis is the absence of interaction.

Let’s first examine the overall interaction strength for all the variables.

plot(Interaction$new(obj1))

Then, let’s examine the 2-way interactions strength of \(X_1\) with all other features, and \(X_3\) with other features.

plot(Interaction$new(obj1,feature="X1"))

plot(Interaction$new(obj1,feature="X3"))

Then, let’s see the 2D PDP for \(X_3\) and \(X_4\).

plot(FeatureEffect$new(obj1,c("X3","X4"),method="pdp",grid.size=c(30,30)))

Reference

  1. Friedman, Jerome H., and Bogdan E. Popescu. “Predictive learning via rule ensembles.” The Annals of Applied Statistics 2.3 (2008): 916-954.

  2. Christoph Molnar (2020) ``Interpretable Machine Learning: A Guide for Making Black Box Models Explainable’’. link