How to implement shapper:shap for whole dataset? - r

I have created a Random Forest model using the randomForest package
model_rf <- randomForest(y~ . , data = data_train,ntree=1000, keep.forest=TRUE,importance=TRUE)
To calculate Shapley values for the different features based on this RF model, I first create an "explainer object" and then use the "shapper" package
exp_rf <- DALEX::explain(model_rf, data = data_test[,-1], y = data_test[,1])
ive_rf <- shap(exp_rf, new_observation = data_test[1,-1])
To my knowledge, I can only apply the "shap" function to one observation (the "new_observation").
But I am looking for a way to calculate the shapley values for all of my respondents in my datafile.
I know this is possible in the "SHAP" package in Python; but is it also possible with the "shapper" package in R?
At the moment, I created a loop to calculate the shapley values for all respondents, but this will take me days to calculate for my entire datafile.
for(i in c(1:nrow(data_test)))
{
ive_rf <- shap(exp_rf,new_observation=data_test[i,-1])
shapruns<-cbind(shapruns,ive_rf[,"_attribution_"])
}
Any help would be much appreciated.

I recently published two R packages that are optimized for this kind of tasks: "kernelshap" (calculate SHAP values fast) and "shapviz" (plot SHAP values from any source). In your case, a working example would be:
library(randomForest)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- randomForest(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")

Related

shap plots for random forest models

I would like to get the Shap Contribution for variables for a Ranger/random forest model and have plots like this in R:
beeswarm plots
I have tried using the following libraries: DALEX, shapr, fastshap, shapper. I could only end up getting plots like this:
fastshap plot
Is it possible to get such plots? I have tried reticulate package and it still doesnt work.
Random forests need to grow many deep trees. While possible, crunching TreeSHAP for deep trees requires an awful lot of memory and CPU power. An alternative is to use the Kernel SHAP algorithm, which works for all kind of models.
library(ranger)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- ranger(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")
Disclaimer: I wrote "kernelshap" and "shapviz"

Is it possible to adapt standard prediction interval code for dlm in R with other distribution?

Using the dlm package in R I fit a dynamic linear model to a time series data set, consisting of 20 observations. I then use the dlmForecast function to predict future values (which I can validate against the genuine data for said period).
I use the following code to create a prediction interval;
ciTheory <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(c(0.05,0.95))) +
as.vector(t(fut1$f)))
However my data does not follow a normal distribution and I wondered whether it would be possible to
adapt the qnorm function for other distributions. I have tried qt, but am unable to apply qgamma.......
Just wondered if anyone knew how you would go about sorting this.....
Below is a reproduced version of my code...
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
Cheers

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

How to loop an analysis in R while iteratively removing/replacing rows from the original dataset?

I have an excel csv file with mixed data that looks similar to the sample dataframe provided below.
Given the following sample data and analysis:
#Installing packages
library(cluster)
library(vegan)
size = c(5,300,500,4000,60000,2000)
diet = c('A','A','C','D','C','D')
area = c('Ae','Te','Fo','Ae','Te','Ae')
time = c('Di','No','Di','Cr','Ca','Ca')
distance = c(50,800,60,12000,150000,4200)
DF = data.frame(size,diet,area,time,distance)
row.names(DF) = c('Bird','Rat','Cobra','Dog','Human','Fish')
#Calculate Gower distance dissimilarity matrix for species in "DF"
DF.diss = daisy(DF, metric = "gower", type = list(logratio = c("size", "distance")))
attributes(DF.diss)
#Performing hierarchical cluster analysis on dissimilarity matrix
DF.Hclust = hclust(DF.diss, method = "average")
#Calculating metric for species community based on hclust tree
treeheight(DF.Hclust)
Starting with the all the rows as the example does, how would I go about rerunning the analysis while iteratively removing a row, rerunning the analysis, putting the row back, removing the next row, rerunning the analysis, and so on, until the analysis has been done once for every species removed/replaced.
I am interested in calculating the treeheight metric for the entire community while removing and replacing single species to gauge each of their contributions to overall treeheight.
Since my actual data set has well over 200 species it would be great if there was a way to do this in R without having to prepare over 200 separate csv files where I've removed single species and then running each through the provided analysis. Also is it possible to output each treeheight output/result to a table?
You can create a loop for this:
treeheights <- matrix(-9999, nrow(DF), 1) # make matrix to store answers.
# I set -9999 as standard value so I can check if everything went alright afterwards.
for ( i in 1:nrow(DF)) {
DF.LOO <- DF[-i,] # leave one (row) out
DF.diss.LOO <- daisy(DF.LOO, metric = "gower", type = list(logratio =
c("size", "distance")))
DF.HC.LOO <- hclust(DF.diss.LOO, method = "average")
treeheights[i,] <- treeheight(DF.HC.LOO)
}
This goes through all the rows and always leaves one row out. Hope this helps!

plot one of 500 trees in randomForest package

How can plot trees in output of randomForest function in same names packages in R? For example I use iris data and want to plot first tree in 500 output tress. my code is
model <-randomForest(Species~.,data=iris,ntree=500)
You can use the getTree() function in the randomForest package (official guide: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf)
On the iris dataset:
require(randomForest)
data(iris)
## we have a look at the k-th tree in the forest
k <- 10
getTree(randomForest(iris[, -5], iris[, 5], ntree = 10), k, labelVar = TRUE)
You may use cforest to plot like below, I have hardcoded the value to 5, you may change as per your requirement.
ntree <- 5
library("party")
cf <- cforest(Species~., data=iris,controls=cforest_control(ntree=ntree))
for(i in 1:ntree){
pt <- prettytree(cf#ensemble[[i]], names(cf#data#get("input")))
nt <- new("Random Forest BinaryTree")
nt#tree <- pt
nt#data <- cf#data
nt#responses <- cf#responses
pdf(file=paste0("filex",i,".pdf"))
plot(nt, type="simple")
dev.off()
}
cforest is another implementation of random forest, It can't be said which is better but in general there are few differences that we can see. The difference is that cforest uses conditional inferences where we put more weight to the terminal nodes in comparison to randomForest package where the implementation provides equal weights to terminal nodes.
In general cofrest uses weighted mean and randomForest uses normal average. You may want to check this .

Resources