shap plots for random forest models - r

I would like to get the Shap Contribution for variables for a Ranger/random forest model and have plots like this in R:
beeswarm plots
I have tried using the following libraries: DALEX, shapr, fastshap, shapper. I could only end up getting plots like this:
fastshap plot
Is it possible to get such plots? I have tried reticulate package and it still doesnt work.

Random forests need to grow many deep trees. While possible, crunching TreeSHAP for deep trees requires an awful lot of memory and CPU power. An alternative is to use the Kernel SHAP algorithm, which works for all kind of models.
library(ranger)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- ranger(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")
Disclaimer: I wrote "kernelshap" and "shapviz"

Related

How to implement shapper:shap for whole dataset?

I have created a Random Forest model using the randomForest package
model_rf <- randomForest(y~ . , data = data_train,ntree=1000, keep.forest=TRUE,importance=TRUE)
To calculate Shapley values for the different features based on this RF model, I first create an "explainer object" and then use the "shapper" package
exp_rf <- DALEX::explain(model_rf, data = data_test[,-1], y = data_test[,1])
ive_rf <- shap(exp_rf, new_observation = data_test[1,-1])
To my knowledge, I can only apply the "shap" function to one observation (the "new_observation").
But I am looking for a way to calculate the shapley values for all of my respondents in my datafile.
I know this is possible in the "SHAP" package in Python; but is it also possible with the "shapper" package in R?
At the moment, I created a loop to calculate the shapley values for all respondents, but this will take me days to calculate for my entire datafile.
for(i in c(1:nrow(data_test)))
{
ive_rf <- shap(exp_rf,new_observation=data_test[i,-1])
shapruns<-cbind(shapruns,ive_rf[,"_attribution_"])
}
Any help would be much appreciated.
I recently published two R packages that are optimized for this kind of tasks: "kernelshap" (calculate SHAP values fast) and "shapviz" (plot SHAP values from any source). In your case, a working example would be:
library(randomForest)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- randomForest(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

plot one of 500 trees in randomForest package

How can plot trees in output of randomForest function in same names packages in R? For example I use iris data and want to plot first tree in 500 output tress. my code is
model <-randomForest(Species~.,data=iris,ntree=500)
You can use the getTree() function in the randomForest package (official guide: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf)
On the iris dataset:
require(randomForest)
data(iris)
## we have a look at the k-th tree in the forest
k <- 10
getTree(randomForest(iris[, -5], iris[, 5], ntree = 10), k, labelVar = TRUE)
You may use cforest to plot like below, I have hardcoded the value to 5, you may change as per your requirement.
ntree <- 5
library("party")
cf <- cforest(Species~., data=iris,controls=cforest_control(ntree=ntree))
for(i in 1:ntree){
pt <- prettytree(cf#ensemble[[i]], names(cf#data#get("input")))
nt <- new("Random Forest BinaryTree")
nt#tree <- pt
nt#data <- cf#data
nt#responses <- cf#responses
pdf(file=paste0("filex",i,".pdf"))
plot(nt, type="simple")
dev.off()
}
cforest is another implementation of random forest, It can't be said which is better but in general there are few differences that we can see. The difference is that cforest uses conditional inferences where we put more weight to the terminal nodes in comparison to randomForest package where the implementation provides equal weights to terminal nodes.
In general cofrest uses weighted mean and randomForest uses normal average. You may want to check this .

Cross-validation for kriging in R: how to include the trend while reestimating the variogram using xvalid?

I have a question very specific for the function xvalid (package geoR) in R which is used in spatial statistics only, so I hope it's not too specific for someone to be able to answer. In any case, suggestions for alternative functions/packages are welcome too.
I would like to compute a variogram, fit it, and then perform cross-validation. Function xvalid seems to work pretty nice to do the cross-validation. It works when I set reestimate=TRUE (so it reestimates the variogram for every point removed from the dataset in cross-validation) and it also works when using a trend. However, it does not seem to work when combining these two...
Here is an example using the classical Meuse dataset:
library(geoR)
library(sp)
data(meuse) # import data
coordinates(meuse) = ~x+y # make spatialpointsdataframe
meuse#proj4string <- CRS("+init=epsg:28992") # add projection
meuse_geo <- as.geodata(meuse) # create object of class geodata for geoR compatibility
meuse_geo$data <- meuse#data # attach all data (incl. covariates) to meuse_geo
meuse_vario <- variog(geodata=meuse_geo, data=meuse_geo$data$lead, trend= ~meuse_geo$data$elev) # variogram
meuse_vfit <- variofit(meuse_vario, nugget=0.1, fix.nugget=T) # fit
# cross-validation works fine:
xvalid(geodata=meuse_geo, data=meuse_geo$data$lead, model=meuse_vfit, variog.obj = meuse_vario, reestimate=F)
# cross-validation does not work when reestimate = T:
xvalid(geodata=meuse_geo, data=meuse_geo$data$lead, model=meuse_vfit, variog.obj = meuse_vario, reestimate=T)
The error I get is:
Error in variog(coords = cv.coords, data = cv.data, uvec = variog.obj$uvec, : coords and trend have incompatible sizes
It seems to remove the point from the dataset during cross-validation, but it doesn't seem to remove the point from the covariates/trend data. Any ideas on solving this or using a different package?
Thanks a lot in advance!

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

Resources