I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305
Related
Context
I have a custom function myfun1 that fits the cox model. Before fitting the model, I need to do a bit of processing on the data used to fit the model. Specifically, run two lines of code, dd = datadist(data) and options(datadist = 'dd').
If dd exists in the environment inside the function, myfun1 will report an error.
But when I output dd to the global environment, myfun2 works fine.
Question
Why does this happen?
How can I get myfun1 to run properly while keeping dd inside the function?
Reproducible code
library(survival)
library(rms)
data(cancer)
myfun1 <- function(data, x){
x = sym(x)
dd = datadist(data)
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun1(dat = lung, x = 'meal.cal')
# Error in Design(data, formula, specials = c("strat", "strata")) :
# dataset dd not found for options(datadist=)
myfun2 <- function(data, x){
x = sym(x)
dd <<- datadist(data) # Changed here compared to myfun1
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun2(dat = lung, x = 'meal.cal')
# Frequencies of Missing Values Due to Each Variable
# Surv(time, status) meal.cal
# 0 47
#
# Cox Proportional Hazards Model
#
# cph(formula = Surv(time, status) ~ rcs(meal.cal), data = data)
#
#
# Model Tests Discrimination
# Indexes
# Obs 181 LR chi2 0.72 R2 0.004
# Events 134 d.f. 4 R2(4,181)0.000
# Center -0.3714 Pr(> chi2) 0.9485 R2(4,134)0.000
# Score chi2 0.76 Dxy 0.048
# Pr(> chi2) 0.9443
I'm using predict.lm(fit, newdata=newdata, interval="prediction") to get predictions and their prediction intervals (PI) for new observations. Now I would like to aggregate (sum and mean) these predictions and their PI's based on an additional variable (i.e. a spatial aggregation on the zip code level of predictions for single households).
I learned from StackExchange, that you cannot aggregate the prediction intervals of single predictions just by aggregating the limits of the prediction intervals. The post is very helpful to understand why this can't be done, but I have a hard time translating this bit into actual code. The answer reads:
Here's a reproducible example:
library(dplyr)
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit regression model
fit1 <- lm(Petal.Width ~ Petal.Length, data=train)
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
#Predict Pedal.Width for new data incl prediction intervals for each prediction
predictions1<-predict(fit1, newdata=pred, interval="prediction")
predictions2<-predict(fit2, newdata=pred, interval="prediction")
# Aggregate data by summing predictions for species
#NOT correct for prediction intervals
predictions_agg1<-data.frame(predictions1,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
predictions_agg2<-data.frame(predictions2,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
I couldn't find a good tutorial or package which describes how to properly aggregate predictions and their PI's in R when using predict.lm(). Is there something out there? Would highly appreciate if you could point me in the right direction on how to do this in R.
Your question is closely related to a thread I answered 2 years ago: linear model with `lm`: how to get prediction variance of sum of predicted values. It provides an R implementation of Glen_b's answer on Cross Validated. Thanks for quoting that Cross Validated thread; I didn't know it; perhaps I can leave a comment there linking the Stack Overflow thread.
I have polished my original answer, wrapping up line-by-line code cleanly into easy-to-use functions lm_predict and agg_pred. Solving your question is then simplified to applying those functions by group.
Consider the iris example in your question, and the second model fit2 for demonstration.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
We split pred by group Species, then apply lm_predict (with diag = FALSE) on all sub data frames.
oo <- lapply(split(pred, pred$Species), lm_predict, lmObject = fit2, diag = FALSE)
To use agg_pred we need to specify a weight vector, whose length equals to the number of data. We can determine this by consulting the length of fit in each oo[[i]]:
n <- lengths(lapply(oo, "[[", 1))
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
#List of 3
# $ setosa : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
# $ versicolor: num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
# $ virginica : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
SUM <- Map(agg_pred, w, oo)
SUM[[1]] ## result for the first group, for example
#$mean
#[1] 2.499728
#
#$var
#[1] 0.1271554
#
#$CI
# lower upper
#1.792908 3.206549
#
#$PI
# lower upper
#0.999764 3.999693
sapply(SUM, "[[", "CI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 1.792908 16.41526 26.55839
#upper 3.206549 17.63953 28.10812
If aggregation operation is average, we rescale w by n and call agg_pred.
w <- mapply("/", w, n)
#List of 3
# $ setosa : num [1:11] 0.0909 0.0909 0.0909 0.0909 0.0909 ...
# $ versicolor: num [1:13] 0.0769 0.0769 0.0769 0.0769 0.0769 ...
# $ virginica : num [1:14] 0.0714 0.0714 0.0714 0.0714 0.0714 ...
AVE <- Map(agg_pred, w, oo)
AVE[[2]] ## result for the second group, for example
#$mean
#[1] 1.3098
#
#$var
#[1] 0.0005643196
#
#$CI
# lower upper
#1.262712 1.356887
#
#$PI
# lower upper
#1.189562 1.430037
sapply(AVE, "[[", "PI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 0.09088764 1.189562 1.832255
#upper 0.36360845 1.430037 2.072496
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
Use the fast_agg_pred function provided in the revision of the original Q & A. Let's start it all over.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
## list of new data
newdatlist <- split(pred, pred$Species)
n <- sapply(newdatlist, nrow)
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
SUM <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
If aggregation operation is average, we do
w <- mapply("/", w, n)
AVE <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
Note that we can't use Map in this case as we need to provide more arguments to fast_agg_pred. Use mapply in this situation, with MoreArgs and SIMPLIFY.
I am trying to find model with lowest AIC. Models are returned from two for loops that make possible combinations of columns. I am unable to make the function return model with lowest AIC. The code below demonstrates where I got stuck:
rm(list = ls())
data <- iris
data <- data[data$Species %in% c("setosa", "virginica"),]
data$Species = ifelse(data$Species == 'virginica', 0, 1)
mod_headers <- names(data[1:ncol(data)-1])
f <- function(mod_headers){
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- c(tab[,j])
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[c(mod_tab_new)], family = binomial(link = "logit"))
}
}
best_model <- model[which(AIC(model)[order(AIC(model))][1])]
print(best_model)
}
f(mod_headers)
Any suggestions? Thanks!
I replaced your for loops with vectorised alternatives
library(tidyverse)
library(iterators)
# Column names you want to use in glm model, saved as list
whichcols <- Reduce("c", map(1:length(mod_headers), ~lapply(iter(combn(mod_headers,.x), by="col"),function(y) c(y))))
# glm model results using selected column names, saved as list
models <- map(1:length(whichcols), ~glm(Species ~., data=data[c(whichcols[[.x]], "Species")], family = binomial(link = "logit")))
# selects model with lowest AIC
best <- models[[which.min(sapply(1:length(models),function(x)AIC(models[[x]])))]]
Output
Call: glm(formula = Species ~ ., family = binomial(link = "logit"),
data = data[c(whichcols[[.x]], "Species")])
Coefficients:
(Intercept) Petal.Length
55.40 -17.17
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 1.208e-09 AIC: 4
Using your loop, just put all the models in one list.
Then compute the AIC of all these models.
Finally return the model with the minimum AIC.
f <- function(mod_headers) {
models <- list()
k <- 1
for (i in 1:length(mod_headers)) {
tab <- combn(mod_headers, i)
for(j in 1:ncol(tab)) {
mod_tab_new <- c(tab[, j], "Species")
models[[k]] <- glm(Species ~ ., data = data[mod_tab_new],
family = binomial(link = "logit"))
k <- k + 1
}
}
models[[which.min(sapply(models, AIC))]]
}
glm() uses an iterative re-weighted least squares algorithm. The algorithm reaches the maximum number of iterations before it converges - changing this parameter helps in your case:
glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
There was another issue using which, I replaced it with an if after each model fit to compare to the lowest AIC so far. However, I think there are better solutions than this for-loop approach.
f <- function(mod_headers){
lowest_aic <- Inf # added
best_model <- NULL # added
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- tab[, j]
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
if(AIC(model) < lowest_aic){ # added
lowest_aic <- AIC(model) # added
best_model <- model # added
}
}
}
return(best_model)
}
I want to calculate Brier score and integrated Brier score for my analysis using "ranger" R package.
As an example, I use the veteran data from the "survival" package as follows
install.packages("ranger")
library(ranger)
install.packages("survival")
library(survival)
#load veteran data
data(veteran)
data <- veteran
# training and test data
n <- nrow(data)
testind <- sample(1:n,n*0.7)
trainind <- (1:n)[-testind]
#train ranger
rg <- ranger(Surv(time, status) ~ ., data = data[trainind,])
# use rg to predict test data
pred <- predict(rg,data=data[testind,],num.trees=rg$num.trees)
#cummulative hazard function for each sample
pred$chf
#survival probability for each sample
pred$survival
How can I calculate Brier score and integrated Brier score?
The Integrated Brier Score (IBS) can be calculated using the pec function of the pec package but you need to define a predictSurvProb command to extract survival probability predictions from the ranger modeling approach (?pec:::predictSurvProb for a list of available models).
A possibile solution is:
predictSurvProb.ranger <- function (object, newdata, times, ...) {
ptemp <- ranger:::predict.ranger(object, data = newdata, importance = "none")$survival
pos <- prodlim::sindex(jump.times = object$unique.death.times,
eval.times = times)
p <- cbind(1, ptemp)[, pos + 1, drop = FALSE]
if (NROW(p) != NROW(newdata) || NCOL(p) != length(times))
stop(paste("\nPrediction matrix has wrong dimensions:\nRequested newdata x times: ",
NROW(newdata), " x ", length(times), "\nProvided prediction matrix: ",
NROW(p), " x ", NCOL(p), "\n\n", sep = ""))
p
}
This function can be used as follows:
library(ranger)
library(survival)
data(veteran)
dts <- veteran
n <- nrow(dts)
set.seed(1)
testind <- sample(1:n,n*0.7)
trainind <- (1:n)[-testind]
rg <- ranger(Surv(time, status) ~ ., data = dts[trainind,])
# A formula to be inputted into the pec command
frm <- as.formula(paste("Surv(time, status)~",
paste(rg$forest$independent.variable.names, collapse="+")))
library(pec)
# Using pec for IBS estimation
PredError <- pec(object=rg,
formula = frm, cens.model="marginal",
data=dts[testind,], verbose=F, maxtime=200)
The IBS can be evaluated using the print.pec command, indicating in times the time points at which to show the IBS:
print(PredError, times=seq(10,200,50))
# ...
# Integrated Brier score (crps):
#
# IBS[0;time=10) IBS[0;time=60) IBS[0;time=110) IBS[0;time=160)
# Reference 0.043 0.183 0.212 0.209
# ranger 0.041 0.144 0.166 0.176
Is there a way to combine multiple predictions from different models in mlr into a single average prediction so that it can be used to calculate performance measures etc.?
library(mlr)
data(iris)
iris2 <- iris
iris2$Species <- ifelse(iris$Species=="setosa", "ja", "nein")
task = makeClassifTask(data = iris2, target = "Species")
lrn = makeLearner("classif.h2o.deeplearning", predict.type="prob")
model1 = train(lrn, task)
model2 = train(lrn, task)
pred1 = predict(model1, newdata=iris2)
pred2 = predict(model2, newdata=iris2)
performance(pred1, measures = auc)
g = generateThreshVsPerfData(pred1)
plotThreshVsPerf(g)
A workaround to show what I mean could be maybe
pred_avg = pred1
pred_avg$data[,c("prob.ja","prob.nein")] = (pred1$data[,c("prob.ja","prob.nein")] +
pred2$data[,c("prob.ja","prob.nein")])/2
performance(pred_avg, measures = auc)
g_avg = generateThreshVsPerfData(pred_avg)
plotThreshVsPerf(g_avg)
Is there a way to do this without a workaround and could this workaround have any unwanted side effects?
It sounds like you are looking for a stacking learner, which is mlr's method of performing ensembles.
from the docs
# Regression
data(BostonHousing, package = "mlbench")
tsk = makeRegrTask(data = BostonHousing, target = "medv")
base = c("regr.rpart", "regr.svm")
lrns = lapply(base, makeLearner)
m = makeStackedLearner(base.learners = lrns,
predict.type = "response", method = "average")
tmp = train(m, tsk)
res = predict(tmp, tsk)
# Prediction: 506 observations
# predict.type: response
# threshold:
# time: 0.02
# id truth response
# 1 1 24.0 27.33742
# 2 2 21.6 22.08853
# 3 3 34.7 33.52007
# 4 4 33.4 32.49923
# 5 5 36.2 32.67973
# 6 6 28.7 22.99323
# ... (506 rows, 3 cols)
performance(res, rmse)
# rmse
# 3.138981