How to understand boxplot comparing performance metrics of machine learning models? - r

I have created several models and plot their accuracy using a boxplot. What does the circles outside the box mean?
I used caret package for building the models and the following code to make a picture.
model_list <- list(rf = rf_model, gbm=gbm_model, xgboost = xgboost_model, treebag = treebag_model)
resamples <- resamples(model_list)
bwplot(resamples, metric="Accuracy")

Related

Generate SHAP dependence plots

Is there a package that allows for estimation of shap values for multiple observations for models that are not XGBoost or decision tree based? I created a neural network using Caret and NNET. I want to develop a beeswarm plot and shap dependence plots to explore the relationship between certain variables in my model and the outcome. The only success I have had is using the DALEX package to estimate SHAP values, but DALEX only does this for single instances and cannot do a global analysis using SHAP values. Any insight or help would be appreciated!
I have tried using different shap packages (fastshap, shapr) but these require decision tree based models. I tried creating an XGBoost model in caret but this did not implement well with the shap packages in r and I could not get the outcome I wanted.
I invested a little bit of time to push R in this regard:
shapviz plots SHAP values from any source, including XGBoost, LightGBM, H2O, kernelshap, and fastshap
kernelshap calculates Kernel SHAP values for all models with numeric output, even multivariate output. This will be your friend when it comes to models outside the TreeSHAP confort zone...
Put differently: kernelshap + shapviz = explain any model.
Here an example using "caret" for linear regression, but nnet works identically.
library(caret)
library(kernelshap)
library(shapviz)
fit <- train(
Sepal.Length ~ .,
data = iris,
method = "lm",
tuneGrid = data.frame(intercept = TRUE),
trControl = trainControl(method = "none")
)
# Explain rows in `X` based on background data `bg_X` (50-200 rows, not the full training data!)
shap <- kernelshap(fit, X = iris[, -1], bg_X = iris)
sv <- shapviz(shap)
sv_importance(sv)
sv_importance(sv, kind = "bee")
sv_dependence(sv, "Species", color_var = "auto")
# Single observations
sv_waterfall(sv, 1)
sv_force(sv, 1)

Prediction Intervals for R tidymodels Stacked model from stacks()

Is it possible to calculate prediction intervals from a tidymodels stacked model?
Working through the example from the stacks() package here yields the stacked frog model (which can be downloaded here for reprex) and the testing data:
data("tree_frogs")
tree_frogs <- tree_frogs %>%
filter(!is.na(latency)) %>%
select(-c(clutch, hatched))
set.seed(1)
tree_frogs_split <- initial_split(tree_frogs)
tree_frogs_train <- training(tree_frogs_split)
tree_frogs_test <- testing(tree_frogs_split)
I tried to run something like this:
pi <- predict(tree_frogs_model_st, tree_frogs_test, type = "pred_int")
but this gives an error:
Error in UseMethod("stack_predict") : no applicable method for 'stack_predict' applied to an object of class "NULL"
Reading the documentation of stacks() I also tried passing "pred_int" in the opts list:
pi <- predict(tree_frogs_model_st, tree_frogs_test, opts = list(type = "pred_int"))
but this just gives: opts is only used with type = raw and was ignored.
For reference, I am trying to do a similar thing that is done in Ch.19 of Tidy Modeling with R book
lm_fit <- fit(lm_wflow, data = Chicago_train)
predict(lm_fit, Chicago_test, type = "pred_int")
which seems to work fine for a single model fit like lm_fit, but apparently not for a stacked model?
Am I missing something? Is it not possible to calculate prediction intervals for stacked models for some reason?
This is very difficult to do.
Even if glmnet produced a prediction interval, it would be a significant underestimate since it doesn’t know anything about the error in each of the ensemble members.
We would have to get the standard error of prediction from all of the models to compute it for the stacking model. A lot of these models don’t/can’t generate that standard error.
The alternative is the use bootstrapping to get the interval but you would have to bootstrap each model a large number of times to get the overall prediction interval.

using caret for survival analysis (random survival forest)

Is there a way to use caret for Survival Analysis. I really like how easy to use it is. I tried fitting a random survival forest using the party package, which is on caret's list.
This works:
library(survival)
library(caret)
library(party)
fitcforest <- cforest(Surv(futime, death) ~ sex+age, data=flchain,
controls = cforest_classical(ntree = 1000))
but using caret I get an error:
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
repeats = 2,
)
cforestfit <- train(Surv(futime, death) ~ sex+age,data=flchain, method="cforest",trControl = fitControl)
I get this error:
Error: nrow(x) == length(y) is not TRUE
Is there a way to make these Surv object work with caret?
Can I use other survival analysis oriented packages with caret?
thanks
Not yet. That is one of two major updates that should be coming soon (the other expands pre-processing).
Contact me offline if you are interested in helping the development and/or testing of those features.
Thanks,
Max
I have found no way to train survival models with caret. As an alternative, the mlr framework (1) has a set of survival learners (2). I have found mlr to be extremely user-friendly and useful.
mlr: http://mlr-org.github.io/mlr-tutorial/release/html/
survival learners in mlr: http://mlr-org.github.io/mlr-tutorial/release/html/integrated_learners/index.html#survival-analysis-15
There is an increasing number of packages in R that model survival data, examples;
For lasso and elastic nets: BioSpear.
For random forest: randomForestSRC.
Best, Loic

Forest plot from logistf

I have run a few models in for the penalized logistic model in R using the
logistf package. I however wish to plot some forest plots for the data.
The sjPlot package : http://www.strengejacke.de/sjPlot/custplot/
gives excellent function for the glm output, but no function for the logistf function.
Any assistance?
The logistf objects differ in their structure compared to glm objects, but not too much. I've added support for logistf-fitted models, however, 1) model summaries can't be printed and b) predicted probability plots are currently not supported with logistf-models.
I'll update the code on GitHub tonight, so you can try the updated sjp.glm function...
library(sjPlot)
library(logistf)
data(sex2)
fit<-logistf(case ~ age+oc+vic+vicl+vis+dia, data=sex2)
# for this example, axisLimits need to be specified manually
sjp.glm(fit, axisLimits = c(0.05, 25), transformTicks = T)

Issues when using randomForest in caret with ROC as optimization metric

I'm having an issue when constructing random forest models using caret. I have a dataset of about 46k rows and 10 columns (one of which is the optimization target). From this dataset, I'm trying to compare different classifiers. I did the following:
ctrl = trainControl(method="boot"
,classProbs=TRUE
,summaryFunction=twoClassSummary )
#GLM Model:
model.glm = train(x=d[,2:10]
,y=d$CONV_BT, method='glm'
,trControl=ctrl, metric="ROC"
,family="binomial")
#Random Forest Model:
model.rf = train(x=d[,2:10]
,y=d$CONV_BT, method='rf'
,trControl=ctrl, metric="ROC")
#Naive Bayes Model:
model.nb = train(x=d[,2:10]
,y=d$CONV_BT, method='nb'
,trControl=ctrl, metric="ROC" )
Then, model.glm and model.nb both look pretty decent. I can look at the 25 bootstrap replications, and each case has an ROC of around .7. However, something appears to be wrong with model.rf, because the reported ROC scores are all around .3. That suggests to me that something is being specified incorrectly, because I could just switch my predictions from the rf model from p to 1-p and my ROC would then be .7, right?
I'm sorry that I can't provide the data (because it's pretty big to upload and it's proprietary). The other bizarre thing is that when I simulate data, I no longer have this issue. Any idea what this could be??? Thanks for your help!

Resources