Predicted probabilities in R ranger package - r

I am trying to build a model in R with random forest classification. (By editing the code by Ned Horning) I first used randomForest package but then found ranger, which promises faster calculations.
At first, I used the code below to get predicted probabilities for each class after fitting the model with randomForest as:
predProbs <- as.data.frame(predict(randfor, imageBlock, type='prob'))
The type of probability here is as follows:
We have 500 trees in the model and 250 of them says the observation is class 1, hence the probability is 250/500 = 50%
In ranger, I realized that there is no type = 'prob' option.
I searched and tried some adjustments but couldn't get any progress. I need an object or so containing probabilities as mentioned above with ranger package.
Could anyone give some advice about the issue?

You need to train a "probabilistic classifier"-type ranger object:
library("ranger")
iris.ranger = ranger(Species ~ ., data = iris, probability = TRUE)
This object computes a matrix (n_samples, n_classes) when used in the predict.ranger function:
probabilities = predict(iris.ranger, data = iris)$predictions

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

Generate SHAP dependence plots

Is there a package that allows for estimation of shap values for multiple observations for models that are not XGBoost or decision tree based? I created a neural network using Caret and NNET. I want to develop a beeswarm plot and shap dependence plots to explore the relationship between certain variables in my model and the outcome. The only success I have had is using the DALEX package to estimate SHAP values, but DALEX only does this for single instances and cannot do a global analysis using SHAP values. Any insight or help would be appreciated!
I have tried using different shap packages (fastshap, shapr) but these require decision tree based models. I tried creating an XGBoost model in caret but this did not implement well with the shap packages in r and I could not get the outcome I wanted.
I invested a little bit of time to push R in this regard:
shapviz plots SHAP values from any source, including XGBoost, LightGBM, H2O, kernelshap, and fastshap
kernelshap calculates Kernel SHAP values for all models with numeric output, even multivariate output. This will be your friend when it comes to models outside the TreeSHAP confort zone...
Put differently: kernelshap + shapviz = explain any model.
Here an example using "caret" for linear regression, but nnet works identically.
library(caret)
library(kernelshap)
library(shapviz)
fit <- train(
Sepal.Length ~ .,
data = iris,
method = "lm",
tuneGrid = data.frame(intercept = TRUE),
trControl = trainControl(method = "none")
)
# Explain rows in `X` based on background data `bg_X` (50-200 rows, not the full training data!)
shap <- kernelshap(fit, X = iris[, -1], bg_X = iris)
sv <- shapviz(shap)
sv_importance(sv)
sv_importance(sv, kind = "bee")
sv_dependence(sv, "Species", color_var = "auto")
# Single observations
sv_waterfall(sv, 1)
sv_force(sv, 1)

Variable importance in Caret

I am using the Caret package in R for training logistic regression model for a binary classification problem. I have been able to get the results, accuracy, etc., but I also want the importance of the variables (in decreasing order of importance). I used varImp() function. But according to the documentation, the importance depends on the class :
"For most classification models, each predictor will have a separate variable importance for each class (the exceptions are classification trees, bagged trees and boosted trees)."
How can I get the variable importance for each class ?
Thank you
For the first part, have you tried:
round(importance(myModel$finalModel), 2)
For putting that in decreasing order:
imp <- round(importance(myModel$finalModel), 2)
dfimp <- data.frame(feature = rownames(imp), MeanDecreaseGini = as.numeric(imp))
dfimp[order(dfimp$MeanDecreaseGini, decreasing = T),]

computing concordance index with ranger (R package)

I'm trying to use predictions from a random survival forest computed using Ranger to calculate a c-index at specific time points. I know this can be done easily for a coxph model with the following code:
cox_model = coxph(Surv(time, status == 1) ~ ., data = train)
c_index_test <- pec::cindex(cox_model, formula = Cox_model$formula, data=test, eval.times= c(30, 90, 730))
#want to evaluate at 1 month, 3 months, and 2 years
However, although I can calculate a c-index at these time points easily with a random forest generated using rfsrc(), I haven't been able to do it using ranger.
In addition to the pec cindex() function (which doesn't work with objects of class "ranger", I've also tried the concordance.index function (part of the survcomp package) and tried different combinations of using the predict.ranger function to generate survival probability predictions, but nothing has worked.
If anyone can provide code as to how to calculate a the c-index of a ranger RSF (at specific time points and on an external validation set) I would appreciate it immensely!!! I've been able to do it with randomforestSRC but it just takes so long that often my R session will time out and I haven't actually been able to get ANY results with runs having >10 trees...
The ranger packages computes Harrell’s c-index, which is similar to the concordance statistic. If you have a fitted model rf, the attribute prediction.error is equivalent to 1 - Harrell's c-index. Have a look at the following link for more details.

H2o random forest plot on r

I'm new to h2o and I'm having difficulty with this package on r.
I'm using a traning and test set 5100 and 2300 obs respectively with 18917 variables and a binary target (0,1)
I ran a random forest:
train_h20<-as.h2o(train)
test_h20<-as.h2o(test)
forest <- h2o.randomForest(x = Words,
y = 18918,
training_frame = train_h20,
ntree = 250,
validation = test_h20,
seed = 8675309)
I know i can get the plot of logloss or mse or ... as the number of tree changes
But is there a way to plot an image of the model itself. I mean the final ensembled tree used for the final predictions?
Also, another question, in randomForest package I could use varImp function which returned me, as well as the absolute importance, the class-specific measures (computed as mean decrease in accuracy), i interpreted as a class-relative measure of variable importance.
varImp matrix, randomForest package:
In h2o package I only find the absolute importance measure, is there something similar?
There is no a final tree at the end of the random forest in R with randomForest packages. To make final predıction, random forest uses voting method. Voting means, for any data:
For example 0;
of tree that predict the data as Class 0/total number of trees in the forest
For Class 1 it is same as the Class 0;
of tree that predict the data as Class 1/total number of trees in the forest
However you can use ctree.
library("party")
x <- ctree(Class ~ ., data=data)
plot(x)

Resources