is there an R function to obtain the minimal depth distribution from a conditional random forest estimated with the party package? - r

I ran a conditional random forest regression model using the cforest function from the party package because I have both categorical and continuous predictor variables that are correlated with each other, and a continuous outcome variable.
Here is my code to run the conditional random forest model, obtain out-of-bag estimates, and estimate the permutation variable importance.
# 1. fit the random forest
crf <- party::cforest(Y ~ ., data = df,
controls = party::cforest_unbiased(ntree = 10000, mtry = 7))
# 2. obtain out-of-bag estimates
pred_oob <- as.data.frame(caret::predict(crf, OOB = T, newdata = NULL))
# 3. estimate permutation variable importance
vi <- permimp::permimp(crf, condition = T, threshold = 0.5, nperm = 1000, OOB = T,
mincriterion = 0)))
I would like to visualize the minimal depth distribution and calculate mean minimal depth similar to the output from the RandomForestExplainer package. However, the RandomForestExplainer package only takes in objects from the randomForest function in the randomForest package. It's not an option for me to use this function due to the nature of my data (described above).
I have been combing the internet and have not been able to find a solution. Can someone point me to a way to visualize the minimal depth distribution for all predictors and calculate the mean minimal depth?

Related

Generate SHAP dependence plots

Is there a package that allows for estimation of shap values for multiple observations for models that are not XGBoost or decision tree based? I created a neural network using Caret and NNET. I want to develop a beeswarm plot and shap dependence plots to explore the relationship between certain variables in my model and the outcome. The only success I have had is using the DALEX package to estimate SHAP values, but DALEX only does this for single instances and cannot do a global analysis using SHAP values. Any insight or help would be appreciated!
I have tried using different shap packages (fastshap, shapr) but these require decision tree based models. I tried creating an XGBoost model in caret but this did not implement well with the shap packages in r and I could not get the outcome I wanted.
I invested a little bit of time to push R in this regard:
shapviz plots SHAP values from any source, including XGBoost, LightGBM, H2O, kernelshap, and fastshap
kernelshap calculates Kernel SHAP values for all models with numeric output, even multivariate output. This will be your friend when it comes to models outside the TreeSHAP confort zone...
Put differently: kernelshap + shapviz = explain any model.
Here an example using "caret" for linear regression, but nnet works identically.
library(caret)
library(kernelshap)
library(shapviz)
fit <- train(
Sepal.Length ~ .,
data = iris,
method = "lm",
tuneGrid = data.frame(intercept = TRUE),
trControl = trainControl(method = "none")
)
# Explain rows in `X` based on background data `bg_X` (50-200 rows, not the full training data!)
shap <- kernelshap(fit, X = iris[, -1], bg_X = iris)
sv <- shapviz(shap)
sv_importance(sv)
sv_importance(sv, kind = "bee")
sv_dependence(sv, "Species", color_var = "auto")
# Single observations
sv_waterfall(sv, 1)
sv_force(sv, 1)

Does *metafor* package in R provide forest plot for robust random effects models

I have fit a robust random-effects meta-regression model using metafor package in R.
My full data, as well as reproducible R code appear below.
Questions:
(1) What are the meaning and interpretation of grey-colored diamonds appearing over CIs?
(2) I won't get an overall mean effect when I have moderators, correct?
library(metafor)
d <- read.csv("https://raw.githubusercontent.com/izeh/m/master/d.csv", h = T) ## DATA
res <- robust(rma.uni(yi = dint, sei = SD, mods = ~es.type, data = d, slab = d$study.name),
cluster = d$id)
forest(res)
1) Quoting from help(forest.rma): "For models involving moderators, the fitted value for each study is added as a polygon to the plot." So, the grey-colored diamonds (or polygons) are the fitted values and the width of the diamonds/polygons reflects the width of the CI for the fitted values.
2) No, since there is no longer a single overall effect when your model includes moderators.

How can I pass a weight decay argument to mlogit()?

How can I specify weight decay in a model fit by the mlogit?
The multinom() function of nnet allows you to specify weight decay for the model that is being fit, and mlogit uses this function behind the scenes to fit its models so I imagine that it should be possible to pass the decay argument to multinom, but have not so far found a way to do this.
So far I have attempted to simply pass a value in the model formula, like this.
library(mlogit)
set.seed(1)
data("Fishing", package = "mlogit")
Fishing$wts <- runif(nrow(Fishing)) #for some weights
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
fit1 <- mlogit(mode ~ 0 | income, data = Fish, weights = wts, decay = .01)
fit2 <- mlogit(mode ~ 0 | income, data = Fish, weights = wts)
But the output is exactly the same:
identical(logLik(fit1), logLik(fit2))
[1] TRUE
mlogit() and nnet::multinom() both fit multinomial logistic models (predicting probability of class membership for multiple classes) but they use different algorithms to fit the model. nnet::multinom() uses a neural network to fit the model and mlogit() uses maximum likelihood.
Weight decay is a parameter for neural networks and is not applicable to maximum likelihood.
The effect of weight decay is keep the weights in the neural network from getting too large by penalizing larger weights during the weight update step of the fitting algorithm. This helps to prevent over-fitting and hopefully creates a more general model.
Consider using the pmlr function in the pmlr package. This function implements a "Penalized maximum likelihood estimation for multinomial logistic regression" when called with the default function parameter penalized = TRUE.

How to get the nodal raw numbers (from all the trees for a particula test vector) from which random forest calculates the prediction in R?

I'd like to predict a distribution rather than a single number using random forest regression in R. To do this, I'd like to get all the numbers from which random forest calculates (averages) the predicted value for a particular test vector. How can I get this done?
To be specific,
I'm not growing each tree to its full size, but limiting the size using nodesize parameter. In this case, I'm interested not in the prediction of each tree in the forest (which is given by setting the predict.all to TRUE) , but all the data points from which this prediction is calculated; that is all the data points from the node on which a new observation lands on, for all the trees in the forest.
Thanks,
The function predict.randomForest has a boolean parameter predict.all exactly for this purpose.
library("randomForest")
rf = randomForest(Species ~ ., data = iris)
?predict.randomForest
allpred = predict(rf, newdata = iris, predict.all = TRUE)
Now, the allpred$individual is a matrix, where columns correspond to individual decision trees

Fitting a zero inflated poisson distribution in R

I have a vector of count data that is strongly over dispersed and zero inflated.
The vector looks like this:
i.vec=c(0,63,1,4,1,44,2,2,1,0,1,0,0,0,0,1,0,0,3,0,0,2,0,0,0,0,0,2,0,0,0,0,
0,0,0,0,0,0,0,0,6,1,11,1,1,0,0,0,2)
m=mean(i.vec)
# 3.040816
sig=sd(i.vec)
# 10.86078
I would like to fit a distribution to this, which I strongly suspect will be a zero inflated poisson (ZIP). But I need to perform a significance test to demonstrate that a ZIP distribution fits the data.
If I had a normal distribution, I could do a chi square goodness of fit test using the function goodfit() in the package vcd, but I don't know of any tests that I can perform for zero inflated data.
Here is one approach
# LOAD LIBRARIES
library(fitdistrplus) # fits distributions using maximum likelihood
library(gamlss) # defines pdf, cdf of ZIP
# FIT DISTRIBUTION (mu = mean of poisson, sigma = P(X = 0)
fit_zip = fitdist(i.vec, 'ZIP', start = list(mu = 2, sigma = 0.5))
# VISUALIZE TEST AND COMPUTE GOODNESS OF FIT
plot(fit_zip)
gofstat(fit_zip, print.test = T)
Based on this, it does not look like ZIP is a good fit.

Resources