plot one of 500 trees in randomForest package - r

How can plot trees in output of randomForest function in same names packages in R? For example I use iris data and want to plot first tree in 500 output tress. my code is
model <-randomForest(Species~.,data=iris,ntree=500)

You can use the getTree() function in the randomForest package (official guide: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf)
On the iris dataset:
require(randomForest)
data(iris)
## we have a look at the k-th tree in the forest
k <- 10
getTree(randomForest(iris[, -5], iris[, 5], ntree = 10), k, labelVar = TRUE)

You may use cforest to plot like below, I have hardcoded the value to 5, you may change as per your requirement.
ntree <- 5
library("party")
cf <- cforest(Species~., data=iris,controls=cforest_control(ntree=ntree))
for(i in 1:ntree){
pt <- prettytree(cf#ensemble[[i]], names(cf#data#get("input")))
nt <- new("Random Forest BinaryTree")
nt#tree <- pt
nt#data <- cf#data
nt#responses <- cf#responses
pdf(file=paste0("filex",i,".pdf"))
plot(nt, type="simple")
dev.off()
}
cforest is another implementation of random forest, It can't be said which is better but in general there are few differences that we can see. The difference is that cforest uses conditional inferences where we put more weight to the terminal nodes in comparison to randomForest package where the implementation provides equal weights to terminal nodes.
In general cofrest uses weighted mean and randomForest uses normal average. You may want to check this .

Related

How to implement shapper:shap for whole dataset?

I have created a Random Forest model using the randomForest package
model_rf <- randomForest(y~ . , data = data_train,ntree=1000, keep.forest=TRUE,importance=TRUE)
To calculate Shapley values for the different features based on this RF model, I first create an "explainer object" and then use the "shapper" package
exp_rf <- DALEX::explain(model_rf, data = data_test[,-1], y = data_test[,1])
ive_rf <- shap(exp_rf, new_observation = data_test[1,-1])
To my knowledge, I can only apply the "shap" function to one observation (the "new_observation").
But I am looking for a way to calculate the shapley values for all of my respondents in my datafile.
I know this is possible in the "SHAP" package in Python; but is it also possible with the "shapper" package in R?
At the moment, I created a loop to calculate the shapley values for all respondents, but this will take me days to calculate for my entire datafile.
for(i in c(1:nrow(data_test)))
{
ive_rf <- shap(exp_rf,new_observation=data_test[i,-1])
shapruns<-cbind(shapruns,ive_rf[,"_attribution_"])
}
Any help would be much appreciated.
I recently published two R packages that are optimized for this kind of tasks: "kernelshap" (calculate SHAP values fast) and "shapviz" (plot SHAP values from any source). In your case, a working example would be:
library(randomForest)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- randomForest(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")

shap plots for random forest models

I would like to get the Shap Contribution for variables for a Ranger/random forest model and have plots like this in R:
beeswarm plots
I have tried using the following libraries: DALEX, shapr, fastshap, shapper. I could only end up getting plots like this:
fastshap plot
Is it possible to get such plots? I have tried reticulate package and it still doesnt work.
Random forests need to grow many deep trees. While possible, crunching TreeSHAP for deep trees requires an awful lot of memory and CPU power. An alternative is to use the Kernel SHAP algorithm, which works for all kind of models.
library(ranger)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- ranger(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")
Disclaimer: I wrote "kernelshap" and "shapviz"

Weighted Portmanteau Test for Fitted GARCH process

I have fitted a GARCH process to a time series and analyzed the ACF for squared and absolute residuals to check the model goodness of fit. But I also want to do a formal test and after searching the internet, The Weighted Portmanteau Test (originally by Li and Mak) seems to be the one.
It's from the WeightedPortTest package and is one of the few (perhaps the only one?) that properly tests the GARCH residuals.
While going through the instructions in various documents I can't wrap my head around what the "h.t" argument wants. It says in the info in R that I need to assign "a numeric vector of the conditional variances". This may be simple to an experienced user, though I'm struggling to understand. What is it that I need to do and preferably how would I code it in R?
Thankful for any kind of help
Taken directly from the documentation:
h.t: a numeric vector of the conditional variances
A little toy example using the fGarch package follows:
library(fGarch)
library(WeightedPortTest)
spec <- garchSpec(model = list(alpha = 0.6, beta = 0))
simGarch11 <- garchSim(spec, n = 300)
fit <- garchFit(formula = ~ garch(1, 0), data = simGarch11)
Weighted.LM.test(fit#residuals, fit#h.t, lag = 10)
And using garch() from the tseries package:
library(tseries)
fit2 <- garch(as.numeric(simGarch11), order = c(0, 1))
summary(fit2)
# comparison of fitted values:
tail(fit2$fitted.values[,1]^2)
tail(fit#h.t)
# comparison of residuals after unstandardizing:
unstd <- fit2$residuals*fit2$fitted.values[,1]
tail(unstd)
tail(fit#residuals)
Weighted.LM.test(unstd, fit2$fitted.values[,1]^2, lag = 10)

fitted GEV df parameters / different results for different packages (lmomRFA, lmom, extRemes, nsRFA)

I am trying to fit a GEV df using L-Moment method for annual maxima value, but when using different packages, the parameters of the fitted GEV distribution (Location, scale and shape) were not similar.
I used different packages and the odd one was lmomRFA, could any one help me to identify the problem?
here is sample of what I am doing :
install.packages("extRemes")
install.packages("lmom")
install.packages("lmomRFA")
install.packages("nsRFA")
library (nsRFA)
library (extRemes)
library (lmomRFA)
library (lmom)
amax_standard <- c (0.6274510, 0.7545455, 0.6521739, 1.5102041, 2.0937500, 1.0000000, 0.7094017, 1.0315789, 1.3207547)
amax_standard_matrix <- as.matrix(amax_standard)## The AMAX as a matrix
amax_standard_vector <- as.vector(amax_standard_matrix) ## The AMAX as a vector
amax_standard_vector_order <- amax_standard_vector[order (amax_standard_vector)] ## The AMAX as an ordered vector
fit1 <- fevd(amax_standard_vector_order,type =("GEV"),method=("Lmoments")) #fittig EVD using the "extRemes" package
lmom <- samlmu(amax_standard_vector_order) # calculating the L-Moment ratios for the region. This function deals with all the observation as one vector
fit2 <- pelgev(lmom) #fittig EVD using the "lmom" package
regdata <- regsamlmu(amax_standard_matrix, nmom = 4, sort.data = TRUE, lcv = FALSE)
colnames (regdata) [4]<- c("t") # to make the column name acceptable for the "regfit" function
fit3 <- regfit(regdata, "gev") #fittig EVD using the "lmomRFA" package
the first two functions which are in the extRemes and lmom packages give similar results for the GEV df estimated parameter while the third function which is from the lmomRFA package provides different answer.
Moreover, I tried to use different package, which is the nsRFA just to estimate the parameters and used the function par.GEV and I provided the function with the estimated l-moment ratios which are identical for all the packages and can be found in the variable lmom and regdata to estimate the parameters and the results were similar to the extRemes and lmom packages, so what is the problem with the function in lmomRFA
check <- par.GEV (1.0777622, 0.2760841, 0.3553799)

Predictive model decision tree

I want to build a predictive model using decision tree classification in R. I used this code:
library(rpart)
library(caret)
DataYesNo <- read.csv('DataYesNo.csv', header=T)
summary(DataYesNo)
worktrain <- sample(1:50, 40)
worktest <- setdiff(1:50, worktrain)
DataYesNo[worktrain,]
DataYesNo[worktest,]
M <- ncol(DataYesNo)
input <- names(DataYesNo)[1:(M-1)]
target <- “YesNo”
tree <- rpart(YesNo~Var1+Var2+Var3+Var4+Var5,
data=DataYesNo[worktrain, c(input,target)],
method="class",
parms=list(split="information"),
control=rpart.control(usesurrogate=0, maxsurrogate=0))
summary(tree)
plot(tree)
text(tree)
I got just one root (Var3) and two leafs (yes, no). I'm not sure about this result. How can I get the confusion matrix, accuracy, sensitivity, and specificity?
Can I get them with the caret package?
If you use your model to make predictions on your test set, you can use confusionMatrix() to get the measures you're looking for.
Something like this...
predictions <- predict(tree, worktest)
cmatrix <- confusionMatrix(predictions, worktest$YesNo)
print(cmatrix)
Once you create a confusion matrix, other measures can also be obtained - I don't remember them at the moment.
According to your example, the confusion matrix can be obtained as following.
fitted <- predict(tree, DataYesNo[worktest, c(input,target)])
actual <- DataYesNo[worktest, c(target)]
confusion <- table(data.frame(fitted = fitted, actual = actual))

Resources