How to get the nodal raw numbers (from all the trees for a particula test vector) from which random forest calculates the prediction in R? - r

I'd like to predict a distribution rather than a single number using random forest regression in R. To do this, I'd like to get all the numbers from which random forest calculates (averages) the predicted value for a particular test vector. How can I get this done?
To be specific,
I'm not growing each tree to its full size, but limiting the size using nodesize parameter. In this case, I'm interested not in the prediction of each tree in the forest (which is given by setting the predict.all to TRUE) , but all the data points from which this prediction is calculated; that is all the data points from the node on which a new observation lands on, for all the trees in the forest.
Thanks,

The function predict.randomForest has a boolean parameter predict.all exactly for this purpose.
library("randomForest")
rf = randomForest(Species ~ ., data = iris)
?predict.randomForest
allpred = predict(rf, newdata = iris, predict.all = TRUE)
Now, the allpred$individual is a matrix, where columns correspond to individual decision trees

Related

is there an R function to obtain the minimal depth distribution from a conditional random forest estimated with the party package?

I ran a conditional random forest regression model using the cforest function from the party package because I have both categorical and continuous predictor variables that are correlated with each other, and a continuous outcome variable.
Here is my code to run the conditional random forest model, obtain out-of-bag estimates, and estimate the permutation variable importance.
# 1. fit the random forest
crf <- party::cforest(Y ~ ., data = df,
controls = party::cforest_unbiased(ntree = 10000, mtry = 7))
# 2. obtain out-of-bag estimates
pred_oob <- as.data.frame(caret::predict(crf, OOB = T, newdata = NULL))
# 3. estimate permutation variable importance
vi <- permimp::permimp(crf, condition = T, threshold = 0.5, nperm = 1000, OOB = T,
mincriterion = 0)))
I would like to visualize the minimal depth distribution and calculate mean minimal depth similar to the output from the RandomForestExplainer package. However, the RandomForestExplainer package only takes in objects from the randomForest function in the randomForest package. It's not an option for me to use this function due to the nature of my data (described above).
I have been combing the internet and have not been able to find a solution. Can someone point me to a way to visualize the minimal depth distribution for all predictors and calculate the mean minimal depth?

How to identify the non-zero coefficients in final caret elastic net model -

I have used caret to build a elastic net model using 10-fold cv and I want to see which coefficients are used in the final model (i.e the ones that aren't reduced to zero). I have used the following code to view the coefficients, however, this apears to create a dataframe of every permutation of coefficient values used, rather than the ones used in the final model:
tr_control = train_control(method="cv",number=10)
formula = response ~.
model1 = caret::train(formula,
data=training,
method="glmnet",
trControl=tr_control,
metric = "Accuracy",
family = "binomial")
Then to extract the coefficients from the final model and using the best lambda value, I have used the following:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$.lambda)))
However, this just returns a dataframe of all the coefficients and I can see different instances of where the coefficients have been reduced to zero, however, I'm not sure which is the one the final model uses. Using some slightly different code, I get slightly different results, but in this instance, non of the coefficients are reduced to zero, which suggests to me that the the final model isn't reducing any coefficients to zero:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda))) #i have removed the full stop preceeding lambda
Basically, I want to know which features are in the final model to assess how the model has performed as a feature reduction process (alongside standard model evaluation metrics such as accuracy, sensitivity etc).
Since you do not provide any example data I post an example based on the iris built-in dataset, slightly modified to fit better your need (a binomial outcome).
First, modify the dataset
library(caret)
set.seed(5)#just for reproducibility
iris
irisn <- iris[iris$Species!="virginica",]
irisn$Species <- factor(irisn$Species,levels = c("versicolor","setosa"))
str(irisn)
summary(irisn)
fit the model (the caret function for setting controls parameters for train is trainControl, not train_control)
tr_control = trainControl(method="cv",number=10)
model1 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
You can extract the coefficients of the final model as you already did:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda)))
Also here the model did not reduce any coefficients to 0, but what if we add a random variable that explains nothing about the outcome?
irisn$new1 <- runif(nrow(irisn))
model2 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
var <- data.frame(as.matrix(coef(model2$finalModel, model2$bestTune$lambda)))
Here, as you can see, the coefficient of the new variable was turning to 0. You can extract the variable name retained by the model with:
rownames(var)[var$X1!=0]
Finally, the accuracy metrics from the test set can be obtained with
confusionMatrix(predict(model1,test),test$outcome)

I am setting seed on Gradient Boosting Machine(GBM) Model but I keep on getting different prediction

I am performing credit risk modelling using the Gradient Boosting Machine (GBM) algorithm and on making predictions of Probability of Default (PD) I keep on getting different PDs for each run even when I have set.seed(1234) in my code.
What could be causing this to happen and how do I fix it. Here is my code below:
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)
modelLookup(model='gbm')
#Creating grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode
= c(3,5,10),interaction.depth=c(1,5,10))
#SetSeed
set.seed(1234)
# training the model
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneGrid=grid)
# summarizing the model
print(model_gbm)
plot(model_gbm)
#using tune length
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=10)
print(model_gbm)
plot(model_gbm)
#Checking variable importance for GBM
#Variable Importance
library(gbm)
varImp(object=model_gbm, numTrees = 50)
#Plotting Varianle importance for GBM
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
#Checking variable importance for RF
varImp(object=model_rf)
#Plotting Varianle importance for Random Forest
plot(varImp(object=model_rf),main="RF - Variable Importance")
#Checking variable importance for NNET
varImp(object=model_nnet)
#Plotting Variable importance for Neural Network
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
#Checking variable importance for GLM
varImp(object=model_glm)
#Plotting Variable importance for GLM
plot(varImp(object=model_glm),main="GLM - Variable Importance")
#Predictions
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
table(predictions)
confusionMatrix(predictions,testSet[,outcomeName])
PD <- predict.train(object=model_gbm,credit_transformed[,predictors],type="prob")
I assume you are using train() from caret.
I recommend you use the more complex but customizable trainControl() from the same package.
As you can see from ?trainControl, the parameter seeds is:
an optional set of integers that will be used to set the seed at each
resampling iteration. This is useful when the models are run in
parallel. A value of NA will stop the seed from being set within the
worker processes while a value of NULL will set the seeds using a
random set of integers. Alternatively, a list can be used. The list
should have B+1 elements where B is the number of resamples, unless
method is "boot632" in which case B is the number of resamples plus 1.
The first B elements of the list should be vectors of integers of
length M where M is the number of models being evaluated. The last
element of the list only needs to be a single integer (for the final
model). See the Examples section below and the Details section.
Fixing seeds should do the trick.
Please, next time try to offer a dput o analogous of your data in order to be reproducible.
Best!

H2o random forest plot on r

I'm new to h2o and I'm having difficulty with this package on r.
I'm using a traning and test set 5100 and 2300 obs respectively with 18917 variables and a binary target (0,1)
I ran a random forest:
train_h20<-as.h2o(train)
test_h20<-as.h2o(test)
forest <- h2o.randomForest(x = Words,
y = 18918,
training_frame = train_h20,
ntree = 250,
validation = test_h20,
seed = 8675309)
I know i can get the plot of logloss or mse or ... as the number of tree changes
But is there a way to plot an image of the model itself. I mean the final ensembled tree used for the final predictions?
Also, another question, in randomForest package I could use varImp function which returned me, as well as the absolute importance, the class-specific measures (computed as mean decrease in accuracy), i interpreted as a class-relative measure of variable importance.
varImp matrix, randomForest package:
In h2o package I only find the absolute importance measure, is there something similar?
There is no a final tree at the end of the random forest in R with randomForest packages. To make final predıction, random forest uses voting method. Voting means, for any data:
For example 0;
of tree that predict the data as Class 0/total number of trees in the forest
For Class 1 it is same as the Class 0;
of tree that predict the data as Class 1/total number of trees in the forest
However you can use ctree.
library("party")
x <- ctree(Class ~ ., data=data)
plot(x)

Plotting a ROC curve from a random forest classification

I'm trying to plot ROC curve of a random forest classification. Plotting works, but I think I'm plotting the wrong data since the resulting plot only has one point (the accuracy).
This is the code I use:
set.seed(55)
data.controls <- cforest_unbiased(ntree=100, mtry=3)
data.rf <- cforest(type ~ ., data = dataset ,controls=data.controls)
pred <- predict(data.rf, type="response")
preds <- prediction(as.numeric(pred), dataset$type)
perf <- performance(preds,"tpr","fpr")
performance(preds,"auc")#y.values
confusionMatrix(pred, dataset$type)
plot(perf,col='red',lwd=3)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
To plot a receiver operating curve you need to hand over continuous output of the classifier, e.g. posterior probabilities. That is, you need to predict (data.rf, newdata, type = "prob").
predicting with type = "response" already gives you the "hardened" factor as output. Thus, your working point is implicitly fixed already. With respect to that, your plot is correct.
side note: in bag prediction of random forests will be highly overoptimistic!

Resources