H2o random forest plot on r - r

I'm new to h2o and I'm having difficulty with this package on r.
I'm using a traning and test set 5100 and 2300 obs respectively with 18917 variables and a binary target (0,1)
I ran a random forest:
train_h20<-as.h2o(train)
test_h20<-as.h2o(test)
forest <- h2o.randomForest(x = Words,
y = 18918,
training_frame = train_h20,
ntree = 250,
validation = test_h20,
seed = 8675309)
I know i can get the plot of logloss or mse or ... as the number of tree changes
But is there a way to plot an image of the model itself. I mean the final ensembled tree used for the final predictions?
Also, another question, in randomForest package I could use varImp function which returned me, as well as the absolute importance, the class-specific measures (computed as mean decrease in accuracy), i interpreted as a class-relative measure of variable importance.
varImp matrix, randomForest package:
In h2o package I only find the absolute importance measure, is there something similar?

There is no a final tree at the end of the random forest in R with randomForest packages. To make final predıction, random forest uses voting method. Voting means, for any data:
For example 0;
of tree that predict the data as Class 0/total number of trees in the forest
For Class 1 it is same as the Class 0;
of tree that predict the data as Class 1/total number of trees in the forest
However you can use ctree.
library("party")
x <- ctree(Class ~ ., data=data)
plot(x)

Related

Feature importance plot using xgb and also ranger. Best way to compare

I'm working on a script that trains both a ranger random forest and a xgb regression. Depending on which performs best based on rmse, one or the other is used to test against hold out data.
I would also like to return feature importance for both in a comparable way.
With the xgboost library, I can get my feature importance table and plot like so:
> xgb.importance(model = regression_model)
Feature Gain Cover Frequency
1: spend_7d 0.981006272 0.982513621 0.79219969
2: IOS 0.006824499 0.011105014 0.08112324
3: is_publisher_organic 0.006379284 0.002917203 0.06770671
4: is_publisher_facebook 0.005789945 0.003464162 0.05897036
Then I can plot it like so:
> xgb.importance(model = regression_model) %>% xgb.plot.importance()
That was using xgboost library and their functions. With ranger random forrest, if I fit a regression model, I can get feature importance if I include importance = 'impurity' while fitting the model. Then:
regression_model$variable.importance
spend_7d d7_utility_sum recent_utility_ratio IOS is_publisher_organic is_publisher_facebook
437951687132 0 0 775177421 600401959 1306174807
I could just create a ggplot. But the scales are entirely different between what ranger returns in that table and what xgb shows in the plot.
Is there an out of the box library or solution where I can plot the feature importance of either the xgb or ranger model in a comparable way?
Both the column "Gain" of XGboost and the importances of ranger with parameter "impurity" are constructed via the total decrease in impurity (therefore gain) of the splits of a given variable.
The only difference appears to be that while XGboost automatically makes the importances in percentage form, ranger keeps them as original values, so sum of squares, which is not very handy to be plotted. You can therefore transform the values of ranger importances by dividing them by the total sum, so that you will have the equivalent percentages as in Xgboost.
Since using impurity decrease can be sometimes misleading, I however suggest you compute (for both models) the importances of the variables via permutation. This allows you to get the importances in an easy way that is comparable for the different models, and it is more stable.
I suggest this incredibly helpful post
Here is the permutation importance, as defined in there (sorry it's Python, not R):
def permutation_importances(rf, X_train, y_train, metric):
baseline = metric(rf, X_train, y_train)
imp = []
for col in X_train.columns:
save = X_train[col].copy()
X_train[col] = np.random.permutation(X_train[col])
m = metric(rf, X_train, y_train)
X_train[col] = save
imp.append(baseline - m)
return np.array(imp)
However, ranger also allows for permutation importances to be computed via importance="permutation", and xgboost might do so as well.

Predicted probabilities in R ranger package

I am trying to build a model in R with random forest classification. (By editing the code by Ned Horning) I first used randomForest package but then found ranger, which promises faster calculations.
At first, I used the code below to get predicted probabilities for each class after fitting the model with randomForest as:
predProbs <- as.data.frame(predict(randfor, imageBlock, type='prob'))
The type of probability here is as follows:
We have 500 trees in the model and 250 of them says the observation is class 1, hence the probability is 250/500 = 50%
In ranger, I realized that there is no type = 'prob' option.
I searched and tried some adjustments but couldn't get any progress. I need an object or so containing probabilities as mentioned above with ranger package.
Could anyone give some advice about the issue?
You need to train a "probabilistic classifier"-type ranger object:
library("ranger")
iris.ranger = ranger(Species ~ ., data = iris, probability = TRUE)
This object computes a matrix (n_samples, n_classes) when used in the predict.ranger function:
probabilities = predict(iris.ranger, data = iris)$predictions

Variable importance in Caret

I am using the Caret package in R for training logistic regression model for a binary classification problem. I have been able to get the results, accuracy, etc., but I also want the importance of the variables (in decreasing order of importance). I used varImp() function. But according to the documentation, the importance depends on the class :
"For most classification models, each predictor will have a separate variable importance for each class (the exceptions are classification trees, bagged trees and boosted trees)."
How can I get the variable importance for each class ?
Thank you
For the first part, have you tried:
round(importance(myModel$finalModel), 2)
For putting that in decreasing order:
imp <- round(importance(myModel$finalModel), 2)
dfimp <- data.frame(feature = rownames(imp), MeanDecreaseGini = as.numeric(imp))
dfimp[order(dfimp$MeanDecreaseGini, decreasing = T),]

How can I pass a weight decay argument to mlogit()?

How can I specify weight decay in a model fit by the mlogit?
The multinom() function of nnet allows you to specify weight decay for the model that is being fit, and mlogit uses this function behind the scenes to fit its models so I imagine that it should be possible to pass the decay argument to multinom, but have not so far found a way to do this.
So far I have attempted to simply pass a value in the model formula, like this.
library(mlogit)
set.seed(1)
data("Fishing", package = "mlogit")
Fishing$wts <- runif(nrow(Fishing)) #for some weights
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
fit1 <- mlogit(mode ~ 0 | income, data = Fish, weights = wts, decay = .01)
fit2 <- mlogit(mode ~ 0 | income, data = Fish, weights = wts)
But the output is exactly the same:
identical(logLik(fit1), logLik(fit2))
[1] TRUE
mlogit() and nnet::multinom() both fit multinomial logistic models (predicting probability of class membership for multiple classes) but they use different algorithms to fit the model. nnet::multinom() uses a neural network to fit the model and mlogit() uses maximum likelihood.
Weight decay is a parameter for neural networks and is not applicable to maximum likelihood.
The effect of weight decay is keep the weights in the neural network from getting too large by penalizing larger weights during the weight update step of the fitting algorithm. This helps to prevent over-fitting and hopefully creates a more general model.
Consider using the pmlr function in the pmlr package. This function implements a "Penalized maximum likelihood estimation for multinomial logistic regression" when called with the default function parameter penalized = TRUE.

How to get the nodal raw numbers (from all the trees for a particula test vector) from which random forest calculates the prediction in R?

I'd like to predict a distribution rather than a single number using random forest regression in R. To do this, I'd like to get all the numbers from which random forest calculates (averages) the predicted value for a particular test vector. How can I get this done?
To be specific,
I'm not growing each tree to its full size, but limiting the size using nodesize parameter. In this case, I'm interested not in the prediction of each tree in the forest (which is given by setting the predict.all to TRUE) , but all the data points from which this prediction is calculated; that is all the data points from the node on which a new observation lands on, for all the trees in the forest.
Thanks,
The function predict.randomForest has a boolean parameter predict.all exactly for this purpose.
library("randomForest")
rf = randomForest(Species ~ ., data = iris)
?predict.randomForest
allpred = predict(rf, newdata = iris, predict.all = TRUE)
Now, the allpred$individual is a matrix, where columns correspond to individual decision trees

Resources