Extract rules in the randomForest model - r

I have created a random forest model for classification. The memory size of the model is too large. I want to use this random forest model to predict class labels for new observations. How can I extract only the rules associated with each tree to save the memory size. A similar question is asked before in here
https://stats.stackexchange.com/questions/102667/reduce-random-forest-model-memory-size.

Related

Is there a package that allows online updating of glm with new data?

Is there a package that allows updating of a logistic regression with new data, without retraining the model on all data points again?
i.e. say I have fitted a model glm(y~., data=X). Then some time later I receive more data (more rows) Y. I need a model that trains on rbind(X,Y). But instead of re-training the model on this new combined dataset, it would be good if the model can simply update itself using Y (either using bayesian or frequentist method).
The reason I am seeking an update method is that over time, the dataset will grow to be huge, so the re-training method will become increasing computationally infeasible.

How to predict with a “getTree” tree

I have a random forest model. With getTree function I can get all trees created in my random forest Model. Now I want to check the predictions made by each tree for some observations. For this reason I need to make prediction using each tree of my random forest Model.
I found this question with the same objective. But, unfortunately, this question has not been answered.
https://stackoverflow.com/q/40875489/3834837
Any propositions?
If you are referring to the randomForest package then you can do this by using predict(...,predict.all=T), which would get you all the predictions for each tree. Then you can select whichever you want.

Balanced random forest in R using H2O

Due to the fact that I'm currently working on a highly unbalanced multi-class classification problem, I'm considering balanced random forests (https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf). Do you have some experience implementing balanced random forests using H2O? If so, could you please elaborate on the following question:
Is it even possible to change the default process of creating bootstrap samples within H2O to come up with balanced sub-samples (for each iteration in the random forest, draw a bootstrap sample from the minority class. Randomly draw the same number of cases, with replacement, from the majority classes) of the original data set for each tree to grow?
H2O's random forest doesn't perform bootstrapping, instead it samples at a rate of 63.2% (which is the expected value of unique rows in any bootstrapped sample).
If you want to get a balanced sample, you can use can use the parameter balance_classes with class_sampling_factors, or weights_column

What does seed do in random forest?

I know that seed is set in general is used so that we can reproduce the same result. But, what does setting up the seed actually do in random forest part. Does it change any of the arguments of randomForest() function in R like nTree or sampSize.
I am using different seeds for my random forest model each time, but want to know how different seeds affect a random forest model.
Trees grow from seeds and so do forests ;-) (scnr)
There are different ways to built a random forest, however, all in common is that multiple trees are built. To improve classification accuracy over a single decision tree, the individual trees in a random forest need to differ, as you would have nTree times the same tree. This difference is achieved by introducing randomness in the generation of the trees. The randomness is influenced by the seed and what is most important about the seed is that using the same seed should always generate the same result.
How does the randomness influence the tree build? There are multiple ways.
- build the tree for a random subset. This is for each individual tree of the forest a subset of training example are drawn and then a tree is build for this subset
- at each decision point in the tree, the decision attribute is selected randomly.
Often these two elements are combined.
http://link.springer.com/article/10.1023%2FA%3A1010933404324#page-1

Massive datasets with the randomForest package

I have about 300,000 rows of data and 10 features in my model and I want to fit a random forest from the randomForest package in R.
To maximise the amount of trees I can get in the forest in a fixed window of time without ruining generalisation what are sensible ranges that I should set the parameters to?
Usually you can get away with just mtryas explained here and the default is often best:
https://stats.stackexchange.com/questions/50210/caret-and-randomforest-number-of-trees
But there is a function tuneRF with randomForest that will help you find optimal ntree or mtry as explained here:
setting values for ntree and mtry for random forest regression model
The time it takes you will have to test yourself - it's going to be the products of foldstuningntrees.
The only speculative point I would add is that with 300,000 rows of data you might reduce the runtime without loss of predictive accuracy by bootstrapping small samples of the data???

Resources