I know that seed is set in general is used so that we can reproduce the same result. But, what does setting up the seed actually do in random forest part. Does it change any of the arguments of randomForest() function in R like nTree or sampSize.
I am using different seeds for my random forest model each time, but want to know how different seeds affect a random forest model.
Trees grow from seeds and so do forests ;-) (scnr)
There are different ways to built a random forest, however, all in common is that multiple trees are built. To improve classification accuracy over a single decision tree, the individual trees in a random forest need to differ, as you would have nTree times the same tree. This difference is achieved by introducing randomness in the generation of the trees. The randomness is influenced by the seed and what is most important about the seed is that using the same seed should always generate the same result.
How does the randomness influence the tree build? There are multiple ways.
- build the tree for a random subset. This is for each individual tree of the forest a subset of training example are drawn and then a tree is build for this subset
- at each decision point in the tree, the decision attribute is selected randomly.
Often these two elements are combined.
http://link.springer.com/article/10.1023%2FA%3A1010933404324#page-1
Related
I'm constrained by the memory footprint / size of my Random Forest model so would prefer the number of trees be as low as possible and the depth of trees to be as shallow as possible while minimizing any impact on performance. Rather than needing to set-up hyperparameter tuning to optimize for this, I am wondering whether I can just build one large Random Forest composed of many deep trees. From this, can I then get an estimate of the performance of hypothetical smaller models enclosed within (and save myself the time of hyperparameter tuning -- again I'm looking to just tune on those parameters that generally just need to be "big enough" for the data/problem)?
For example, if I build a model with 1500 trees, could I just extract 500 of these and build a prediction from these to give an estimate of the performance of using just 500 trees (if I do this repeatedly, each time evaluating performance on a holdout set, I figure this should give an estimate of the performance of building a model with 500 trees -- unless I'm missing something?) I should be able to do this similarly with max tree depth or minimum node size, correct?
How would I do this in R on a ranger model?
(Would appreciate any examples, with parsnip would be a bonus. Also guidance / verification that this is a reasonable approach to use to avoid hyperparameter tuning for Random Forest models for those hyperparameters that simply need to be "big"/"deep" enough would also be helpful.)
The algorithm implemented in randomForest() generates bootstrapped data. It is possible that it generates variables with zero variation. In this case, are these bootstrapped constants dropped before mtry candidate variables are selected or are they drawn then dropped from the pool of candidate variables?
Relatedly, we can imagine a case where all of the bootstrapped variables exhibit zero variation. What does the package do in such edge cases?
Based on the original paper from Breiman (2001), nothing excludes the possibility that bootstrap may generate variables with near-zero variance. BUT..., this is not a big issue in the contest of random forests algorithm, which is based on the growth of numerous decision trees that are weakly correlated with each other.
In random forests, the bagging (bootstrap aggregation) procedure is used to reduce the variance for those algorithms that have high variance, like decision trees (CART for instance). Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn, the predictions can be quite different.
But, when bagging with decision trees, we are less concerned about individual trees overfitting the training data. This is why the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characteristics of sub-models when combining predictions using bagging.
There is at least one problem when dealing with bagged decision trees like CART: they are greedy. They choose which variable to split on using a greedy algorithm that minimizes error. As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn, have a high correlation in their predictions.
Is here that Breiman finds a solution, by injecting even more randomness into the construction of the trees. In CART, when selecting a split point, the learning algorithm is allowed to look through all variables and all variable values in order to select the most optimal split-point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search (this number is selected by the mtry parameter).
All this to say that, even if a variable with zero-variance is produced, the algorithm is built specifically to be resistant to the weaknesses of individual decision trees. All this to say that, even if a zero-variance variable is produced, the algorithm is purpose-built to be resistant to the weaknesses of individual decision trees. In the voting process, the erroneous prediction of the single tree is dampened by the rest of the forest.
This not respond to your question "how the algorithm deal with the near-zero variance predictors" but it serves to say that it is actually not a real problem in random forests
Due to the fact that I'm currently working on a highly unbalanced multi-class classification problem, I'm considering balanced random forests (https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf). Do you have some experience implementing balanced random forests using H2O? If so, could you please elaborate on the following question:
Is it even possible to change the default process of creating bootstrap samples within H2O to come up with balanced sub-samples (for each iteration in the random forest, draw a bootstrap sample from the minority class. Randomly draw the same number of cases, with replacement, from the majority classes) of the original data set for each tree to grow?
H2O's random forest doesn't perform bootstrapping, instead it samples at a rate of 63.2% (which is the expected value of unique rows in any bootstrapped sample).
If you want to get a balanced sample, you can use can use the parameter balance_classes with class_sampling_factors, or weights_column
I am using Random Forest in my Project. I want to use all the training set samples with replacement for creating individual random trees and not just 2/3rd of the training set and leaving the 1/3rd for OOB error calculation.
How can I do this in Weka or R?
In short, making sure Random forest uses all training samples to generate individual trees and not just 2/3rd of the training set.
Any help is highly appreciated.
I'm using randomForest pkg in R to predict the binary class based upon 11 numerical predictors. Out of the two classes, Hit or Miss, the class Hit is of more importance, i.e. I would like to know about how many times correctly predicted Hit.
Is there a way to give the Hit a higher importance in training the random forest? Currently the trained random forest predicts merely 7% of the Hit cases correctly and definitely would like an improvement.
Higher importance? I don't know how to tell any algorithm "I'm not kidding this time: I want this analysis to be accurate."
You're always fighting the variance versus bias battle. If you improve the training accuracy too much, you run the risk of overfitting.
You can adjust random forest by varying the size of the random sample of predictors. If you have m predictors, the recommendation for random forest is p = m^1/2 for the number of splits in the tree. You can also vary the number of trees. Plot the test classification error versus # trees for different values of p to see how you do.
You can also try another algorithm, like gbm (Generalized Boosted Regression Models) or support vector machines
How does your data look when you plot it? Any obvious groups jumping out at you when you look at them in scatterplots?
Regardless of algorithm, I'd advise that you do n-fold validation of your model.