Due to the fact that I'm currently working on a highly unbalanced multi-class classification problem, I'm considering balanced random forests (https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf). Do you have some experience implementing balanced random forests using H2O? If so, could you please elaborate on the following question:
Is it even possible to change the default process of creating bootstrap samples within H2O to come up with balanced sub-samples (for each iteration in the random forest, draw a bootstrap sample from the minority class. Randomly draw the same number of cases, with replacement, from the majority classes) of the original data set for each tree to grow?
H2O's random forest doesn't perform bootstrapping, instead it samples at a rate of 63.2% (which is the expected value of unique rows in any bootstrapped sample).
If you want to get a balanced sample, you can use can use the parameter balance_classes with class_sampling_factors, or weights_column
Related
I know that seed is set in general is used so that we can reproduce the same result. But, what does setting up the seed actually do in random forest part. Does it change any of the arguments of randomForest() function in R like nTree or sampSize.
I am using different seeds for my random forest model each time, but want to know how different seeds affect a random forest model.
Trees grow from seeds and so do forests ;-) (scnr)
There are different ways to built a random forest, however, all in common is that multiple trees are built. To improve classification accuracy over a single decision tree, the individual trees in a random forest need to differ, as you would have nTree times the same tree. This difference is achieved by introducing randomness in the generation of the trees. The randomness is influenced by the seed and what is most important about the seed is that using the same seed should always generate the same result.
How does the randomness influence the tree build? There are multiple ways.
- build the tree for a random subset. This is for each individual tree of the forest a subset of training example are drawn and then a tree is build for this subset
- at each decision point in the tree, the decision attribute is selected randomly.
Often these two elements are combined.
http://link.springer.com/article/10.1023%2FA%3A1010933404324#page-1
I am using Random Forest in my Project. I want to use all the training set samples with replacement for creating individual random trees and not just 2/3rd of the training set and leaving the 1/3rd for OOB error calculation.
How can I do this in Weka or R?
In short, making sure Random forest uses all training samples to generate individual trees and not just 2/3rd of the training set.
Any help is highly appreciated.
I'm trying to use Random Forest to predict the outcome of an extremely imbalanced data set (the 1's rate is about only 1% or even less). Because the traditinal randomForest minimize the overall error rate, rather than paying special attention to the positive class, it makes the traditional randomForest not applicable for the imbalanced data. So I want to assigne a high cost to misclassification of the minority class(cost sensitive learning).
I read several sources that we can use the option classwt of randomForest on R, but I don't know how to use this. And do we have any other alternatives to the randomForest funtion?
classwt gives you the ability to assign a prior probability to each of the classes in your dataset. So, if you have classwt = c(0.5, 0.5), then you are saying that before actually running the model for your specific dataset, you expect there to be around the same number of 0's as 1's. You can adjust these parameters as you like to try to minimize false negatives. This may seem like a smart idea to assign a cost in theory, but in reality, does not work so well. The prior probabilities tend to affect the algorithm more sharply than desired. Still, you could play around with this.
An alternative solution is to run the regular random forest, and then for a prediction, use the type='prob' option in the predict() command. For instance, for a random forest rf1, where we are trying to predict the results of a dataset data1, we could do:
predictions <- predict(rf1, data=data1, type='prob')
Then, you can choose your own probability threshold for classifying the observations of your data. A nice way to graphically view which threshold may be desirable is to use the ROCR package, which generates receiver operator curve.
I'm using randomForest pkg in R to predict the binary class based upon 11 numerical predictors. Out of the two classes, Hit or Miss, the class Hit is of more importance, i.e. I would like to know about how many times correctly predicted Hit.
Is there a way to give the Hit a higher importance in training the random forest? Currently the trained random forest predicts merely 7% of the Hit cases correctly and definitely would like an improvement.
Higher importance? I don't know how to tell any algorithm "I'm not kidding this time: I want this analysis to be accurate."
You're always fighting the variance versus bias battle. If you improve the training accuracy too much, you run the risk of overfitting.
You can adjust random forest by varying the size of the random sample of predictors. If you have m predictors, the recommendation for random forest is p = m^1/2 for the number of splits in the tree. You can also vary the number of trees. Plot the test classification error versus # trees for different values of p to see how you do.
You can also try another algorithm, like gbm (Generalized Boosted Regression Models) or support vector machines
How does your data look when you plot it? Any obvious groups jumping out at you when you look at them in scatterplots?
Regardless of algorithm, I'd advise that you do n-fold validation of your model.
I have about 300,000 rows of data and 10 features in my model and I want to fit a random forest from the randomForest package in R.
To maximise the amount of trees I can get in the forest in a fixed window of time without ruining generalisation what are sensible ranges that I should set the parameters to?
Usually you can get away with just mtryas explained here and the default is often best:
https://stats.stackexchange.com/questions/50210/caret-and-randomforest-number-of-trees
But there is a function tuneRF with randomForest that will help you find optimal ntree or mtry as explained here:
setting values for ntree and mtry for random forest regression model
The time it takes you will have to test yourself - it's going to be the products of foldstuningntrees.
The only speculative point I would add is that with 300,000 rows of data you might reduce the runtime without loss of predictive accuracy by bootstrapping small samples of the data???