Massive datasets with the randomForest package - r

I have about 300,000 rows of data and 10 features in my model and I want to fit a random forest from the randomForest package in R.
To maximise the amount of trees I can get in the forest in a fixed window of time without ruining generalisation what are sensible ranges that I should set the parameters to?

Usually you can get away with just mtryas explained here and the default is often best:
https://stats.stackexchange.com/questions/50210/caret-and-randomforest-number-of-trees
But there is a function tuneRF with randomForest that will help you find optimal ntree or mtry as explained here:
setting values for ntree and mtry for random forest regression model
The time it takes you will have to test yourself - it's going to be the products of foldstuningntrees.
The only speculative point I would add is that with 300,000 rows of data you might reduce the runtime without loss of predictive accuracy by bootstrapping small samples of the data???

Related

Review performance of smaller model subsets of a large Random Forest model?

I'm constrained by the memory footprint / size of my Random Forest model so would prefer the number of trees be as low as possible and the depth of trees to be as shallow as possible while minimizing any impact on performance. Rather than needing to set-up hyperparameter tuning to optimize for this, I am wondering whether I can just build one large Random Forest composed of many deep trees. From this, can I then get an estimate of the performance of hypothetical smaller models enclosed within (and save myself the time of hyperparameter tuning -- again I'm looking to just tune on those parameters that generally just need to be "big enough" for the data/problem)?
For example, if I build a model with 1500 trees, could I just extract 500 of these and build a prediction from these to give an estimate of the performance of using just 500 trees (if I do this repeatedly, each time evaluating performance on a holdout set, I figure this should give an estimate of the performance of building a model with 500 trees -- unless I'm missing something?) I should be able to do this similarly with max tree depth or minimum node size, correct?
How would I do this in R on a ranger model?
(Would appreciate any examples, with parsnip would be a bonus. Also guidance / verification that this is a reasonable approach to use to avoid hyperparameter tuning for Random Forest models for those hyperparameters that simply need to be "big"/"deep" enough would also be helpful.)

Assessing LDA predictions with textmineR in R - Calculating perplexity?

I am working on a LDA model with textmineR, have calculated coherence, log-likelihood measures and optimized my model.
As a last step I would like to see how well the model predicts topics on unseen data. Thus, I am using the predict() function from the textminer package in combination with GIBBS sampling on my testset-sample.
This results in predicted "Theta" values for each document in my testset-sample.
While I have read in another post that perplexity-calculations are not available with the texminer package (See this post here: How do i measure perplexity scores on a LDA model made with the textmineR package in R?), I am now wondering what the purpose of the prediction function is then for? Especially with a large dataset of over 100.000 Documents it is hard to just visually assess whether the prediction has performed well or not.
I do not want to use perplexity for model selection (I am using coherence/log-likelihood instead), but as far as I understand, perplexity would help me to understand how well the prediction is and how "surprised" the model is with new, previously unseen data.
Since this does not seem to be available for textmineR, I am not sure how to assess the model prediction. Is there anything else that I could use to measure the prediction quality of my textminer model?
Thank you!

Balanced random forest in R using H2O

Due to the fact that I'm currently working on a highly unbalanced multi-class classification problem, I'm considering balanced random forests (https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf). Do you have some experience implementing balanced random forests using H2O? If so, could you please elaborate on the following question:
Is it even possible to change the default process of creating bootstrap samples within H2O to come up with balanced sub-samples (for each iteration in the random forest, draw a bootstrap sample from the minority class. Randomly draw the same number of cases, with replacement, from the majority classes) of the original data set for each tree to grow?
H2O's random forest doesn't perform bootstrapping, instead it samples at a rate of 63.2% (which is the expected value of unique rows in any bootstrapped sample).
If you want to get a balanced sample, you can use can use the parameter balance_classes with class_sampling_factors, or weights_column

Feature selection and prediction accuracy in regression Forest in R

I am attempting to solve a regression problem where the input feature set is of size ~54.
Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (i.e., Random forest regression). The selected 'X1' is later found to be the most important feature.
My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.
I have the following questions:
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
forest <- randomForest(fmla, dTraining, ntree=501, importance=T)
mean((dTraining$y - predict(forest, data=dTraining))^2)
0.9371891
rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))
0.7431078
mean((dTest$y - predict(forest, newdata=dTest))^2)
0.009771256
rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))
0.9950448
Please suggest.
Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?
You should also try Cross Validated here
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
Yes, but the purpose of feature selection is not necessarily to speed up computation. With infinite features, it is possible to fit any pattern of data (i.e., overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest))?

class importance for random forest in r

I'm using randomForest pkg in R to predict the binary class based upon 11 numerical predictors. Out of the two classes, Hit or Miss, the class Hit is of more importance, i.e. I would like to know about how many times correctly predicted Hit.
Is there a way to give the Hit a higher importance in training the random forest? Currently the trained random forest predicts merely 7% of the Hit cases correctly and definitely would like an improvement.
Higher importance? I don't know how to tell any algorithm "I'm not kidding this time: I want this analysis to be accurate."
You're always fighting the variance versus bias battle. If you improve the training accuracy too much, you run the risk of overfitting.
You can adjust random forest by varying the size of the random sample of predictors. If you have m predictors, the recommendation for random forest is p = m^1/2 for the number of splits in the tree. You can also vary the number of trees. Plot the test classification error versus # trees for different values of p to see how you do.
You can also try another algorithm, like gbm (Generalized Boosted Regression Models) or support vector machines
How does your data look when you plot it? Any obvious groups jumping out at you when you look at them in scatterplots?
Regardless of algorithm, I'd advise that you do n-fold validation of your model.

Resources