setting values for ntree and mtry for random forest regression model - r

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.

The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.

Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.

One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.

I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

Related

Computational speed of a complex Hierarchical GAM

I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Preprocess data in R

Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data

RandomForest for Regression in R

I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets.
My first test is to try and regress: sin(x)+gaussian noise.
With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters)
When doing the same on randomForest I have a completely overfitted solution.
I simply use (R 2.14.0, tried on 2.14.1 too, just in case):
library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)
I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...
You can use maxnodes to limit the size of the trees,
as in the examples in the manual.
r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)
You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them
Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.
Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry in R) or the training set of each tree (sampsize in R). Since there is only 1 input dimesions, mtry does not help, leaving sampsize. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.
small bags, more trees :: rmse = 0.04:
>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000),
mat)
- sin(x))
[1] 0.03912643
default settings :: rmse=0.14:
> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018
error due to noise in training set :: rmse = 0.25
> sd(y - sin(x))
[1] 0.2548882
The error due to noise is of course evident from
noise<-rnorm(1001)
y<-sin(x)+noise/4
In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:
> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000))
- sin(x))
[1] 0.04059679
My intuition is that:
if you had a simple decision tree to fit a 1 dimensional curve f(x), that would be equivalent to fit with a staircase function (not necessarily with equally spaced jumps)
with random forests you will make a linear combination of staircase functions
For a staircase function to be a good approximator of f(x), you want enough steps on the x axis, but each step should contain enough points so that their mean is a good approximation of f(x) and less affected by noise.
So I suggest you tune the nodesize parameter. If you have 1 decision tree and N points, and nodesize=n, then your staircase function will have N/n steps. n too small brings to overfitting. I got nice results with n~30 (RMSE~0.07):
r <- randomForest(Y~.,data=mat, nodesize=30)
plot(x,predict(r,mat),col="green")
points(x,y)
Notice that RMSE gets smaller if you take N'=10*N and n'=10*n.

Random Forest with classes that are very unbalanced

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:
strata
sampsize
The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code:
randomForest(x=predictors,
y=response,
data=train.data,
mtry=lista.params[1],
ntree=lista.params[2],
na.action=na.omit,
nodesize=lista.params[3],
maxnodes=lista.params[4],
sampsize=c(250000,2000),
do.trace=100,
importance=TRUE)
The response is a class with two possible values, the first one appears more frequently than the second (10000:1 or more)
The list.params is a list with different parameters (duh! I know...)
Well, the question (again) is: How I can use the 'strata' parameter? I am using sampsize correctly?
And finally, sometimes I get the following error:
Error in randomForest.default(x = predictors, y = response, data = train.data, :
Still have fewer than two classes in the in-bag sample after 10 attempts.
Sorry If I am doing so many (and maybe stupid) questions ...
You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)
One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp:
https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html
In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.
There are a few options.
If you have a lot of data, set aside a random sample of the data. Build your model on one set, then use the other to determine a proper cutoff for the class probabilities using an ROC curve.
You can also upsample the data in the minority class. The SMOTE algorithm might help (see the reference below and the DMwR package for a function).
You can also use other techniques. rpart() and a few other functions can allow different costs on the errors, so you could favor the minority class more. You can bag this type of rpart() model to approximate what random forest is doing.
ksvm() in the kernlab package can also use unbalanced costs (but the probability estimates are no longer good when you do this). Many other packages have arguments for setting the priors. You can also adjust this to put more emphasis on the minority class.
One last thought: maximizing models based on accuracy isn't going to get you anywhere (you can get 99.99% off the bat). The caret can tune models based on the Kappa statistic, which is a much better choice in your case.
Sorry, I don't know how to post a comment on the earlier answer, so I'll create a separate answer.
I suppose that the problem is caused by high imbalance of dataset (too few cases of one of the classes are present). For each tree in RF the algorithm creates bootstrap sample, which is a training set for this tree. And if you have too few examples of one of the classes in your dataset, then the bootstrap sampling will select examples of only one class (major class). And thus tree cannot be grown on only one class examples. It seems that there is a limit on 10 unsuccessful sampling attempts.
So the proposition of DWin to reduce the degree of imbalance to lower values (1:100 or 1:10) is the most reasonable one.
Pretty sure I disagree with the idea of removing observations from your sample.
Instead you might consider using a stratified sample to set a fixed percentage of each class each time it is resampled. This can be done with the Caret package. This way you will not be omitting observations by reducing the size of your training sample. It will not allow you to over represent your classes but will make sure that each subsample has a representative sample.
Here is an example I found:
len_pos <- nrow(example_dataset[example_dataset$target==1,])
len_neg <- nrow(example_dataset[example_dataset$target==0,])
train_model <- function(training_data, labels, model_type, ...) {
experiment_control <- trainControl(method="repeatedcv",
number = 10,
repeats = 2,
classProbs = T,
summaryFunction = custom_summary_function)
train(x = training_data,
y = labels,
method = model_type,
metric = "custom_score",
trControl = experiment_control,
verbose = F,
...)
}
# strata refers to which feature to do stratified sampling on.
# sampsize refers to the size of the bootstrap samples to be taken from each class. These samples will be taken as input
# for each tree.
fit_results <- train_model(example_dataset
, as.factor(sprintf("c%d", as.numeric(example_dataset$target)))
,"rf"
,tuneGrid = expand.grid(mtry = c( 3,5,10))
,ntree=500
,strata=as.factor(example_dataset$target)
,sampsize = c('1'=as.integer(len_pos*0.25),'0'=as.integer(len_neg*0.8))
)

Resources