How do I make a randomForest model size smaller? - r

I've been training randomForest models in R on 7 million rows of data (41 features). Here's an example call:
myModel <- randomForest(RESPONSE~., data=mydata, ntree=50, maxnodes=30)
I thought surely with only 50 trees and 30 terminal nodes that the memory footprint of "myModel" would be small. But it's 65 megs in a dump file. The object seems to be holding all sorts of predicted, actual, and vote data from the training process.
What if I just want the forest and that's it? I want a tiny dump file that I can load later to make predictions off of quickly. I feel like the forest by itself shouldn't be all that large...
Anyone know how to strip this sucker down to just something I can make predictions off of going forward?

Trying to get out of the habit of posting answers as comments...
?randomForest advises against using the formula interface with large numbers of variables... are the results any different if you don't use the formula interface? The Value section of ?randomForest also tells you how to turn off some of the output (importance matrix, the entire forest, proximity matrix, etc.).
For example:
myModel <- randomForest(mydata[,!grepl("RESPONSE",names(mydata))],
mydata$RESPONSE, ntree=50, maxnodes=30, importance=FALSE,
localImp=FALSE, keep.forest=FALSE, proximity=FALSE, keep.inbag=FALSE)

You can make use of tuneRF function in R to know the number of trees and make the size smaller.
tuneRF(data_train, data_train$Response, stepFactor = 1.2, improve = 0.01, plot = T, trace = T)
use ?tuneRF to know more about inside variables.

Related

optimize the model after modeling in r randomforste grid search

there i have two regression models ,rf1 and rf2 and i want o find value of variables that allow output of rf1 to be between 20 and 26 and output of rf2 should be inferior to 10 :
i tried grid search but i found nothing,please i you know how to do it with a heuristic (simulated annealing or genetic algorithm) please help me
you can find the code for this example in this repository here
library(randomForest)
model_rf_fines<- readRDS(file = paste0("rf1.rds"))
model_rf_gros<- readRDS(file = paste0("rf2.rds"))
#grid------
grid_input_test = expand.grid(
"Poste" ="P1",
"Qualité" ="BTNBA",
"CPT_2500" =13.83,
"CPT400" = 46.04,
"CPT160" =15.12,
"CPT125" =5.9,
"CPT40"=15.09,
"CPT_40"=4.02,
"retart"=0,
"dure"=0,
'Débit_CV004'=seq(1300,1400,10),
"Dilution_SB002"=seq(334.68,400,10),
"Arrosage_Crible_SC003"=seq(250,300,10),
"Dilution_HP14"=1200,
"Dilution_HP15"=631.1,
"Dilution_HP18"=500,
"Dilution_HP19"=seq(760.47,800,10),
"Pression_PK12"=c(0.59,0.4),
"Pression_PK13"=c(0.8,0.7),
"Pression_PK14"=c(0.8,0.9,0.99,1),
"Pression_PK16"=c(0.5),
"Pression_PK18"=c(0.4,0.5)
)
#levels correction ----
levels(grid_input_test$Qualité) = model_rf_fines$forest$xlevels$Qualité
levels(grid_input_test$Poste) = model_rf_fines$forest$xlevels$Poste
for(i in 1:nrow(grid_input_test)){
#fines
print("----------------------------")
print(i)
print(paste0('Fines :', predict(object = model_rf_fines,newdata = grid_input_test[i,]) ))
#gros
print(paste0('Gros :',predict(object = model_rf_gros,newdata = grid_input_test[i,]) ))
if(predict(object = model_rf_gros,newdata = grid_input_test[i,])<=10){break}
}
any suggestions will be greatly appreciated
thanks.
It might be such variables/input does not exists. If rf1 and rf2 represent two Random Forest models, with say >50 trees, the number of trees will average out spikes/edges of the model.
Similar to the law of large numbers, the more trees in each forest, the more closer output of rf1 and rf2 will be. This is all if indeed rf_ represent random forests both trained on same data, indeed than the more trees the more impossible your input that satisfies the conditions.
Indeed try a naive grid search first, and keep track of minimum value of rf2 while rf1 satisfies your condition. Call this minimum M_grid
If you want to implement simulated annealing, I would start with a simple neighbour scheme, say take a random input variable and vary it a bit. Use python packages for the annealing scheme. If this simple scheme beats your M_grid by quite a bit and you feel you are close to the solution, you can play around with slower cooling schemes, or more complicated neighbour proposals.
Also, the objective for both SA and GA should not be chosen too fast. Probably you want a objective that steers rf1 close to its lowest edge of 20, and rf2 as minium as possible, with maybe a exp() or **3 to reward going down plenty.
I made some assumptions here, maybe wrong. But hope this helps anyway.

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Preprocess data in R

Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data

Random Forest with classes that are very unbalanced

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:
strata
sampsize
The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code:
randomForest(x=predictors,
y=response,
data=train.data,
mtry=lista.params[1],
ntree=lista.params[2],
na.action=na.omit,
nodesize=lista.params[3],
maxnodes=lista.params[4],
sampsize=c(250000,2000),
do.trace=100,
importance=TRUE)
The response is a class with two possible values, the first one appears more frequently than the second (10000:1 or more)
The list.params is a list with different parameters (duh! I know...)
Well, the question (again) is: How I can use the 'strata' parameter? I am using sampsize correctly?
And finally, sometimes I get the following error:
Error in randomForest.default(x = predictors, y = response, data = train.data, :
Still have fewer than two classes in the in-bag sample after 10 attempts.
Sorry If I am doing so many (and maybe stupid) questions ...
You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)
One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp:
https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html
In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.
There are a few options.
If you have a lot of data, set aside a random sample of the data. Build your model on one set, then use the other to determine a proper cutoff for the class probabilities using an ROC curve.
You can also upsample the data in the minority class. The SMOTE algorithm might help (see the reference below and the DMwR package for a function).
You can also use other techniques. rpart() and a few other functions can allow different costs on the errors, so you could favor the minority class more. You can bag this type of rpart() model to approximate what random forest is doing.
ksvm() in the kernlab package can also use unbalanced costs (but the probability estimates are no longer good when you do this). Many other packages have arguments for setting the priors. You can also adjust this to put more emphasis on the minority class.
One last thought: maximizing models based on accuracy isn't going to get you anywhere (you can get 99.99% off the bat). The caret can tune models based on the Kappa statistic, which is a much better choice in your case.
Sorry, I don't know how to post a comment on the earlier answer, so I'll create a separate answer.
I suppose that the problem is caused by high imbalance of dataset (too few cases of one of the classes are present). For each tree in RF the algorithm creates bootstrap sample, which is a training set for this tree. And if you have too few examples of one of the classes in your dataset, then the bootstrap sampling will select examples of only one class (major class). And thus tree cannot be grown on only one class examples. It seems that there is a limit on 10 unsuccessful sampling attempts.
So the proposition of DWin to reduce the degree of imbalance to lower values (1:100 or 1:10) is the most reasonable one.
Pretty sure I disagree with the idea of removing observations from your sample.
Instead you might consider using a stratified sample to set a fixed percentage of each class each time it is resampled. This can be done with the Caret package. This way you will not be omitting observations by reducing the size of your training sample. It will not allow you to over represent your classes but will make sure that each subsample has a representative sample.
Here is an example I found:
len_pos <- nrow(example_dataset[example_dataset$target==1,])
len_neg <- nrow(example_dataset[example_dataset$target==0,])
train_model <- function(training_data, labels, model_type, ...) {
experiment_control <- trainControl(method="repeatedcv",
number = 10,
repeats = 2,
classProbs = T,
summaryFunction = custom_summary_function)
train(x = training_data,
y = labels,
method = model_type,
metric = "custom_score",
trControl = experiment_control,
verbose = F,
...)
}
# strata refers to which feature to do stratified sampling on.
# sampsize refers to the size of the bootstrap samples to be taken from each class. These samples will be taken as input
# for each tree.
fit_results <- train_model(example_dataset
, as.factor(sprintf("c%d", as.numeric(example_dataset$target)))
,"rf"
,tuneGrid = expand.grid(mtry = c( 3,5,10))
,ntree=500
,strata=as.factor(example_dataset$target)
,sampsize = c('1'=as.integer(len_pos*0.25),'0'=as.integer(len_neg*0.8))
)

Resources