Making caret train rf faster when ranger is not an option - r

The website I am trying to run the code is using an old version of R and does not accept ranger as the library. I have to use the caret package. I am trying to process about 800,000 lines in my train data frame and here is the code I use
control <- trainControl(method = 'repeatedcv',
number = 3,
repeats = 1,
search = 'grid')
tunegrid <- expand.grid(.mtry = c(sqrt(ncol(train_1))))
fit <- train(value~.,
data = train_1,
method = 'rf',
ntree = 73,
tuneGrid = tunegrid,
trControl = control)
Looking at previous posts, I tried to tune my control parameters, is there any way I can make the model run faster? Am I able to specify a specific setting so that it just generates a model with the parameters I set, and not try multiple options?
This is my code from ranger which I optimized and currently having accurate model
fit <- ranger(value ~ .,
data = train_1,
num.trees = 73,
max.depth = 35,mtry = 7,importance='impurity',splitrule = "extratrees")
Thank you so much for your time

When you specify method='rf', caret is using the randomForest package to build the model. If you don't want to do all the cross-validation that caret is useful for, just build your model using the randomForest package directly. e.g.
library(randomForest)
fit <- randomForest(value ~ ., data=train_1)
You can specify values for ntree, mtry etc.
Note that the randomForest package is slow (or just won't work) for large datasets. If ranger is unavailable, have you tried the Rborist package?

Related

Random forest in Caret running time

I need to compare models performance on a dataset (binomial predictor and 170 variable, 4000 observation) and i'm unable to make the caret randomforest 'rf' model work.
My code is below and i've stopped it after 2 hours :
myfolds <- caret::createMultiFolds(milk_training_purged$pred, k = 10, times = 3)
control <- caret::trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")
model <- train(pred ~ ., data = milk_training_purged,
method = "rf",
metric = "Accuracy",
preProc = c("nzv","center","scale"),
tuneLength = 6,
trControl = control)
If i understand it well caret is just a call function from the randomForest pckg and therefore i tried running my dataset with it.
model <- randomForest(pred ~ ., data=milk_training_purged, proximity=TRUE)
and it only takes 1 minutes to create the model.
I know that it doesn't do the cross-validation that is done with caret but it should'nt take so long.
I would like to simply rewrite the RandomForest code into Caret::train.
Thanks in advance for your help and sorry for the silly question

tuneRF vs caret tunning for random forest

I've trying to tune a random forest model using the tuneRF tool included in the randomForest Package and I'm also using the caret package to tune my model. The issue is that I'm tunning to get mtry and I'm getting different results for each approach. The question is how do I know which is the best approach and base on what? I'm not clear if I should expect similar or different results.
tuneRF: with this approach I'm getting the best mtry is 3
t <- tuneRF(train[,-12], train[,12],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 100,
trace = TRUE,
improve = 0.05)
caret: With this approach I'm always getting that the best mtry is all variables in this case 6
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(2:6))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse",
tuneGrid=tunegrid, ntree = 100, trControl=control)
There are a few differences, for each mtry parameters, tuneRF fits one model on the whole dataset, and you get the OOB error from each of these fit. tuneRF then takes the lowest OOB error. For each value of mtry, you have one score (or RMSE value) and this will change with different runs.
In caret, you actually do cross-validation, so the test data from the fold was not used at all in the model. Though in principle it should be similar to OOB, you should be aware of the differences.
A evaluation with a better picture on the error might be to run tuneRF a few rounds, and we can use cv in caret:
library(randomForest)
library(mlbench)
data(BostonHousing)
train <- BostonHousing
tuneRF_res = lapply(1:10,function(i){
tr = tuneRF(train[,-14], train[,14],mtryStart=2,step=0.9,ntreeTry = 100,trace = TRUE,improve=1e-5)
tr = data.frame(tr)
tr$RMSE = sqrt(tr[,2])
tr
})
tuneRF_res = do.call(rbind,tuneRF_res)
control <- trainControl(method="cv", number=10,returnResamp="all")
tunegrid <- expand.grid(.mtry=c(2:7))
caret_res <- train(medv ~., data=train, method="rf", metric="RMSE",
tuneGrid=tunegrid, ntree = 100, trControl=control)
library(ggplot2)
df = rbind(
data.frame(tuneRF_res[,c("mtry","RMSE")],test="tuneRF"),
data.frame(caret_res$resample[,c("mtry","RMSE")],test="caret")
)
df = df[df$mtry!=1,]
ggplot(df,aes(x=mtry,y=RMSE,col=test))+
stat_summary(fun.data=mean_se,geom="errorbar",width=0.2) +
stat_summary(fun=mean,geom="line") + facet_wrap(~test)
You can see the trend is more or less similar. My suggestion would be to use tuneRF to quickly check the range of mtrys to train over, then use caret, cross-validation to properly evaluate this.

How to deactivate embedded feature selection in caret package?

I am writing a machine learning code using caret package in R. A sample of code could be
weighted_fit <- train(outcome,
data = train,
method = 'glmnet',
trControl = ctrl)
As you know, some methods in caret package have built-in feature selection such as elastic net. My question is that is there any way to deactivate the built in feature selection in this code?
Thanks in advance for any comment.
#I will try to answer this question to the best of my ability:
#The train function in caret package comes with a parameter tuneGrid which can be used to create a data-frame of tuning parameters.
#The tuning parameter of elastic net regularization in glmnet() is alpha, so create the following:
glmgrid <- expand.grid(alpha = 0) will give ridge regularization.
glmgrid <- expand.grid(alpha = 1) will give lasso regularization.
#and then use
weighted_fit <- train(outcome,
data = train,
method = 'glmnet',
trControl = ctrl,
tuneGrid = glmgrid)
#In glmnet in r , the alpha values can be in the range [0,1] i.e. 0 to 1 including 0 and 1.
# GLMNET - https://www.rdocumentation.org/packages/glmnet/versions/2.0-18/topics/glmnet
# CARET - https://topepo.github.io/caret/index.html

Pre-Processing Data in Caret and Making Predictions on an Unknown Data Set

I am using the Caret package train function to fit a model and then predict to predict values on an unknown data set (which I then get feedback on so I know the quality of my predictions). I'm having problems and I'm convinced it has to do with preprocessing the unknown data.
Briefly and simply, this is what I'm doing:
Pre-Process Training Data:
preproc = preProcess(train_num,method = c("center", "scale"))
train_standardized <- predict(preproc, train_num)
Train the Model:
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = c(100,500),
shrinkage = 0.1,
n.minobsinnode = 20)
train.boost = train(x=train_standardized[,-length(train_standardized)],
y=train_standardized$response,
method = "gbm",
metric = "ROC",
maximize = FALSE,
tuneGrid= gbmGrid,
trControl = trainControl(method="cv",
number=5,
classProbs = TRUE,
verboseIter = TRUE,
summaryFunction=twoClassSummary,
savePredictions = TRUE))
Prepare unknown data for predictions:
...
unknown_standardized <- predict(preproc, unknown_num)
...
Make the actual prediction on the unknown data:
preds <- predict(train.boost,newdata=unknown_standardized,type="prob")
Note that the "preproc" object is the same one resulting from analysis of the training set and used to make the centered/standardized predictions on which the model was trained.
When I get my evaluation back my evaluation on the unknown data it is substantially worse than what was predicted using the training set (ROC using training data via cross validation is about .83, ROC using the unknown data that I get back from the evaluating party is about .70).
Do I have the process right? What am I doing wrong?
Thanks in advance.
In one sense, you are not doing anything wrong at all.
A predictor is likely to do better on a training sample as it has used that data to build the model.
The whole point of the training set is to see how well that model generalizes. It is likely to "overfit" to the training data to a greater or lesser extent and to do somewhat worse on new data.
At least once you have your score against new data, you know the true accuracy of the model. If that accuracy is sufficient for your purposes, then the model will be useable and (because you have done the training/test) robust to new data.
Now, it is possible that the model could be better if it was trained on a wider variety of data. So to increase real accuracy, it might be worth using cross-validation to train it on multiple slices of the data - k fold cross-validation. Caret has a nice facility for that. http://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/

How to track a progress while building model with the caret package?

I am trying to build model using train function from caret package:
model <- train(training$class ~ .,data=training, method = "nb")
Training set contains about 20K observations, each observation has above 100 variables. I would like to know if building a model from that dataset will take hours or days.
How to estimate time needed to train model from data? How track a progress of training process when using functions from caret package?
Assuming that you are training the model with
an expanded grid of tuning parameters (all combinations of the tuning parameters)
and a resampling technique of your choice (cross validation, bootstrap etc)
You could set
trainctrl <- trainControl(verboseIter = TRUE)
and set it in the trControl argument of the train function to track the training progress
model <- train(training$class ~ .,data=training, method = 'nb', trControl = trainctrl)
This prints out the progress out to the console at each resampling stage, and allows you to gauge the progress of the training/parameter tuning.
To estimate the total running time, you could run the model once to see how long it runs, and estimate the total time by multiplying accordingly based on your resampling scheme and number of parameter combinations. This can be done by setting the trainControl again, and setting the tuneLength to 1:
trainctrl <- trainControl(method = 'none')
model <- train(training$class ~ ., data = training, method = 'nb', trControl = trainctrl, tuneLength = 1)
Hope this helps! :)

Resources