Random forest in Caret running time - r

I need to compare models performance on a dataset (binomial predictor and 170 variable, 4000 observation) and i'm unable to make the caret randomforest 'rf' model work.
My code is below and i've stopped it after 2 hours :
myfolds <- caret::createMultiFolds(milk_training_purged$pred, k = 10, times = 3)
control <- caret::trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")
model <- train(pred ~ ., data = milk_training_purged,
method = "rf",
metric = "Accuracy",
preProc = c("nzv","center","scale"),
tuneLength = 6,
trControl = control)
If i understand it well caret is just a call function from the randomForest pckg and therefore i tried running my dataset with it.
model <- randomForest(pred ~ ., data=milk_training_purged, proximity=TRUE)
and it only takes 1 minutes to create the model.
I know that it doesn't do the cross-validation that is done with caret but it should'nt take so long.
I would like to simply rewrite the RandomForest code into Caret::train.
Thanks in advance for your help and sorry for the silly question

Related

Making caret train rf faster when ranger is not an option

The website I am trying to run the code is using an old version of R and does not accept ranger as the library. I have to use the caret package. I am trying to process about 800,000 lines in my train data frame and here is the code I use
control <- trainControl(method = 'repeatedcv',
number = 3,
repeats = 1,
search = 'grid')
tunegrid <- expand.grid(.mtry = c(sqrt(ncol(train_1))))
fit <- train(value~.,
data = train_1,
method = 'rf',
ntree = 73,
tuneGrid = tunegrid,
trControl = control)
Looking at previous posts, I tried to tune my control parameters, is there any way I can make the model run faster? Am I able to specify a specific setting so that it just generates a model with the parameters I set, and not try multiple options?
This is my code from ranger which I optimized and currently having accurate model
fit <- ranger(value ~ .,
data = train_1,
num.trees = 73,
max.depth = 35,mtry = 7,importance='impurity',splitrule = "extratrees")
Thank you so much for your time
When you specify method='rf', caret is using the randomForest package to build the model. If you don't want to do all the cross-validation that caret is useful for, just build your model using the randomForest package directly. e.g.
library(randomForest)
fit <- randomForest(value ~ ., data=train_1)
You can specify values for ntree, mtry etc.
Note that the randomForest package is slow (or just won't work) for large datasets. If ranger is unavailable, have you tried the Rborist package?

How do you extract the undersampled data from caret in R?

I created a random forest model using the train function from caret with the following train control parameters:
ctrl <- trainControl(sampling = "down")
set.seed(123)
rf_fit <- train(Dep ~.,
data = dt_train,
method = "rf",
metric = "Kappa",
trControl = ctrl,
importance = TRUE)
My forest is performing well and now I need to sort of explain it in simpler terms. I am using inTrees package to do that but to proceed, I need the undersampled data first that was generated from train.
Is there any way that I can get the undersampled data from train function?

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

Different results with randomForest() and caret's randomForest (method = "rf")

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.
I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.
Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.
Here’s a replicable example, using the CO2 dataset from the MASS package.
library(MASS)
data(CO2)
library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry=2,
importance=TRUE,
metric="RMSE")
library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl = trainControl(method="oob"),
allowParallel=FALSE)
set.seed(1)
caret.boot.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl=trainControl(method="boot", number=50),
allowParallel=FALSE)
print(rf.model)
print(caret.oob.model$finalModel)
print(caret.boot.model$finalModel)
Produces the following:
print(rf.model)
Mean of squared residuals: 9.380421
% Var explained: 91.88
print(caret.oob.model$finalModel)
Mean of squared residuals: 38.3598
% Var explained: 66.81
print(caret.boot.model$finalModel)
Mean of squared residuals: 42.56646
% Var explained: 63.16
And the code to look at variable importance:
importance(rf.model)
importance(caret.oob.model$finalModel)
importance(caret.boot.model$finalModel)
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final

Model trained with preprocess using impute not processing new data

I am using caret train function with the preProcess option:
fit <- train(form,
data=train,
preProcess=c("YeoJohnson","center","scale","bagImpute"),
method=model,
metric = "ROC",
tuneLength = tune,
trControl=fitControl)
This preprocesses the training data. However, when I predict, the observations with NAs, they are omitted even though I have bagImpute as an option. I know there is a na.action parameter on predict.train, but I can't get it to work.
predict.train(model, newdata=test, na.action=???)
Is it correct to assume that the predict function automatically preprocesses the new data because the model was trained using the preProcess option? If so, shouldn't the new data be imputed and processed the same way as train? What am i doing wrong?
Thanks for any help.
You would use na.action = na.pass. The problem is, while making a working example, I found a bug that occurs with the formula method for train and imputation. Here is an example without the formula method:
library(caret)
set.seed(1)
training <- twoClassSim(100)
testing <- twoClassSim(100)
testing$Linear05[4] <- NA
fitControl <- trainControl(classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(2)
fit <- train(x = training[, -ncol(training)],
y = training$Class,
preProcess=c("YeoJohnson","center","scale","bagImpute"),
method="lda",
metric = "ROC",
trControl=fitControl)
predict(fit, testing[1:5, -ncol(testing)], na.action = na.pass)
The bug will be fixed on the next release of the package.
Max

Resources