Pass PCA preprocessing arguments to train() - r

I'm trying to build a predictive model in caret using PCA as pre-processing. The pre-processing would be as follows:
preProc <- preProcess(IL_train[,-1], method="pca", thresh = 0.8)
Is it possible to pass the thresh argument directly to caret's train() function? I've tried the following, but it doesn't work:
modelFit_pp <- train(IL_train$diagnosis ~ . , preProcess="pca",
thresh= 0.8, method="glm", data=IL_train)
If not, how can I pass the separate preProc results to the train() function?

As per the documentation, you specify additional preprocessing arguments with trainControl
?trainControl
...
preProcOptions
A list of options to pass to preProcess. The type of pre-processing
(e.g. center, scaling etc) is passed in via the preProc option in train.
...
Since your dataset is not reproducible, let's look at an example. I will use the Sonar dataset from mlbench and use the pls algorithm just for fun.
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(preProcOptions = list(thresh = 0.95))
mod <- train(Class ~ .,
data = Sonar,
method = "pls",
trControl = ctrl)
Although documentation isn't the most exciting read, definitely make sure to try to go through it. Package authors work hard to create documentation and there are many wonders to be found within.

Related

Making caret train rf faster when ranger is not an option

The website I am trying to run the code is using an old version of R and does not accept ranger as the library. I have to use the caret package. I am trying to process about 800,000 lines in my train data frame and here is the code I use
control <- trainControl(method = 'repeatedcv',
number = 3,
repeats = 1,
search = 'grid')
tunegrid <- expand.grid(.mtry = c(sqrt(ncol(train_1))))
fit <- train(value~.,
data = train_1,
method = 'rf',
ntree = 73,
tuneGrid = tunegrid,
trControl = control)
Looking at previous posts, I tried to tune my control parameters, is there any way I can make the model run faster? Am I able to specify a specific setting so that it just generates a model with the parameters I set, and not try multiple options?
This is my code from ranger which I optimized and currently having accurate model
fit <- ranger(value ~ .,
data = train_1,
num.trees = 73,
max.depth = 35,mtry = 7,importance='impurity',splitrule = "extratrees")
Thank you so much for your time
When you specify method='rf', caret is using the randomForest package to build the model. If you don't want to do all the cross-validation that caret is useful for, just build your model using the randomForest package directly. e.g.
library(randomForest)
fit <- randomForest(value ~ ., data=train_1)
You can specify values for ntree, mtry etc.
Note that the randomForest package is slow (or just won't work) for large datasets. If ranger is unavailable, have you tried the Rborist package?

Missing Value Error although "na.action" was set to "na.roughfix"

I would like to create a Random Forest model with caret. Since there are missing values in the training set, I was looking for possible solutions and came across the option "na.roughfix" from the package "randomForest". If the library randomForest is loaded, this option can be used as argument for the parameter "na.action" within the train function of caret. Inside the train function I use a 5-fold CV and tune for the best ROC value. I do this to ensure comparability between other models. The method I've chosen for the Random Forest is "ranger".
But now something strange happens: When I trigger the train function, the calculation is started, but for example the following error message appears:
model fit failed for Fold5: mtry= 7, splitrule=gini, min.node.size= 5 Error : Missing data in columns: ...
The "..." stands for the columns in which the missing values occur. Moreover, this error message always occurs, no matter for which fold or value for mtry.
I am well aware that there are missing values in these columns ... that's why I use na.roughfix. I also remove the NZVs, but that doesn't help either.
I would be very happy about an explanation or even a solution!
Many greetings
Edit.: I've seen now that, if I want to choose the "na.action" arugment in the train function, it does not appear automatically, which it usually does. It seems that it's somehow lost ... maybe this is the reason, why caret does not use the na.roughfix ...
Edit. 2: I guess that this is one part of the problem. train behaves always differently, depending on the previous arguments. In my train function I use a recipe from the recipe package to remove the NZVs. As soon as I remove the recipe, the na.action argument becomes available again. However, now the preProcess argument vanished, meaning I cannot remove the NZVs anymore. This is really a mess :-/ Is there a possibilty to apply the na.action AND the preProcess argument at the same time or any other solution for my Missing-Values-NZV-problem?
Edit. 3: As wished by the user missuse I try to provide you with a code expamle. Unfortunately I cannot provide you with data since mine is relatively sensitve - thank you for your understanding.
At first, I create a "blueprint" which I hand over to the train function. Here, I remove the Near Zero Variance Variables.
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors())
In the next step, I define the trainControl
train_control <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = TRUE)
and a grid:
hyper_grid <- expand.grid(mtry=c(1:(ncol(train_data)-1)),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 3, 5, 7, 10))
Finally, I put it all together into the train function:
tuned_rf <- train(
blueprint,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
Here, the argument na.action doesn't get suggested by R, meaning that is not available. This throws the error message in the opening question. However, if I remove the blueprint and write the model like this:
tuned_rf <- train(
target ~ .,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
na.action is available and na.roughfix can be used. However, now, the pre processing is missing. If I want to add the argument "preProcess =" to remove the NZVs, R does not suggest it, meaning that is not available anymore. Therefore, I would have to replace the fomula and the data with the training_data X and the response variable y. Now, preProcess is available again ... but na.action has vanished, therefore I cannot use na.roughfix.
tuned_rf <- train(
X,
Y,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
preProcess = "nzv"
)
Of course I could identify the NZVs first and remove them manually - but if I want to apply further steps, the whole process gets complicated.
I hope, my problem is now more understandable ...
From the help of ?randomForest::na.roughfix just performs median/mode imputation you can replace it when using a recipe with step_impute_median and step_impute_mode
your blueprint would look like:
library(recipes)
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors()) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal())
Perhaps also try
blueprint <- recipe(target ~ ., data = train_data) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal()) %:%
step_nzv(all_predictors())
Depending on how step_nzv handles missing values.
I would also check performance with other imputing functions like
step_impute_bag
step_impute_knn

How to deactivate embedded feature selection in caret package?

I am writing a machine learning code using caret package in R. A sample of code could be
weighted_fit <- train(outcome,
data = train,
method = 'glmnet',
trControl = ctrl)
As you know, some methods in caret package have built-in feature selection such as elastic net. My question is that is there any way to deactivate the built in feature selection in this code?
Thanks in advance for any comment.
#I will try to answer this question to the best of my ability:
#The train function in caret package comes with a parameter tuneGrid which can be used to create a data-frame of tuning parameters.
#The tuning parameter of elastic net regularization in glmnet() is alpha, so create the following:
glmgrid <- expand.grid(alpha = 0) will give ridge regularization.
glmgrid <- expand.grid(alpha = 1) will give lasso regularization.
#and then use
weighted_fit <- train(outcome,
data = train,
method = 'glmnet',
trControl = ctrl,
tuneGrid = glmgrid)
#In glmnet in r , the alpha values can be in the range [0,1] i.e. 0 to 1 including 0 and 1.
# GLMNET - https://www.rdocumentation.org/packages/glmnet/versions/2.0-18/topics/glmnet
# CARET - https://topepo.github.io/caret/index.html

caretEnsemble: Component models do not have the same re-sampling strategies

I have several prediction models which are created using the same trainControl. These models have to be created beforehand (i.e. I can't use caretList to train multiple models simultaneously).
Below is my minimal example. When I manually combine multiple (already created) models and pass them to caretStack,
library("kernlab")
library("rpart")
library("caret")
library("caretEnsemble")
trainingControl <- trainControl(method='cv', number=10, savePredictions = "final", classProbs=TRUE)
data(spam)
ds <- spam
tr <- ds[sample(nrow(ds),3221),]
te <- ds[!(rownames(ds) %in% rownames(tr)),]
model <- train(tr[,-58], tr$type, 'svmRadial', trControl = trainingControl)
model2 <- train(tr[,-58], tr$type, 'rpart', trControl = trainingControl)
multimodel <- list(svm = model, nb = model2)
class(multimodel) <- "caretList"
stack <- caretStack(multimodel, method = "rf", metric = "ROC", trControl = trainingControl)
the library throws the error:
Component models do not have the same re-sampling strategies.
Why is that since I'm using the same strategy to generate the base models?
I found the "casting" to caretList class in the github discussion zachmayer/caretEnsemble/issues/104.
You are almost there. One of the things to remember is that when you want to use caretEnsemble is that in trainControl you have to set the resample index via the 'index' option in trainControl. If you run caretList it tends to set this itself, but it is better to do this yourself. This is especially true when you run different models outside of caretList. You need to make sure the resampling is the same. You can also see this in the example on github you refer to.
trainingControl <- trainControl(method='cv',
number=10,
savePredictions = "final",
classProbs=TRUE,
index=createResample(tr$type)) # this needs to be set.
This will make sure that your code will run.
Note that in the example code you have given, it will return with errors.

Is there a way to enable grid tuning of mtry using Random Forest and PCA pre-processing in train function from Caret?

When I use Random Forest with PCA pre-processing with the train function from Caret package, if I add a expand.grid(ncomp=c(2,5,10,15)), I need to provide also a grid for mtry.
res <- train(Y~., data=df, method="icr", preProc = c("center",
"scale"), tuneGrid = expand.grid(n.comp = c(2,5,10,15))))
I would rather not provide it and let it work as it is when I perform the same Random Forest with PCA pre-processing without specifying any expand.grid.
res <- train(Y~., data=df, method="icr", preProc = c("center",
"scale")))
Does any one know how I can solve this ?
Many Thanks
I found my answer, I post it for someone interest.
You need to add the ICA parameter in the trainControl because ICA is use for pre-processing.
fitControl <-(preProcOptions = list(ICAcomp = 2))
res <- train(Y~., data=df, method="icr", preProc = c("center","scale")))
Unfortunatly, I don't think you can give a grid for ICA comp in this case
n.comp isn't exposed to the preProcess function when you call it from train.
One alternative is to use a custom method. Here is an example that does almost exactly what you want (but uses PLS instead of PCA).
Max

Resources