TuneLength definition in Caret R - r

Can Someone Explain how the TuneLength works in different models in the train function of the Caret package?
ctreeModel <- train(CompressiveStrength ~ .,
+ data = trainingSet,
+ method = "ctree",
+ tuneLength = 10,
+ trControl = controlObject)
In this case, the tuneLength has been used to define the number of predictors that are used in each split?

It all depends on the model. A valuable function within caret is modelLookup(). Pass a string with the name of the model you’re using, for example modelLookup("rf") and it will tell you which parameter is being tuned by tunelength. In your case above :
> modelLookup("ctree")
model parameter label forReg forClass probModel
1 ctree mincriterion 1 - P-Value Threshold TRUE TRUE TRUE
You can also specify your own range of tuning in a more customized way, if you want to try out specific values. For that, pass a data frame with column names matching those of the arguments from modelLookup() or if you have many of them for the specific model you're using, try expand.grid().

Related

Missing Value Error although "na.action" was set to "na.roughfix"

I would like to create a Random Forest model with caret. Since there are missing values in the training set, I was looking for possible solutions and came across the option "na.roughfix" from the package "randomForest". If the library randomForest is loaded, this option can be used as argument for the parameter "na.action" within the train function of caret. Inside the train function I use a 5-fold CV and tune for the best ROC value. I do this to ensure comparability between other models. The method I've chosen for the Random Forest is "ranger".
But now something strange happens: When I trigger the train function, the calculation is started, but for example the following error message appears:
model fit failed for Fold5: mtry= 7, splitrule=gini, min.node.size= 5 Error : Missing data in columns: ...
The "..." stands for the columns in which the missing values occur. Moreover, this error message always occurs, no matter for which fold or value for mtry.
I am well aware that there are missing values in these columns ... that's why I use na.roughfix. I also remove the NZVs, but that doesn't help either.
I would be very happy about an explanation or even a solution!
Many greetings
Edit.: I've seen now that, if I want to choose the "na.action" arugment in the train function, it does not appear automatically, which it usually does. It seems that it's somehow lost ... maybe this is the reason, why caret does not use the na.roughfix ...
Edit. 2: I guess that this is one part of the problem. train behaves always differently, depending on the previous arguments. In my train function I use a recipe from the recipe package to remove the NZVs. As soon as I remove the recipe, the na.action argument becomes available again. However, now the preProcess argument vanished, meaning I cannot remove the NZVs anymore. This is really a mess :-/ Is there a possibilty to apply the na.action AND the preProcess argument at the same time or any other solution for my Missing-Values-NZV-problem?
Edit. 3: As wished by the user missuse I try to provide you with a code expamle. Unfortunately I cannot provide you with data since mine is relatively sensitve - thank you for your understanding.
At first, I create a "blueprint" which I hand over to the train function. Here, I remove the Near Zero Variance Variables.
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors())
In the next step, I define the trainControl
train_control <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = TRUE)
and a grid:
hyper_grid <- expand.grid(mtry=c(1:(ncol(train_data)-1)),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 3, 5, 7, 10))
Finally, I put it all together into the train function:
tuned_rf <- train(
blueprint,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
Here, the argument na.action doesn't get suggested by R, meaning that is not available. This throws the error message in the opening question. However, if I remove the blueprint and write the model like this:
tuned_rf <- train(
target ~ .,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
na.action is available and na.roughfix can be used. However, now, the pre processing is missing. If I want to add the argument "preProcess =" to remove the NZVs, R does not suggest it, meaning that is not available anymore. Therefore, I would have to replace the fomula and the data with the training_data X and the response variable y. Now, preProcess is available again ... but na.action has vanished, therefore I cannot use na.roughfix.
tuned_rf <- train(
X,
Y,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
preProcess = "nzv"
)
Of course I could identify the NZVs first and remove them manually - but if I want to apply further steps, the whole process gets complicated.
I hope, my problem is now more understandable ...
From the help of ?randomForest::na.roughfix just performs median/mode imputation you can replace it when using a recipe with step_impute_median and step_impute_mode
your blueprint would look like:
library(recipes)
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors()) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal())
Perhaps also try
blueprint <- recipe(target ~ ., data = train_data) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal()) %:%
step_nzv(all_predictors())
Depending on how step_nzv handles missing values.
I would also check performance with other imputing functions like
step_impute_bag
step_impute_knn

Pooled Regression Results using mice, caret, and glmnet

Not sure if this more of a statistics question but the closest similar problem I could find is here, although I couldn't get it to work for my case.
I am trying to develop a pooled, penalized logistic regression model. I used mice to create a mids object and then fit a model to each dataset using caret repeated cross-validation with elastic net regression (glmnet) to tune parameters. The fitted object is not of class "mira" but I think I fixed that by changing the object class with the right list items. The major issue is that glmnet does not have an associated vcov method, which is required by pool().
I would like to use penalized regression based on the amount of variables and uncertainty over which ones are the best predictors. My data consists of 4x numeric variables and 9x categorical variables of varying levels and I anticipate including interactions.
Does anyone know how I might be able to create my own vcov method or otherwise address this issue? I am not sure if this is possible.
Example data and code are enclosed, noting that I am not able to share the actual data.
library(mice)
library(caret)
dat <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,NA,1,2,0,1,1,1,NA,2,2,0,0,1,NA,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c("M","M","M","M","F","F","F","F","M","F","F","M","F","M","M","M","F","F","M","F","M","F","M","F")))
imp <- mice(dat,m=5, seed=192)
control = trainControl(method = "repeatedcv",
number = 10,
repeats=3,
verboseIter = FALSE)
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- train(sex ~ .,
data = complete(imp, i),
method = "glmnet",
family="binomial",
trControl = control,
tuneLength = 10,
metric="Kappa")
}
obj <- as.mira(mod)
obj <- list(call=mod$analyses[[1]]$call, call1=imp$call, nmis=imp$nmis, analyses=mod$analyses)
oldClass(obj) <- "mira"
pool(obj)
Produces:
Error in pool(obj) : Object has no vcov() method.

R - how to set a specific number of PCA components to train a prediction model

Using train() and preProcess() I want to build a predictive model using PCA with the first 7 principal components as my predictors.
The below works but I'm not able to specify the number of PCs:
predModel2 <- train(diagnosis~., data=training2, method = "glm", preProcess = "pca")
I've tried this to specify the number of PCs but I don't know how to incorporate it into train():
training_pre<-preProcess(training[,ILcols],method = c("center", "scale", "pca"),pcaComp= 7)
I've tried using:
predModel2 <- train(diagnosis~., data=training2, method = "glm", preProcess = "pca", pcaComp=7)
Error in train.default(x, y, weights = w, ...) : Stopping
UPDATE:
It seems I get around this by using predict() first:
training2_pca<-predict(training_pre,training2_pca)
train(diagnosis~., data=training2_pca, method = "glm")
All preprocessing should be done within the training folds or, in this case, resamples. That prevents 'data leaks', so the first of the above approaches should be preferred, see e.g. this question.
The pcaComp argument goes into trainControl(). Using the iris data, KNN and the first two principal components as an example:
predModel2 <- train(Species~., data=iris, method = "knn", preProcess = "pca",
trControl = trainControl(preProcOptions = list(pcaComp = 2)))

R Formula Interface Excluded Variables Still Referenced With predict()?

I'm using caret to train a gbm model in R. I've used the formula interface to exclude certain variables from my model:
gbmTune <- train(Outcome ~ . - VarA - VarB - VarC, data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
trControl = cvCtrl,
verbose = FALSE)
When I try to use predict() against my test set, R complains about new factor levels for a variable I've asked to be excluded. The only solution I've been able to come up with is to set those variables to NULL before training my model...remove them. That doesn't seem like the answer.
I'm fairly new at this, so I would love to know what I'm doing wrong!

Pass PCA preprocessing arguments to train()

I'm trying to build a predictive model in caret using PCA as pre-processing. The pre-processing would be as follows:
preProc <- preProcess(IL_train[,-1], method="pca", thresh = 0.8)
Is it possible to pass the thresh argument directly to caret's train() function? I've tried the following, but it doesn't work:
modelFit_pp <- train(IL_train$diagnosis ~ . , preProcess="pca",
thresh= 0.8, method="glm", data=IL_train)
If not, how can I pass the separate preProc results to the train() function?
As per the documentation, you specify additional preprocessing arguments with trainControl
?trainControl
...
preProcOptions
A list of options to pass to preProcess. The type of pre-processing
(e.g. center, scaling etc) is passed in via the preProc option in train.
...
Since your dataset is not reproducible, let's look at an example. I will use the Sonar dataset from mlbench and use the pls algorithm just for fun.
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(preProcOptions = list(thresh = 0.95))
mod <- train(Class ~ .,
data = Sonar,
method = "pls",
trControl = ctrl)
Although documentation isn't the most exciting read, definitely make sure to try to go through it. Package authors work hard to create documentation and there are many wonders to be found within.

Resources