R Formula Interface Excluded Variables Still Referenced With predict()? - r

I'm using caret to train a gbm model in R. I've used the formula interface to exclude certain variables from my model:
gbmTune <- train(Outcome ~ . - VarA - VarB - VarC, data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
trControl = cvCtrl,
verbose = FALSE)
When I try to use predict() against my test set, R complains about new factor levels for a variable I've asked to be excluded. The only solution I've been able to come up with is to set those variables to NULL before training my model...remove them. That doesn't seem like the answer.
I'm fairly new at this, so I would love to know what I'm doing wrong!

Related

R function for finding the sensitivity given an alpha value

I am new to data analysis with R so any help is appreciated.
I have a dataset with some explanatory variables and one target variable. The target variable is either Yes or No only. So I would like to use logistic regression for model fitting.
This is how I plot a roc curve
myModel = train(
myTarget ~ .,
myTrainData,
method = "glm",
metric = "ROC",
trControl = myControl,
na.action = na.pass
)
myPred = predict(myModel , newdata = myTestData, type="prob")
eval <- evalm(data.frame(myPred , myTestData$myTarget)
eval$roc
Now, I would like to find the sensitivity given an alpha value / Type I error
And show the information like the following if possible, how can I achieve it?
confusionMatrix(myPred, reference = myTestData$myTarget)

Missing Value Error although "na.action" was set to "na.roughfix"

I would like to create a Random Forest model with caret. Since there are missing values in the training set, I was looking for possible solutions and came across the option "na.roughfix" from the package "randomForest". If the library randomForest is loaded, this option can be used as argument for the parameter "na.action" within the train function of caret. Inside the train function I use a 5-fold CV and tune for the best ROC value. I do this to ensure comparability between other models. The method I've chosen for the Random Forest is "ranger".
But now something strange happens: When I trigger the train function, the calculation is started, but for example the following error message appears:
model fit failed for Fold5: mtry= 7, splitrule=gini, min.node.size= 5 Error : Missing data in columns: ...
The "..." stands for the columns in which the missing values occur. Moreover, this error message always occurs, no matter for which fold or value for mtry.
I am well aware that there are missing values in these columns ... that's why I use na.roughfix. I also remove the NZVs, but that doesn't help either.
I would be very happy about an explanation or even a solution!
Many greetings
Edit.: I've seen now that, if I want to choose the "na.action" arugment in the train function, it does not appear automatically, which it usually does. It seems that it's somehow lost ... maybe this is the reason, why caret does not use the na.roughfix ...
Edit. 2: I guess that this is one part of the problem. train behaves always differently, depending on the previous arguments. In my train function I use a recipe from the recipe package to remove the NZVs. As soon as I remove the recipe, the na.action argument becomes available again. However, now the preProcess argument vanished, meaning I cannot remove the NZVs anymore. This is really a mess :-/ Is there a possibilty to apply the na.action AND the preProcess argument at the same time or any other solution for my Missing-Values-NZV-problem?
Edit. 3: As wished by the user missuse I try to provide you with a code expamle. Unfortunately I cannot provide you with data since mine is relatively sensitve - thank you for your understanding.
At first, I create a "blueprint" which I hand over to the train function. Here, I remove the Near Zero Variance Variables.
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors())
In the next step, I define the trainControl
train_control <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = TRUE)
and a grid:
hyper_grid <- expand.grid(mtry=c(1:(ncol(train_data)-1)),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 3, 5, 7, 10))
Finally, I put it all together into the train function:
tuned_rf <- train(
blueprint,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
Here, the argument na.action doesn't get suggested by R, meaning that is not available. This throws the error message in the opening question. However, if I remove the blueprint and write the model like this:
tuned_rf <- train(
target ~ .,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
na.action is available and na.roughfix can be used. However, now, the pre processing is missing. If I want to add the argument "preProcess =" to remove the NZVs, R does not suggest it, meaning that is not available anymore. Therefore, I would have to replace the fomula and the data with the training_data X and the response variable y. Now, preProcess is available again ... but na.action has vanished, therefore I cannot use na.roughfix.
tuned_rf <- train(
X,
Y,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
preProcess = "nzv"
)
Of course I could identify the NZVs first and remove them manually - but if I want to apply further steps, the whole process gets complicated.
I hope, my problem is now more understandable ...
From the help of ?randomForest::na.roughfix just performs median/mode imputation you can replace it when using a recipe with step_impute_median and step_impute_mode
your blueprint would look like:
library(recipes)
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors()) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal())
Perhaps also try
blueprint <- recipe(target ~ ., data = train_data) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal()) %:%
step_nzv(all_predictors())
Depending on how step_nzv handles missing values.
I would also check performance with other imputing functions like
step_impute_bag
step_impute_knn

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

TuneLength definition in Caret R

Can Someone Explain how the TuneLength works in different models in the train function of the Caret package?
ctreeModel <- train(CompressiveStrength ~ .,
+ data = trainingSet,
+ method = "ctree",
+ tuneLength = 10,
+ trControl = controlObject)
In this case, the tuneLength has been used to define the number of predictors that are used in each split?
It all depends on the model. A valuable function within caret is modelLookup(). Pass a string with the name of the model you’re using, for example modelLookup("rf") and it will tell you which parameter is being tuned by tunelength. In your case above :
> modelLookup("ctree")
model parameter label forReg forClass probModel
1 ctree mincriterion 1 - P-Value Threshold TRUE TRUE TRUE
You can also specify your own range of tuning in a more customized way, if you want to try out specific values. For that, pass a data frame with column names matching those of the arguments from modelLookup() or if you have many of them for the specific model you're using, try expand.grid().

R - how to set a specific number of PCA components to train a prediction model

Using train() and preProcess() I want to build a predictive model using PCA with the first 7 principal components as my predictors.
The below works but I'm not able to specify the number of PCs:
predModel2 <- train(diagnosis~., data=training2, method = "glm", preProcess = "pca")
I've tried this to specify the number of PCs but I don't know how to incorporate it into train():
training_pre<-preProcess(training[,ILcols],method = c("center", "scale", "pca"),pcaComp= 7)
I've tried using:
predModel2 <- train(diagnosis~., data=training2, method = "glm", preProcess = "pca", pcaComp=7)
Error in train.default(x, y, weights = w, ...) : Stopping
UPDATE:
It seems I get around this by using predict() first:
training2_pca<-predict(training_pre,training2_pca)
train(diagnosis~., data=training2_pca, method = "glm")
All preprocessing should be done within the training folds or, in this case, resamples. That prevents 'data leaks', so the first of the above approaches should be preferred, see e.g. this question.
The pcaComp argument goes into trainControl(). Using the iris data, KNN and the first two principal components as an example:
predModel2 <- train(Species~., data=iris, method = "knn", preProcess = "pca",
trControl = trainControl(preProcOptions = list(pcaComp = 2)))

Resources