Model trained with preprocess using impute not processing new data - r

I am using caret train function with the preProcess option:
fit <- train(form,
data=train,
preProcess=c("YeoJohnson","center","scale","bagImpute"),
method=model,
metric = "ROC",
tuneLength = tune,
trControl=fitControl)
This preprocesses the training data. However, when I predict, the observations with NAs, they are omitted even though I have bagImpute as an option. I know there is a na.action parameter on predict.train, but I can't get it to work.
predict.train(model, newdata=test, na.action=???)
Is it correct to assume that the predict function automatically preprocesses the new data because the model was trained using the preProcess option? If so, shouldn't the new data be imputed and processed the same way as train? What am i doing wrong?
Thanks for any help.

You would use na.action = na.pass. The problem is, while making a working example, I found a bug that occurs with the formula method for train and imputation. Here is an example without the formula method:
library(caret)
set.seed(1)
training <- twoClassSim(100)
testing <- twoClassSim(100)
testing$Linear05[4] <- NA
fitControl <- trainControl(classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(2)
fit <- train(x = training[, -ncol(training)],
y = training$Class,
preProcess=c("YeoJohnson","center","scale","bagImpute"),
method="lda",
metric = "ROC",
trControl=fitControl)
predict(fit, testing[1:5, -ncol(testing)], na.action = na.pass)
The bug will be fixed on the next release of the package.
Max

Related

How to apply knn model on the test dataset after cross validation in R

I am trying to solve the well-known problem named Titanic- Machine Learning from Disaster.
I want to apply knn to predict the survived from the test dataset. I also want to use cross-validation and then want to apply it to my test dataset.
The code structure is given below:
install.packages("caret")
library(caret)
knn_2_train <- knn_1_train # train dataset
knn_2_train$Survived <- train$Survived
Survived <- as.factor(train$Survived) # train labels
knn_2_test <- knn_1_test # test dataset
trControl <- trainControl(method = "cv", number = 5)
fit <- train(knn_2_train, Survived,
method = "knn",
tuneGrid = expand.grid(k = 1:50),
metric = "Accuracy",
trControl = trControl
)
Now, I am not sure how can I apply the knn model for the test dataset after the cross-validation.
Any kind of suggestion is appreciable.
You can do the following;
test.df$predObs <- predict(
object = fit,
newdata = test.df
)
This stores your predictions as predObs in your testset test.df, which you then can evaluate with various performance measures.
Good luck with you project!
Note: Remember to change test.df such that it corresponds to your test data. Let med know if it works for you!

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

caret: `predict` fails when `train` formula has deleted variables

TL/DR ANSWER: specify training data in newdata argument.
How do I consistently extract class probabilities from trained models with caret's predict? Currently I get an error when the argument to predict was trained with the formula notation and a variable was indicated to be ignored with -variable.
This can be reproduced with:
fit.lda <- train(Species ~ . -Petal.Length,
data = iris,
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
savePredictions = "final",
selectionFunction = "best",
summaryFunction = multiClassSummary),
method = "lda",
metric = "Mean_F1")
and then the following line will fail:
predict(fit.lda, type = "prob")
Error in predict.lda(modelFit, newdata) : wrong number of variables
If the -Petal.Length is omitted in the train formula, there is no error. Am I doing something wrong with the formula statement?
I suppose I could dig into the model's pred slot and grab the columns corresponding to the class types (see EDIT2), but this seems hackish. Is there a way to get predict to work as expected?
=====EDIT=====
I trained a number of different models (using formula notation) with caretList from the caretEnsemble package, and I got various errors when trying to use predict:
knn
Error in knn3Train(train = c(....) : dims of 'test' and 'train differ
svmRadial:
Warning message:
In method$prob(modelFit = modelFit, newdata = newdata, submodels = param) :
kernlab class probability calculations failed; returning NAs
mlpML:
Error in myFunc[[1]](x, ...) :
number of input data columns 28 does not match number of input neurons 20
Methods that worked without errors were nnet and tree based methods (rf, xgbTree)
=====EDIT2=====
The following doesn't take repeated resampling into account. The selected answer is much simpler.
Here's a self-fashioned solution for extracting probabilities from the trained model, but for standardization, I'd prefer if it's possible to get predict to behave.
grabProbs <- function(model) model$pred[, colnames(model$pred) %in% model$levels]
grabProbs(fit.lda)
Just use the newdata parameter and it will work
predict(fit.lda, newdata = iris, type = "prob")
[EDITED]
As we can see, for lda the prediction result is identical:
library(MASS)
fit.lda <- lda(Species ~ . -Petal.Length, data = iris)
identical(predict(fit.lda), predict(fit.lda, newdata=iris))
# [1] TRUE
library(randomForest)
fit.rf <- randomForest(Species ~ . -Petal.Length, data = iris)
identical(predict(fit.rf), predict(fit.rf, newdata=iris))
# [1] FALSE

trainControl in caret package

In caret package, there is a thing called trainControl that allow us to perform variety of cross validation. To perform 10-fold cross-validation, one would use
fitControl <- trainControl(method= "repeatedcv", number = 10, repeats = 10)
fitJ48_10_fold <- train(x = x, y =y, method = "J48", trControl= fitControl)
while for training set, it is
fitControl <- trainControl(method= "none")
fitJ48train <- train(x = x, y =y, method = "J48", trControl= fitControl)
However, confusion matrix of these model show the same for both 10-fold and training.
Activity <- predict(fitJ48_10_fold, newdata = Train)
confusionMatrix(Activity, Train$Activity)
Activity <- predict(fitJ48train, newdata = Train)
confusionMatrix(Activity, Train$Activity)
I used the weka classifier GUI and indeed the performance of J48 from 10-fold cross validation is lower than that of training set. Am I wrong to suspect that the trainControl from caret isn't working or I pass this in a wrong way?
Am I wrong to suspect that the trainControl from caret isn't working or I pass this in a wrong way?
A little. For J48, there is a tuning parameter but the default grid only fits a single value C = 0.25. The final model will be the same no matter what value of method that you use in trainControl so the confusion matrices will always be the same.
Max

Obtaining training Error using Caret package in R

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

Resources