Cross validation for linear models in R - r

I am trying to do cross validation of a linear model in R using cv.lm. I have tried capturing the output from cv.lm in a separate variable using something like:
cvOutput <- cv.lm(.....)
However, I cannot extract the predicted values from every fold as cvOutput seems to have no information about folds. Is there any way of extracting this?

Try this out. (I used Caravan dataset from MASS package for example)
First your partition the data
df <- Caravan
inTrain <- createDataPartition(df$Purchase,
p =0.8,
list =F)
training <- df[ inTrain,]
testing <- df[-inTrain,]
Then you choose the method
fitControl <- trainControl(method = "cv", number = 10)
Then you can have your cross validated model
fit <- train(Purchase ~ .,
data = training,
method = "lm",
trControl = fitControl)

Related

The RMSE from the train() function in the caret package gives a much different value when I use the RMSE() function from the caret package

I am trying to do model selection and want to retrieve mean RMSE from 10-fold cross validation. for some models it is possible to use the train() function from the caret package, however for other models I want to look at I have found a manual way to do k-fold cross validation here: https://www.r-bloggers.com/2016/06/bootstrap-and-cross-validation-for-evaluating-modelling-strategies/
The RMSE for cross-validated models are however more different than I would expect. below is the code where I apply different methods of retrieving the RMSE on the same model
library(caret)
library(datasets)
exd<-warpbreaks
##k-fold cross validation
Repeats <- 100
cv_repeat_num <- Repeats / 10
the_control <- trainControl(method = "repeatedcv", number = 10, repeats = cv_repeat_num)
cv_ex <- train(breaks~wool+tension, data = exd, method = "glm",family= "poisson", trControl = the_control)
m_ex <- glm(data = exd, breaks~wool+tension, family = "poisson")
results <- numeric(10 * cv_repeat_num)
for(j in 0:(cv_repeat_num - 1)){
cv_group <- sample(1:10, nrow(exd), replace = TRUE)
for(i in 1:10){
train_data <- exd[cv_group != i, ]
test_data <- exd[cv_group == i, ]
m_ex <- update(m_ex, data = train_data)
results[j * 10 + i] <- RMSE(
predict(m_ex, newdata = test_data),
test_data$breaks)
}
}
#RMSE from manual cross validation
mean(results)
#RMSE from RMSE function, no cross validation
RMSE(predict(m_ex, exd), exd$breaks)
#RMSE from train function
mean(cv_ex$resample$RMSE)
This difference in RMSE does not occur when I use a simple linear model instead of a poisson model as in this example. can someone shed some light on why this is? and if I could simply use the manual approach I found?

How to apply knn model on the test dataset after cross validation in R

I am trying to solve the well-known problem named Titanic- Machine Learning from Disaster.
I want to apply knn to predict the survived from the test dataset. I also want to use cross-validation and then want to apply it to my test dataset.
The code structure is given below:
install.packages("caret")
library(caret)
knn_2_train <- knn_1_train # train dataset
knn_2_train$Survived <- train$Survived
Survived <- as.factor(train$Survived) # train labels
knn_2_test <- knn_1_test # test dataset
trControl <- trainControl(method = "cv", number = 5)
fit <- train(knn_2_train, Survived,
method = "knn",
tuneGrid = expand.grid(k = 1:50),
metric = "Accuracy",
trControl = trControl
)
Now, I am not sure how can I apply the knn model for the test dataset after the cross-validation.
Any kind of suggestion is appreciable.
You can do the following;
test.df$predObs <- predict(
object = fit,
newdata = test.df
)
This stores your predictions as predObs in your testset test.df, which you then can evaluate with various performance measures.
Good luck with you project!
Note: Remember to change test.df such that it corresponds to your test data. Let med know if it works for you!

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

How to get the predicted class instead of class probabilities?

I have trained a random forest using caret package for predicting a binary classification task.
library(caret)
set.seed(78)
inTrain <- createDataPartition(disambdata$Response, p=3/4, list = FALSE)
trainSet <- disambdata[inTrain,]
testSet <- disambdata[-inTrain,]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
grid_rf <- expand.grid(.mtry = c(3,5,7,9))
set.seed(78)
m_rf <- train(Response ~ ., data=trainSet,
method= "rf", metric = "Kappa", trcontrol=ctrl, tuneGrid = grid_rf)
The Response variable contains values {Valid, Invalid}.
Using the following I get the class probabilities for the testing data:
pred <- predict.train(m_rf, newdata = testSet,
type="prob", models=m_rf$finalModel)
However I am interested in obtaining the predicted class i.e. Valid or Invalid instead of class probabilities to generate a confusion matrix.
I have already tried the argument type="raw" in the predict.train function but it returns a list of NAs.
By assigning type = "prob" in predict() function, you are specifically asking for probabilities. just remove it & it will provide labels
pred <- predict.train(m_rf, newdata = testSet,models=m_rf$finalModel)
It seems that the caret package (caret_6.0-70) still has issue with the formula interface. Expanding the formula from Response ~ . to the one that explicitly mentions all predictors like this Response ~ MaxLikelihood + n1 + n2 + count resolves the problem and predict.train(m_rf, newdata=testSet) returns the predicted class.

R caret / Confusion matrix

I'd like to display the confusion matrix after a train() of the caret library, but I have some doubts. The "train()" should be on a train set ?(I'm not sure because of the "control" parameter). The "predict()" on the test set ? It seems weird to predict on the whole data set...
# df_corpus = Document Term Matrix + 1 column of Cos.code(class which are 203.2.2, 204.3.2 ...)
dataset <- df_corpus
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
seed <- 7
metric <- "Accuracy"
preProcess=c("center", "scale")
# Linear Discriminant Analysis
set.seed(seed)
fit.lda <- train(Cos.code~., data=dataset, method="lda", metric=metric,preProc=c("center", "scale"), trControl=control)
ldaClasses <- predict(fit.lda)
cm <- confusionMatrix(data = ldaClasses, dataset$Cos.code)
F1_score(cm$table, "lda")
Thank you for your help
You can get the confusion matrix like this:
confusionMatrix(predict(fit.lda,dataset$Cos.code),dataset$Cos.code)
You can calculate the confusion matrix in the same manner for your testing set, just switch the datasets.
But I believe your model should contain already the information that you want
Examine the information given when printing these two objects.
fit.lda
fit.lda$finalModel

Resources