Difference between Model and $FinalModel for classification in R? - r

Currently got this Random Forest model, just seeing how well it predicts those with diabetes positive or diabetes negative
Model is calculated using the caret workflow
when looking at variable importance i was told to use the code
randomForest::importance(model$finalModel)
what is the purpose of $finalModel? what is $finalModel as compared to just the original model? should it not be just be the original model passed in as the argument instead to view variable importance?
example below:
library(tidyverse)
library(mlbench)
library(caret)
library(car)
library(glmnet)
library(rpart.plot)
library(rpart)
data("PimaIndiansDiabetes2")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples,]
test.data <- PimaIndiansDiabetes2[-training.samples,]
model_rf <- caret::train(
diabetes ~.,
data = train.data,
method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE)
model_rf
model_rf$bestTune
model_rf$finalModel
# variable importance here
importance(model_rf$finalModel)

From the documentation:
finalModel A fit object using the best parameters
Most of the times with train you pass some different values for hyper-parameter estimation, to find the values that achieve the best performance (using trainControl).
Inside model_rf you'll find under finalModel the model built with the best parameters.
FYI caret also has a function for variable importance plotting: varImp.

Related

DALEX for classification problems

I built a logistic regression model with penalty with caret and then i try to create an object through DALEX::explain to subsequently analyze the various aspects of the model.
Perhaps the problem lies in having a binary classification model.
Here my reproducible code
library(DALEX)
library(modelStudio)
set.seed(10)
data<- as.data.frame(mtcars)
data$vs<- as.factor(data$vs)
set.seed(10)
trc<- trainControl(method = "repeatedcv", number=3,repeats=4, classProbs=FALSE)
library(caret)
model <- caret::train(vs~., data=data, trControl= trc, family="binomial", method = "regLogistic")
explainer<- DALEX::explain(
model = model,
data= as.data.frame(data[, -which(colnames(data) %in% "vs")]),
y = as.numeric(as.character(data$vs)),
predict_function = predict,
label = "regLogistic")
modelStudio::modelStudio(explainer)

How to apply knn model on the test dataset after cross validation in R

I am trying to solve the well-known problem named Titanic- Machine Learning from Disaster.
I want to apply knn to predict the survived from the test dataset. I also want to use cross-validation and then want to apply it to my test dataset.
The code structure is given below:
install.packages("caret")
library(caret)
knn_2_train <- knn_1_train # train dataset
knn_2_train$Survived <- train$Survived
Survived <- as.factor(train$Survived) # train labels
knn_2_test <- knn_1_test # test dataset
trControl <- trainControl(method = "cv", number = 5)
fit <- train(knn_2_train, Survived,
method = "knn",
tuneGrid = expand.grid(k = 1:50),
metric = "Accuracy",
trControl = trControl
)
Now, I am not sure how can I apply the knn model for the test dataset after the cross-validation.
Any kind of suggestion is appreciable.
You can do the following;
test.df$predObs <- predict(
object = fit,
newdata = test.df
)
This stores your predictions as predObs in your testset test.df, which you then can evaluate with various performance measures.
Good luck with you project!
Note: Remember to change test.df such that it corresponds to your test data. Let med know if it works for you!

tuneRF vs caret tunning for random forest

I've trying to tune a random forest model using the tuneRF tool included in the randomForest Package and I'm also using the caret package to tune my model. The issue is that I'm tunning to get mtry and I'm getting different results for each approach. The question is how do I know which is the best approach and base on what? I'm not clear if I should expect similar or different results.
tuneRF: with this approach I'm getting the best mtry is 3
t <- tuneRF(train[,-12], train[,12],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 100,
trace = TRUE,
improve = 0.05)
caret: With this approach I'm always getting that the best mtry is all variables in this case 6
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(2:6))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse",
tuneGrid=tunegrid, ntree = 100, trControl=control)
There are a few differences, for each mtry parameters, tuneRF fits one model on the whole dataset, and you get the OOB error from each of these fit. tuneRF then takes the lowest OOB error. For each value of mtry, you have one score (or RMSE value) and this will change with different runs.
In caret, you actually do cross-validation, so the test data from the fold was not used at all in the model. Though in principle it should be similar to OOB, you should be aware of the differences.
A evaluation with a better picture on the error might be to run tuneRF a few rounds, and we can use cv in caret:
library(randomForest)
library(mlbench)
data(BostonHousing)
train <- BostonHousing
tuneRF_res = lapply(1:10,function(i){
tr = tuneRF(train[,-14], train[,14],mtryStart=2,step=0.9,ntreeTry = 100,trace = TRUE,improve=1e-5)
tr = data.frame(tr)
tr$RMSE = sqrt(tr[,2])
tr
})
tuneRF_res = do.call(rbind,tuneRF_res)
control <- trainControl(method="cv", number=10,returnResamp="all")
tunegrid <- expand.grid(.mtry=c(2:7))
caret_res <- train(medv ~., data=train, method="rf", metric="RMSE",
tuneGrid=tunegrid, ntree = 100, trControl=control)
library(ggplot2)
df = rbind(
data.frame(tuneRF_res[,c("mtry","RMSE")],test="tuneRF"),
data.frame(caret_res$resample[,c("mtry","RMSE")],test="caret")
)
df = df[df$mtry!=1,]
ggplot(df,aes(x=mtry,y=RMSE,col=test))+
stat_summary(fun.data=mean_se,geom="errorbar",width=0.2) +
stat_summary(fun=mean,geom="line") + facet_wrap(~test)
You can see the trend is more or less similar. My suggestion would be to use tuneRF to quickly check the range of mtrys to train over, then use caret, cross-validation to properly evaluate this.

Cross validation for linear models in R

I am trying to do cross validation of a linear model in R using cv.lm. I have tried capturing the output from cv.lm in a separate variable using something like:
cvOutput <- cv.lm(.....)
However, I cannot extract the predicted values from every fold as cvOutput seems to have no information about folds. Is there any way of extracting this?
Try this out. (I used Caravan dataset from MASS package for example)
First your partition the data
df <- Caravan
inTrain <- createDataPartition(df$Purchase,
p =0.8,
list =F)
training <- df[ inTrain,]
testing <- df[-inTrain,]
Then you choose the method
fitControl <- trainControl(method = "cv", number = 10)
Then you can have your cross validated model
fit <- train(Purchase ~ .,
data = training,
method = "lm",
trControl = fitControl)

Obtaining training Error using Caret package in R

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

Resources