DALEX for classification problems - r

I built a logistic regression model with penalty with caret and then i try to create an object through DALEX::explain to subsequently analyze the various aspects of the model.
Perhaps the problem lies in having a binary classification model.
Here my reproducible code
library(DALEX)
library(modelStudio)
set.seed(10)
data<- as.data.frame(mtcars)
data$vs<- as.factor(data$vs)
set.seed(10)
trc<- trainControl(method = "repeatedcv", number=3,repeats=4, classProbs=FALSE)
library(caret)
model <- caret::train(vs~., data=data, trControl= trc, family="binomial", method = "regLogistic")
explainer<- DALEX::explain(
model = model,
data= as.data.frame(data[, -which(colnames(data) %in% "vs")]),
y = as.numeric(as.character(data$vs)),
predict_function = predict,
label = "regLogistic")
modelStudio::modelStudio(explainer)

Related

How to use ggRandomForests dependence plots for caret model in R?

I want to create a facet of partial dependence plots for my model using the gg_variable function from ggRandomForests package. I have the following but it does not work.
How can I do this?
library("caret")
library("ggRandomForests")
library("randomForest")
data("iris")
iris$Species<-NULL
control = trainControl(method="cv", number=5,savePredictions = TRUE)
in_train= createDataPartition(iris$Sepal.Length, p=.66, list=FALSE)
train_st=iris[in_train,]
test_st=iris[-in_train,]
trf_sep = train(Sepal.Length ~ .,
data=train_st,ntree=800,method="rf",metric="Rsquared",trControl=control,importance = TRUE)
gg_variable(trf_sep)#Here is the problem
gg_variable requires output from randomForest model. It does not work with the output of caret::train function. Under such situation you can use the train function from caret package to tune mtry and fit the random forest model with randomForest package with tuned mtry and then apply gg_variable on that like
library("caret")
library("ggRandomForests")
library("randomForest")
data("iris")
iris$Species<-NULL
control = trainControl(method="cv", number=5,savePredictions = TRUE)
in_train= createDataPartition(iris$Sepal.Length, p=.66, list=FALSE)
train_st=iris[in_train,]
test_st=iris[-in_train,]
trf_sep = train(Sepal.Length ~ .,
data=train_st,ntree=800,method="rf",metric="Rsquared",trControl=control,importance = TRUE)
try <- randomForest(Sepal.Length ~ .,
data=train_st,ntree=800, mtry=3)
gg_dta <- gg_variable(try)
plot(gg_dta, xvar=c("Sepal.Width","Petal.Length","Petal.Width"),
panel=TRUE)

Get predictions from each tree in a random forest model in R (using train function in training and predict in predicting)

I am using a train function in training a Random Forest model:
fitControl = trainControl(method="oob")
tuneGrid = expand.grid(.mtry=c(15))
rfmod = train(p ~
x +
y +
z,
method="rf",
data=train_df,
tuneGrid=tuneGrid,
trControl=fitControl,
importance=TRUE,
allowParallel=TRUE)
This is a simplified code showing the structure of my model and I am using the training data train_df.
And I want to have the prediction from each tree. I searched somehow and tried this:
preds <- predict(rfmod, newdata = test_df[1], predict.all = TRUE)
I just used the first row of my test data test_df. After this, when I check preds it just still gives me only one prediction value, while I want the prediction from all the trees.
How can I do this to get all the predictions from all the trees?
Thanks in advance!!

Difference between Model and $FinalModel for classification in R?

Currently got this Random Forest model, just seeing how well it predicts those with diabetes positive or diabetes negative
Model is calculated using the caret workflow
when looking at variable importance i was told to use the code
randomForest::importance(model$finalModel)
what is the purpose of $finalModel? what is $finalModel as compared to just the original model? should it not be just be the original model passed in as the argument instead to view variable importance?
example below:
library(tidyverse)
library(mlbench)
library(caret)
library(car)
library(glmnet)
library(rpart.plot)
library(rpart)
data("PimaIndiansDiabetes2")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples,]
test.data <- PimaIndiansDiabetes2[-training.samples,]
model_rf <- caret::train(
diabetes ~.,
data = train.data,
method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE)
model_rf
model_rf$bestTune
model_rf$finalModel
# variable importance here
importance(model_rf$finalModel)
From the documentation:
finalModel A fit object using the best parameters
Most of the times with train you pass some different values for hyper-parameter estimation, to find the values that achieve the best performance (using trainControl).
Inside model_rf you'll find under finalModel the model built with the best parameters.
FYI caret also has a function for variable importance plotting: varImp.

Extract the Intercept from a Caret LASSO Model

I'm using the caret package in R to fit a LASSO regression model. My code runs fine, however I would like to extract the Intercept for the final model so I can build a scoring key using the selected predictors and coefficients.
For example, if "Extraversion" is the variable I am trying to model using survey items, I would like to produce the following scoring key:
Intercept + Survey_Item_1*Slope + Survey_Item_2*Slope + and so on
FWIW, I am able to extract the coefficients for the predictors.
My code for reference:
##Create Training & test set
set.seed(9808)
ind <- sample(0:1, nrow(df), replace=T, prob=c(.75,.25))
train <- df[ind==0,]
test <- df[ind==1,]
ctrl <- trainControl(method = "repeatedcv", number=5, repeats = 5)
##Train Lasso model
fit.lasso <- train(Extraversion ~., , data=train, method="lasso", preProc=c('scale','center','nzv'), trControl=ctrl)
fit.lasso
predict.enet(fit.lasso$finalModel, type='coefficients', s=fit.lasso$bestTune$fraction, mode='fraction')
##Fit models to test data
lasso_test<- predict(fit.lasso, newdata=test, na.action="na.pass")
postResample(pred = lasso_test, obs = test[,c(1)])

Obtaining training Error using Caret package in R

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

Resources