I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))
Related
I used the caret and glmnet pacakges to run a lasso logistic regression using repeated cross validation to select the optimized minimum lambda.
glmnet.obj <- train(outcome ~ .,
data = df.train,
method = "glmnet",
metric = "ROC",
family = "binomial",
trControl = trainControl(
method = "repeatedcv",
repeats = 10,
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = "all",
selectionFunction = "best"))
After that, I get the best lambda and alpha:
best_lambda<- get_best_result(glmnet.obj)$lambda
best_alpha<- get_best_result(glmnet.obj)$alpha
Then I obtain the predicted probabilities for the test set:
pred_prob<- predict(glmnet.obj,s=best_lambda, alpha=best_alpha, type="prob", newx = x.test)
and then to get the predicted classes, which I intend to use in ConfusionMatrix:
pred_class<-predict(glmnet.obj,s=best_lambda, alpha=best_alpha, type="raw",newx=x.test)
But when I just run pred_class it returns NULL.
What could I be missing here?
You need to use newdata = as opposed to newx= because when you do predict(glmnet.obj), it is calling predict.train on the caret object.
You did not provide one function, but I suppose it is rom this source:
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
Using an example data
set.seed(111)
df = data.frame(outcome = factor(sample(c("y","n"),100,replace=TRUE)),
matrix(rnorm(1000),ncol=10))
colnames(df.train)[-1] = paste0("col",1:10)
df.train = df[1:70,]
x.test = df[71:100,]
And we run your model, then you can predict using the function:
pred_class<-predict(glmnet.obj,type="raw",newdata=x.test)
confusionMatrix(table(pred_class,x.test$outcome))
Confusion Matrix and Statistics
pred_class n y
n 1 5
y 11 13
The arguments for lambda = and newx= comes from glmnet, you can potentially use it on glmnet.obj$finalModel , but you need to convert the data into a matrix, for example:
predict(glmnet.obj$finalModel,s=best_lambda, alpha=best_alpha,
type="class",newx=as.matrix(x.test[,-1]))
How do I choose the optimal df(degrees of freedom) for my splines?
I used poisson regression and splines that help me to adjust for non linear changes. Using the caret package, I used the train function with method = gamSpline to test only 3 df.
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
Aggregating results
Selecting tuning parameters
Fitting df = 3 on full training set
Is this the default? If so how I can change it?
Tnx,
Daniel
The tuneGrid argument allows the user to specify a custom grid of tuning parameters, in this case, df
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
),
tuneGrid = data.frame(df=seq(2,20,by=2))
)
I am trying to run a logistic regression on a classification problem
the dependent variable "SUBSCRIBEDYN" is a factor with 2 levels ("Yes" and "No")
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = F,
classProbs = T,
summaryFunction = prSummary)
set.seed(13)
simple.logistic.regression <- caret::train(SUBSCRIBEDYN ~ .,
data = train_data,
method = "glm",
metric = "Accuracy",
trControl = train.control)
simple.logistic.regression`
However, it does not accept Accuracy as a metric
"The metric "Accuracy" was not in the result set. AUC will be used instead"
For a classification model with 2 levels, you should use metric="ROC". metric="Accuracy" is used for multiple classes. However, after training the model, you can use the confusion matrix to retrieve the accuracy, for example using the function confusionMatrix().
I am using random forest and support vector machine method in caret package in R. I want to calculate AUC under ROC for both cases; however, I do not know how to do it in this particular case. My outcome is coded as 0 and 1. Here is the example of code I am using :
set.seed(123)
cvCtrl <- trainControl(method = "cv", number = 10)
rf_moded<-train(readm30~.,data=train,method="rf", trControl=cvCtrl)
Do you want to train the model with ROC? Then you need the following:
For trainControl:
control <- trainControl(method = 'cv', number = 10,
savePredictions = 'final', classProbs = TRUE, summaryFunction = twoClassSummary)
And in train:
train(
outcome ~ .,
data = data,
method = method,
trControl = control,
metric = "ROC"
)
Is it possible to select a different ROC set point in the Caret Train function instead of using metric = ROC (which I believe maximizes the AUC).
For example:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = train.control)
Specifically I have a two class problem (fail or pass) and I want to maximize the fail predictions while still maintaining a fail accuracy (or negative prediction value) of >80%. ie for every 10 fails I predict at least 8 of them are correct.
You can customize the caret::trainControl() object to use AUC, instead of accuracy, to tune the parameters of your models. Please check the caret documentation for details. (The built-in function, twoClassSummary, will compute the sensitivity, specificity and area under the ROC curve).
Note: In order to compute class probabilities, the pass feature must be Factor
Here under is an example of using 5-fold CV:
fitControl <- caret::trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
So your code will be adjusted a bit:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = fitControl)
# Print model to console to examine the output
random.forest.orig