Calculating AUC of training dataset for glm function in R - r

I am trying to find AUC on a training data for my logistic regression model using glm
I split data to train and test set, fitted a logistic regression model regression model using glm, computed predicted value and trying to find AUC
d<-read.csv(file.choose(), header=T)
set.seed(12345)
train = runif(nrow(d))<.5
table(train)
fit = glm(y~ ., binomial, d)
phat<-predict(fit,type = 'response')
d$phat=phat
g <- roc(y ~ phat, data = d, print.auc=T)
plot(g)

Another user-friendly option is to use the caret library, which makes it pretty straightforward to fit and compare regression/classification models in R. The following example code uses the GermanCredit dataset to predict credit worthiness using a logistic regression model. The code is adapted from this blog: https://www.r-bloggers.com/evaluating-logistic-regression-models/.
library(caret)
## example from https://www.r-bloggers.com/evaluating-logistic-regression-models/
data(GermanCredit)
## 60% training / 40% test data
trainIndex <- createDataPartition(GermanCredit$Class, p = 0.6, list = FALSE)
GermanCreditTrain <- GermanCredit[trainIndex, ]
GermanCreditTest <- GermanCredit[-trainIndex, ]
## logistic regression based on 10-fold cross-validation
trainControl <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
fit <- train(
form = Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical,
data = GermanCreditTrain,
trControl = trainControl,
method = "glm",
family = "binomial",
metric = "ROC"
)
## AUC ROC for training data
print(fit)
## AUC ROC for test data
## See https://topepo.github.io/caret/measuring-performance.html#measures-for-class-probabilities
predictTest <- data.frame(
obs = GermanCreditTest$Class, ## observed class labels
predict(fit, newdata = GermanCreditTest, type = "prob"), ## predicted class probabilities
pred = predict(fit, newdata = GermanCreditTest, type = "raw") ## predicted class labels
)
twoClassSummary(data = predictTest, lev = levels(predictTest$obs))

I like using the performance command found in the ROCR library.
library(ROCR)
# responsev = response variable
d.prediction<-prediction(predict(fit, type="response"), train$responsev)
d.performance<-performance(d.prediction,measure = "tpr",x.measure="fpr")
d.test.prediction<-prediction(predict(fit,newdata=d.test, type="response"), d.test$DNF)
d.test.prefermance<-performance(d.test.prediction, measure="tpr", x.measure="fpr")
# What is the actual numeric performance of our model?
performance(d.prediction,measure="auc")
performance(d.test.prediction,measure="auc")

Related

How to get the continuous probability from SVM prediction

I have trained a binary SVM classifier and made predictions like the following:
classifier = svm(formula = type ~ .,
data = train,
type = 'C-classification',
kernel = 'polynomial')
y_pred = predict(classifier, newdata = test[1:57])
The label that I am training against (type) is a factor. The prediction (y_pred) in this case is also a factor list. How can I obtain the probability/logits of these predictions so that I can produce a ROC curve?
To solve this problem, probability = TRUE need to be specified both when constructing the classifier and making predictions:
classifier = svm(formula = type ~ .,
data = train,
type = 'C-classification',
probability=TRUE,
kernel = 'polynomial')
y_pred = predict(classifier, newdata = test[1:57], probability=TRUE)
Then the attr() can be used to retrieve the probability scores:
prob = as.data.frame(attr(y_pred, "probabilities”))

Listing model coefficients in descending order

I have a dataset with both continuous and categorical variables. I am running regression to predict one of the variables based on the other variables in the dataset. After comparing the results of ridge, lasso and elastic-net regression, the lasso regression is the best model to proceed with.
I used the 'coef' function to extract the model's coefficients, however, the result is a very long list with over 800 variables (as some of my categorical variables have many levels). Is there a way I can quickly rank the coefficients from largest to smallest? This is a glmnet model output
Reproducible problem with example code:
# Libraries Needed
library(caret)
library(glmnet)
library(mlbench)
library(psych)
# Data
data("BostonHousing")
data <- BostonHousing
str(data)
# Data Partition
set.seed(222)
ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = T)
# Linear Model
set.seed(1234)
lm <- train(medv ~.,
train,
method='lm',
trControl = custom)
# Results
lm$results
lm
summary(lm)
plot(lm$finalModel)
# Ridge Regression
set.seed(1234)
ridge <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0.0001, 1, length=5)),#try 5 values for lambda between 0.0001 and 1
trControl=custom)
#increasing lambda = increasing penalty and vice versa
#increase lambda therefore will cause coefs to shrink
# Plot Results
plot(ridge)
plot(ridge$finalModel, xvar = "lambda", label = T)
plot(ridge$finalModel, xvar = 'dev', label=T)
plot(varImp(ridge, scale=T))
# Lasso Regression
set.seed(1234)
lasso <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha=1,
lambda = seq(0.0001,1, length=5)),
trControl = custom)
# Plot Results
plot(lasso)
lasso
plot(lasso$finalModel, xvar = 'lambda', label=T)
plot(lasso$finalModel, xvar = 'dev', label=T)
plot(varImp(lasso, scale=T))
# Elastic Net Regression
set.seed(1234)
en <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha = seq(0,1,length=10),
lambda = seq(0.0001,1,length=5)),
trControl = custom)
# Plot Results
plot(en)
plot(en$finalModel, xvar = 'lambda', label=T)
plot(en$finalModel, xvar = 'dev', label=T)
plot(varImp(en))
# Compare Models
model_list <- list(LinearModel = lm, Ridge = ridge, Lasso = lasso, ElasticNet=en)
res <- resamples(model_list)
summary(res)
bwplot(res)
xyplot(res, metric = 'RMSE')
# Best Model
en$bestTune
best <- en$finalModel
coef(best, s = en$bestTune$lambda)
For most models all you'd have to do would be:
sort(coef(model), decreasing=TRUE)
Since you're using glmnet it's a little bit more complicated. I'm going to replicate a minimal version of your example here (the other models, plots, etc. are not necessary in order for us to be able to reproduce your problem ...)
## Packages
library(caret)
library(glmnet)
library(mlbench) ## for BostonHousing data
# Data
data("BostonHousing")
data <- BostonHousing
# Data Partition
set.seed(222)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = TRUE)
# Elastic Net Regression
set.seed(1234)
en <- train(medv ~.,
train,
method = 'glmnet',
tuneGrid = expand.grid(alpha = seq(0,1,length=10),
lambda = seq(0.0001,1,length=5)),
trControl = custom)
# Best Model
best <- en$finalModel
coefs <- coef(best, s = en$bestTune$lambda)
(This could probably be made simpler: for example, do you really need the custom control parameters to show us the example? This would be even simpler without using caret - just using `glmnet - but I was afraid I might leave something out.)
Once you've got the coefficients, sorting does appear to work, albeit with a message about possible inefficiency:
sort(coefs, decreasing=TRUE)
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
## [1] 25.191049410 5.078589706 1.389548822 0.244605193 0.045600250
## [6] 0.008840485 0.004372752 -0.012701593 -0.028337745 -0.162794401
## [11] -0.335062819 -0.901475516 -1.395091095 -12.632336419
sort(as.numeric(coefs)) also appears to work fine.
If you want to sort the entire matrix (i.e. keeping the values for all penalization levels), you can take advantage of the fact that the penalization doesn't change the rank-order of the parameters:
coeftab <-coef(best)
lastvals <- coeftab[,ncol(coeftab)]
coeftab_s <- coeftab[order(lastvals,decreasing=TRUE),]
## plot, leaving out the intercept
matplot(t(coeftab_s)[,-1],type="l")

Is it normal to get the same Goodness of Fit values for a logit model and an LPM model, based on the same data?

I am using the Stata dataset ANES.dta with the information about the 2000 presidential election in the USA. I build two models on this dataset - one logit and one LPM. I want to compare the two models with each other using the following Goodness of fit measures - accuracy, sensitivity and specificity of the both models.
I am new to R, I have mainly used STATA so far and that's why I'm wondering if it is normal to get absolutely the same values in confusion matrices for a logit model and a LPM model, based on the same data? Am I doing something wrong?
rm(list=ls())
library(foreign)
dat <- read.dta("ANES.dta", convert.factors = FALSE)
dat_clear <- na.omit(dat)
head(dat_clear)
#Logit model
m1_logit <- glm(vote ~ gender + income + pro_choice ,
data = dat_clear, family = binomial(link = "logit") ,
na.action = na.omit)
summary(m1_logit)
#LPM
m2_lpm <- lm(vote ~ gender + income + pro_choice,
data = dat_clear, na.action = na.omit)
summary(m2_lpm)
#Confusion matrix for logit model
dat_clear$prediction_log <- predict(m1_logit, newdata = dat_clear, type = "response")
dat_clear$vote_pred_log <- as.numeric(dat_clear$prediction_log > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_log)
#Confusion matrix for LPM model
dat_clear$prediction_lpm <- predict(m2_lpm, newdata = dat_clear, type = "response")
dat_clear$vote_pred_lpm <- as.numeric(dat_clear$prediction_lpm > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_lpm)
This is what the confusion matrices look like

Plot ROC curve for bootstrapped caret model

I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))

R caret held-out sample and testing set ROC

I am building two different classifiers to predict a binary out come. Then I want to compare the results of the two models by using a ROC curve and the area under it (AUC).
I split the data set into a training and testing set. On the training set I perform a form of cross-validation. From the held-out samples of the cross validation I am able to build a ROC curve per model. Then I use the models on the testing set and build another set of ROC curves.
The results are contradictory which is confusing me. I am not sure which result is the correct one or if I am doing something completely wrong. The held-out sample ROC curve shows that RF is the better model and the training set ROC curve shows that SVM is the better model.
Analysis
library(ggplot2)
library(caret)
library(pROC)
library(ggthemes)
library(plyr)
library(ROCR)
library(reshape2)
library(gridExtra)
my_data <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
str(my_data)
names(my_data)[1] <- "Class"
my_data$Class <- ifelse(my_data$Class == 1, "event", "noevent")
my_data$Class <- factor(emr$Class, levels = c("noevent", "event"), ordered = TRUE)
set.seed(1732)
ind <- createDataPartition(my_data$Class, p = 2/3, list = FALSE)
train <- my_data[ ind,]
test <- my_data[-ind,]
Next I train two models: Random Forest and SVM. Here I also use Max Kuhns function to get the averaged ROC curves from held-out samples for both models and save those results into a another data.frame along with the AUC from the curves.
#Train RF
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary)
grid <- data.frame(mtry = seq(1,3,1))
set.seed(1537)
rf_mod <- train(Class ~ .,
data = train,
method = "rf",
metric = "ROC",
tuneGrid = grid,
ntree = 1000,
trControl = ctrl)
rfClasses <- predict(rf_mod, test)
#This is the ROC curve from held out samples. Source is from Max Kuhns 2016 UseR! code here: https://github.com/topepo/useR2016
roc_train <- function(object, best_only = TRUE, ...) {
lvs <- object$modelInfo$levels(object$finalModel)
if(best_only) {
object$pred <- merge(object$pred, object$bestTune)
}
## find tuning parameter names
p_names <- as.character(object$modelInfo$parameters$parameter)
p_combos <- object$pred[, p_names, drop = FALSE]
## average probabilities across resamples
object$pred <- plyr::ddply(.data = object$pred,
.variables = c("obs", "rowIndex", p_names),
.fun = function(dat, lvls = lvs) {
out <- mean(dat[, lvls[1]])
names(out) <- lvls[1]
out
})
make_roc <- function(x, lvls = lvs, nms = NULL, ...) {
out <- pROC::roc(response = x$obs,
predictor = x[, lvls[1]],
levels = rev(lvls))
out$model_param <- x[1,nms,drop = FALSE]
out
}
out <- plyr::dlply(.data = object$pred,
.variables = p_names,
.fun = make_roc,
lvls = lvs,
nms = p_names)
if(length(out) == 1) out <- out[[1]]
out
}
temp <- roc_train(rf_mod)
plot_data_ROC <- data.frame(Model='Random Forest', sens = temp$sensitivities, spec=1-temp$specificities)
#This is the AUC of the held-out samples roc curve for RF
auc.1 <- abs(sum(diff(1-temp$specificities) * (head(temp$sensitivities,-1)+tail(temp$sensitivities,-1)))/2)
#Build SVM
set.seed(1537)
svm_mod <- train(Class ~ .,
data = train,
method = "svmRadial",
metric = "ROC",
trControl = ctrl)
svmClasses <- predict(svm_mod, test)
#ROC curve into df
temp <- roc_train(svm_mod)
plot_data_ROC <- rbind(plot_data_ROC, data.frame(Model='Support Vector Machine', sens = temp$sensitivities, spec=1-temp$specificities))
#This is the AUC of the held-out samples roc curve for SVM
auc.2 <- abs(sum(diff(1-temp$specificities) * (head(temp$sensitivities,-1)+tail(temp$sensitivities,-1)))/2)
Next I will plot the results
#Plotting Final
#ROC of held-out samples
q <- ggplot(data=plot_data_ROC, aes(x=spec, y=sens, group = Model, colour = Model))
q <- q + geom_path() + geom_abline(intercept = 0, slope = 1) + xlab("False Positive Rate (1-Specificity)") + ylab("True Positive Rate (Sensitivity)")
q + theme(axis.line = element_line(), axis.text=element_text(color='black'),
axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())
#ROC of testing set
rf.probs <- predict(rf_mod, test,type="prob")
pr <- prediction(rf.probs$event, factor(test$Class, levels = c("noevent", "event"), ordered = TRUE))
pe <- performance(pr, "tpr", "fpr")
roc.data <- data.frame(Model='Random Forest',fpr=unlist(pe#x.values), tpr=unlist(pe#y.values))
svm.probs <- predict(svm_mod, test,type="prob")
pr <- prediction(svm.probs$event, factor(test$Class, levels = c("noevent", "event"), ordered = TRUE))
pe <- performance(pr, "tpr", "fpr")
roc.data <- rbind(roc.data, data.frame(Model='Support Vector Machine',fpr=unlist(pe#x.values), tpr=unlist(pe#y.values)))
q <- ggplot(data=roc.data, aes(x=fpr, y=tpr, group = Model, colour = Model))
q <- q + geom_line() + geom_abline(intercept = 0, slope = 1) + xlab("False Positive Rate (1-Specificity)") + ylab("True Positive Rate (Sensitivity)")
q + theme(axis.line = element_line(), axis.text=element_text(color='black'),
axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())
#AUC of hold out samples
data.frame(Rf = auc.1, Svm = auc.2)
#AUC of testing set. Source is from Max Kuhns 2016 UseR! code here: https://github.com/topepo/useR2016
test_pred <- data.frame(Class = factor(test$Class, levels = c("noevent", "event"), ordered = TRUE))
test_pred$Rf <- predict(rf_mod, test, type = "prob")[, "event"]
test_pred$Svm <- predict(svm_mod, test, type = "prob")[, "event"]
get_auc <- function(pred, ref){
auc(roc(ref, pred, levels = rev(levels(ref))))
}
apply(test_pred[, -1], 2, get_auc, ref = test_pred$Class)
The results from the held-out samples and from the testing set are totally different (I know they will be different but by this much?).
Rf Svm
0.656044 0.5983193
Rf Svm
0.6326531 0.6453428
From the held-out samples one would choose the RF model but from the testing set one would pick the SVM model.
Which is the "correct" or "better" way to chose the model?
Am I making a big mistake somewhere or not understanding something correctly?
If I understand correctly then you have 3 labeled data sets:
Training
Hold-out CV sample from training
"Testing" CV sample
While, yes, under a hold-out sample CV strategy you normally choose your model based on the hold-out sample, you also don't normally also have a larger validation data sample.
Clearly, if both the hold-out and the Testing data sets are (a) labeled and (b) as close to the level of orthogonality as possible from from the training data, then you'd choose your model based on whichever has the larger sample size.
In your case it looks like what you're calling the hold-out sample is just the repeated CV resampling from training. That being the case you have even more reason to prefer the results from the Testing data set validation. See Steffen's related note on repeated CV.
In theory Random Forest's bagging has a inherit form of cross-validation through the OOB stats and the CV conducted within the training phase should give you some measure of validation. However, in practice it's common to observe a lack of orthogonality and an increased likelihood of overfitting since the samples are coming from the training data itself and may be reinforcing the mistake of overfitting for accuracy.
I can explain that theoretically as above to some extent, then beyond that I just have to tell you that empirically I've found that the performance results from the so-called CV and OOB error calculated from the training data can be highly misleading and the true hold-out (Testing) data that was never touched during training is the far better validation.
Your true hold-out sample is the Testing data set, since none of its data is using during training. Use those results.

Resources