AUC unexpected value - r

I have the following predictions after running a logistic regression model on a set of molecules we suppose that are predictive of tumors versus normals.
Predicted class
T N
T 29 5
Actual class
N 993 912
I have a list of scores that range from predictions <0 (negative numbers) to predictions >0 (positive numbers). Then I have another column in my data.frame that indicated the labels (1== tumours and 0==normals) as predicted from the model. I tried to calculate the ROC using the library(ROC) in the following way:
pred = prediction(prediction, labels)
roc = performance(pred, "tpr", "fpr")
plot(roc, lwd=2, colorize=TRUE)
Using:
roc_full_data <- roc(labels, prediction)
rounded_scores <- round(prediction, digits=1)
roc_rounded <- roc(labels, prediction)
Call:
roc.default(response = labels, predictor = prediction)
Data: prediction in 917 controls (category 0) < 1022 cases (category1).
Area under the curve: 1
The AUC is equal to 1. I'm not sure that I run all correctly or probably I'm doing something wrong in the interpretation of my results because it is quite rare that the AUC is equal to 1.

There is a typo in your x.measure which should have thrown an error. You have "for" and not "fpr". Try the following code.
performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)
# add a reference line to the graph
abline(a = 0, b = 1, lwd = 2, lty = 2)
# calculate AUC
perf.auc <- performance(pred, measure = "auc")
str(perf.auc)
as.numeric(perf.auc#y.values)

I use pROC to calculate AUC:
require(pROC)
set.seed(1)
pred = runif(100)
y = factor(sample(0:1, 100, TRUE))
auc = as.numeric(roc(response = y, predictor = pred)$auc)
print(auc) # 0.5430757
Or
require(AUC)
auc = AUC::auc(AUC::roc(pred, y))
print(auc) # 0.4569243
I can't explain why the results are different.
EDIT: The above aucs sum to 1.0, so one of the libs automatically 'inverted' the predictions.

Related

Subscript out of bound error in predict function of LASSO model [duplicate]

This question already has answers here:
Subscript out of bounds - general definition and solution?
(7 answers)
Closed 2 years ago.
I am using LASSO model for prediction and in the prediction, I get the following error when running predict function. Can someone help me to overcome this?
ERROR MSG:
Error in predict(lasso_model, x, type = "response")[, 2] :
subscript out of bounds
#convert data to matrix format
x <- model.matrix(St_recurrence~.,tr)
#convert class to numerical variable
y <- tr$St_recurrence
#Cross validation - perform grid search to find optimal value of lambda
cv_out <- cv.glmnet(x, y, alpha=1, family = 'binomial', nfolds = 5, type.measure = "auc")
#best value of lambda
best_lambda <- cv_out$lambda.1se
# Rebuilding the model with best lamda value identified
lasso_model <- glmnet(x, y, family = "binomial", alpha = 1, lambda = best_lambda)
coef(lasso_model)
# odds ratio
exp(coef(lasso_model))
library(ROCR)
# Calculate the probability of new observations belonging to "yes"
predprob <- predict(lasso_model, x, type = "response")[,2] ## error comes in this line
# prediction is ROCR function
pred <- prediction(predprob, tr$Structural_recurrence)
# ROC curve (x-axis: fpr, y-axis: tpr)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main="ROC Curve for LASSO model", col=rainbow(10))
#compute area under curve
aucperf <- performance(pred, measure="auc")
print(aucperf#y.values)
For glmnet, if you do ?glmnet::predict.glmnet you can see under details:
type: Type of prediction required. Type ‘"link"’ gives the linear
predictors for ‘"binomial"’, ‘"multinomial"’, ‘"poisson"’ or
‘"cox"’ models; for ‘"gaussian"’ models it gives the fitted
values. Type ‘"response"’ gives the fitted probabilities for
‘"binomial"’ or ‘"multinomial"
And it returns a vector of being 1, unlike caret which returns 2 columns.
So you can do:
library(glmnet)
library(ROCR)
data(Sonar)
y= as.numeric(Sonar$Class)-1
x=as.matrix(Sonar[,-ncol(Sonar)])
lasso_model = glmnet(x=x,y=y,family="binomial",alpha=1,lambda=0.001)
predprob <- predict(lasso_model, x, type = "response")
pred <- prediction(predprob, y)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main="ROC Curve for LASSO model")

ROC Curve Ranger

I am trying to calculate ROC Curve and AUC using ranger for a binomial classification problem (0 and 1), where the response variable is defined as BiClass.
Suppose I cast a data frame to Train_Set and Test_Set (75% and 25 % respectively) and compute binary class probabilities using:
library(ranger)
library(ROCR)
library(mlr)
library(pROC)
library(tidyverse)
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
pred <- predict(BiClass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)
My intention is now to compute ROC curve (and AUC). I tried the following code, through which I get ROC curve (using ROCR and mlr packages):
pred_object <- prediction(pred$predictions[,2], Test_Set$BiClass)
per_measure <- performance(pred_object, "tnr", "fnr")
plot(per_measure, col="red", lwd=1)
abline(a=0,b=1,lwd=1,lty=1,col="gray")
Or, aletrnatively using pROC package:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
probabilities$predic <- colnames(probabilities)[max.col(probabilities,ties.method="first")] # For each row, return the column name of the largest value from 0 and 1 columns (prediction column). This will be a character type
probabilities$prednum <- as.numeric(as.character(probabilities$predic)) # create prednum as a numeric data type in probabilities
probabilities <- dplyr::mutate_if(probabilities, is.character, as.factor) # convert character to factor
probabilities <- cbind(probabilities,BiClass=Test_Set$BiClass) # append BiClass. This data frame contains the response variable from the Test_Data, along with prediction (prednum) and probability classes (0 and 1)
ROC_ranger <- pROC::roc(Table$BiClass, pred$predictions[,2])
plot(ROC_ranger, col = "blue", main = "ROC - Ranger")
paste("Accuracy % of ranger: ", mean(Test_Set$BiClass == round(pred$predictions[,2], digits = 0))) # print the performance of each model
The ROC curve obtained is given below:
I have the following questions:
1) How can I set a threshold value and plot confusion matrix for the set threshold?
I compute the confusion matrix presently using:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
max.col(probabilities) - 1
confusionMatrix(table(Test_Set$BiClass, max.col(probabilities)-1))
2) How do I calculate the optimal thershold value (global value at which I have more true positives or true negatives) through optimization?
Again, referring to the pROC and the guidelines proposed by its author using:
myroc <- pROC::roc(probabilities$BiClass, probabilities$`1`)
mycoords <- pROC::coords(myroc, "all", transpose = FALSE)
plot(mycoords$threshold, mycoords$specificity, type="l", col="red", xlab="Cutoff", ylab="Performance")
lines(mycoords$threshold, mycoords$sensitivity, type="l", col="blue")
legend(0.23,0.2, c("Specificity", "Sensitivity"), col=c("red", "blue"), lty=1)
best.coords <- coords(myroc, "best", best.method="youden", transpose = FALSE)
abline(v=best.coords$threshold, lty=2, col="grey")
abline(h=best.coords$specificity, lty=2, col="red")
abline(h=best.coords$sensitivity, lty=2, col="blue")
I was able to draw this curve using youden index:
]2
Does it mean there isn't a lot of freedom to vary threshold to play with specificity and sensitivity, since the dashed blue and red lines are not far away from each other?
3) How to evaulate AUC?
I calculated AUC using pROC again following the guidelines of its author. See below:
ROC_ranger <- pROC::roc(probabilities$BiClass, probabilities$`1`)
ROC_ranger_auc <- pROC::auc(ROC_ranger)
paste("Area under curve of random forest: ", ROC_ranger_auc) # AUC of the model
The goal finally is to increase the True Neagtives, which are presently defined by 1 in BiClass and of course True Positives (defined by 0 in BiClass) in the confusion matrix. At present, the Accuracy of my classification algorithm is 0.74 and the AUC is 0.81 respectively.

How to solve these problems about inverted ROC curve, small AUC, and the cutoff?

I am constructing this ROC curve from my SVM model, but the curve came out inverted. Also, although my SVM prediction has high accuracy (~93%), my ROC curve shows that my area under the curve is just about 2.7%. Moreover, it tells me that the optimal cutoff value is infinity, which is not what I expected from my model fitting.
I have fitted my SVM model using the built-in SVM function just like in the code I showed below, and then I predicted using the function predict(). Then, I computed the prediction() and calculated the performance(), the cutoff value, and the AUC (all code shown below)
svm.fit <- svm(label ~ NDAI + SD + CORR, data = trainSet, scale = FALSE, kernel = "radial", cost = 2, probability=TRUE)
svm.pred <- predict(svm.fit, testSet, probability=TRUE)
mean(svm.pred== testSet$label)*100
prediction.svm <- prediction(attr(svm.pred, "probabilities")[,2], testSet$label)
eval.svm <- performance(prediction.svm, "acc")
roc.svm <- performance(prediction.svm, "tpr", "fpr")
#identify best values and cutoff
max_index.svm <- which.max(slot(eval.svm, "y.values")[[1]])
max.acc_svm <- (slot(eval.svm, "y.values")[[1]])[max_index.svm]
opt.cutoff_svm <- (slot(eval.svm, "x.values")[[1]])[max_index.svm][[1]]
#AUC
auc.svm <- performance(prediction.svm, "auc")
auc.svm <- unlist(slot(auc.svm, "y.values"))
auc.svm <- round(auc.svm, 4)
plot(roc.svm,colorize=TRUE)
points(0.072, 0.93, pch= 20)
legend(.6,.2, auc.svm, title = "AUC", cex = 0.8)
legend(.8,.2, round(opt.cutoff_svm,4), title = "cutoff", cex = 0.8)
I expect the output to have AUC close to 1, and a small cutoff which is close to 0.5, with a curve with AUC close to 1. Has anyone encountered a similar problem like this one? If yes, how should I fix my code?

Adding arbitrary curve with AUC 0.8 to ROC plot

I have a simple ROC plot that I am creating using pROC package:
plot.roc(response, predictor)
It is working fine, as expected, but I would like to add an "ideally" shaped reference curve with AUC 0.8 for comparison (the AUC of my ROC plot is 0.66).
Any thoughts?
Just to clarify, I am not trying to smoothen my ROC plot, but trying to add a reference curve that would represent AUC 0.8 (similar to the reference diagonal line representing AUC 0.5).
The reference diagonal line has a meaning (a model that guesses randomly), so you would similarly have to define the model associated with your reference curve of AUC 0.8. Different models would be associated with different reference curves.
For instance, one might define a model for which predicted probabilities are distributed evenly between 0 and 1 and for a point with predicted probability p, the probability of the true outcome is p^k for some constant k. It turns that for this model, k=2 yields a plot with AUC 0.8.
library(pROC)
set.seed(144)
probs <- seq(0, 1, length.out=10000)
truth <- runif(10000)^2 < probs
plot.roc(truth, probs)
# Call:
# plot.roc.default(x = truth, predictor = probs)
#
# Data: probs in 3326 controls (truth FALSE) < 6674 cases (truth TRUE).
# Area under the curve: 0.7977
Some algebra shows that this particular family of models has AUC (2+3k)/(2+4k), meaning it can generate curves with AUC between 0.75 and 1 depending on the value of k.
Another approach you could use is linked to logistic regression. If you had logistic regression linear predictor function value p, aka you would have predicted probability 1/(1+exp(-p)), then you could label the true outcome as true if p plus some normally distributed noise exceeds 0 and otherwise label the true outcome as false. If the normally distributed noise has variance 0 your model will have AUC 1, and if the normally distributed noise has variance approaching infinity your model will have AUC 0.5.
If I assume the original predictions are drawn from the standard normal distribution, it looks like normally distributed noise with standard deviation 1.2 give AUC 0.8 (I couldn't figure out a nice closed form for AUC, though):
set.seed(144)
pred.fxn <- rnorm(10000)
truth <- (pred.fxn + rnorm(10000, 0, 1.2)) >= 0
plot.roc(truth, pred.fxn)
# Call:
# plot.roc.default(x = truth, predictor = pred.fxn)
#
# Data: pred.fxn in 5025 controls (truth FALSE) < 4975 cases (truth TRUE).
# Area under the curve: 0.7987
A quick/rough way is to add a circle of radius 1 onto your plot which will have AUC pi/4 = 0.7853982
library(pROC)
library(car)
n <- 100L
x1 <- rnorm(n, 2.0, 0.5)
x2 <- rnorm(n, -1.0, 2)
y <- rbinom(n, 1L, plogis(-0.4 + 0.5 * x1 + 0.1 * x2))
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
plot(roc(y, probs))
ellipse(c(0, 0), matrix(c(1,0,0,1), 2, 2), radius = 1, center.pch = FALSE, col = "blue")

How to create a ROC in R using predicted value from SAS?

I have a dataset from SAS, it is scored data with two columns, y and yhat. y is binary (0,1), yhat is scored value, model is logistic regression. I want create roc in r for this SAS model and compare it with other models in R. I have no clue regarding how to accomplish this? Any suggestions? Thanks.
How to create a ROC in R using predicted value from SAS?
You can use the ROCR package like this:
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR)
pred <- prediction( SASdataset$predictions, SASdataset$labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
Very simply if you know how ROC curves work. You want to be able to classify people into your dichotomous outcomes, 0 or 1 I am using below, using the predicted values from your model.
So if you were to select a cut-off for your predicted values at 0.5, say anyone above this threshold is considered positive/1/diseased/etc, and anyone below as a 0/unaffected.
That's great, but can that be improved? So the thought here is that if we go through a bunch of cutoff points, which one will be the most accurate in classifying people into our dichotomous outcomes, that is, comparing the predicted values from the model to the actual classifications that we know.
# some data
dat <- data.frame(pred = rep(0:1, each = 50),
predict = c(runif(50), runif(50, .5, 1.5)))
# a matrix of the cutoffs, specificity, and sensitivity
p1 <- matrix(0, nrow = 19, ncol = 3)
i <- 1
# for each cutoff value, create a 2x2 table and calculate your sens/spec
for (p in seq(min(dat$predict), .95, 0.05)) {
t1 <- table(dat$predict > p, dat$pred)
p1[i, ] <- c(p, (t1[2, 2]) / sum(t1[ , 2]), (t1[1, 1]) / sum(t1[ , 1]))
i <- i + 1
}
# and plot
plot(1 - p1[ , 3], p1[ , 2], type = 'l',
xlab = '1 - spec', ylab = 'sens',
main = 'ROC', cex.main = .8)
There are some packages out there, ROCR is one I have used, but this takes me a couple minutes to program, is very simple to understand, and is in base R.

Resources