ROC Curve Ranger - r

I am trying to calculate ROC Curve and AUC using ranger for a binomial classification problem (0 and 1), where the response variable is defined as BiClass.
Suppose I cast a data frame to Train_Set and Test_Set (75% and 25 % respectively) and compute binary class probabilities using:
library(ranger)
library(ROCR)
library(mlr)
library(pROC)
library(tidyverse)
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
pred <- predict(BiClass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)
My intention is now to compute ROC curve (and AUC). I tried the following code, through which I get ROC curve (using ROCR and mlr packages):
pred_object <- prediction(pred$predictions[,2], Test_Set$BiClass)
per_measure <- performance(pred_object, "tnr", "fnr")
plot(per_measure, col="red", lwd=1)
abline(a=0,b=1,lwd=1,lty=1,col="gray")
Or, aletrnatively using pROC package:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
probabilities$predic <- colnames(probabilities)[max.col(probabilities,ties.method="first")] # For each row, return the column name of the largest value from 0 and 1 columns (prediction column). This will be a character type
probabilities$prednum <- as.numeric(as.character(probabilities$predic)) # create prednum as a numeric data type in probabilities
probabilities <- dplyr::mutate_if(probabilities, is.character, as.factor) # convert character to factor
probabilities <- cbind(probabilities,BiClass=Test_Set$BiClass) # append BiClass. This data frame contains the response variable from the Test_Data, along with prediction (prednum) and probability classes (0 and 1)
ROC_ranger <- pROC::roc(Table$BiClass, pred$predictions[,2])
plot(ROC_ranger, col = "blue", main = "ROC - Ranger")
paste("Accuracy % of ranger: ", mean(Test_Set$BiClass == round(pred$predictions[,2], digits = 0))) # print the performance of each model
The ROC curve obtained is given below:
I have the following questions:
1) How can I set a threshold value and plot confusion matrix for the set threshold?
I compute the confusion matrix presently using:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
max.col(probabilities) - 1
confusionMatrix(table(Test_Set$BiClass, max.col(probabilities)-1))
2) How do I calculate the optimal thershold value (global value at which I have more true positives or true negatives) through optimization?
Again, referring to the pROC and the guidelines proposed by its author using:
myroc <- pROC::roc(probabilities$BiClass, probabilities$`1`)
mycoords <- pROC::coords(myroc, "all", transpose = FALSE)
plot(mycoords$threshold, mycoords$specificity, type="l", col="red", xlab="Cutoff", ylab="Performance")
lines(mycoords$threshold, mycoords$sensitivity, type="l", col="blue")
legend(0.23,0.2, c("Specificity", "Sensitivity"), col=c("red", "blue"), lty=1)
best.coords <- coords(myroc, "best", best.method="youden", transpose = FALSE)
abline(v=best.coords$threshold, lty=2, col="grey")
abline(h=best.coords$specificity, lty=2, col="red")
abline(h=best.coords$sensitivity, lty=2, col="blue")
I was able to draw this curve using youden index:
]2
Does it mean there isn't a lot of freedom to vary threshold to play with specificity and sensitivity, since the dashed blue and red lines are not far away from each other?
3) How to evaulate AUC?
I calculated AUC using pROC again following the guidelines of its author. See below:
ROC_ranger <- pROC::roc(probabilities$BiClass, probabilities$`1`)
ROC_ranger_auc <- pROC::auc(ROC_ranger)
paste("Area under curve of random forest: ", ROC_ranger_auc) # AUC of the model
The goal finally is to increase the True Neagtives, which are presently defined by 1 in BiClass and of course True Positives (defined by 0 in BiClass) in the confusion matrix. At present, the Accuracy of my classification algorithm is 0.74 and the AUC is 0.81 respectively.

Related

ROC curve in R with rpart for a survival tree

I have an issue with creating a ROC Curve for my survival tree created by the rpart package. My goal was to evaluate my survival tree through Area Under Curve (AUC) in ROC curve. I had tried many ways to plot a ROC curve but failed. How can I approach my next step the ROC curve plot?
Here is the R code I have so far:
library(survival)
library("rpart")
library("partykit")
library(rattle)
library(rpart.plot)
temp = coxph(Surv(pgtime, pgstat) ~ age+eet+g2+grade+gleason+ploidy, stagec)
newtime = predict(temp, type = 'expected')
fit <- rpart(Surv(pgtime, pgstat) ~ age+eet+g2+grade+gleason+ploidy, data = stagec)
fancyRpartPlot(fit)
tfit <- as.party(fit) #Transfer "rpart" to "party"
predtree<-predict(tfit,newdata=stagec,type="prob") #Prediction
Here is the R code I have tried so far:
1.
library("ROCR")
predROCR <- prediction(predict(tfit, newdata = stagec, type = "prob")[, 2],labels=Surv(stagec$pgtime, stagec$pgstat))
Error in predict(tfit, newdata = stagec, type = "prob")[, 2] :
incorrect number of dimensions
It doesn't work. I checked the prediction result of a function of predict() and found that it is an ‘Survival’ object (Code and results are as follows:). I guess this method fails because it not suitable for "Survival" object?
predict(tfit, newdata = stagec, type = "prob")[[1]]
Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
n events median 0.95LCL 0.95UCL
33 1 NA NA NA
I try to derive the survival function value of each terminal node and use these value to draw the ROC curve. Is this correct? The ROC curve drawn in this way seems to treat the predicted classification results as continuous variables rather than categorical variables.
Here's the R code I tried:
tree2 = fit
tree2$frame$yval = as.numeric(rownames(tree2$frame))
#Get the survival function value of each sample
Surv_value = data.frame(predict(tree2, newdata=stagec,type = "matrix"))[,1]
Out=data.frame()
Out=cbind(stagec,Surv_value)
#ROC
library(survivalROC)
roc=survivalROC(Stime=Out$pgtime, status=Out$pgstat, marker = Out$Surv_value, predict.time =5, method="KM")
roc$AUC #Get the AUC of ROC plot
#Plot ROC
aucText=c()
par(oma=c(0.5,1,0,1),font.lab=1.5,font.axis=1.5)
plot(roc$FP, roc$TP, type="l", xlim=c(0,1), ylim=c(0,1),col="#f8766d",
xlab="False positive rate", ylab="True positive rate",
lwd = 2, cex.main=1.3, cex.lab=1.2, cex.axis=1.2, font=1.2)
aucText=c(aucText,paste0("498"," (AUC=",sprintf("%.3f",roc$AUC),")"))
legend("bottomright", aucText,lwd=2,bty="n",col=c("#f8766d","#00bfc4","blue","green"))
abline(0,1)

How to solve these problems about inverted ROC curve, small AUC, and the cutoff?

I am constructing this ROC curve from my SVM model, but the curve came out inverted. Also, although my SVM prediction has high accuracy (~93%), my ROC curve shows that my area under the curve is just about 2.7%. Moreover, it tells me that the optimal cutoff value is infinity, which is not what I expected from my model fitting.
I have fitted my SVM model using the built-in SVM function just like in the code I showed below, and then I predicted using the function predict(). Then, I computed the prediction() and calculated the performance(), the cutoff value, and the AUC (all code shown below)
svm.fit <- svm(label ~ NDAI + SD + CORR, data = trainSet, scale = FALSE, kernel = "radial", cost = 2, probability=TRUE)
svm.pred <- predict(svm.fit, testSet, probability=TRUE)
mean(svm.pred== testSet$label)*100
prediction.svm <- prediction(attr(svm.pred, "probabilities")[,2], testSet$label)
eval.svm <- performance(prediction.svm, "acc")
roc.svm <- performance(prediction.svm, "tpr", "fpr")
#identify best values and cutoff
max_index.svm <- which.max(slot(eval.svm, "y.values")[[1]])
max.acc_svm <- (slot(eval.svm, "y.values")[[1]])[max_index.svm]
opt.cutoff_svm <- (slot(eval.svm, "x.values")[[1]])[max_index.svm][[1]]
#AUC
auc.svm <- performance(prediction.svm, "auc")
auc.svm <- unlist(slot(auc.svm, "y.values"))
auc.svm <- round(auc.svm, 4)
plot(roc.svm,colorize=TRUE)
points(0.072, 0.93, pch= 20)
legend(.6,.2, auc.svm, title = "AUC", cex = 0.8)
legend(.8,.2, round(opt.cutoff_svm,4), title = "cutoff", cex = 0.8)
I expect the output to have AUC close to 1, and a small cutoff which is close to 0.5, with a curve with AUC close to 1. Has anyone encountered a similar problem like this one? If yes, how should I fix my code?

AUC unexpected value

I have the following predictions after running a logistic regression model on a set of molecules we suppose that are predictive of tumors versus normals.
Predicted class
T N
T 29 5
Actual class
N 993 912
I have a list of scores that range from predictions <0 (negative numbers) to predictions >0 (positive numbers). Then I have another column in my data.frame that indicated the labels (1== tumours and 0==normals) as predicted from the model. I tried to calculate the ROC using the library(ROC) in the following way:
pred = prediction(prediction, labels)
roc = performance(pred, "tpr", "fpr")
plot(roc, lwd=2, colorize=TRUE)
Using:
roc_full_data <- roc(labels, prediction)
rounded_scores <- round(prediction, digits=1)
roc_rounded <- roc(labels, prediction)
Call:
roc.default(response = labels, predictor = prediction)
Data: prediction in 917 controls (category 0) < 1022 cases (category1).
Area under the curve: 1
The AUC is equal to 1. I'm not sure that I run all correctly or probably I'm doing something wrong in the interpretation of my results because it is quite rare that the AUC is equal to 1.
There is a typo in your x.measure which should have thrown an error. You have "for" and not "fpr". Try the following code.
performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)
# add a reference line to the graph
abline(a = 0, b = 1, lwd = 2, lty = 2)
# calculate AUC
perf.auc <- performance(pred, measure = "auc")
str(perf.auc)
as.numeric(perf.auc#y.values)
I use pROC to calculate AUC:
require(pROC)
set.seed(1)
pred = runif(100)
y = factor(sample(0:1, 100, TRUE))
auc = as.numeric(roc(response = y, predictor = pred)$auc)
print(auc) # 0.5430757
Or
require(AUC)
auc = AUC::auc(AUC::roc(pred, y))
print(auc) # 0.4569243
I can't explain why the results are different.
EDIT: The above aucs sum to 1.0, so one of the libs automatically 'inverted' the predictions.

ROC curve - model performace error

I am trying to plot a ROC curve to show my model performance. The model is fitted using the randomForest package
prediction <- predict(fit, test, type="prob")
pred <- prediction(test$prediction, test$flag_cross_over )
pred2 <- prediction(abs(test$prediction +
rnorm(length(test$prediction), 0, 0.1)), flag_cross_over)
perf <- performance( pred, "tpr", "fpr" )
perf2 <- performance(pred2, "tpr", "fpr")
plot( perf, colorize = TRUE)
plot(perf2, add = TRUE, colorize = TRUE)
So using the test data I am trying to check the model performance. So the prediction column holds the predictions made and the flag_cross_over is the labels for the model.
The error is saying:
Number of cross-validation runs must be equal for predictions and labels.
.
the prediction dimensions is 410 2
the labels dim is 410 1
I am unsure why the prediction has one more value in the dimensions

How to create a ROC in R using predicted value from SAS?

I have a dataset from SAS, it is scored data with two columns, y and yhat. y is binary (0,1), yhat is scored value, model is logistic regression. I want create roc in r for this SAS model and compare it with other models in R. I have no clue regarding how to accomplish this? Any suggestions? Thanks.
How to create a ROC in R using predicted value from SAS?
You can use the ROCR package like this:
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR)
pred <- prediction( SASdataset$predictions, SASdataset$labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
Very simply if you know how ROC curves work. You want to be able to classify people into your dichotomous outcomes, 0 or 1 I am using below, using the predicted values from your model.
So if you were to select a cut-off for your predicted values at 0.5, say anyone above this threshold is considered positive/1/diseased/etc, and anyone below as a 0/unaffected.
That's great, but can that be improved? So the thought here is that if we go through a bunch of cutoff points, which one will be the most accurate in classifying people into our dichotomous outcomes, that is, comparing the predicted values from the model to the actual classifications that we know.
# some data
dat <- data.frame(pred = rep(0:1, each = 50),
predict = c(runif(50), runif(50, .5, 1.5)))
# a matrix of the cutoffs, specificity, and sensitivity
p1 <- matrix(0, nrow = 19, ncol = 3)
i <- 1
# for each cutoff value, create a 2x2 table and calculate your sens/spec
for (p in seq(min(dat$predict), .95, 0.05)) {
t1 <- table(dat$predict > p, dat$pred)
p1[i, ] <- c(p, (t1[2, 2]) / sum(t1[ , 2]), (t1[1, 1]) / sum(t1[ , 1]))
i <- i + 1
}
# and plot
plot(1 - p1[ , 3], p1[ , 2], type = 'l',
xlab = '1 - spec', ylab = 'sens',
main = 'ROC', cex.main = .8)
There are some packages out there, ROCR is one I have used, but this takes me a couple minutes to program, is very simple to understand, and is in base R.

Resources