ROC curve in R with rpart for a survival tree - r

I have an issue with creating a ROC Curve for my survival tree created by the rpart package. My goal was to evaluate my survival tree through Area Under Curve (AUC) in ROC curve. I had tried many ways to plot a ROC curve but failed. How can I approach my next step the ROC curve plot?
Here is the R code I have so far:
library(survival)
library("rpart")
library("partykit")
library(rattle)
library(rpart.plot)
temp = coxph(Surv(pgtime, pgstat) ~ age+eet+g2+grade+gleason+ploidy, stagec)
newtime = predict(temp, type = 'expected')
fit <- rpart(Surv(pgtime, pgstat) ~ age+eet+g2+grade+gleason+ploidy, data = stagec)
fancyRpartPlot(fit)
tfit <- as.party(fit) #Transfer "rpart" to "party"
predtree<-predict(tfit,newdata=stagec,type="prob") #Prediction
Here is the R code I have tried so far:
1.
library("ROCR")
predROCR <- prediction(predict(tfit, newdata = stagec, type = "prob")[, 2],labels=Surv(stagec$pgtime, stagec$pgstat))
Error in predict(tfit, newdata = stagec, type = "prob")[, 2] :
incorrect number of dimensions
It doesn't work. I checked the prediction result of a function of predict() and found that it is an ‘Survival’ object (Code and results are as follows:). I guess this method fails because it not suitable for "Survival" object?
predict(tfit, newdata = stagec, type = "prob")[[1]]
Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
n events median 0.95LCL 0.95UCL
33 1 NA NA NA
I try to derive the survival function value of each terminal node and use these value to draw the ROC curve. Is this correct? The ROC curve drawn in this way seems to treat the predicted classification results as continuous variables rather than categorical variables.
Here's the R code I tried:
tree2 = fit
tree2$frame$yval = as.numeric(rownames(tree2$frame))
#Get the survival function value of each sample
Surv_value = data.frame(predict(tree2, newdata=stagec,type = "matrix"))[,1]
Out=data.frame()
Out=cbind(stagec,Surv_value)
#ROC
library(survivalROC)
roc=survivalROC(Stime=Out$pgtime, status=Out$pgstat, marker = Out$Surv_value, predict.time =5, method="KM")
roc$AUC #Get the AUC of ROC plot
#Plot ROC
aucText=c()
par(oma=c(0.5,1,0,1),font.lab=1.5,font.axis=1.5)
plot(roc$FP, roc$TP, type="l", xlim=c(0,1), ylim=c(0,1),col="#f8766d",
xlab="False positive rate", ylab="True positive rate",
lwd = 2, cex.main=1.3, cex.lab=1.2, cex.axis=1.2, font=1.2)
aucText=c(aucText,paste0("498"," (AUC=",sprintf("%.3f",roc$AUC),")"))
legend("bottomright", aucText,lwd=2,bty="n",col=c("#f8766d","#00bfc4","blue","green"))
abline(0,1)

Related

Getting Confidence Intervals from predicted values from a nlme model from package medrc

I am trying to figure out how to get confidence intervals from predicted values from a model run on medrc (nlme model). The code worked on the regular drc package model, which does not use random effects, so I assume there is something I am not doing right with this nlme model to get CI because I am getting errors.
Below is an example data frame of the data I am using
df <- data.frame(Geno = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,
9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,14,14,14,14),
Treatment = c(3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",
3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",
3,6,9,"MMM",3,6,9,"MMM"),
Temp = c(32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535),
PAM = c(0.62225,0.593,0.35775,0.654,0.60625,0.5846667,0.316,0.60875,0.62275,0.60875,0.32125,
0.63725,0.60275,0.588,0.32275,0.60875,0.65225,0.6185,0.29925,0.64525,0.61925,0.61775,
0.11725,0.596,0.603,0.6065,0.2545,0.59025,0.586,0.5895,0.27025,0.59125,0.6345,0.6135,
0.3755,0.622,0.53375,0.552,0.2485,0.51925,0.6375,0.6256667,0.3575,0.63975,0.59375,0.6055,
0.333,0.64125,0.55275,0.51025,0.319,0.55725,0.6375,0.64725,0.348,0.66125))
df$Geno <- as.factor(df$Geno)
With this data, I am running this model that has 3 parameters for the dose-response curve model, b =slope, d= max, e= ED50.
model <- medrm(PAM ~ Temp,
data=df,
random= d + e ~ 1|Geno,
fct=LL.3(),
control=nlmeControl(msMaxIter = 2000, maxIter=2000, minScale=0.00001, tolerance=0.1, pnlsTol=1))
summary(model)
plot(model)
From this model I want to make prediction values for different temperatures along the model
model_preddata = data.frame(Temp = seq(28,39, length.out = 100))
model_pred = as.data.frame(predict(model, newdata = model_preddata, interval = 'confidence'))
with this I get an error but I can make it predict the PAM values if I add this
model_pred = as.data.frame(predict(model, newdata = model_preddata, interval = 'confidence', level = 0))
However this does not give me the lower and upper bounds columns like it does when I run this code with other non mixed effect models.
Can anyone help me figure out how to get the CI from the predicted values of this model

Predicting and Plotting Survival Curve with the CoxPH

I am trying to predict and plot the (estimated) survival curve for a new observation in R. Using the "survival" library and the "lung" data set, I first fit a cox proportional hazards model to the data. Then, I tried to predict and plot the survival curve for a hypothetical new observation (I entered the details for this hypothetical new observation in the "list" command). However, this is not working.
I have attached my code below:
#load library
library(survival)
data(lung)
#create survival object
s <- with(lung,Surv(time,status))
#create model
modelA <- coxph(s ~ as.factor(sex)+age+ph.ecog+wt.loss+ph.karno,data=lung, model=TRUE)
summary(modelA)
#plot
plot(survfit(modelA), ylab="Probability of Survival",
xlab="Time", col=c("red", "black", "black"))
#predict for a hypothetical NEW observation (here is where the error is)
lines(predict(modelA, newdata=list(sex=1,
age = 56,
ph.ecog = 1,
ph.karno = 50,
wt.loss = 11),
type="quantile",
p=seq(.01,.99,by=.01)),
seq(.99,.01,by=-.01),
col="blue")
## Error in match.arg(type) :
## 'arg' should be one of “lp”, “risk”, “expected”, “terms”, “survival”
Does anyone know what I am doing wrong? Thanks
This is what the survfit function is for. In your example, you plot the survfit for the model, but you can feed a newdata argument into this function and it will produce the estimated survival for these data.
If we reproduce your example:
library(survival)
s <- with(lung, Surv(time, status))
modelA <- coxph(s ~ as.factor(sex) + age + ph.ecog + wt.loss + ph.karno,
data = lung, model = TRUE)
plot(survfit(modelA), ylab = "Probability of Survival",
xlab = "Time", col = c("red", "black", "black"))
Then we can create a survival curve given your specified covariates like this:
est <- survfit(modelA, newdata = data.frame(sex = 1,
age = 56,
ph.ecog = 1,
ph.karno = 50,
wt.loss = 11))
Now est is an S3 object with members that include time and survival, so we can plot a blue line tracking the estimated survival of individuals with the given covariates like this:
lines(est$time, est$surv, col = 'blue', type = 's')
Or plot it on its own with a 95% confidence interval:
plot(est, ylab = "Probability of Survival",
xlab = "Time", col = c("red", "black", "black"))
Created on 2022-05-26 by the reprex package (v2.0.1)
See the description of the predict() function (you can open it in R help by running ?predict.coxph, or here for example):
type - the type of predicted value. Choices are the linear predictor
("lp"), the risk score exp(lp) ("risk"), the expected number of events
given the covariates and follow-up time ("expected"), and the terms of
the linear predictor ("terms"). The survival probability for a subject
is equal to exp(-expected).
You can see that your type="quantile" does not match expected input. If you call predict() without the type argument, in your case it will default to using lp (linear predictor).
When you call predict() function for your object modelA, it determines that it is of coxph class, so the predict.coxph() function is applied. The arguments like type="quantile" and p=seq(.01,.99,by=.01) are not acceptable for predict.coxph() (p is ignored, type raises error). They are used in another function, predict.survreg() - for it to be called, your modelA object must be of survreg class, i.e. it should be created using survreg() call instead of coxph() call.

ROC Curve Ranger

I am trying to calculate ROC Curve and AUC using ranger for a binomial classification problem (0 and 1), where the response variable is defined as BiClass.
Suppose I cast a data frame to Train_Set and Test_Set (75% and 25 % respectively) and compute binary class probabilities using:
library(ranger)
library(ROCR)
library(mlr)
library(pROC)
library(tidyverse)
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
pred <- predict(BiClass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)
My intention is now to compute ROC curve (and AUC). I tried the following code, through which I get ROC curve (using ROCR and mlr packages):
pred_object <- prediction(pred$predictions[,2], Test_Set$BiClass)
per_measure <- performance(pred_object, "tnr", "fnr")
plot(per_measure, col="red", lwd=1)
abline(a=0,b=1,lwd=1,lty=1,col="gray")
Or, aletrnatively using pROC package:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
probabilities$predic <- colnames(probabilities)[max.col(probabilities,ties.method="first")] # For each row, return the column name of the largest value from 0 and 1 columns (prediction column). This will be a character type
probabilities$prednum <- as.numeric(as.character(probabilities$predic)) # create prednum as a numeric data type in probabilities
probabilities <- dplyr::mutate_if(probabilities, is.character, as.factor) # convert character to factor
probabilities <- cbind(probabilities,BiClass=Test_Set$BiClass) # append BiClass. This data frame contains the response variable from the Test_Data, along with prediction (prednum) and probability classes (0 and 1)
ROC_ranger <- pROC::roc(Table$BiClass, pred$predictions[,2])
plot(ROC_ranger, col = "blue", main = "ROC - Ranger")
paste("Accuracy % of ranger: ", mean(Test_Set$BiClass == round(pred$predictions[,2], digits = 0))) # print the performance of each model
The ROC curve obtained is given below:
I have the following questions:
1) How can I set a threshold value and plot confusion matrix for the set threshold?
I compute the confusion matrix presently using:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
max.col(probabilities) - 1
confusionMatrix(table(Test_Set$BiClass, max.col(probabilities)-1))
2) How do I calculate the optimal thershold value (global value at which I have more true positives or true negatives) through optimization?
Again, referring to the pROC and the guidelines proposed by its author using:
myroc <- pROC::roc(probabilities$BiClass, probabilities$`1`)
mycoords <- pROC::coords(myroc, "all", transpose = FALSE)
plot(mycoords$threshold, mycoords$specificity, type="l", col="red", xlab="Cutoff", ylab="Performance")
lines(mycoords$threshold, mycoords$sensitivity, type="l", col="blue")
legend(0.23,0.2, c("Specificity", "Sensitivity"), col=c("red", "blue"), lty=1)
best.coords <- coords(myroc, "best", best.method="youden", transpose = FALSE)
abline(v=best.coords$threshold, lty=2, col="grey")
abline(h=best.coords$specificity, lty=2, col="red")
abline(h=best.coords$sensitivity, lty=2, col="blue")
I was able to draw this curve using youden index:
]2
Does it mean there isn't a lot of freedom to vary threshold to play with specificity and sensitivity, since the dashed blue and red lines are not far away from each other?
3) How to evaulate AUC?
I calculated AUC using pROC again following the guidelines of its author. See below:
ROC_ranger <- pROC::roc(probabilities$BiClass, probabilities$`1`)
ROC_ranger_auc <- pROC::auc(ROC_ranger)
paste("Area under curve of random forest: ", ROC_ranger_auc) # AUC of the model
The goal finally is to increase the True Neagtives, which are presently defined by 1 in BiClass and of course True Positives (defined by 0 in BiClass) in the confusion matrix. At present, the Accuracy of my classification algorithm is 0.74 and the AUC is 0.81 respectively.

How to solve these problems about inverted ROC curve, small AUC, and the cutoff?

I am constructing this ROC curve from my SVM model, but the curve came out inverted. Also, although my SVM prediction has high accuracy (~93%), my ROC curve shows that my area under the curve is just about 2.7%. Moreover, it tells me that the optimal cutoff value is infinity, which is not what I expected from my model fitting.
I have fitted my SVM model using the built-in SVM function just like in the code I showed below, and then I predicted using the function predict(). Then, I computed the prediction() and calculated the performance(), the cutoff value, and the AUC (all code shown below)
svm.fit <- svm(label ~ NDAI + SD + CORR, data = trainSet, scale = FALSE, kernel = "radial", cost = 2, probability=TRUE)
svm.pred <- predict(svm.fit, testSet, probability=TRUE)
mean(svm.pred== testSet$label)*100
prediction.svm <- prediction(attr(svm.pred, "probabilities")[,2], testSet$label)
eval.svm <- performance(prediction.svm, "acc")
roc.svm <- performance(prediction.svm, "tpr", "fpr")
#identify best values and cutoff
max_index.svm <- which.max(slot(eval.svm, "y.values")[[1]])
max.acc_svm <- (slot(eval.svm, "y.values")[[1]])[max_index.svm]
opt.cutoff_svm <- (slot(eval.svm, "x.values")[[1]])[max_index.svm][[1]]
#AUC
auc.svm <- performance(prediction.svm, "auc")
auc.svm <- unlist(slot(auc.svm, "y.values"))
auc.svm <- round(auc.svm, 4)
plot(roc.svm,colorize=TRUE)
points(0.072, 0.93, pch= 20)
legend(.6,.2, auc.svm, title = "AUC", cex = 0.8)
legend(.8,.2, round(opt.cutoff_svm,4), title = "cutoff", cex = 0.8)
I expect the output to have AUC close to 1, and a small cutoff which is close to 0.5, with a curve with AUC close to 1. Has anyone encountered a similar problem like this one? If yes, how should I fix my code?

ROC curve - model performace error

I am trying to plot a ROC curve to show my model performance. The model is fitted using the randomForest package
prediction <- predict(fit, test, type="prob")
pred <- prediction(test$prediction, test$flag_cross_over )
pred2 <- prediction(abs(test$prediction +
rnorm(length(test$prediction), 0, 0.1)), flag_cross_over)
perf <- performance( pred, "tpr", "fpr" )
perf2 <- performance(pred2, "tpr", "fpr")
plot( perf, colorize = TRUE)
plot(perf2, add = TRUE, colorize = TRUE)
So using the test data I am trying to check the model performance. So the prediction column holds the predictions made and the flag_cross_over is the labels for the model.
The error is saying:
Number of cross-validation runs must be equal for predictions and labels.
.
the prediction dimensions is 410 2
the labels dim is 410 1
I am unsure why the prediction has one more value in the dimensions

Resources