unexpected output in plotting ROC curve for SVM classifier - r

While i'm trying to plot SVM using ROC curve for discrete classification it didn't produce a curve matched with its accuracy rate . while the acc. rate produced for SVM using confusion matrix was 88.1%
and it produced the following curve.
ROC Curve for SVM in R
also i calculated the area under curve (AUC) it gives me 1.0 this means that the acc. rate has to be 100 % not 88.1%.
here is the code that i used to produce it
x <- subset(mov[3522:4521,-17])
q <- (mov[3522:4521,17])
svm_model1 <- svm(x,q,cost = .1 , gamma =.5,probability = TRUE)
a <- predict(svm_model1,type="prob", newdata = mov[3522:4521,-17],
probability = TRUE)
library(ROCR)
rocc <-prediction (attr(a, "probabilities")[,2],mov[3522:4521,'y'])
per <- performance(rocc, "tpr","fpr")
plot(per,colorize=T,lwd= 3,main="ROC curve for SVM")
i found a question that could be related but unfortunately i couldn't get it
does anyone know why i got this ROC curve?

Related

error flexible calibration curve with val.prob.ci.2 in LASSO logistic regression model (internal calibration) in R

i want to calculate a flexible calibration curve after developing a logistic regression model with cv.glmnet function using the LASSO in R.
Here's part of my code:
install.packages("glmnet")
install.packages("CalibrationCurves")
#building model with 4 predictors, binomial outcome, 10 fold cross-validation for identifying lambda.min
cv.fit <-cv.glmnet(x=data.matrix(data[,2:5]),y=data$outcome, alpha=1,standardize=TRUE,intercept=TRUE,type.measure="deviance",nfolds=10,weights=WeightsTest2)
#calculating predicted probabilities of model in original dataset
pred_original_logodds <- c(predict(cv.fit, newx=data.matrix(data[,2:5]), s="lambda.min",type="response"))
#calculating probabilities
predict_original <- exp(pred_original_logodds)/(1+exp(pred_original_logodds))
#Fit calibration plot using val.prob.ci.2 function
CalibrationCurves::val.prob.ci.2(p = predict_original, y = data$outcome[enter image description here][1])
I get the following warning message: Warning: collapsing to unique 'x' values.
#Output
A 95% confidence interval is given for the calibration intercept, calibration slope and c-statistic.
Dxy C (ROC) R2 D
0.5737520 0.7868760 0.2907194 0.2128636
D:Chi-sq D:p U U:Chi-sq
127.6538494 0.0000000 0.5822965 348.4664350
U:p Q Brier Intercept
0.0000000 -0.3694329 0.2689491 -1.4232385
Slope Emax Brier scaled Eavg
6.7630170 0.4582670 -0.4947364 0.3431554
ECI
13.4254419
The flexible calibration curve is not properly fitted - i think due to "unique x values". What does this warning message mean?
Regards,
Max

Make roc curves like roc.glmnet in R

I am plotting several ROC curves in R to compare various models. In particular, I am checking LASSO, Logistic and Random Forests. However, while LASSO has a dedicated function for that, namely:
plot(roc.glmnet(lasso.fit_SUM, newx = x.train.loop, newy=y.train.loop)[[10]])
Logistic and RF do not come with such functions.
Now the problem is that I should present pretty ROC curves as the one of LASSO. LASSO ROC curve appears like this:
while Random Forest (and Logistic) like this:
This is the code I am adopting:
df_train_logit_rf_class=df_train_logit_rf
df_test_rf_class=df_test_rf
df_train_logit_rf_class$export_future=as.factor(df_train_logit_rf_class$export_future)
df_test_rf_class$export_future=as.factor(df_test_rf$export_future)
rf.fit_SUM_classification <- randomForest(formula = export_future ~ ., data = df_train_logit_rf_class, ntree = 500, maxnodes= 100, norm.votes = F)
rf.pred_SUM_db <- as.data.frame(predict(rf.fit_SUM_classification, df_test_rf_class, type = "prob"))
rf.pred_SUM_db$predict <- names(rf.pred_SUM_db)[1:2][apply(rf.pred_SUM_db[,1:2], 1, which.max)]
rf.pred_SUM_db$observed <- df_test_rf_class$export_future
#head(rf.pred_SUM_db)
# 1 ROC curve
roc.curve <- roc(ifelse(rf.pred_SUM_db$observed==1, 1, 0), as.numeric(rf.pred_SUM_db$predict))
plot(roc.curve, col = "gray60")
but the outcome is the ugly ROC curve I showed you before.
export_future is a factor variable taking either 0 or 1. There are many covariates (mainly interaction term, dummies).
My aim is to plot a ROC curve for RandomForest (and possibly Logistic) which looks like the one of LASSO.
It seems like logistic is just taking a value as threshold (-Inf) and then interpolate the rest of the curve, while it should take more thresholds.
Thank you in advance,
Federico

Using ROC curve to find optimum cutoff for my weighted binary logistic regression (glm) in R

I have build a binary logistic regression for churn prediction in Rstudio. Due to the unbalanced data used for this model, I also included weights. Then I tried to find the optimum cutoff by try and error, however To complete my research I have to incorporate ROC curves to find the optimum cutoff. Below I provided the script I used to build the model (fit2). The weight is stored in 'W'. This states that the costs of wrongly identifying a churner is 14 times as large as the costs of wrongly identifying a non-churner.
#CH1 logistic regression
library(caret)
W = 14
lvl = levels(trainingset$CH1)
print(lvl)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trainingset$CH1==lvl[2],W,1)
fit2 = glm(CH1 ~ RET + ORD + LVB + REVA + OPEN + REV2KF + CAL + PSIZEF + COM_P_C + PEN + SHOP, data = trainingset, weight=fit_wts, family=binomial(link='logit'))
# we test it on the test set
predlog1 = ifelse(predict(fit2,testset,type="response")>0.5,lvl[2],lvl[1])
predlog1 = factor(predlog1,levels=lvl)
predlog1
confusionMatrix(pred,testset$CH1,positive=lvl[2])
For this research I have also build ROC curves for decision trees using the pROC package. However, of course the same script does not work the same for a logistic regression. I have created a ROC curve for the logistic regression using the script below.
prob=predict(fit2, testset, type=c("response"))
testset$prob=prob
library(pROC)
g <- roc(CH1 ~ prob, data = testset, )
g
plot(g)
Which resulted in the ROC curve below.
How do I get the optimum cut off from this ROC curve?
Getting the "optimal" cutoff is totally independent of the type of model, so you can get it like you would for any other type of model with pROC. With the coords function:
coords(g, "best", transpose = FALSE)
Or directly on a plot:
plot(g, print.thres=TRUE)
Now the above simply maximizes the sum of sensitivity and specificity. This is often too simplistic and you probably need a clear definition of "optimal" that is adapted to your use case. That's mostly beyond the scope of this question, but as a starting point you should a look at Best Thresholds section of the documentation of the coords function for some basic options.

Plotting a ROC curve from a random forest classification

I'm trying to plot ROC curve of a random forest classification. Plotting works, but I think I'm plotting the wrong data since the resulting plot only has one point (the accuracy).
This is the code I use:
set.seed(55)
data.controls <- cforest_unbiased(ntree=100, mtry=3)
data.rf <- cforest(type ~ ., data = dataset ,controls=data.controls)
pred <- predict(data.rf, type="response")
preds <- prediction(as.numeric(pred), dataset$type)
perf <- performance(preds,"tpr","fpr")
performance(preds,"auc")#y.values
confusionMatrix(pred, dataset$type)
plot(perf,col='red',lwd=3)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
To plot a receiver operating curve you need to hand over continuous output of the classifier, e.g. posterior probabilities. That is, you need to predict (data.rf, newdata, type = "prob").
predicting with type = "response" already gives you the "hardened" factor as output. Thus, your working point is implicitly fixed already. With respect to that, your plot is correct.
side note: in bag prediction of random forests will be highly overoptimistic!

ROC curve plot: 0.50 significant and cross-validation

I have got two problems of using pROC package to plot the ROC curve.
A. The Significance level or P-value is the probability that the observed sample Area under the ROC curve is found when in fact, the true (population) Area under the ROC curve is 0.5 (null hypothesis: Area = 0.5). If P is small (P<0.05) then it can be concluded that the Area under the ROC curve is significantly different from 0.5 and that therefore there is evidence that the laboratory test does have an ability to distinguish between the two groups.
Therefore, I would like to calculate whether a certain area under the ROC curve differs from 0.50 significantly. I found the codes using pROC package to compare TWO ROC curves as follows, but not sure how to test if it is 0.5 significant.
library(pROC)
data(aSAH)
rocobj1 <- plot.roc(aSAH$outcome, aSAH$s100,
main="Statistical comparison",
percent=TRUE, col="#1c61b6")
rocobj2 <- lines.roc(aSAH$outcome, aSAH$ndka,
percent=TRUE, col="#008600")
testobj <- roc.test(rocobj1, rocobj2)
text(50, 50,
labels=paste("p-value =", format.pval(testobj$p.value)),
adj=c(0, .5))
legend("bottomright", legend=c("S100B", "NDKA"),
col=c("#1c61b6", "#008600"), lwd=2)
B. I have done a k-fold cross-validation for my classification problem. For example, 5 fold cross-validation will produce 5 ROC curves. Then how to plot the average of these 5 ROC curves using pROC package (What I want to do is explained at this webpage but done in Python: enter link description here)? Another thing is can we get the confidence interval and the best threshold for this average ROC curve (something like the codes implemented below)?
rocobj <- plot.roc(aSAH$outcome, aSAH$s100b,
main="Confidence intervals",
percent=TRUE, ci=TRUE, # compute AUC (of AUC by default)
print.auc=TRUE) # print the AUC (will contain the CI)
ciobj <- ci.se(rocobj, # CI of sensitivity
specificities=seq(0, 100, 5)) # over a select set of specificities
plot(ciobj, type="shape", col="#1c61b6AA") # plot as a blue shape
plot(ci(rocobj, of="thresholds", thresholds="best")) # add one threshold
Refs:
http://web.expasy.org/pROC/screenshots.html
http://scikit-learn.org/0.13/auto_examples/plot_roc_crossval.html
http://www.talkstats.com/showthread.php/14487-ROC-significance
http://www.medcalc.org/manual/roc-curves.php
A. Use a wilcox.test which does exactly that.
B. See my answer to this question: Feature selection + cross-validation, but how to make ROC-curves in R and simply concatenate the data in each fold of the cross-validation (but don't do that with bootstrap, LOO, when you repeat the whole cross-validation multiple times, or when the predictions can't be compared between run).

Resources