ROC curve - model performace error - r

I am trying to plot a ROC curve to show my model performance. The model is fitted using the randomForest package
prediction <- predict(fit, test, type="prob")
pred <- prediction(test$prediction, test$flag_cross_over )
pred2 <- prediction(abs(test$prediction +
rnorm(length(test$prediction), 0, 0.1)), flag_cross_over)
perf <- performance( pred, "tpr", "fpr" )
perf2 <- performance(pred2, "tpr", "fpr")
plot( perf, colorize = TRUE)
plot(perf2, add = TRUE, colorize = TRUE)
So using the test data I am trying to check the model performance. So the prediction column holds the predictions made and the flag_cross_over is the labels for the model.
The error is saying:
Number of cross-validation runs must be equal for predictions and labels.
.
the prediction dimensions is 410 2
the labels dim is 410 1
I am unsure why the prediction has one more value in the dimensions

Related

Subscript out of bound error in predict function of LASSO model [duplicate]

This question already has answers here:
Subscript out of bounds - general definition and solution?
(7 answers)
Closed 2 years ago.
I am using LASSO model for prediction and in the prediction, I get the following error when running predict function. Can someone help me to overcome this?
ERROR MSG:
Error in predict(lasso_model, x, type = "response")[, 2] :
subscript out of bounds
#convert data to matrix format
x <- model.matrix(St_recurrence~.,tr)
#convert class to numerical variable
y <- tr$St_recurrence
#Cross validation - perform grid search to find optimal value of lambda
cv_out <- cv.glmnet(x, y, alpha=1, family = 'binomial', nfolds = 5, type.measure = "auc")
#best value of lambda
best_lambda <- cv_out$lambda.1se
# Rebuilding the model with best lamda value identified
lasso_model <- glmnet(x, y, family = "binomial", alpha = 1, lambda = best_lambda)
coef(lasso_model)
# odds ratio
exp(coef(lasso_model))
library(ROCR)
# Calculate the probability of new observations belonging to "yes"
predprob <- predict(lasso_model, x, type = "response")[,2] ## error comes in this line
# prediction is ROCR function
pred <- prediction(predprob, tr$Structural_recurrence)
# ROC curve (x-axis: fpr, y-axis: tpr)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main="ROC Curve for LASSO model", col=rainbow(10))
#compute area under curve
aucperf <- performance(pred, measure="auc")
print(aucperf#y.values)
For glmnet, if you do ?glmnet::predict.glmnet you can see under details:
type: Type of prediction required. Type ‘"link"’ gives the linear
predictors for ‘"binomial"’, ‘"multinomial"’, ‘"poisson"’ or
‘"cox"’ models; for ‘"gaussian"’ models it gives the fitted
values. Type ‘"response"’ gives the fitted probabilities for
‘"binomial"’ or ‘"multinomial"
And it returns a vector of being 1, unlike caret which returns 2 columns.
So you can do:
library(glmnet)
library(ROCR)
data(Sonar)
y= as.numeric(Sonar$Class)-1
x=as.matrix(Sonar[,-ncol(Sonar)])
lasso_model = glmnet(x=x,y=y,family="binomial",alpha=1,lambda=0.001)
predprob <- predict(lasso_model, x, type = "response")
pred <- prediction(predprob, y)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main="ROC Curve for LASSO model")

ROC Curve Ranger

I am trying to calculate ROC Curve and AUC using ranger for a binomial classification problem (0 and 1), where the response variable is defined as BiClass.
Suppose I cast a data frame to Train_Set and Test_Set (75% and 25 % respectively) and compute binary class probabilities using:
library(ranger)
library(ROCR)
library(mlr)
library(pROC)
library(tidyverse)
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
pred <- predict(BiClass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)
My intention is now to compute ROC curve (and AUC). I tried the following code, through which I get ROC curve (using ROCR and mlr packages):
pred_object <- prediction(pred$predictions[,2], Test_Set$BiClass)
per_measure <- performance(pred_object, "tnr", "fnr")
plot(per_measure, col="red", lwd=1)
abline(a=0,b=1,lwd=1,lty=1,col="gray")
Or, aletrnatively using pROC package:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
probabilities$predic <- colnames(probabilities)[max.col(probabilities,ties.method="first")] # For each row, return the column name of the largest value from 0 and 1 columns (prediction column). This will be a character type
probabilities$prednum <- as.numeric(as.character(probabilities$predic)) # create prednum as a numeric data type in probabilities
probabilities <- dplyr::mutate_if(probabilities, is.character, as.factor) # convert character to factor
probabilities <- cbind(probabilities,BiClass=Test_Set$BiClass) # append BiClass. This data frame contains the response variable from the Test_Data, along with prediction (prednum) and probability classes (0 and 1)
ROC_ranger <- pROC::roc(Table$BiClass, pred$predictions[,2])
plot(ROC_ranger, col = "blue", main = "ROC - Ranger")
paste("Accuracy % of ranger: ", mean(Test_Set$BiClass == round(pred$predictions[,2], digits = 0))) # print the performance of each model
The ROC curve obtained is given below:
I have the following questions:
1) How can I set a threshold value and plot confusion matrix for the set threshold?
I compute the confusion matrix presently using:
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 500, type='response', verbose = TRUE)$predictions)
max.col(probabilities) - 1
confusionMatrix(table(Test_Set$BiClass, max.col(probabilities)-1))
2) How do I calculate the optimal thershold value (global value at which I have more true positives or true negatives) through optimization?
Again, referring to the pROC and the guidelines proposed by its author using:
myroc <- pROC::roc(probabilities$BiClass, probabilities$`1`)
mycoords <- pROC::coords(myroc, "all", transpose = FALSE)
plot(mycoords$threshold, mycoords$specificity, type="l", col="red", xlab="Cutoff", ylab="Performance")
lines(mycoords$threshold, mycoords$sensitivity, type="l", col="blue")
legend(0.23,0.2, c("Specificity", "Sensitivity"), col=c("red", "blue"), lty=1)
best.coords <- coords(myroc, "best", best.method="youden", transpose = FALSE)
abline(v=best.coords$threshold, lty=2, col="grey")
abline(h=best.coords$specificity, lty=2, col="red")
abline(h=best.coords$sensitivity, lty=2, col="blue")
I was able to draw this curve using youden index:
]2
Does it mean there isn't a lot of freedom to vary threshold to play with specificity and sensitivity, since the dashed blue and red lines are not far away from each other?
3) How to evaulate AUC?
I calculated AUC using pROC again following the guidelines of its author. See below:
ROC_ranger <- pROC::roc(probabilities$BiClass, probabilities$`1`)
ROC_ranger_auc <- pROC::auc(ROC_ranger)
paste("Area under curve of random forest: ", ROC_ranger_auc) # AUC of the model
The goal finally is to increase the True Neagtives, which are presently defined by 1 in BiClass and of course True Positives (defined by 0 in BiClass) in the confusion matrix. At present, the Accuracy of my classification algorithm is 0.74 and the AUC is 0.81 respectively.

ROC curves comparing logistic regression and neural network predictions in R

I am trying to compare the prediction accuracy of a dataset using a logistic regression model and a neural network. While looking at the confusion matrices of the two methods, the ANN model gives a better output compared to the logistic regression model. However, while plotting the ROC curves for the two methods, it seems that the logistic regression model is better. I am wondering if there is something wrong with my code for the ROC curves.
For context, I am explaining my procedure. First, I divided the dataset into training and testing data.
data = read.csv("heart.csv", header=TRUE)
set.seed(300)
index = sample(seq_len(nrow(data)), size = samplesize) # For logistic
train <- data[index,]
test <- data[-index,]
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
scaled <- as.data.frame(lapply(data, normalize))
index = sample(seq_len(nrow(scaled)), size = samplesize) # For ANN
trainset <- scaled[index, ]
testset <- scaled[-index, ]
The response variable is "target" so I fit the following GLM :
glm.fit <- glm(target ~ ., data=train, family=binomial(link = "logit"),control = list(maxit = 50))
For the ANN, I used R's neuralnet package and did the following:
library(neuralnet)
nn <- neuralnet(target ~ ., data=trainset, hidden=c(3,2), act.fct = "logistic", err.fct = "sse", linear.output=FALSE, threshold=0.01)
For my ROC curves, I did the following:
For ANN:
prob = compute(nn, testset[, -ncol(testset)] )
prob.result <- prob$net.result
detach(package:neuralnet,unload = T)
library(ROCR)
nn.pred = prediction(prob.result, testset$target)
pref <- performance(nn.pred, "tpr", "fpr")
plot(pref)
And for logistic regression:
prob=predict(glm.fit,type=c("response"))
library(ROCR)
pred <- prediction(prob, test$target)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(7), main="ROC curve Admissions", xlab="Specificity",
ylab="Sensitivity")
I would just like some guidance in understanding why the plots seem to suggest that the logistic regression model is better when the confusion matrix suggests otherwise, and understand what I am doing wrong.
Thank you for any input.

How to solve these problems about inverted ROC curve, small AUC, and the cutoff?

I am constructing this ROC curve from my SVM model, but the curve came out inverted. Also, although my SVM prediction has high accuracy (~93%), my ROC curve shows that my area under the curve is just about 2.7%. Moreover, it tells me that the optimal cutoff value is infinity, which is not what I expected from my model fitting.
I have fitted my SVM model using the built-in SVM function just like in the code I showed below, and then I predicted using the function predict(). Then, I computed the prediction() and calculated the performance(), the cutoff value, and the AUC (all code shown below)
svm.fit <- svm(label ~ NDAI + SD + CORR, data = trainSet, scale = FALSE, kernel = "radial", cost = 2, probability=TRUE)
svm.pred <- predict(svm.fit, testSet, probability=TRUE)
mean(svm.pred== testSet$label)*100
prediction.svm <- prediction(attr(svm.pred, "probabilities")[,2], testSet$label)
eval.svm <- performance(prediction.svm, "acc")
roc.svm <- performance(prediction.svm, "tpr", "fpr")
#identify best values and cutoff
max_index.svm <- which.max(slot(eval.svm, "y.values")[[1]])
max.acc_svm <- (slot(eval.svm, "y.values")[[1]])[max_index.svm]
opt.cutoff_svm <- (slot(eval.svm, "x.values")[[1]])[max_index.svm][[1]]
#AUC
auc.svm <- performance(prediction.svm, "auc")
auc.svm <- unlist(slot(auc.svm, "y.values"))
auc.svm <- round(auc.svm, 4)
plot(roc.svm,colorize=TRUE)
points(0.072, 0.93, pch= 20)
legend(.6,.2, auc.svm, title = "AUC", cex = 0.8)
legend(.8,.2, round(opt.cutoff_svm,4), title = "cutoff", cex = 0.8)
I expect the output to have AUC close to 1, and a small cutoff which is close to 0.5, with a curve with AUC close to 1. Has anyone encountered a similar problem like this one? If yes, how should I fix my code?

Calculating AUC from nnet model

For a bit of background, I am using the nnet package building a simple neural network.
My dataset has a number of factor and continuous variable features. To handle the continuous variables I apply scale and center which minuses each by its mean and divides by its SD.
I'm trying to produce an ROC & AUC plot from the results of neural network model.
The below is the code used to build my basic neural network model:
model1 <- nnet(Cohort ~ .-Cohort,
data = train.sample,
size = 1)
To get some predictions, I call the following function:
train.predictions <- predict(model1, train.sample)
Now, this assigns the train.predictions object to a large matrix consisting of 0 & 1 values. What I want to do, is getting the class probabilities for each prediction so I can plot an ROC curve using the pROC package.
So, I tried adding the following parameter to my predict function:
train.predictions <- predict(model1, train.sample, type="prob")
But I get an error:
Error in match.arg(type) : 'arg' should be one of “raw”, “class”
How can I go about getting class probabilities from outputs?
Assuming your test/validation data set is in train.test, and train.labels contains the true class labels:
train.predictions <- predict(model1, train.test, type="raw")
## This might not be necessary:
detach(package:nnet,unload = T)
library(ROCR)
## train.labels:= A vector, matrix, list, or data frame containing the true
## class labels. Must have the same dimensions as 'predictions'.
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
pred = prediction(train.predictions, train.labels)
perf = performance(pred, "tpr", "fpr")
plot(perf, lwd=2, col="blue", main="ROC - Title")
abline(a=0, b=1)

Resources