ROC curve in R using ROCR package - r

Can someone explain me please how to plot a ROC curve with ROCR.
I know that I should first run:
prediction(predictions, labels, label.ordering = NULL)
and then:
performance(prediction.obj, measure, x.measure="cutoff", ...)
I am just not clear what is meant with prediction and labels. I created a model with ctree and cforest and I want the ROC curve for both of them to compare it in the end. In my case the class attribute is y_n, which I suppose should be used for the labels. But what about the predictions? Here are the steps of what I do (dataset name= bank_part):
pred<-cforest(y_n~.,bank_part)
tablebank<-table(predict(pred),bank_part$y_n)
prediction(tablebank, bank_part$y_n)
After running the last line I get this error:
Error in prediction(tablebank, bank_part$y_n) :
Number of cross-validation runs must be equal for predictions and labels.
Thanks in advance!
Here's another example: I have the training dataset(bank_training) and testing dataset(bank_testing) and I ran a randomForest as below:
bankrf<-randomForest(y~., bank_training, mtry=4, ntree=2,
keep.forest=TRUE,importance=TRUE)
bankrf.pred<-predict(bankrf, bank_testing, type='response')
Now the bankrf.pred is a factor object with labels c=("0", "1"). Still, I don't know how to plot ROC, cause I get stuck to the prediction part. Here's what I do
library(ROCR)
pred<-prediction(bankrf.pred$y, bank_testing$c(0,1)
But this is still incorrect, cause I get the error message
Error in bankrf.pred$y_n : $ operator is invalid for atomic vectors

The predictions are your continuous predictions of the classification, the labels are the binary truth for each variable.
So something like the following should work:
> pred <- prediction(c(0.1,.5,.3,.8,.9,.4,.9,.5), c(0,0,0,1,1,1,1,1))
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
to generate an ROC.
EDIT: It may be helpful for you to include the sample reproducible code in the question (I'm having a hard time intepreting your comment).
There's no new code here, but... here's a function I use quite often for plotting an ROC:
plotROC <- function(truth, predicted, ...){
pred <- prediction(abs(predicted), truth)
perf <- performance(pred,"tpr","fpr")
plot(perf, ...)
}

Like #Jeff said, your predictions need to be continuous for ROCR's prediction function. require(randomForest); ?predict.randomForest shows that, by default, predict.randomForest returns a prediction on the original scale (class labels, in classification), whereas predict.randomForest(..., type = 'prob') returns probabilities of each class. So:
require(ROCR)
data(iris)
iris$setosa <- factor(1*(iris$Species == 'setosa'))
iris.rf <- randomForest(setosa ~ ., data=iris[,-5])
summary(predict(iris.rf, iris[,-5]))
summary(iris.preds <- predict(iris.rf, iris[,-5], type = 'prob'))
preds <- iris.preds[,2]
plot(performance(prediction(preds, iris$setosa), 'tpr', 'fpr'))
gives you what you want. Different classification packages require different commands for getting predicted probabilities -- sometimes it's predict(..., type='probs'), predict(..., type='prob')[,2], etc., so just check out the help files for each function you're calling.

This is how you can do it:
have our data in a csv file,("data_file.csv") but you may need to give the full path here. In that file have the column headers, which here I will use
"default_flag", "var1", "var2", "var3", where default_flag is 0 or 1 and the other variables have any value.
R code:
rm(list=ls())
df <- read.csv("data_file.csv") #use the full path if needed
mylogit <- glm(default_flag ~ var1 + var2 + var3, family = "binomial" , data = df)
summary(mylogit)
library(ROCR)
df$score<-predict.glm(mylogit, type="response" )
pred<-prediction(df$score,df$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc
Note that df$score will give you the probability of default.
In case you want to use this logit (same regression coefficients) to test in another data df2 set for cross validation, use
df2 <- read.csv("data_file2.csv")
df2$score<-predict.glm(mylogit,newdata=df2, type="response" )
pred<-prediction(df2$score,df2$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc

The problem is, as pointed out by others, prediction in ROCR expects numerical values. If you are inserting predictions from randomForest (as the first argument into prediction in ROCR), that prediction needs to be generated by type='prob' instead of type='response', which is the default. Alternatively, you could take type='response' results and convert to numerical (that is, if your responses are, say 0/1). But when you plot that, ROCR generates a single meaningful point on ROC curve. For having many points on your ROC curve, you really need the probability associated with each prediction - i.e. use type='prob' in generating predictions.

The problem may be that you would like to run the prediction function on multiple runs for example for cross-validatation.
In this case for prediction(predictions, labels, label.ordering = NULL) function the class of "predictions" and "labels" variables should be list or matrix.

Try this one:
library(ROCR)
pred<-ROCR::prediction(bankrf.pred$y, bank_testing$c(0,1)
The function prediction is present is many packages. You should explicitly specify(ROCR::) to use the one in ROCR. This one worked for me.

Related

Error when calculating variable importance with categorical variables using the caret package (varImp)

I've been trying to compute the variable importance for a model with mixed scale features using the varImp function in the caret package. I've tried a number of approaches, including renaming and coding my levels numerically. In each case, I am getting the following error:
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
The following dummy example should illustrate my point (edited to reflect #StupidWolf's correction):
library(caret)
#create small dummy dataset
set.seed(124)
dummy_data = data.frame(Label = factor(sample(c("a","b"),40, replace = TRUE)))
dummy_data$pred1 = ifelse(dummy_data$Label=="a",rnorm(40,-.5,2),rnorm(40,.5,2))
dummy_data$pred2 = factor(ifelse(dummy_data$Label=="a",rbinom(40,1,0.3),rbinom(40,1,0.7)))
# check varImp
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
model.lvq <- caret::train(Label~., data=dummy_data,
method="lvq", preProcess="scale", trControl=control.lvq)
varImp.lvq <- caret::varImp(model.lvq, scale=FALSE)
The issue persists when using different models (like randomForest and SVM).
If anyone knows a solution or can tell me what is going wrong, I would highly appreciate that.
Thanks!
When you call varImp on lvq , it defaults to filterVarImp() because there is no specific variable importance for this model. Now if you check the help page:
For two class problems, a series of cutoffs is applied to the
predictor data to predict the class. The sensitivity and specificity
are computed for each cutoff and the ROC curve is computed.
Now if you read the source code of varImp.train() that feeds the data into filterVarImp(), it is the original dataframe and not whatever comes out of the preprocess.
This means in the original data, if you have a variable that is a factor, it cannot cut the variable, it will throw and error like this:
filterVarImp(data.frame(dummy_data$pred2),dummy_data$Label)
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
So using my example and like you have pointed out, you need to onehot encode it:
set.seed(111)
dummy_data = data.frame(Label = rep(c("a","b"),each=20))
dummy_data$pred1 = rnorm(40,rep(c(-0.5,0.5),each=20),2)
dummy_data$pred2 = rbinom(40,1,rep(c(0.3,0.7),each=20))
dummy_data$pred2 = factor(dummy_data$pred2)
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
ohe_data = data.frame(
Label = dummy_data$Label,
model.matrix(Label ~ 0+.,data=dummy_data))
model.lvq <- caret::train(Label~., data=ohe_data,
method="lvq", preProcess="scale",
trControl=control.lvq)
caret::varImp(model.lvq, scale=FALSE)
ROC curve variable importance
Importance
pred1 0.6575
pred20 0.6000
pred21 0.6000
If you use a model that doesn't have a specific variable importance method, then one option is that you can already calculate the variable importance first, and run the model after that.
Note that this problem can be circumvented by replacing ordinal features (with d levels) by its (d-1)-dimensional indicator encoding:
model.matrix(~dummy_data$pred2-1)[,1:(length(levels(dummy_data$pred2)-1)]
However, why does varImp not handle this automatically? Further, this has the drawback that it yields an importance score for each of the d-1 indicators, not one unified importance score for the original feature.

Subscript out of bound error in predict function of randomforest

I am using random forest for prediction and in the predict(fit, test_feature) line, I get the following error. Can someone help me to overcome this. I did the same steps with another dataset and had no error. but I get error here.
Error: Error in x[, vname, drop = FALSE] : subscript out of bounds
training_index <- createDataPartition(shufflled[,487], p = 0.8, times = 1)
training_index <- unlist(training_index)
train_set <- shufflled[training_index,]
test_set <- shufflled[-training_index,]
accuracies<- c()
k=10
n= floor(nrow(train_set)/k)
for(i in 1:k){
sub1<- ((i-1)*n+1)
sub2<- (i*n)
subset<- sub1:sub2
train<- train_set[-subset, ]
test<- train_set[subset, ]
test_feature<- test[ ,-487]
True_Label<- as.factor(test[ ,487])
fit<- randomForest(x= train[ ,-487], y= as.factor(train[ ,487]))
prediction<- predict(fit, test_feature) #The error line
correctlabel<- prediction == True_Label
t<- table(prediction, True_Label)
}
I had similar problem few weeks ago.
To go around the problem, you can do this:
df$label <- factor(df$label)
Instead of as.factor try just factor generic function. Also, try first naming your label variable.
Are there identical column names in your training and validation x?
I had the same error message and solved it by renaming my column names because my data was a matrix and their colnames were all empty, i.e. "".
Your question is not very clear, anyway I try to help you.
First of all check your data to see the distribution in levels of your various predictors and outcomes.
You may find that some of your predictor levels or outcome levels are very highly skewed, or some outcomes or predictor levels are very rare. I got that error when I was trying to predict a very rare outcome with a heavily tuned random forest, and so some of the predictor levels were not actually in the training data. Thus a factor level appears in the test data that the training data thinks is out of bounds.
Alternatively, check the names of your variables.
Before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.
For example You can try:
names(test) <- names(train)
Add the expression
dimnames(test_feature) <- NULL
before
prediction <- predict(fit, test_feature)

Error when predicting new fitted values from R gamlss object

I have a gamlss model that I'd like to use to make new y predictions (and confidence intervals) from in order to visualize how well the model fits the real data. I'd like to make predictions from a new data set of randomized predictor values (rather than the original data), but I'm running into an error message. Here's some example code:
library(gamlss)
# example data
irr <- c(0,0,0,0,0,0.93,1.4,1.4,2.3,1.5)
lite <- c(0,1,2,2.5)
blck <- 1:8
raw <- data.frame(
css =abs(rnorm(500, mean=0.5, sd=0.1)),
nit =abs(rnorm(500, mean=0.72, sd=0.5)),
irr =sample(irr, 500, replace=TRUE),
lit =sample(lite, 500, replace=TRUE),
block =factor(sample(blck, 500, replace=TRUE))
)
# the model
mod <- gamlss(css~nit + irr + lit + random(block),
sigma.fo=~irr*nit + random(block), data=raw, family=BE)
# new data (predictors) for making css predictions
pred <- data.frame(
nit =abs(rnorm(500, mean=0.72, sd=0.5)),
irr =sample(irr, 500, replace=TRUE),
lit =sample(lite, 500, replace=TRUE),
block =factor(sample(blck, 500, replace=TRUE))
)
# make predictions
predmu <- predict(mod, newdata=pred, what="mu", type="response")
This gives the following error:
Error in data[match(names(newdata), names(data))] :
object of type 'closure' is not subsettable
When I run this on my real data, it gives this slightly different error:
Error in `[.data.frame`(data, match(names(newdata), names(data))) :
undefined columns selected
When I use predict without newdata, it works fine making predictions on the original data, as in:
predmu <- predict(mod, what="mu", type="response")
Am I using predict wrong? Any suggestions are greatly appreciated! Thank you.
No, you are not wrong. I have experienced the same issue.
The documentation indicates the implementation of predict is incomplete. this appears to be an example of an incomplete feature/function.
Hedgehog mentioned that predictions based on new-data is not possible yet.
BonnieM therefore "moved the model" into lmer().
I would like to further comment on this idea:
BonniM tried to get predictions based on the object mod
mod <- gamlss(css~nit + irr + lit + random(block),
sigma.fo=~irr*nit + random(block), data=raw, family=BE)
"Moving into lme()" in this scenario could look as follows:
mod2 <- gamlss(css~nit + irr + lit + re(random=~1|block),
sigma.fo=~irr*nit + re(random=~1|block),
data=raw,
family=BE)
Predictions on new-data based on mod2 are implemented within the gamlss2 package.
Furthermore, mod and mod2 should be the same models.
See:
Stasinopoulos, M. D., Rigby, R. A., Heller, G. Z., Voudouris, V., & De Bastiani, F. (2017). Flexible regression and smoothing: using GAMLSS in R. Chapman and Hall/CRC. Chapter 10.9.1
Best regards
Kai
I had a lot of random problems in this direction, and found fitting using the weights argument, and some extra dummy observations set to weight zero (but the predictors I was interested in) to be one workaround.
I was able to overcome the undefined columns selected error by ensuring that the new data for the newdata parameter had the EXACT column structure as what was used when running the gamlss model.

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

Predict function from Caret package give an Error

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.
data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)
This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:
trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test <- model_data[-trainIndex,]
Time to train my model using the train function:
model <- train(SALE_FLAG~. data=train,method = "glm")
Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:
predict(model, newdata =test,type="prob")
and I get an ERROR:
Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.
Not sure why this happens. I did the same thing without using the caret package and it worked how it should:
model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")
I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.
Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.
Max

Resources