Plotting precision#k and recall#k in ROCR (R) - r

I'm evaluating a binary classifier in R with the ROCR package. My classifier outputs a score between 0 and 1 for target 0/1 labels.
I'd like to plot precision and recall # k but can't find a way to do it. Calling performance() without specifying the x-axis measure plots the precision value by score cutoff:
library(ROCR)
#df <- a two-dimensional dataframe with prediction scores and actual labels of my classifier
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec")
For precision (or recall) at k, I'd need to plot the precision against the rank of each prediction, ordered by descending score:
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec", x.measure="rank") #but there seems to be no "rank" in ROCR!
Is there a way to do this in ROCR? I'm open to use alternative libraries if this isn't the case.

Load libraries and define train and test set:
library(mlbench)
library(e1071)
library(ROCR)
data(BreastCancer)
df = BreastCancer
idx = sample(1:nrow(df),150)
trn = df[idx,]
test = df[-idx,]
Fit naives bayes
fit = naiveBayes(Class ~ .,data=trn)
In the manual for performance, it is written,
Precision/recall graphs: measure="prec", x.measure="rec".
Plot precision-recall:
pred = prediction(predict(fit,test,type="raw")[,2],test$Class)
#plot to see it is working correctly:
plot(performance(pred,measure="prec",x.measure="rec"))
Now for your case to do it at K, we can also do the precision recall from scratch:
#combine prob, predicted labels, and actual labels
res = data.frame(prob=predict(fit,test,type="raw")[,2],
predicted_label=predict(fit,test),
label = test$Class)
res = res[order(res$prob,decreasing=TRUE),]
res$rank = 1:nrow(res)
# calculate recall, which is the number of actual classes we get back
res$recall = cumsum(res$label=="malignant")/sum(res$label=="malignant")
# precision, number of malignant cases we predicted correctly
res$precision = cumsum(res$label=="malignant")/res$rank
# check the two plots
par(mfrow=c(1,2))
plot(performance(pred,measure="prec",x.measure="rec"))
plot(res$recall,res$precision,type="l")
Now you have it correct, getting or plotting precision at K is simply:
par(mfrow=c(1,2))
with(res,
plot(rank,precision,main="self-calculated",type="l"))
plot(pred#n.pos.pred[[1]],
pred#tp[[1]]/(pred#fp[[1]]+pred#tp[[1]]),
type="l",main="from RORC")
I am not aware of a way to use the .plot.performance function.. But you can use the variables stored under prediction object. pred#tp is the true positive, pred#fp is the false positive, so tp / fp+fp gives precision and pred#n.pos.pred gives the rank essentially.

Related

Understanding the iml (interpretable machine learning) output for a classification task

Consider this synthetic dataset for classification,
library(tidyverse)
library(iml)
library(randomForest)
# Generate data
set.seed(5)
x = matrix(rnorm(2000), nrow=500)
z = x %*% matrix(c(1,1,1,1), nrow=4)
y = round(1 / (1 + exp(-z)), 0) %>% as.integer()
x = cbind(x, rnorm(500))
y_factor = as.factor(y)
data = data.frame(x, y_factor)
# Train model
rf = randomForest(y_factor ~ X1+X2+X3+X4+X5, data=data, ntree = 50)
# Compute feature importance using iml package
x_df = data[,-6]
predictor_rf <- Predictor$new(rf, data=x_df, y=y_factor)
imp_rf <- FeatureImp$new(predictor_rf, loss = "ce")
plot(imp_rf)
Here, x is a matrix with 5 independent variables, 4 of them are related to the response, and the fith is just noise. Then I train a random forest algorithm and finally compute the variable importance using feature permutation from the iml package and obtain the output from the figure below. In the manual from the package says that:
The importance is measured as
the factor by which the model’s prediction error increases when the feature is shuffled.
So here, variable X4 obtained a feature importance value of 0.2, which means that the prediction error "increased" by a factor of 0.2. However, being 0.2 a factor smaller than 1 this means that the prediction error actually decreased when doing the permutation on X2, which makes no sense to me, because on one side, it would imply that just random shuffled numbers obtain better results than the actual variables, but on the other side, the current model with the original variable obtains an accuracy of 100%. Same interpretation could be seen in the rest of the variables, except for variable X5, which was noise and obtained an importance of 0.
So... what am I missing here? What is that 0.2 value?

Dirichlet Regression using Caret package

I am attempting to predict tree species composition using Sentinel 2A imagery and forest plot data. I have calculated the proportion of basal area (the cross-sectional area trees of a given species occupy within the plot divided by the total cross-sectional area all tree occupy with in a plot) from the forest plot data and want to predict proportion of basal area by species across the landscape as a raster using the intensity values from Sentinel 2A. Dirichlet regression seems appropriate here, because I have more than 2 categories of proportional data bounded on (0,1) where the summed proportions for each observational unit must be equal to 1. Therefore, I want to model the joint composition of proportional basal area as a function of spectral reflectance intensity using Dirichlet regression, with k-folds crossvalidation using 10 folds and 5 repeats. This seems like a modeling exercise perfectly suited to caret::train() using a custom function.
I followed the documentation for how to build a custom routine for DirichletReg::DirichReg(), but I am a bit stumped on how to feed a multivariate response variable to caret::train(). The DirichletReg::DirichReg() function requires that response variable be formatted with the DirichletReg::DR_data() function first prior to modeling. However, when I feed a Dr_data object to caret::train(), I get this error:
Error: wrong model type for classification
I think this error is getting thrown because I have passed a multivariate response variable to the y= argument of caret::train() and the routine thinks that can only happen with classification algorithms. Does anyone have experience doing multivariate regression? Is there another way to change my custom routine so that caret::train() will accept multivariate response variables?
Below is my reproducible example:
##Loading Necessary Packages##
library(caret)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(25,50, 15)
TSHE_BA<-rnorm(25,40,12)
ALRU2_BA<-rnorm(25,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(25, 0, 2000)
B2<-runif(25, 0, 1800)
B3<-runif(25, 0, 3000)
#Dataset for modeling#
df<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Creating a Dirichlet regression modeling routine to feed to caret::train()##
#Method list#
Dreg <- list(type='Regression',
library='DirichletReg',
loop=NULL)
#Parameters element#
prm <- data.frame(parameter=c("model"),
class="character")
Dreg$parameters <- prm
#Grid element#
DregGrid <- function(x, y, len=NULL, search="grid"){
if(search == "grid"){
out <- expand.grid(model=c("common", "alternative"), stringsAsFactors = F) # here force the strings as character,
# othewise get error that the model arguments
# were expecting 'chr' when fitting
}
out
}
Dreg$grid <- DregGrid
#Fit element#
DregFit <- function(x, y, param, ...){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
theDots <- list(...)
modelArgs <- c(list(formula = as.formula(".outcome ~ ."), data = dat, link=param$model, type=param$type), theDots)
out <- do.call(DirichletReg::DirichReg, modelArgs)
out$call <- NULL
out
}
Dreg$fit <- DregFit
#Predict element#
DregPred <- function(modelFit, newdata, preProc=NULL, submodels=NULL){
if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
DirichletReg::predict.DirichletRegModel(modelFit, newdata)
}
Dreg$predict <- DregPred
#prob element#
DregProb <- function(){
return(NULL)
}
Dreg$prob <- DregProb
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
trCrtl<-trainControl(method="repeatedcv", number = 10, repeats = 5)
Y<-DR_data(df[,c(1:3)])#Converting the response variables to DR_data format
mod<-train(x=df[,-c(1:3)],Y, method=Dreg,trControl=trCrtl)#Throws error

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

How to create a ROC in R using predicted value from SAS?

I have a dataset from SAS, it is scored data with two columns, y and yhat. y is binary (0,1), yhat is scored value, model is logistic regression. I want create roc in r for this SAS model and compare it with other models in R. I have no clue regarding how to accomplish this? Any suggestions? Thanks.
How to create a ROC in R using predicted value from SAS?
You can use the ROCR package like this:
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR)
pred <- prediction( SASdataset$predictions, SASdataset$labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
Very simply if you know how ROC curves work. You want to be able to classify people into your dichotomous outcomes, 0 or 1 I am using below, using the predicted values from your model.
So if you were to select a cut-off for your predicted values at 0.5, say anyone above this threshold is considered positive/1/diseased/etc, and anyone below as a 0/unaffected.
That's great, but can that be improved? So the thought here is that if we go through a bunch of cutoff points, which one will be the most accurate in classifying people into our dichotomous outcomes, that is, comparing the predicted values from the model to the actual classifications that we know.
# some data
dat <- data.frame(pred = rep(0:1, each = 50),
predict = c(runif(50), runif(50, .5, 1.5)))
# a matrix of the cutoffs, specificity, and sensitivity
p1 <- matrix(0, nrow = 19, ncol = 3)
i <- 1
# for each cutoff value, create a 2x2 table and calculate your sens/spec
for (p in seq(min(dat$predict), .95, 0.05)) {
t1 <- table(dat$predict > p, dat$pred)
p1[i, ] <- c(p, (t1[2, 2]) / sum(t1[ , 2]), (t1[1, 1]) / sum(t1[ , 1]))
i <- i + 1
}
# and plot
plot(1 - p1[ , 3], p1[ , 2], type = 'l',
xlab = '1 - spec', ylab = 'sens',
main = 'ROC', cex.main = .8)
There are some packages out there, ROCR is one I have used, but this takes me a couple minutes to program, is very simple to understand, and is in base R.

Issues with predict function in R

I'm having issues with using the predict() function in R and I hope that I can get some help. Consider a dataset with two columns - 1) Y, 2) X
My goal is to fit a natural spline fit and get a 95% CI and to mark points outside of the 95% CI as outlier. Here is what I do:
1) Initially no point in the dataset is marked as outlier.
2) I fit my ns fit and using its 95% CI, I mark the points outside of the CI as outlier
3) I, then, exclude the initially marked outliers, and fit another ns and using it's 95% CI, I mark outliers.
* Issue: *
Suppose my initial dataset has 1000 obs. I mark some outliers in the first round and I get 23 outliers. Then I fit another ns (call it fit.ns) using the remaining 977 non-outliers. I then use ALL X's (all 1000) to get predicted values based on this new fit but I get warning AND error that newdata in my predict function has 1000 obs but fit has 977. The predicted values returned has also 977 values and NOT 1000.
* My predict() code *
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time),
interval="prediction", level = 1 - 0.05) # ??? PROBLEM
I really appreciate your help.
Seems that I cannot upload the dataset, but my code is:
library(splines)
ns.knot <- 10
for (i in 1:2){
# I exclude outliers so that my ns.fit does not get affected my outliers
data.ns <- data.temp[data.temp$OutlierInd == 0,]
data.ns$BeatNum <- 1:nrow(data.ns) # BeatNum is like a row number for me and is an auxilary variable
# Place Holder for Natural Spline results:
data.temp$IBI.NSfit <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.L95 <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.U95 <- rep(NA, nrow(data.temp))
# defining the knots in n.s.:
knots <- (data.ns$BeatNum)[seq(ns.knot, (length(data.ns$BeatNum) - ns.knot), by = ns.knot)]
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time), interval="prediction", level = 1 - 0.05) # ??? PROBLEM
data.temp$IBI.NSfit <- fit.ns.values[,1]
data.temp$IBI.NSfit.L95 <- fit.ns.values[,2]
data.temp$IBI.NSfit.U95 <- fit.ns.values[,3]
# Updating OutlierInd based on Natural Spline 95% CI:
data.temp$OutlierInd <- ifelse(data.temp$IBI < data.temp$IBI.NSfit.U95 & data.temp$IBI > data.temp$IBI.NSfit.L95, 0, 1)
}
Finally, I found the solution:
When I fit the model, I should use the "data =" option. In other words, instead of the command below,
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
I should use the command below instead:
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(IBI ~ ns(Time, knots = Time[knots]), data = data.ns)
Then the predict function will work.
I wanted to add a comment but my rep level doesnt allow that.
Anyways, I think this is a well documented point that predict uses the exact variables names used in the fit function. So, naming your variables is the best way to get around this error in my experience.
So, in the case above, please redefine a data frame just for your fit purposes like this
library(splines)
#Fit part
fit.data <- data.frame(y=rnorm(30),x=rnorm(30))
fit.ns <- lm(y ~ ns(x,3),data=fit.data)
#Predict
pred.data <- data.frame(y=rnorm(10),x=rnorm(10))
pred.fit <- predict(fit.ns,interval="confidence",limit=0.95,data.frame(x=pred.data$x))
IMHO, this should get rid of your error

Resources