Make roc curves like roc.glmnet in R - r

I am plotting several ROC curves in R to compare various models. In particular, I am checking LASSO, Logistic and Random Forests. However, while LASSO has a dedicated function for that, namely:
plot(roc.glmnet(lasso.fit_SUM, newx = x.train.loop, newy=y.train.loop)[[10]])
Logistic and RF do not come with such functions.
Now the problem is that I should present pretty ROC curves as the one of LASSO. LASSO ROC curve appears like this:
while Random Forest (and Logistic) like this:
This is the code I am adopting:
df_train_logit_rf_class=df_train_logit_rf
df_test_rf_class=df_test_rf
df_train_logit_rf_class$export_future=as.factor(df_train_logit_rf_class$export_future)
df_test_rf_class$export_future=as.factor(df_test_rf$export_future)
rf.fit_SUM_classification <- randomForest(formula = export_future ~ ., data = df_train_logit_rf_class, ntree = 500, maxnodes= 100, norm.votes = F)
rf.pred_SUM_db <- as.data.frame(predict(rf.fit_SUM_classification, df_test_rf_class, type = "prob"))
rf.pred_SUM_db$predict <- names(rf.pred_SUM_db)[1:2][apply(rf.pred_SUM_db[,1:2], 1, which.max)]
rf.pred_SUM_db$observed <- df_test_rf_class$export_future
#head(rf.pred_SUM_db)
# 1 ROC curve
roc.curve <- roc(ifelse(rf.pred_SUM_db$observed==1, 1, 0), as.numeric(rf.pred_SUM_db$predict))
plot(roc.curve, col = "gray60")
but the outcome is the ugly ROC curve I showed you before.
export_future is a factor variable taking either 0 or 1. There are many covariates (mainly interaction term, dummies).
My aim is to plot a ROC curve for RandomForest (and possibly Logistic) which looks like the one of LASSO.
It seems like logistic is just taking a value as threshold (-Inf) and then interpolate the rest of the curve, while it should take more thresholds.
Thank you in advance,
Federico

Related

is there an R function to obtain the minimal depth distribution from a conditional random forest estimated with the party package?

I ran a conditional random forest regression model using the cforest function from the party package because I have both categorical and continuous predictor variables that are correlated with each other, and a continuous outcome variable.
Here is my code to run the conditional random forest model, obtain out-of-bag estimates, and estimate the permutation variable importance.
# 1. fit the random forest
crf <- party::cforest(Y ~ ., data = df,
controls = party::cforest_unbiased(ntree = 10000, mtry = 7))
# 2. obtain out-of-bag estimates
pred_oob <- as.data.frame(caret::predict(crf, OOB = T, newdata = NULL))
# 3. estimate permutation variable importance
vi <- permimp::permimp(crf, condition = T, threshold = 0.5, nperm = 1000, OOB = T,
mincriterion = 0)))
I would like to visualize the minimal depth distribution and calculate mean minimal depth similar to the output from the RandomForestExplainer package. However, the RandomForestExplainer package only takes in objects from the randomForest function in the randomForest package. It's not an option for me to use this function due to the nature of my data (described above).
I have been combing the internet and have not been able to find a solution. Can someone point me to a way to visualize the minimal depth distribution for all predictors and calculate the mean minimal depth?

Using ROC curve to find optimum cutoff for my weighted binary logistic regression (glm) in R

I have build a binary logistic regression for churn prediction in Rstudio. Due to the unbalanced data used for this model, I also included weights. Then I tried to find the optimum cutoff by try and error, however To complete my research I have to incorporate ROC curves to find the optimum cutoff. Below I provided the script I used to build the model (fit2). The weight is stored in 'W'. This states that the costs of wrongly identifying a churner is 14 times as large as the costs of wrongly identifying a non-churner.
#CH1 logistic regression
library(caret)
W = 14
lvl = levels(trainingset$CH1)
print(lvl)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trainingset$CH1==lvl[2],W,1)
fit2 = glm(CH1 ~ RET + ORD + LVB + REVA + OPEN + REV2KF + CAL + PSIZEF + COM_P_C + PEN + SHOP, data = trainingset, weight=fit_wts, family=binomial(link='logit'))
# we test it on the test set
predlog1 = ifelse(predict(fit2,testset,type="response")>0.5,lvl[2],lvl[1])
predlog1 = factor(predlog1,levels=lvl)
predlog1
confusionMatrix(pred,testset$CH1,positive=lvl[2])
For this research I have also build ROC curves for decision trees using the pROC package. However, of course the same script does not work the same for a logistic regression. I have created a ROC curve for the logistic regression using the script below.
prob=predict(fit2, testset, type=c("response"))
testset$prob=prob
library(pROC)
g <- roc(CH1 ~ prob, data = testset, )
g
plot(g)
Which resulted in the ROC curve below.
How do I get the optimum cut off from this ROC curve?
Getting the "optimal" cutoff is totally independent of the type of model, so you can get it like you would for any other type of model with pROC. With the coords function:
coords(g, "best", transpose = FALSE)
Or directly on a plot:
plot(g, print.thres=TRUE)
Now the above simply maximizes the sum of sensitivity and specificity. This is often too simplistic and you probably need a clear definition of "optimal" that is adapted to your use case. That's mostly beyond the scope of this question, but as a starting point you should a look at Best Thresholds section of the documentation of the coords function for some basic options.

plot multiple ROC curves for logistic regression model in R

I have a logistic regression model (using R) as
fit6 <- glm(formula = survived ~ ascore + gini + failed, data=records, family = binomial)
summary(fit6)
I'm using pROC package to draw ROC curves and figure out AUC for 6 models fit1 through fit6.
I have approached this way to plots one ROC.
prob6=predict(fit6,type=c("response"))
records$prob6 = prob6
g6 <- roc(survived~prob6, data=records)
plot(g6)
But is there a way I can combine the ROCs for all 6 curves in one plot and display the AUCs for all of them, and if possible the Confidence Intervals too.
You can use the add = TRUE argument the plot function to plot multiple ROC curves.
Make up some fake data
library(pROC)
a=rbinom(100, 1, 0.25)
b=runif(100)
c=rnorm(100)
Get model fits
fit1=glm(a~b+c, family='binomial')
fit2=glm(a~c, family='binomial')
Predict on the same data you trained the model with (or hold some out to test on if you want)
preds=predict(fit1)
roc1=roc(a ~ preds)
preds2=predict(fit2)
roc2=roc(a ~ preds2)
Plot it up.
plot(roc1)
plot(roc2, add=TRUE, col='red')
This produces the different fits on the same plot. You can get the AUC of the ROC curve by roc1$auc, and can add it either using the text() function in base R plotting, or perhaps just toss it in the legend.
I don't know how to quantify confidence intervals...or if that is even a thing you can do with ROC curves. Someone else will have to fill in the details on that one. Sorry. Hopefully the rest helped though.

Plotting a ROC curve from a random forest classification

I'm trying to plot ROC curve of a random forest classification. Plotting works, but I think I'm plotting the wrong data since the resulting plot only has one point (the accuracy).
This is the code I use:
set.seed(55)
data.controls <- cforest_unbiased(ntree=100, mtry=3)
data.rf <- cforest(type ~ ., data = dataset ,controls=data.controls)
pred <- predict(data.rf, type="response")
preds <- prediction(as.numeric(pred), dataset$type)
perf <- performance(preds,"tpr","fpr")
performance(preds,"auc")#y.values
confusionMatrix(pred, dataset$type)
plot(perf,col='red',lwd=3)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
To plot a receiver operating curve you need to hand over continuous output of the classifier, e.g. posterior probabilities. That is, you need to predict (data.rf, newdata, type = "prob").
predicting with type = "response" already gives you the "hardened" factor as output. Thus, your working point is implicitly fixed already. With respect to that, your plot is correct.
side note: in bag prediction of random forests will be highly overoptimistic!

Advice on calculating a function to describe upper bound of data

I have a scatter plot of a dataset and I am interested in calculating the upper bound of the data. I don't know if this is a standard statistical approach so what I was considering doing was splitting the X-axis data into small ranges, calculating the max for these ranges and then trying to identify a function to describe these points. Is there a function already in R to do this?
If it's relevant there are 92611 points.
You might like to look into quantile regression, which is available in the quantreg package. Whether this is useful will depend on whether you want the absolute maximum within your "windows" are whether some extreme quantile, say 95th or 99th, is acceptable? If you are not familiar with quantile regression, then consider the linear regression which fits a model for the expectation or mean response, conditional upon the model covariates. Quantile regression for the middle quantile (0.5) would fit a model to the median response, conditional upon the model covariates.
Here is an example using the quantreg package, to show you what I mean. First, generate some dummy data similar to the data you show:
set.seed(1)
N <- 5000
DF <- data.frame(Y = rev(sort(rlnorm(N, -0.9))) + rnorm(N),
X = seq_len(N))
plot(Y ~ X, data = DF)
Next, fit the model to the 99th percentile (or the 0.99 quantile):
mod <- rq(Y ~ log(X), data = DF, tau = .99)
To generate the "fitted line", we predict from the model at 100 equally spaced values in X
pDF <- data.frame(X = seq(1, 5000, length = 100))
pDF <- within(pDF, Y <- predict(mod, newdata = pDF))
and add the fitted model to the plot:
lines(Y ~ X, data = pDF, col = "red", lwd = 2)
This should give you this:
I would second Gavin's nomination for using quantile regression. Your data might be simulated with your X and Y each log-normally distributed. You can see what a plot of the joint distribution of two independent (no imposed correlation, but not necessarily cor(x,y)==0) log-normal variates looks like if you run:
x <- rlnorm(1000, log(300), sdlog=1)
y<- rlnorm(1000, log(7), sdlog=1)
plot(x,y, cex=0.3)
You might consider looking at their individual distributions with qqplot (in the base plotting functions) remembering that the tails of such distrubutions can behave in surprising manner. You should be more interested in how well the bulk of the values fit a particular distribution than the extremes ... unless of course your applications are in finance or insurance. Don't want another global financial crisis because of poor modeling assumptions about tail behavior, now do we?
qqplot(x, rlnorm(10000, log(300), sdlog=1) )

Resources