ROC curve error in randomForest - r

I am trying to create a ROC curve off the below. I get an error that states Error in prediction(bc_rf_predict_prob, bc_test$Class) :
Number of cross-validation runs must be equal for predictions and labels.
library(mlbench) #has the Breast Cancer dataset in it
library(caret)
data(BreastCancer) #two class model
bc_changed<-BreastCancer[2:11] #removes variables not to be used
#Create train and test/holdout samples (works fine)
set.seed(59)
bc_rand <- bc_changed[order(runif(699)), ] #699 observations
bc_rand <- sample(1:699, 499)
bc_train <- bc_changed[ bc_rand,]
bc_test <- bc_changed[-bc_rand,]
#random forest decision tree (works fine)
library(caret)
library(randomForest)
set.seed(59)
bc_rf <- randomForest(Class ~.,data=bc_train, ntree=500,na.action = na.omit, importance=TRUE)
#ROC
library(ROCR)
actual <- bc_test$Class
bc_rf_predict_prob<-predict(bc_rf, type="prob", bc_test)
bc.pred = prediction(bc_rf_predict_prob,bc_test$Class) #not work- error
Error-Error in prediction(bc_rf_predict_prob, bc_test$Class) :
Number of cross-validation runs must be equal for predictions and labels.
I think it is coming from the fact when I do the:
bc_rf_predict_prob<-predict(bc_rf, type="prob", bc_test)
I get a matrix as the result with two columns Benign and a list of its probabilities and a second column of Malignant and its list of probabilities. My logic tells me I should only have a vector of probabilities.

According to page 9 of the ROCR Library documentation, the prediction function has two required inputs, predictions and labels, which must have the same dimensions.
In the case of a matrix or data frame, all cross-validation runs must have the same length.
Since str(bc_rf_predict_prob) > [1] matrix [1:200, 1:2], this means str(bc_test$Class) should have a matching dimension.
It sounds like you only want the first column vector of bc_rf_predict_prob, but I can't be certain without looking at the data.

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

Boosting classification tree in R

I'm trying to boost a classification tree using the gbm package in R and I'm a little bit confused about the kind of predictions I obtain from the predict function.
Here is my code:
#Load packages, set random seed
library(gbm)
set.seed(1)
#Generate random data
N<-1000
x<-rnorm(N)
y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
z<-rep(0,N)
for(i in 1:N){
if(x[i]-y[i]+0.2*rnorm(1)>1.0){
z[i]=1
}
}
#Create data frame
myData<-data.frame(x,y,z)
#Split data set into train and test
train<-sample(N,800,replace=FALSE)
test<-(-train)
#Boosting
boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
pred.boost
pred.boost is a vector with elements from the interval (0,1).
I would have expected the predicted values to be either 0 or 1, as my response variable z also consists of dichotomous values - either 0 or 1 - and I'm using distribution="bernoulli".
How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost values or is there anything I'm doing wrong with the predict function?
Your observed behavior is correct. From documentation:
If type="response" then gbm converts back to the same scale as the
outcome. Currently the only effect this will have is returning
probabilities for bernoulli.
So you should be getting probabilities when using type="response" which is correct. Plus distribution="bernoulli" merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.
To proceed do predict_class <- pred.boost > 0.5 (cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.
Try using adabag. Class, probabilities, votes and error are inbuilt in adabag which makes it easy to interpret, and of course less lines of codes.

Random predictions from linear model in R

I have some data with some missing values for one variable, and I want to be able to create (random) predictions for what these could be. Here's my first thought:
# miss indicates where the observations with missing response are
library(MASS)
model <- glm.nb(data[-miss,4] ~ ., data=data[-miss,-4])
predict(model, newdata=data[miss,-4])
However, if I repeat the last line, it gives the same answers over and over - it appears to give the predicted mean of responses given that data and the model. I want a random prediction which incorporates variance i.e. a random draw from the distribution of the response of an observation with such predictors under the given model.
It could have something to do with the pred.var argument, but I'm unsure how to use that.
Suppose we have data like this:
set.seed(101)
dd <- data.frame(x=(1:20)*0.1)
dd$y <- rnbinom(20,mu=exp(dd$x),size=1)
## make some missing values
miss <- c(2,3,5)
dd$y[miss] <- NA
Now fit a model:
m1 <- MASS::glm.nb(y~x,dd,na.action=na.exclude)
Now use predictions from that model to get the expected mean value and rnbinom to generate the random values ...
p <- predict(m1,newdata=dd,type="response")
randvals <- rnbinom(length(p),mu=p,size=m1$theta)
(This gives random values for every element, not just the missing ones, but obviously you can pick out just the ones you want ...) It would be nice if the simulate method did this, but it's not quite flexible enough ...

How to obtain AUC using leave-one-out cross-validation in R?

I have a matrix (x) containing 100 samples (rows) and 10000 independent features (columns). The observations are binary, either the sample is good or bad {0,1} (stored in vector y). I want to perform leave one out cross-validation and determine the Area Under Curve (AUC) for each feature separately (something like colAUC from CAtools package). I tried to use glmnet, but it didn't work. As it is said in manual, I tried to set the nfold parameter to be equal to the number of observations (100).
>result=cv.glmnet(x,y,nfolds=100,type.measure="auc",family="binomial")
And I'm getting these warnings:
>"Warning messages:
1: Too few (< 10) observations per fold for type.measure='auc' in
cv.lognet; changed to type.measure='deviance'. Alternatively, use smaller
value for nfolds
2: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold"
Any ideas what I'm doing wrong? And is there any other way or R package to obtain LOO-balanced AUC values for each of the features?
I'll really appreciate any help. Thank you!
When you do a LOO-CV, you have a test set with only 1 sample in it, and you can of course not build an AUC with that. However, you can loop and store the predictions at each step:
k <- dim(x)[1]
predictions <- c()
for (i in 1:k) {
model <- glmnet(x[-i,], y[-i], family="binomial")
predictions <- c(predictions, predict(model, newx=x[i,]))
}
So that in the end you can make a ROC curve, for example:
library(pROC)
roc(y, predictions)

Resources