ROC with cross-validation for linear regression in R - r

A two-part question: I'm trying to figure out: (1) how to generate a ROC curve for a linear regression using lm() (properly, if it's even right??), and (2) how to do it with k-fold cross validation so I may get the mean ROC curve (and AUC).
If the outcome is a continuous variable, it has to be converted into a binary variable, right? Normally I would fit a logistic regression model using glm(..., family = 'binomial') instead, but is it the most appropriate way? (It seems like I'm just fitting a different model.)
I would like something like this plot below from the cvAUC package's rdrr.io website (red line is the mean ROC curve, dotted lines are k-fold ROC curves), but I'm not sure how to get there with my data.
Example with data(USArrests):
library(dplyr)
library(pROC)
data(USArrests)
# create train and test sets
set.seed(2021)
dat <- mutate(USArrests, index=1:nrow(USArrests))
train.dat <- sample_frac(dat, 0.5) # splits `dat` in half
test.dat <- subset(dat, !dat$index %in% train.dat$index) # uses other half to test
# trying to build predictions with lm()
fit <- lm(Murder ~ Assault, data = train.dat)
predicted <- predict(fit, test.dat, type = "response")
# roc curve
roc(test.dat$Murder ~ predicted, plot = TRUE, print.auc = TRUE) # AUC = 1.000
The code above gets results, but gives a warning:
Warning message:
In roc.default(response, m[[predictors]], ...) :
'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead
I don't know what to do from its suggestion. It also got an AUC = 1.000 -- is this approach wrong, and why?
Moreover, it's only working with one train/test set. I'm not sure how to train with k-fold sets. I think I have to combine it with caret::train() somehow. I tried with the ROC solutions for random forest models from ROC curve from training data in caret, but it is not working with my code.
Example:
library(caret)
library(MLeval)
train_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfFit <- train(Murder ~ Assault, data = USArrests, trControl = train_control, method = "lm")
rfFit$pred$mtry # NULL
res <- MLeval::evalm(rfFit) # error with error message below
MLeval: Machine Learning Model Evaluation
Input: caret train function object
Not averaging probs.
Group 1 type: cv
Error in [.data.frame(preds, c(G1, G2, "obs")) :
undefined columns selected

You could do the cross-validation like this if you switched it to a 0/1 variable:
USArrests <- USArrests %>%
mutate(Murder01 = as.numeric(Murder > mean(Murder, na.rm=TRUE)))
# create train and test sets
set.seed(2021)
cvfun <- function(split, ...){
mod <- glm(Murder01 ~ Assault, data=analysis(split), family=binomial)
fit <- predict(mod, newdata=assessment(split), type="response")
data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}
library(rsample)
library(purrr)
library(tidyverse)
cv_out <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(auc = roc(y, fit, plot=FALSE)$auc[1])
cv_out
# # A tibble: 5 x 2
# id auc
# * <chr> <dbl>
# 1 Repeat1 0.936
# 2 Repeat2 0.928
# 3 Repeat3 0.937
# 4 Repeat4 0.918
# 5 Repeat5 0.942
That said, I'm not sure this is better than using something like the R-squared or MSE on the linear model. And, I'm not not super confident that the algorithm in the tutorial is actually doing something that makes sense statistically. I could definitely be wrong and would defer to someone with more expertise, but I can't see how it makes a lot of sense and it certainly doesn't produce something meaningful in this case. An AUC of 1 you would think would only happen with perfect prediction.
Further, I'm not sure what probative value these numbers have. Generally you would want to use this sort of analysis to tune the model specification - often by finding nearly optimal values of hyper-parameters. You could imagine doing this with a different model specification. For example, you could evaluate the relative predictive power of a model with a second-degree polynomial in Assault versus one that was linear, as below.
cvfun2 <- function(split, ...){
mod <- glm(Murder01 ~ poly(Assault, 2), data=analysis(split), family=binomial)
fit <- predict(mod, newdata=assessment(split), type="response")
data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}
cv_out2 <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun2)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(auc = roc(y, fit, plot=FALSE)$auc[1])
mean(cv_out2$auc)
# [1] 0.9123994
mean(cv_out$auc)
# [1] 0.9320451
Edit - Making the ROC plot
cv_out_plot <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(sens = roc(y, fit, plot=FALSE)$sensitivities,
spec = roc(y, fit, plot=FALSE)$specificities,
obs = 1:length(sens))
ave <- cv_out_plot %>%
ungroup %>%
group_by(obs) %>%
summarise(sens = mean(sens),
spec = mean(spec),
id = "Average")
cv_out_plot <- bind_rows(cv_out_plot, ave) %>%
mutate(col = factor(ifelse(id == "Average", "Average", "Individual"),
levels=c("Individual", "Average")))
ggplot(cv_out_plot , aes(x=1-sens, y=spec, group=id, colour=col)) +
geom_line(aes(size=col, alpha=col)) +
scale_colour_manual(values=c("black", "red")) +
scale_size_manual(values=c(.5,1.25)) +
scale_alpha_manual(values=c(.3, 1)) +
theme_classic() +
theme(legend.position=c(.75, .15)) +
labs(x="1-Sensitivity", y="Specificity", colour="", alpha="", size="")

Related

tidymodels roc auc results in multiple classification are affected by first level of factor

Using the iris dataset, a knn-classifier was tuned with iterative search and roc_auc as metric for the purpose of multiple classification.
One AUC result per potential model was calculated as expected, nevertheless, this value is not stable, but affected by:
the order of levels ("setosa", "virginica", "versicolor") in the Species column in the initial dataset
the order of columns in the roc_auc(truth = Species, .pred_setosa, .pred_virginica,.pred_versicolor)
Does this indicate that the AUC may be calculated similarly as setting the first level of the Species column as the positive event (which is expected in the binary classification, whereas in the multiple classification a single AUC based on e.g. a one-vs-all comparison would be appropriate)?
If so, is there a way to select a potential model based on e.g. the averaging AUC value of all the AUC values produced by the "one vs all comparisons"?
Could it also be implemented in the metric_set during the iterative search?
Thank you in advance for your support!
library(tidyverse)
library(tidymodels)
tidymodels_prefer()
df <- iris %>%
mutate(Species = factor(Species,levels = c("virginica", "versicolor", "setosa")))
set.seed(2023)
splits <- initial_split(df, strata = Species, prop = 4/5)
df_train <- training(splits)
df_test <- testing(splits)
df_rec <- recipe(Species ~ ., data = df_train)
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
df_wflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(df_rec)
set.seed(2023)
knn_cv <-
df_wflow %>%
tune_bayes(
metrics = metric_set(roc_auc),
resamples = vfold_cv(df_train, strata = "Species", v = 2),
control = control_bayes(verbose = TRUE, save_pred = TRUE)
)
cv_train_metrics <- knn_cv %>%
collect_predictions() %>%
group_by(.config, id) %>%
roc_auc(truth = Species, .pred_setosa, .pred_virginica,.pred_versicolor)
roc_auc() expects that the columns that have the probability estimates are in the same order as the factor levels. We'll make the documentation better for that.
By default, we use the method of Hand and Till to compute the area under a single muticlass ROC curve.
So this is not doing multiple ROC curves by default. You can change the estimator argument to do different types of averaging methods though but I would not suggest it for this metric.

How do I extract standard errors or variation from predicted ordinal logistic regression analyses?

I am undertaking a ordinal logistic regression using R package MASS.
For example:
library(MASS)
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
summary(house.plr, digits = 3)
I am using the s3 method predict() to obtain the predicted values
test_dat <- data.frame(Infl = factor(rep("Low",4)),
Cont = factor(rep("Low",4)),
Type = unique(housing$Type))
predict(house.plr, test_dat, type = "p")
Low Medium High
1 0.3784493 0.2876752 0.3338755
2 0.5190445 0.2605077 0.2204478
3 0.4675584 0.2745383 0.2579033
4 0.6444840 0.2114256 0.1440905
The result is a table of predicted means for each level of Sat given the variables defined in the test_dat.
How might I extract the variation around each of these means in the form of a standard error or standard deviation?
First, your predicted values are the predicted probability of each outcome for each observation. It is not the predicted mean on the response scale.
Second, you can use the marginaleffects package to get the standard errors for the predicted probabilities and then calculate the confidence intervals yourself. Alternatively, you may implement the non-parametric bootstrap. I implement both in the below. Note that I shifted the order of the columns around in the test data to match the training data.
# Packages
library(MASS)
library(marginaleffects)
library(dplyr)
# Create a test set
N <- 4
test_dat <- data.frame(
Infl = factor(rep("Low", N)),
Type = unique(housing$Type),
Cont = factor(rep("Low", N))
)
# Fit ordered logistic regression model
house.plr <- polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = housing,
Hess = TRUE)
# Demonstrate that predict() doesn't provide any measure of variability
# for the predicted class probabilities, as shown in question
predict(house.plr, test_dat, type = "probs")
# Use the marginaleffects package to get delta method standard errors for
# each predicted probability
probs <- marginaleffects::predictions(house.plr,
newdata = test_dat,
type = "probs")
# Compute CIs from the standard error using normal approximation
probs$predicted - 1.96*probs$std.error
probs$predicted + 1.96*probs$std.error
# Alternatively, use non-parametric bootstrapped confidence intervals.
# note that this does not adjust the weights to a constant sum for
# each bootstrap, although it is easy to implement. You're free to
# determine how to handle the weights, including resampling based
# on the weights.
# Generate bootstrapped data.frames
set.seed(123)
sims <- 5
samples <- vector(mode = "list", length = sims)
samples <- lapply(samples, function(x){ slice_sample(housing, n = nrow(housing), replace = TRUE)})
# Fit model on each bootstrapped data.frame
models <- lapply(samples, function(x){polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = x,
Hess = TRUE)})
# Get test predictions into a data.frame
probs_boot <- lapply(models, function(x) {
marginaleffects::predictions(x,
newdata = test_dat,
type = "probs")
})
probs_boot_df <- bind_rows(probs_boot)
# Compute CIs
probs_boot_df %>%
group_by(group, Type.x, Infl, Type.y, Cont) %>%
summarise(ci_low = quantile(predicted, probs = 0.025),
ci_high = quantile(predicted, probs = 0.975))

All values of AUC ROC Curve 1 using tidymodels

Trying to do a LASSO model with a binary outcome using tidymodels, I have essentially copied the case study from the tidymodels webpage (https://www.tidymodels.org/start/case-study/)(the hotel stay dataset) and applied it to my own data but for some reason all of the values on my area under the ROC curve are coming out at 1 (as you can see from graph below). The only thing I have changed is the recipe (to try and suit my data)
recipe(outcome ~ ., data = df_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_medianimpute(all_predictors())
so I don't know if it is my recipe that is incorrect or my data is not suitable for whatever reason. As mentioned I have a binary outcome and 68 predictors (59 factors and 9 numeric), some do have missing data but thought that the step_medianimpute would deal with that. Many thanks for any help anyone can offer
My AUC ROC Curve
Without seeing the data it is hard to know for sure, but your results indicate a couple of things.
Firstly, AUC ROC of 1. An AOC ROC of 1 for a binary classification model indicated that the model is perfectly able to separate the two classes. This could either be the case of overfitting or that you just have linearly separable classes.
Secondly, the constant metric value for different values of penalty. For a LASSO model, as the penalty increases, more and more variables will be shrunk to zero. In your case for all the values of the penalty (if you are following the post it will be 10^(-4) through 10^(-1)) you are seeing the same performance. That means that even if you use a penalty of 10^(-1) you still haven't shrunk enough predictors to hurt/change the performance. Reprex below
set.seed(1234)
library(tidymodels)
response <- rep(c(0, 10), length.out = 1000)
data <- bind_cols(
response = factor(response),
map_dfc(seq_len(50), ~ rnorm(1000, response))
)
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <- testing(data_split)
lasso_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
set_engine("glmnet")
lasso_wf <- workflow() %>%
add_model(lasso_spec) %>%
add_formula(response ~ .)
data_folds <- vfold_cv(data_train)
param_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
What what you can do is expand the range of penalties until you the performance changes. Below we see that once the penalty is high enough, the last important predictors got shrunk to zero, and we lose performance.
param_grid <- tibble(penalty = 10^seq(-1, 0, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
To verify, we fit the model using one of the good performance penalties and we get perfect predictions.
lasso_final <- finalize_workflow(lasso_wf, select_best(tune_res))
lasso_final_fit <- fit(lasso_final, data = data_train)
augment(lasso_final_fit, new_data = data_train) %>%
conf_mat(truth = response, estimate = .pred_class)
#> Truth
#> Prediction 0 10
#> 0 375 0
#> 10 0 375
Created on 2021-05-08 by the reprex package (v2.0.0)

How to incorporate tidy models PCA into the workflow of a model and make predictions

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11

Example of Time Series Prediction using Neural Networks in R

Anyone's got a quick short educational example how to use Neural Networks (nnet in R) for the purpose of prediction?
Here is an example, in R, of a time series
T = seq(0,20,length=200)
Y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
plot(T,Y,type="l")
Many thanks
David
I think you can use the caret package and specially the train function
This function sets up a grid of tuning parameters for a number
of classification and regression routines.
require(quantmod)
require(nnet)
require(caret)
T = seq(0,20,length=200)
y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
dat <- dat[c(3:200),] #delete first 2 observations
#Fit model
model <- train(y ~ x1+x2 ,
dat,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, dat)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[-c(1:2)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
The solution proposed by #agstudy is useful, but in-sample fits are not a reliable guide to out-of-sample forecasting accuracy. The gold standard in forecasting accuracy measurement is to use a holdout sample. Remove the last 5 or 10 or 20 observations (depending to the length of the time series) from the training sample, fit your models to the rest of the data, use the fitted models to forecast the holdout sample and simply compare accuracies on the holdout, using Mean Absolute Deviations (MAD) or weighted Mean Absolute Percentage Errors (wMAPEs).
So to do this you can change the code above in this way:
require(quantmod)
require(nnet)
require(caret)
t = seq(0,20,length=200)
y = 1 + 3*cos(4*t+2) +.2*t^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
train_set <- dat[c(3:185),]
test_set <- dat[c(186:200),]
#Fit model
model <- train(y ~ x1+x2 ,
train_set,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, test_set)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[c(186:200)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
This last two lines output the wMAPE of the forecasts from the model
sum(abs(ps-test_set["y"]))/sum(test_set)

Resources