R | How to get accuracy from cv.glmnet - r

I've been using the cv.glmnet function to fit a lasso logistic regression model. I'm using R
Here's my code. I'm using the iris dataset.
df = iris %>%
mutate(Species = as.character(Species)) %>%
filter(!(Species =="setosa")) %>%
mutate(Species = as.factor(Species))
X = data.matrix(df %>% select(-Species))
y = df$Species
Model = cv.glmnet(X, y, alpha = 1, family = "binomial")
How do I get the model accuracy from the cv.glmnet object (Model).
If I had been using caret on a normal logistic regression model, accuracy is already in the output.
train_control = trainControl(method = "cv", number = 10)
M2 = train(Species ~., data = df, trControl = train_control,
method = "glm", family = "binomial")
M2$results
but a cv.glmnet object doesn't seem to contain this information.

You want to add type.measure='class' as in Model 2 below, otherwise the default for family='binomial' is 'deviance'.
df = iris %>%
mutate(Species = as.character(Species)) %>%
filter(!(Species =="setosa")) %>%
mutate(Species = as.factor(Species))
X = data.matrix(df %>% select(-Species))
y = df$Species
Model = cv.glmnet(X, y, alpha = 1, family = "binomial")
Model2 = cv.glmnet(X, y, alpha = 1, family = "binomial", type.measure = 'class')
Then cvm gives the misclassification rate.
Model2$lambda ## lambdas used in CV
Model2$cvm ## mean cross-validated error for each of those lambdas
If you want results for the best lambda, you can use lambda.min
Model2$lambda.min ## lambda with the lowest cvm
Model2$cvm[Model2$lambda==Model2$lambda.min] ## cvm for lambda.min

Related

GLM Family using tidymodels

I am trying to use the tidymodels package for a GLM and want to use the Gamma or Poisson distribution.
Using glm I would use something like the following
# using glm
mdl <- glm(data = data, y ~ x, family = Gamma(link = "inverse"))
mdl <- glm(data = data, y ~ x, family = poisson(link = "log"))
# using glmnet
library(glmnet)
mdl <- glmnet(data$x, data$y, family = Gamma(link = "inverse"))
mdl <- glmnet(data$x, data$y, family = poisson(link = "log"))
How can I achieve the same using tidymodels? Note that I am trying to do a regression and not a classification (logistic regression) for which I could use parsnip::logistic_reg().
I found one article on Generalized Linear Models on tidymodels, which belongs to the embed package but does not show how to specify the family.
I would expect something similar to this (which does not work as neither linear_reg has the parameters family or link, nor does set_engine support glm in linear regression mode)
mdl <- linear_reg(mode = "regression", family = "gamma", link = "inverse") %>% set_engine("glm") # or glmnet
That was easier than expected:
mdl <- linear_reg(mode = "regression") %>%
set_engine("glmnet", family = "gamma")
# or
mdl <- linear_reg(mode = "regression") %>%
set_engine("glmnet", family = Gamma(link = "inverse"))

preprocessing (center and scale) only specific variables (numeric variables)

I have a dataframe that consist of numerical and non-numerical variables. I am trying to fit a logisic regression model predicting my variable "risk" based on all other variables, optimizing AUC using a 6-fold cross validation.
However, I want to center and scale all numerical explanatory variables. My code raises no errors or warning but somehow I fail to figure out how to tell train() through preProcess (or in some other way) to just center and scale my numerical variables.
Here is the code:
test <- train(risk ~ .,
method = "glm",
data = df,
family = binomial(link = "logit"),
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = prSummary),
metric = "AUC")
You could try to preprocess all numerical variables in original df first and then applying train function over scaled df
library(dplyr)
library(caret)
df <- df %>%
dplyr::mutate_if(is.numeric, scale)
test <- train(risk ~ .,
method = "glm",
data = df,
family = binomial(link = "logit"),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = prSummary),
metric = "AUC")

Using caret with recipes is leading to difficulties with resample

I've been using recipes to pipe into caret::train, which has been going well, but now I've tried some step_transforms, I'm getting the error:
Error in resamples.default(model_list) :
There are different numbers of resamples in each model
when I compare models with and without the transformations. The same code with step_centre and step_scale works fine.
library(caret)
library(tidyverse)
library(tidymodels)
formula <- price ~ carat
model_recipe <- recipe(formula, data = diamonds)
quadratic_model_recipe <- recipe(formula, data = diamonds) %>%
step_poly(all_predictors())
model_list <- list(
linear_model = NULL,
quadratic = NULL
)
model_list$linear_model <-
model_recipe %>% train(
data = diamonds,
method = "lm",
trControl = trainControl(method = "cv"))
model_list$quadratic_model <-
quadratic_model_recipe %>% train(
data = diamonds,
method = "lm",
trControl = trainControl(method = "cv"))
resamp <- resamples(model_list)
quadratic = NULL should have been quadratic_model = NULL

Prediction of single bagged tree models dependent on pre-processing using caret

I'm using the caret package to predict a time series with method treebag. caret estimates bagging regression trees with 25 bootstrap replications.
What I'm struggling to understand is how the final prediction of that 'treebag model' relates to the predictions made by each of the 25 trees, depending on whether I use caret::preProcess, or not.
I am aware of this question and the linked resources therein. (But could not draw the right conclusions from it.)
Here is an example using the economics data. Let's say I want to predict unemploy_rate, which has to be created first.
# packages
library(caret)
library(tidyverse)
# data
data("economics")
economics$unemploy_rate <- economics$unemploy / economics$pop * 100
x <- economics[, -c(1, 7)]
y <- economics[["unemploy_rate"]]
I wrote a function that extracts the 25 individual trees from the train object, makes a prediction for each tree, averages these 25 predictions, and compares this average with the prediction from the train object. It returns a plot.
predict_from_treebag <- function(model) {
# extract 25 trees from train object
bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
# make a prediction for each tree
pred_trees <- map(bagged_trees, .f = predict, newdata = x)
names(pred_trees) <- paste0("tree_", seq_along(pred_trees))
# aggreagte predictions
pred_trees <- as.data.frame(pred_trees) %>%
add_column(date = economics$date, .before = 1) %>%
gather(tree, value, matches("^tree")) %>%
group_by(date) %>%
mutate(mean_pred_from_trees = mean(value)) %>%
ungroup()
# add prediction from train object
pred_trees$bagging_model_prediction = predict(model, x)
pred_trees <- pred_trees %>%
gather(model, pred_value, 4:5)
# plot
p <- ggplot(data = pred_trees, aes(date)) +
geom_line(aes(y = value, group = tree), alpha = .2) +
geom_line(aes(y = pred_value, col = model)) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
p
}
Now I estimate two models, the first will be unscaled, the second will be centered and scaled.
preproc_opts <- list(unscaled = NULL,
scaled = c("center", "scale"))
# estimate the models
models <- map(preproc_opts, function(preproc)
train(
x = x,
y = y,
trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
metric = "RMSE",
method = "treebag",
preProcess = preproc
))
# apply predict_from_treebag to each model
imap(.x = models,
.f = ~{predict_from_treebag(.x) + labs(title = .y)})
The results are shown below. The unscaled model prediction is the average of the 25 trees but why is each prediction from the 25 trees a constant when I use preProcess?
Thank you for any advice where I might be wrong.
The problem is in this part of the code:
pred_trees <- map(bagged_trees, .f = predict, newdata = x)
in the function predict_from_treebag
this predict function is in fact predict.rpart since
class(bagged_trees[[1]])
predict.rpart does not know that you pre-processed the data in caret.
Here is a quick fix:
predict_from_treebag <- function(model) {
# extract 25 trees from train object
bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
x <- economics[, -c(1, 7)]
# make a prediction for each tree
newdata = if(is.null(model$preProcess)) x else predict(model$preProcess, x)
pred_trees <- map(bagged_trees, .f = predict, newdata = newdata)
names(pred_trees) <- paste0("tree_", seq_along(pred_trees))
# aggreagte predictions
pred_trees <- as.data.frame(pred_trees) %>%
add_column(date = economics$date, .before = 1) %>%
gather(tree, value, matches("^tree")) %>%
group_by(date) %>%
mutate(mean_pred_from_trees = mean(value)) %>%
ungroup()
# add prediction from train object
pred_trees$bagging_model_prediction = predict(model, x)
pred_trees <- pred_trees %>%
gather(model, pred_value, 4:5)
# plot
p <- ggplot(data = pred_trees, aes(date)) +
geom_line(aes(y = value, group = tree), alpha = .2) +
geom_line(aes(y = pred_value, col = model)) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
p
}
Now after running:
preproc_opts <- list(unscaled = NULL,
scaled = c("center", "scale"))
models <- map(preproc_opts, function(preproc)
train(
x = x,
y = y,
trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
metric = "RMSE",
method = "treebag",
preProcess = preproc
))
map2(.x = models,
.y = names(models),
.f = ~{predict_from_treebag(.x) + labs(title = .y)})
the result is in line with the expected

Why the coefficient estimates of glmnet varies a lot between models with same input parameters?

I have been trying to fit a lasso model using cv.glmnet. I tried to implement four different models (3 using cv.glmnet and 1 using caret::train) based on standardization. All the four models give very different coefficient estimates which I can't figure out why.
Here is a fully reproducible code:
library("glmnet")
data(iris)
iris <- iris
dat <- iris[iris$Species %in% c("setosa","versicolor"),]
X <- as.matrix(dat[,1:4])
Y <- as.factor(as.character(dat$Species))
set.seed(123)
model1 <- cv.glmnet(x = X,
y = Y,
family = "binomial",
standardize = FALSE,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
set.seed(123)
model2 <- cv.glmnet(x = scale(X, center = T, scale = T),
y = Y,
family = "binomial",
standardize = FALSE,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
set.seed(123)
model3 <- cv.glmnet(x = X,
y = Y,
family = "binomial",
standardize = TRUE,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
##Using caret
library("caret")
lambda.grid <- rev(seq(0,1,length=100)) #set of lambda values for cross-validation
alpha.grid <- 1 #alpha
trainControl <- trainControl(method ="cv",
number=3) #3-fold cross-validation
tuneGrid <- expand.grid(.alpha=alpha.grid, .lambda=lambda.grid) #these are tuning parameters to be passed into the train function below
set.seed(123)
model4 <- train(x = X,
y = Y,
method="glmnet",
family="binomial",
standardize = FALSE,
trControl = trainControl,
tuneGrid = tuneGrid)
c1 <- coef(model1, s=model1$lambda.min)
c2 <- coef(model2, s=model2$lambda.min)
c3 <- coef(model3, s=model3$lambda.min)
c4 <- coef(model4$finalModel, s=model4$finalModel$lambdaOpt)
c1 <- as.matrix(c1)
c2 <- as.matrix(c2)
c3 <- as.matrix(c3)
c4 <- as.matrix(c4)
model2 scales the independent variables (vector X) beforehand and model3 does so by setting standardize = TRUE. So atleast these two models should return identical results - but it is not so.
The lambda.min obtained from the four models are:
model1 = 0
model2 = 0
model3 = 0
model4 = 0.6565657
The coefficient estimates between the models differ drastically too. Why would this be occurring?
Actually there is a little different between scale(x) & standardize = FALSE and x & standardize = TRUE. We need to multiple (N-1)/N.
See here.
If we use gaussian family,
library(glmnet)
X <- matrix(runif(100, 0, 1), ncol=2)
y <- 1 -2*X[,1] + X[,2]
enet <- glmnet(X, y, lambda=0.1,standardize = T,family="gaussian")
coefficients(enet)
coef <- coefficients(enet)
coef[2]*sd(X[,1])/sd(y) #standardized coef
#[1] -0.6895065
enet1 <- glmnet(scale(X)/99*100, y/(99/100*sd(y)),lambda=0.1/(99/100*sd(y)),standardize = F,family="gaussian")
coefficients(enet1)[2]
#[1] -0.6894995
If we use binomial family,
data(iris)
iris <- iris
dat <- iris[iris$Species %in% c("setosa","versicolor"),]
X <- as.matrix(dat[,1:4])
Y <- as.factor(as.character(dat$Species))
set.seed(123)
model1 <- cv.glmnet(x = X,
y = Y,
family = "binomial",
standardize = T,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
coefficients(model1,s=0.03)[3]*sd(X[,2])
#[1] -0.3374946
set.seed(123)
model3 <- cv.glmnet(x = scale(X)/99*100,
y = Y,
family = "binomial",
standardize = F,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
coefficients(model3,s=0.03)[3]
#[1] -0.3355027
These results are nearly the same. Hope it is not too late for this answer.

Resources