Why the coefficient estimates of glmnet varies a lot between models with same input parameters? - r

I have been trying to fit a lasso model using cv.glmnet. I tried to implement four different models (3 using cv.glmnet and 1 using caret::train) based on standardization. All the four models give very different coefficient estimates which I can't figure out why.
Here is a fully reproducible code:
library("glmnet")
data(iris)
iris <- iris
dat <- iris[iris$Species %in% c("setosa","versicolor"),]
X <- as.matrix(dat[,1:4])
Y <- as.factor(as.character(dat$Species))
set.seed(123)
model1 <- cv.glmnet(x = X,
y = Y,
family = "binomial",
standardize = FALSE,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
set.seed(123)
model2 <- cv.glmnet(x = scale(X, center = T, scale = T),
y = Y,
family = "binomial",
standardize = FALSE,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
set.seed(123)
model3 <- cv.glmnet(x = X,
y = Y,
family = "binomial",
standardize = TRUE,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
##Using caret
library("caret")
lambda.grid <- rev(seq(0,1,length=100)) #set of lambda values for cross-validation
alpha.grid <- 1 #alpha
trainControl <- trainControl(method ="cv",
number=3) #3-fold cross-validation
tuneGrid <- expand.grid(.alpha=alpha.grid, .lambda=lambda.grid) #these are tuning parameters to be passed into the train function below
set.seed(123)
model4 <- train(x = X,
y = Y,
method="glmnet",
family="binomial",
standardize = FALSE,
trControl = trainControl,
tuneGrid = tuneGrid)
c1 <- coef(model1, s=model1$lambda.min)
c2 <- coef(model2, s=model2$lambda.min)
c3 <- coef(model3, s=model3$lambda.min)
c4 <- coef(model4$finalModel, s=model4$finalModel$lambdaOpt)
c1 <- as.matrix(c1)
c2 <- as.matrix(c2)
c3 <- as.matrix(c3)
c4 <- as.matrix(c4)
model2 scales the independent variables (vector X) beforehand and model3 does so by setting standardize = TRUE. So atleast these two models should return identical results - but it is not so.
The lambda.min obtained from the four models are:
model1 = 0
model2 = 0
model3 = 0
model4 = 0.6565657
The coefficient estimates between the models differ drastically too. Why would this be occurring?

Actually there is a little different between scale(x) & standardize = FALSE and x & standardize = TRUE. We need to multiple (N-1)/N.
See here.
If we use gaussian family,
library(glmnet)
X <- matrix(runif(100, 0, 1), ncol=2)
y <- 1 -2*X[,1] + X[,2]
enet <- glmnet(X, y, lambda=0.1,standardize = T,family="gaussian")
coefficients(enet)
coef <- coefficients(enet)
coef[2]*sd(X[,1])/sd(y) #standardized coef
#[1] -0.6895065
enet1 <- glmnet(scale(X)/99*100, y/(99/100*sd(y)),lambda=0.1/(99/100*sd(y)),standardize = F,family="gaussian")
coefficients(enet1)[2]
#[1] -0.6894995
If we use binomial family,
data(iris)
iris <- iris
dat <- iris[iris$Species %in% c("setosa","versicolor"),]
X <- as.matrix(dat[,1:4])
Y <- as.factor(as.character(dat$Species))
set.seed(123)
model1 <- cv.glmnet(x = X,
y = Y,
family = "binomial",
standardize = T,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
coefficients(model1,s=0.03)[3]*sd(X[,2])
#[1] -0.3374946
set.seed(123)
model3 <- cv.glmnet(x = scale(X)/99*100,
y = Y,
family = "binomial",
standardize = F,
alpha = 1,
lambda = rev(seq(0,1,length=100)),
nfolds = 3)
coefficients(model3,s=0.03)[3]
#[1] -0.3355027
These results are nearly the same. Hope it is not too late for this answer.

Related

Cross validation PCA with different correlation matrix in R

I want to evaluate in terms of MSE, AIC, and Adjusted R squared, various Principal Component models based on different correlation coefficients (e.g., Pearson, Kendall) in R. I have created the following function however I can't find the way to "force" the function to perform principal component regression based on the given correlation matrix cor1 and cor2. Therefore, I end up with the exact same results. Can someone help me?
library(caret)
set.seed(123)
df <- data.frame(Y = rnorm(100), X1 = rnorm(100), X2 = rnorm(100), X3 = rnorm(100), X4 = rnorm(100), X5 = rnorm(100))
X <- df[,-1]
Y <- df[,1]
# compute Pearson's and Kendall's correlation matrices
cor1 <- cor(X, method = "pearson")
cor2 <- cor(X, method = "kendall")
# define function to compute PCA with cross-validation and return MSE, AIC, and adjusted R-squared
pca_cv_mse_aic_r2 <- function(X, Y, cor_mat, ncomp, nfolds) {
# create empty vectors to store results
mse <- rep(0, ncomp)
aic <- rep(0, ncomp)
adj_r2 <- rep(0, ncomp)
# loop over the number of components
for (i in 1:ncomp) {
# perform PCA with cross-validation
pca <- caret::train(X, Y, method = "pcr", preProc = c("center", "scale"),
tuneLength = nfolds, trControl = trainControl(method = "cv", number = nfolds),
tuneGrid = data.frame(ncomp = i))
# compute MSE, AIC, and adjusted R-squared
pred <- predict(pca, newdata = X)
mse[i] <- mean((pred - Y)^2)
aic[i] <- AIC(lm(Y ~ pred + 1))
adj_r2[i] <- summary(lm(Y ~ pred))$adj.r.squared
}
# return a list of results
return(list(mse = mse, aic = aic, adj_r2 = adj_r2))
}
# compute the MSE, AIC, and adjusted R-squared of PCA models with different correlation matrices and numbers of components
results1 <- pca_cv_mse_aic_r2(X, Y, cor1, 5, 10)
results2 <- pca_cv_mse_aic_r2(X, Y, cor2, 5, 10)

Cross validation in SuperLearner: How to get results (R², RMSE, etc.) for each individual fold?

I am using SuperLearner for repeated V-fold or cross validation. Is there a way to obtain individual results for each of the say V=10 folds?
# Packages
library(tidyverse)
library(SuperLearner)
library(caret)
# Some data
X <- matrix(rnorm(1000 * 50), nrow = 1000, ncol = 50)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(1000)
df <- cbind(Y, X) %>% as.data.frame()
# Cross validation in SuperLearner
model = CV.SuperLearner(Y = df$Y,
X = df %>% select(-Y),
family = gaussian(),
SL.library = c("SL.lm"),
V = 5)
# Look up folds
model$folds
# Look up individual results for each fold...?
If you compare this to caret, it's pretty straight forward.
# Cross validation in caret
train.control <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
model <- train(Y ~ .,
data = df,
method = "lm",
trControl = train.control,
tuneLength = 10)
# Look up results
model$resample
Long story short, where is the model$resample in SuperLearner?

r: coefficients from glmnet and caret are different for the same lambda

I've read a few Q&As about this, but am still not sure I understand, why the coefficients from glmnet and caret models based on the same sample and the same hyper-parameters are slightly different. Would greatly appreciate an explanation!
I am using caret to train a ridge regression:
library(ISLR)
Hitters = na.omit(Hitters)
x = model.matrix(Salary ~ ., Hitters)[, -1] #Dropping the intercept column.
y = Hitters$Salary
set.seed(0)
train = sample(1:nrow(x), 7*nrow(x)/10)
library(caret)
set.seed(0)
train_control = trainControl(method = 'cv', number = 10)
grid = 10 ^ seq(5, -2, length = 100)
tune.grid = expand.grid(lambda = grid, alpha = 0)
ridge.caret = train(x[train, ], y[train],
method = 'glmnet',
trControl = train_control,
tuneGrid = tune.grid)
ridge.caret$bestTune
# alpha is 0 and best lambda is 242.0128
Now, I use the lambda (and alpha) found above to train a ridge regression for the whole data set. At the end, I extract the coefficients:
ridge_full <- train(x, y,
method = 'glmnet',
trControl = trainControl(method = 'none'),
tuneGrid = expand.grid(
lambda = ridge.caret$bestTune$lambda, alpha = 0)
)
coef(ridge_full$finalModel, s = ridge.caret$bestTune$lambda)
Finally, using exactly the same alpha and lambda, I try to fit the same ridge regression using glmnet package - and extract coefficients:
library(glmnet)
ridge_full2 = glmnet(x, y, alpha = 0, lambda = ridge.caret$bestTune$lambda)
coef(ridge_full2)
The reason is the fact the exact lambda you specified was not used by caret. You can check this by:
ridge_full$finalModel$lambda
closest values are 261.28915 and 238.07694.
When you do
coef(ridge_full$finalModel, s = ridge.caret$bestTune$lambda)
where s is 242.0128 the coefficients are interpolated from the coefficients actually calculated.
Wheres when you provide lambda to the glmnet call the model returns exact coefficients for that lambda which differ only slightly from the interpolated ones caret returns.
Why this happens:
when you specify one alpha and one lambda for a fit on all of the data caret will actually fit:
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
numLev <- if(is.character(y) | is.factor(y)) length(levels(y)) else NA
theDots <- list(...)
if(all(names(theDots) != "family")) {
if(!is.na(numLev)) {
fam <- ifelse(numLev > 2, "multinomial", "binomial")
} else fam <- "gaussian"
theDots$family <- fam
}
## pass in any model weights
if(!is.null(wts)) theDots$weights <- wts
if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
x <- Matrix::as.matrix(x)
modelArgs <- c(list(x = x,
y = y,
alpha = param$alpha),
theDots)
out <- do.call(glmnet::glmnet, modelArgs)
if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
out
}
this was taken from here.
in your example this translates to
fit <- glmnet::glmnet(x, y,
alpha = 0)
lambda <- unique(fit$lambda)
these lambda values correspond to ridge_full$finalModel$lambda:
all.equal(lambda, ridge_full$finalModel$lambda)
#output
TRUE

Confidence intervals for predictions from logistic regression

In R predict.lm computes predictions based on the results from linear regression and also offers to compute confidence intervals for these predictions. According to the manual, these intervals are based on the error variance of fitting, but not on the error intervals of the coefficient.
On the other hand predict.glm which computes predictions based on logistic and Poisson regression (amongst a few others) doesn't have an option for confidence intervals. And I even have a hard time imagining how such confidence intervals could be computed to provide a meaningful insight for Poisson and logistic regression.
Are there cases in which it is meaningful to provide confidence intervals for such predictions? How can they be interpreted? And what are the assumptions in these cases?
The usual way is to compute a confidence interval on the scale of the linear predictor, where things will be more normal (Gaussian) and then apply the inverse of the link function to map the confidence interval from the linear predictor scale to the response scale.
To do this you need two things;
call predict() with type = "link", and
call predict() with se.fit = TRUE.
The first produces predictions on the scale of the linear predictor, the second returns the standard errors of the predictions. In pseudo code
## foo <- mtcars[,c("mpg","vs")]; names(foo) <- c("x","y") ## Working example data
mod <- glm(y ~ x, data = foo, family = binomial)
preddata <- with(foo, data.frame(x = seq(min(x), max(x), length = 100)))
preds <- predict(mod, newdata = preddata, type = "link", se.fit = TRUE)
preds is then a list with components fit and se.fit.
The confidence interval on the linear predictor is then
critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit
critval is chosen from a t or z (normal) distribution as required (I forget exactly now which to use for which type of GLM and what the properties are) with the coverage required. The 1.96 is the value of the Gaussian distribution giving 95% coverage:
> qnorm(0.975) ## 0.975 as this is upper tail, 2.5% also in lower tail
[1] 1.959964
Now for fit, upr and lwr we need to apply the inverse of the link function to them.
fit2 <- mod$family$linkinv(fit)
upr2 <- mod$family$linkinv(upr)
lwr2 <- mod$family$linkinv(lwr)
Now you can plot all three and the data.
preddata$lwr <- lwr2
preddata$upr <- upr2
ggplot(data=foo, mapping=aes(x=x,y=y)) + geom_point() +
stat_smooth(method="glm", method.args=list(family=binomial)) +
geom_line(data=preddata, mapping=aes(x=x, y=upr), col="red") +
geom_line(data=preddata, mapping=aes(x=x, y=lwr), col="red")
I stumbled upon Liu WenSui's method that uses bootstrap or simulation approach to solve that problem for Poisson estimates.
Example from the Author
pkgs <- c('doParallel', 'foreach')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)
data(AutoCollision, package = "insuranceData")
df <- rbind(AutoCollision, AutoCollision)
mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log"))
new_fake <- df[1:5, 1:2]
boot_pi <- function(model, pdata, n, p) {
odata <- model$data
lp <- (1 - p) / 2
up <- 1 - lp
set.seed(2016)
seeds <- round(runif(n, 1, 1000), 0)
boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
set.seed(seeds[i])
bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ]
bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata)
rpois(length(bpred), lambda = bpred)
}
boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up)))
return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2]))
}
boot_pi(mdl, new_fake, 1000, 0.95)
sim_pi <- function(model, pdata, n, p) {
odata <- model$data
yhat <- predict(model, type = "response")
lp <- (1 - p) / 2
up <- 1 - lp
set.seed(2016)
seeds <- round(runif(n, 1, 1000), 0)
sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
set.seed(seeds[i])
sim_y <- rpois(length(yhat), lambda = yhat)
sdata <- data.frame(y = sim_y, odata[names(model$x)])
refit <- glm(y ~ ., data = sdata, family = poisson)
bpred <- predict(refit, type = "response", newdata = pdata)
rpois(length(bpred),lambda = bpred)
}
sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up)))
return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2]))
}
sim_pi(mdl, new_fake, 1000, 0.95)

How to graph my multiple linear regression model (caret)?

I have created an multiple linear regression model and would now like to plot it. But I can't seem to figure it out. Any help would be greatly appreciated! I used baruto to find the feature attributes and then used train() to get the model. When I try to plot model_lm I get the error:
There are no tuning parameters with more than 1 value.
Here is my code at what I have attempted so far:
rt_train <- rttotal2
rt_train$year <- NULL
#rt_train$box_office <- NULL
#impute na and address multicoliniearity
preproc <- preProcess(rt_train, method = c("knnImpute","center",
"scale"))
rt_proc <- predict(preproc, rt_train)
rt_proc$box_office <- rt_train$box_office
sum(is.na(rt_proc))
titles <- rt_proc$titles
rt_proc$titles <- NULL
#rt_train$interval <- as.factor(rt_train$interval)
dmy <- dummyVars(" ~ .", data = rt_proc,fullRank = T)
rt_transform <- data.frame(predict(dmy, newdata = rt_proc))
index <- createDataPartition(rt_transform$interval, p =.75, list = FALSE)
train_m <- rt_transform[index, ]
rt_test <- rt_transform[-index, ]
str(rt_train)
y_train <- train_m$box_office
y_test <-rt_test$box_office
train_m$box_office <- NULL
rt_test$box_office <- NULL
#selected feature attributes
boruta.train <- Boruta(interval~., train_m, doTrace =1)
#graph to see most important var to interval
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
plot(boruta.train, xlab = "", xaxt = "n")
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
#get most important attributes
final.boruta <- TentativeRoughFix(boruta.train)
print(final.boruta)
getSelectedAttributes(final.boruta, withTentative = F)
boruta.rt_df <- attStats(final.boruta)
boruta.rt_df
boruta.rt_df <- setDT(boruta.rt_df, keep.rownames = TRUE)[]
predictors <- boruta.rt_df %>%
filter(., decision =="Confirmed") %>%
select(., rn)
predictors <- unlist(predictors)
control <- trainControl(method="repeatedcv",
number=10,
repeats=6)
#look at residuals
#p-value is very small so reject H0 that predictors have no effect so
#we can use rotten tomatoes to predict box_office ranges
train_m$interval <- NULL
model_lm <- train(train_m[,predictors],
y_train, method='lm',
trControl = control, tuneLength = 10)
model_lm #.568
#
plot(model_lm)
plot(model_lm)
z <- varImp(object=model_lm)
z <- setDT(z, keep.rownames = TRUE)
z$model <- NULL
z$calledFrom <- NULL
row.names(z)
plot(varImp(object=model_lm),main="Linear Model Variable Importance")
predictions<-predict.train(object=model_lm,rt_test[,predictors],type="raw")
table(predictions)
#get coeff
interc <- coef(model_lm$finalModel)
slope <- coef(model_lm$finalModel)
ggplot(data = rt_train, aes(y = box_office)) +
geom_point() +
geom_abline(slope = slope, intercept = interc, color = 'red')
This is what some of my input looks like. Thank you!!
Here is an example using the inbuilt data set cars:
data(cars, package = "datasets")
library(caret)
build the model
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 6)
model_lm <- train(dist ~ speed, data = cars, method='lm',
trControl = control, tuneLength = 10)
I will assume you would like to plot the final model.
You can use the caret predict.train function to get the predictions from the model and plot them:
pred <- predict(model_lm, cars)
pred <- data.frame(pred = pred, speed = cars$speed)
additionally you can provide the cars data set to geom point and plot the observations:
library(ggplot2)
ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x=speed, y = dist))
if you would like to obtain the confidence or prediction interval you can use the predict.lm function on model_lm$finalModel:
Here is an example for the prediction interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "prediction")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_int <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
or the confidence interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "confidence")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_conf <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
plotting them side by side:
library(cowplot)
plot_grid(pred_int, pred_conf)
to plot the linear dependence on two variables you can use a 3D plot, for more than 3 it will be a problem.

Resources