Looping cv.glmnet and get the 'best' coefficients - r

As noted in the help of cv.glmnet, "the results of cv.glmnet are random, since the folds are selected at random. Users can reduce this randomness by running cv.glmnet many times, and averaging the error curves.".
If I make a loop doing n-times cv.glmnet, how can I extract the 'best' coefficients? I usually take the coefficients using this command:
coe<- coef(cvfit, s = "lambda.min")
If I use the mean of all the "lambda.min" then I don't know how to choose the right cvfit out of the many I generated. Do I have to use the mean of cvfit$cvm or MSE or other things?
Thanks

when you do coef(cvfit, s = "lambda.min"), you are taking the lambda that is 1 standard error from the best lambda, see this discusssion So you can average the MSEs across different cv runs
library(glmnet)
library(mlbench)
data(BostonHousing)
X = as.matrix(BostonHousing[,-c(4,14)])
Y = BostonHousing[,14]
nfolds = 5
nreps = 10
res = lapply(1:nreps,function(i){
fit = cv.glmnet(x=X,y=Y,nfolds=nfolds)
data.frame(MSE_mean=fit$cvm,lambda=fit$lambda,se=fit$cvsd)
})
res = do.call(rbind,res)
We can summarize the results, the standard deviation is approximated by just taking the mean, but if you wanna be precise, might have to look into the formula for pooled standard deviation:
library(dplyr)
summarized_res = res %>%
group_by(lambda) %>%
summarise(MSE=mean(MSE_mean),se=mean(se)) %>%
arrange(desc(lambda))
idx = which.min(summarized_res$MSE)
lambda.min = summarized_res$lambda[idx]
lambda.min
[1] 0.019303
index_1se = with(summarized_res,which(MSE < MSE[idx]+se[idx])[1])
lambda_1se = summarized_res$lambda[index_1se]
lambda_1se
[1] 0.3145908
We can plot this:
library(ggplot2)
ggplot(res,aes(x=log(lambda),y=MSE_mean)) + stat_summary(fun=mean,size=2,geom="point") +
geom_vline(xintercept=c(lambda.min,lambda_1se))

Related

Can glmmLasso be used with the Tweedie distribution?

I have a linear mixed effects model and I am trying to do variable selection. The model is testing the level of forest degradation in 1000 sampled points. Most points have no degradation, and so the dependent variable is highly skewed with many zeros. Therefore, I am using the Tweedie distribution to fit the model. My main question is: can the Tweedie distribution actually be used in the glmmLasso function? My second question is: do I even need to use this distribution in glmmLasso()? Any help is much appreciated!
When I run the function with family = tweedie(var.power=1.2,link.power=0) I get the following error:
Error in logLik.glmmLasso(y = y, yhelp = yhelp, mu = mu, family = family, :
object 'loglik' not found
If I change the link.power from 0 to 1 (which I think is not correct for my model, but just for the sake of figuring out the problem), I get a different error:
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
Here tweedie comes from the statmod package. A simple example:
library(tweedie)
library(tidyverse)
library(glmmLasso)
library(statmod)
power <- 2
mu <- 1
phi <- seq(2, 8, by=0.1)
set.seed(10000)
y <- rtweedie( 100, mu=mu, power=power, phi=3)
x <- rnorm(100)
z <- c(rep(1, 50), rep(2,50))
df = as.data.frame(cbind(y,x,z))
df$z = as.factor(df$z)
f = y ~ x
varSelect = glmmLasso(fix = f, rnd = list(z=~1), data = df,
lambda = 5, family = tweedie(var.power=1.2,link.power=0))
I created a hacked version of glmmLasso that incorporates the Tweedie distribution as an option and put it on Github. I had to change two aspects of the code:
add a clause to compute the log-likelihood if family$family == "Tweedie"
in a number of places where the code was essentially if (family$family in list_of_families) ..., add "Tweedie" as an option.
remotes::install_github("bbolker/glmmLasso-bmb")
packageVersion("glmmLasso")
## [1] ‘1.6.2.9000’
Your example runs for me now, but I haven't checked at all to see if the results are sensible.

Obtaining Standardized coefficients from "rstanarm" package in R?

I was wondering if it might be possible (and perhaps recommended) to obtain standardized coefficients from stan_glm() in the rstanarm package? (did not find anything specific in the documentation)
Can I just standardize all variables as in normal regression? (see below)
Example:
library("rstanarm")
fit <- stan_glm(wt ~ vs*gear, data = mtcars)
Standardization:
design <- wt ~ vs*gear
vars <- all.vars(design)
stand.vars <- lapply(mtcars[, vars], scale)
fit <- stan_glm(stand.vars, data = mtcars)
I would not say that it is affirmatively recommended, but I would recommend that you not subtract the sample mean and divide by the sample standard deviation of the outcome because the estimation uncertainty in those two statistics will not be propagated to the posterior distribution.
Standardizing the predictors is more debatable. You can do it, but it makes doing posterior prediction with new data harder because you have to remember to subtract the old means from the new data and divide by the old standard deviations.
The most computationally efficient approach is to leave the variables as they are but specify the non-default argument QR = TRUE, especially if you are not going to modify the default (normal) priors on the coefficients anyway.
You can then standardize the posterior coefficients after-the-fact if standardized coefficients are of interest. To do so, you can do
X <- model.matrix(fit)
sd_X <- apply(X, MARGIN = 2, FUN = sd)[-1]
sd_Y <- apply(posterior_predict(fit), MARGIN = 1, FUN = sd)
beta <- as.matrix(fit)[ , 2:ncol(X), drop = FALSE]
b <- sweep(sweep(beta, MARGIN = 2, STATS = sd_X, FUN = `*`),
MARGIN = 1, STATS = sd_Y, FUN = `/`)
summary(b)
However, standardizing regression coefficients just gives the illusion of comparability across variables and says nothing about how germane a one standard deviation difference is, particularly for dummy variables. If your question is really whether manipulating this predictor or that predictor is going to make a bigger difference on the outcome variable, then simply simulate those manipulations like
PPD_0 <- posterior_predict(fit)
nd <- model.frame(fit)
nd[ , 2] <- nd[ , 2] + 1 # for example
PPD_1 <- posterior_predict(fit, newdata = nd)
summary(c(PPD_1 - PPD_0))
and repeat that process for other manipulations of interest.

Issue with Car ANOVA function in R?

I am running car version 2.1.4 and trying to use the Anova function to get Wald-based p-values for a power analysis using logistic regression with success/failure setup. If I run the following simple factorial, the function through an error due to 0 residual degrees of freedom but clearly the sample size is extremely large. What am I doing or thinking about this wrong?
Is the problem with the glm() call since that likewise says zero residual df?
X <- matrix(c(100,66566,73,66593,1201,398799,165,66501),
nrow = 4,ncol = 2,byrow = TRUE)
x_df <- data.frame(premium = c(300,300,500,500),
restrict = c(500,2500,500,2500))
x_df$int <- x_df$premium * x_df$restrict
mod <- glm(X~premium+restrict+int,
data=x_df,family=binomial)
summary(mod)
car::Anova(mod,type="III","Wald")
ADD #1:
It appears that the success/failure syntax doesn't work properly. When I manually expand out the data to ~600,000 rows the fit is the same but the res.df are correct:
X<-matrix(c(100,66566,73,66593,1201,398799,165,66501),nrow = 4,ncol = 2,byrow = TRUE)
x_df<-data.frame(premium=c(300,300,500,500),restrict=c(500,2500,500,2500))
x_df$int<-x_df$premium*x_df$restrict
mod<-glm(X~premium+restrict+premium*restrict, data=x_df,family=binomial)
summary(mod)
Anova(mod,type = "III",test.statistic = "Wald")
y=c(rep(1,100),rep(0,66566),rep(1,73),rep(0,66593),rep(1,1201),rep(0,398799),rep(1,165),rep(0,66501))
premium<-c(rep(300,66666*2),rep(500,1201+398799+165+66501))
restrict<-c(rep(500,66666),rep(2500,66666),rep(500,1201+398799),rep(2500,165+66501))
x<-data.frame(y=y,premium=premium,restrict=restrict)
mod2<-glm(y~premium+restrict+premium*restrict,data=x,family=binomial)
summary(mod2)
Anova(mod2,type = "III",test.statistic = "Wald")
The residual degrees of freedom is (number of observations)-(number of parameters). You have four observations and four parameters. I'm not sure what else to say ...

R: obtain coefficients&CI from bootstrapping mixed-effect model results

The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.

How to simulate an AR(1) process with arima.sim and an estimated model?

I want to do the following two steps:
Based on a given time series, I want to calibrate an AR(1) process, i.e. I want to estimate the parameters.
Based on the estimated parameters, I want to simulate an AR(1) processes.
Here was my approach:
set.seed(123)
#Just generate random AR(1) time series; based on this, I want to estimate the parameters
ts_AR <- arima.sim(n=10000, list(ar=c(0.5)))
#1. Estimate parameters with arima()
model_AR <- arima(ts_AR, order=c(1,0,0))
#Looks actually good
model_AR
Series: ts_AR
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 intercept
0.4891 -0.0044
s.e. 0.0087 0.0195
sigma^2 estimated as 0.9974: log likelihood=-14176.35
AIC=28358.69 AICc=28358.69 BIC=28380.32
#2. Simulate based on model
arima.sim(model=model_AR, n = 100)
Error in arima.sim(model = model_AR, n = 100) :
'ar' part of model is not stationary
I'm not the biggest time-series expert, but I'm pretty sure that an AR(1) process with a persistence parameter of below one should result in a stationary model. However, the error message tells me somethings
different. So do I do something stupid here? If so, why and what should I do to simulate the AR(1) process based on my estimated parameters. Or can't you just pass the output of arima as the model input into arima.sim? Then, however, I don't understand how I get such an error message...I would expect something like "model input cannot be read. It should be something like ..."
It's not the clearest interface in the world, but the model argument is meant to be a list giving the ARMA order, not an actual arima model.
arima.sim(model=as.list(coef(model_AR)), n=100)
This will create a simulated series with AR coefficient .489 as estimated from your starting data. Note that the intercept is ignored.
I don't think you are using the right approach since there's uncertainty about your coefficient estimate.
The best way to achieve what you want in a proper way is to incorporate uncertainty in the generation process, there are probably parametric way to do that but I think bootstrap can be handy here.
Lets generate the AR process first
set.seed(123)
ts_AR <- arima.sim(n = 10000, list(ar = 0.5))
We'll define two helper functions that will used in the boostrap. The first one generate the statistics we need (here the coef of the AR process and the actual time series) and the second function implement our resampling scheme (it'll be based on residuals)
ar_fun <- function(ts) c(ar = coef(arima(ts, order = c(1, 0, 0),
include.mean = FALSE)), ts = ts)
ar_sim <- function(res, n.sim, ran.args) {
rg <- function(n, res) sample(res, n, replace = TRUE)
ts <- ran.args$ts
model <- ran.args$model
arima.sim(model = model, n = n.sim,
rand.gen = rg, res = c(res))
}
Now we can start our simulation
ar_fit <- arima(ts_AR, order = c(1, 0, 0), include.mean = FALSE)
ts_res <- residuals(ar_fit)
ts_res <- ts_res - mean(ts_res)
ar_model <- list(ar = coef(ar_fit))
require(boot)
set.seed(1)
ar_boot <- tsboot(ts_res, ar_fun,
R = 99, sim = "model",
n.sim = 100, orig.t = FALSE,
ran.gen = ar_sim,
ran.args = list(ts = ts_AR, model = ar_model))
If you want to get all the coefficient generated and the associated time series
coefmat <- apply(ar_boot$t, 1, "[", 1)
seriesmat <- apply(ar_boot$t, 1, "[", -1)
You can get more details in help file of tsboot and in Bootstrap Methods and Their Application, chap 8.

Resources