Standard error for lme4 random effect predictions in lme4 - r

I have an experiment with plants having different growth habits (growth_type), genotypes nested within growth types (ge), and blocks also nested within growth types (block). The objective is to test the influence of growth type and genotypes of plant performance. Here is a sample data and reproducible example.
data1 <- read.csv(text = "
growth_type,ge,block,performance
dwarf,A,1,128.32
dwarf,A,2,97.01
dwarf,A,3,91.05
dwarf,B,1,108.51
dwarf,B,2,121.11
dwarf,B,3,84.15
dwarf,C,1,132.55
dwarf,C,2,129.45
dwarf,C,3,122.33
tall,D,1,79.68
tall,D,2,122.5
tall,D,3,143.42
tall,E,1,149.29
tall,E,2,162.13
tall,E,3,135.42
tall,F,1,90.45
tall,F,2,127.4
tall,F,3,78.99")
These are the libraries I used:
library(dplyr)
library(lme4)
library(lsmeans)
The first step was fitting a model:
model.fit <-
lmer(performance ~ growth_type + (1 | block:growth_type) + (1 | ge:growth_type),
data = data1)
From this model, I can extract the fixed effect of growth type using lsmeans:
fixed.effect.estimates <- lsmeans::lsmeans(model.fit, "growth_type")
and this is the output:
What I need to obtain is the same output for the random effect. I am able to get the prediction interval, but I cannot get the standard error. This is what I tried:
# RANDOM EFFECT ESTIMATES
data1$pred.performance <-
predict(model.fit,
newdata = data1,
re.form= ~(1 | ge:growth_type))
pred.ge <- data1 %>%
distinct(ge, growth_type, pred.performance)
And this is what I've obtained. So far so good.
Then I used the bootMer function to build the prediction interval using bootstrapping.
mySumm <- function(.) {
predict(., newdata=pred.ge, re.form= ~(1 | ge:growth_type))
}
####Collapse bootstrap into median, 95% PI
sumBoot <- function(merBoot) {
return(
data.frame(fit = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.5, na.rm=TRUE))),
lwr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.025, na.rm=TRUE))),
upr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.975, na.rm=TRUE)))
)
)
}
##lme4::bootMer() method 1
PI.boot1.time <- system.time(
boot1 <- lme4::bootMer(model.fit, mySumm, nsim=250, use.u=TRUE, type="parametric")
)
PI.boot1 <- sumBoot(boot1)
cbind(pred.ge, PI.boot1)
This is what I obtained:
In summary, my questions are:
how can I get the standard errors as I did for the fixed effect components?
why random effect estimates from lme4::predict are different from lme4::bootMer?
Sorry for a long explanation.

Related

Storing model estimates after running several models in a for loop in R

I want to get the model estimates that I am creating in my for loop and save them all in a data frame in r. First part of the code just to simulate a similar dataset to mine
library(readxl)
library(dplyr)
library(drc)
library(purrr)
library(readr)
library(ggplot2)
library(broom)
library(broom.mixed)
library(brms)
library(lme4)
########## Simulation #########################################################
#number of assays
nassay= 12
#number of plates
nplate= 3
#fixed mean from above model
mu=0.94587
### mean SD
musd=0.04943
#standard deviation of assay from intial model
sda=0.06260
#standard deviation of residual between plates
sd= 0.07793
set.seed(16)
(assay = rep(LETTERS[1:nassay], each = nplate))
( plate =1:(nassay*nplate) )
plate2 =rep(c(1, 2, 3), times = 12)
### simulate assay level effect
( assayeff = rnorm(nassay, 0, sda) )
#### each assay has 3 plates the assay must be repeated for each plate
( assayeff = rep(assayeff, each = nplate) )
#### every plate measurement has an effect on potency so we draw for every observation based on sd of residual
plateeff= (rnorm(nassay*nplate, 0, sd))
###### simulate different means
(musims= rnorm(nassay*nplate, mu, musd))
( dat = data.frame(assay, assayeff, plate, plateeff,musims) )
sim_dat <- dat
#### now combine all estimates to get rel potency
( dat$relpot = with(dat, mu + assayeff + plateeff ) )
sim_dat <- dat
fit1 = lmer(relpot ~ 1 + (1|assay), data = dat)
fit1
This the code to simulate a dataset then I just BRMS to get posterior estimates and save to a dataframe
fit<-brms::brm(relpot ~ 1 + (1 | Filename), data = dat,iter=100,warmup=10,seed=355545)
post_dat <- posterior_samples(fit,fixed=TRUE)
summary(fit)
plot(fit)
post_fit_use <- post_dat %>% dplyr::select(b_Intercept, sd_Filename__Intercept, sigma)
post_fit_use <- post_fit_use %>% mutate(assay_var=(sd_Filename__Intercept)^2) %>% mutate(platevar=(sigma)^2)
Now I want to use each of these posterior estimates to create a dataset and run a model
for (i in 1:nrow(post_fit_use)) { #fixed mean from above model
mu=post_fit_use$b_Intercept[i]
#standard deviation of assay from intial model
sda=post_fit_use$sd_Filename__Intercept[i]
#standard deviation of residual between plates
sd= post_fit_use$sigma[i]
(assay = rep(LETTERS[1:nassay], each = nplate))
( plate =1:(nassay*nplate) )
plate2 =rep(c(1, 2, 3), times = 12)
### simulate assay level effect
( assayeff = rnorm(nassay, 0, sda) )
#### each assay has 3 plates the assay must be repeated for each plate
( assayeff = rep(assayeff, each = nplate) )
#### every plate measurement has an effect on potency so we draw for every observation based on sd of residual
plateeff= (rnorm(nassay*nplate, 0, sd))
###### simulate different means
( dat = data.frame(assay, assayeff, plate, plateeff) )
sim_dat <- dat
#### now combine all estimates to get rel potency
( dat$relpot = with(dat, mu + assayeff + plateeff ) )
sim_dat <- dat
fit = lmer(relpot ~ 1 + (1|assay), data = dat)
rand <-tidy(fit, effects = "ran_pars", scales = "vcov")
fixed <- tidy(fit, effects = "fixed")
}
my issue is I want to save each model estimate into a dataframe. But when I run my loop I just get the results of the last iteration. I am unsure on how to save each one
the above code shows what I tired and one the last model gets saved not all. Note when you run rand <-tidy(fit, effects = "ran_pars", scales = "vcov") fixed <- tidy(fit, effects = "fixed") you get data frames with 1 row 5 variables for fixed and a dataframe with 2 rows 5 variables for rand. This is for one model
I second Paul's suggestion. You can store the results from tidy in a list, then transform the list in a data frame. But you will have to use double bracket to index the list (i.e. rands[[i]] <- tidy(fit)). Try something like:
library(broom)
rands <- list()
for(i in 2:ncol(mtcars)){
mod <- lm(mtcars[,1] ~ mtcars[,i])
rands[[i]] <- tidy(mod)
}
df <- do.call(rbind, rands)
df

Does caret::train() in r have a standardized output across different fit methods/models?

I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)

Hierarchical Dirichlet regression (jags)... overfitting

Good Morning, please I need community help in order to understand some problems that occurred writing this model.
I aim at modeling causes of death proportion using as predictors "log_GDP" (Gross domestic product in log scale), and "log_h" (hospital beds per 1,000 people on log scale)
y: 3 columns that are observed proportions of deaths over the years.
x1: "log_GDP" (Gross domestic product in log scale)
x2: "log_h" (hospital beds per 1,000 people in log scale)
As you can see from the estimation result in the last plot, I got a high noise level. Where I worked using just one covariate i.e. log_GDP, I obtained smoothed results
Here the model specification:
Here simulated data:
library(reshape2)
library(tidyverse)
library(ggplot2)
library(runjags)
CIRC <- c(0.3685287, 0.3675516, 0.3567829, 0.3517274, 0.3448940, 0.3391031, 0.3320184, 0.3268640,
0.3227445, 0.3156360, 0.3138515,0.3084506, 0.3053657, 0.3061224, 0.3051044)
NEOP <- c(0.3602199, 0.3567355, 0.3599409, 0.3591258, 0.3544591, 0.3566269, 0.3510974, 0.3536156,
0.3532980, 0.3460948, 0.3476183, 0.3475634, 0.3426035, 0.3352433, 0.3266048)
OTHER <-c(0.2712514, 0.2757129, 0.2832762, 0.2891468, 0.3006468, 0.3042701, 0.3168842, 0.3195204,
0.3239575, 0.3382691, 0.3385302, 0.3439860, 0.3520308, 0.3586342, 0.3682908)
log_h <- c(1.280934, 1.249902, 1.244155, 1.220830, 1.202972, 1.181727, 1.163151, 1.156881, 1.144223,
1.141033, 1.124930, 1.115142, 1.088562, 1.075002, 1.061257)
log_GDP <- c(29.89597, 29.95853, 29.99016, 30.02312, 30.06973, 30.13358, 30.19878, 30.25675, 30.30184,
30.31974, 30.30164, 30.33854, 30.37460, 30.41585, 30.45150)
D <- data.frame(CIRC=CIRC, NEOP=NEOP, OTHER=OTHER,
log_h=log_h, log_GDP=log_GDP)
cause.y <- as.matrix((data.frame(D[,1],D[,2],D[,3])))
cause.y <- cause.y/rowSums(cause.y)
mat.x<- D$log_GDP
mat.x2 <- D$log_h
n <- 15
Jags Model
dirlichet.model = "
model {
#setup priors for each species
for(j in 1:N.spp){
m0[j] ~ dnorm(0, 1.0E-3) #intercept prior
m1[j] ~ dnorm(0, 1.0E-3) # mat.x prior
m2[j] ~ dnorm(0, 1.0E-3)
}
#implement dirlichet
for(i in 1:N){
y[i,1:N.spp] ~ ddirch(a0[i,1:N.spp])
for(j in 1:N.spp){
log(a0[i,j]) <- m0[j] + m1[j] * mat.x[i]+ m2[j] * mat.x2[i] # m0 = intercept; m1= coeff log_GDP; m2= coeff log_h
}
}} #close model loop.
"
jags.data <- list(y = cause.y,mat.x= mat.x,mat.x2= mat.x2, N = nrow(cause.y), N.spp = ncol(cause.y))
jags.out <- run.jags(dirlichet.model,
data=jags.data,
adapt = 5000,
burnin = 5000,
sample = 10000,
n.chains=3,
monitor=c('m0','m1','m2'))
out <- summary(jags.out)
head(out)
Gather coefficient and I make estimation of proportions
coeff <- out[c(1,2,3,4,5,6,7,8,9),4]
coef1 <- out[c(1,4,7),4] #coeff (interc and slope) caus 1
coef2 <- out[c(2,5,8),4] #coeff (interc and slope) caus 2
coef3 <- out[c(3,6,9),4] #coeff (interc and slope) caus 3
pred <- as.matrix(cbind(exp(coef1[1]+coef1[2]*mat.x+coef1[3]*mat.x2),
exp(coef2[1]+coef2[2]*mat.x+coef2[3]*mat.x2),
exp(coef3[1]+coef3[2]*mat.x+coef3[3]*mat.x2)))
pred <- pred / rowSums(pred)
Predicted and Obs. values DB
Obs <- data.frame(Circ=cause.y[,1],
Neop=cause.y[,2],
Other=cause.y[,3],
log_GDP=mat.x,
log_h=mat.x2)
Obs$model <- "Obs"
Pred <- data.frame(Circ=pred[,1],
Neop=pred[,2],
Other=pred[,3],
log_GDP=mat.x,
log_h=mat.x2)
Pred$model <- "Pred"
tot60<-as.data.frame(rbind(Obs,Pred))
tot <- melt(tot60,id=c("log_GDP","log_h","model"))
tot$variable <- as.factor(tot$variable)
Plot
tot %>%filter(model=="Obs") %>% ggplot(aes(log_GDP,value))+geom_point()+
geom_line(data = tot %>%
filter(model=="Pred"))+facet_wrap(.~variable,scales = "free")
The problem for the non-smoothness is that you are calculating Pr(y=m|X) = f(x1, x2) - that is the predicted probability is a function of x1 and x2. Then you are plotting Pr(y=m|X) as a function of a single x variable - log of GDP. That result will almost certainly not be smooth. The log_GDP and log_h variables are highly negatively correlated which is why the result is not much more variable than it is.
In my run of the model, the average coefficient for log_GDP is actually positive for NEOP and Other, suggesting that the result you see in the plot is quite misleading. If you were to plot these in two dimensions, you would see that the result is again, smooth.
mx1 <- seq(min(mat.x), max(mat.x), length=25)
mx2 <- seq(min(mat.x2), max(mat.x2), length=25)
eg <- expand.grid(mx1 = mx1, mx2 = mx2)
pred <- as.matrix(cbind(exp(coef1[1]+coef1[2]*eg$mx1 + coef1[3]*eg$mx2),
exp(coef2[1]+coef2[2]*eg$mx1 + coef2[3]*eg$mx2),
exp(coef3[1]+coef3[2]*eg$mx1 + coef3[3]*eg$mx2)))
pred <- pred / rowSums(pred)
Pred <- data.frame(Circ=pred[,1],
Neop=pred[,2],
Other=pred[,3],
log_GDP=mx1,
log_h=mx2)
lattice::wireframe(Neop ~ log_GDP + log_h, data=Pred, drape=TRUE)
A couple of other things to watch out for.
Usually in hierarchical Bayesian models, your the parameters of your coefficients would themselves be distributions with hyperparameters. This enables shrinkage of the coefficients toward the global mean which is a hallmark of hierarhical models.
Not sure if this is what your data really look like or not, but the correlation between the two independent variables is going to make it difficult for the model to converge. You could try using a multivariate normal distribution for the coefficients - that might help.

R: cant get a lme{nlme} to fit when using self-constructed interaction variables

I'm trying to get a lme with self constructed interaction variables to fit. I need those for post-hoc analysis.
library(nlme)
# construct fake dataset
obsr <- 100
dist <- rep(rnorm(36), times=obsr)
meth <- dist+rnorm(length(dist), mean=0, sd=0.5); rm(dist)
meth <- meth/dist(range(meth)); meth <- meth-min(meth)
main <- data.frame(meth = meth,
cpgl = as.factor(rep(1:36, times=obsr)),
pbid = as.factor(rep(1:obsr, each=36)),
agem = rep(rnorm(obsr, mean=30, sd=10), each=36),
trma = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)),
depr = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)))
# check if all factor combinations are present
# TRUE for my real dataset; Naturally TRUE for the fake dataset
with(main, all(table(depr, trma, cpgl) >= 1))
# construct interaction variables
main$depr_trma <- interaction(main$depr, main$trma, sep=":", drop=TRUE)
main$depr_cpgl <- interaction(main$depr, main$cpgl, sep=":", drop=TRUE)
main$trma_cpgl <- interaction(main$trma, main$cpgl, sep=":", drop=TRUE)
main$depr_trma_cpgl <- interaction(main$depr, main$trma, main$cpgl, sep=":", drop=TRUE)
# model WITHOUT preconstructed interaction variables
form1 <- list(fixd = meth ~ agem + depr + trma + depr*trma + cpgl +
depr*cpgl +trma*cpgl + depr*trma*cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl1 <- nlme::lme(fixed=form1[["fixd"]],
random=form1[["rndm"]],
correlation=corCompSymm(form=form1[["corr"]]),
data=main)
# model WITH preconstructed interaction variables
form2 <- list(fixd = meth ~ agem + depr + trma + depr_trma + cpgl +
depr_cpgl + trma_cpgl + depr_trma_cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl2 <- nlme::lme(fixed=form2[["fixd"]],
random=form2[["rndm"]],
correlation=corCompSymm(form=form2[["corr"]]),
data=main)
The first model fits without any problems whereas the second model gives me following error:
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
Nothing i found out about this error so far helped me to solve the problem. However the solution is probably pretty easy.
Can someone help me? Thanks in advance!
EDIT 1:
When i run:
modl3 <- lm(form1[["fixd"]], data=main)
modl4 <- lm(form2[["fixd"]], data=main)
The summaries reveal that modl4 (with the self constructed interaction variables) in contrast to modl3 shows many more predictors. All those that are in 4 but not in 3 show NA as coefficients. The problem therefore definitely lies within the way i create the interaction variables...
EDIT 2:
In the meantime I created the interaction variables "by hand" (mainly paste() and grepl()) - It seems to work now. However I would still be interested in how i could have realized it by using the interaction() function.
I should have only constructed the largest of the interaction variables (combining all 3 simple variables).
If i do so the model gets fit. The likelihoods then are very close to each other and the number of coefficients matches exactly.

Multiply coefficients with standard deviation

In R, the stargazer package offers the possibility to apply functions to the coefficients, standard errors, etc:
dat <- read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
m1 <- glm.nb(daysabs ~ math + prog, data = dat)
transform_coef <- function(x) (exp(x) - 1)
stargazer(m1, apply.coef=transform_coef)
How can I apply a function where the factor with which I multiply depends on the variable, like the standard deviation of that variable?
This may not be exactly what you hoped for, but you can transform the coefficients, and give stargazer a custom list of coefficients. For example, if you would like to report the coefficient times the standard deviation of each variable, the following extension of your example could work:
library(foreign)
library(stargazer)
library(MASS)
dat <- read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
m1 <- glm.nb(daysabs ~ math + prog, data = dat)
# Store coefficients (and other coefficient stats)
s1 <- summary(m1)$coefficients
# Calculate standard deviations (using zero for the constant)
math.sd <- sd(dat$math)
acad.sd <- sd(as.numeric(dat$prog == "Academic"))
voc.sd <- sd(as.numeric(dat$prog == "Vocational"))
int.sd <- 0
# Append standard deviations to stored coefficients
StdDev <- c(int.sd, math.sd, acad.sd, voc.sd)
s1 <- cbind(s1, StdDev)
# Store custom list
new.coef <- s1[ , "Estimate"] * s1[ , "StdDev"]
# Output
stargazer(m1, coef = list(new.coef))
You may want to consider a couple of issues outside your original question about outputting coefficients in stargazer. Should you report the intercept when multiplying times the standard deviation? Will your standard errors and inference be the same with this transformation?

Resources