Issue with weighted data in lmer() and model averaging using glmulti() - r

I am having hard time while performing Linear Mixed Models (LMM) on data with weights (I mean the weight of the different groups differs). Furthermore, it seems that weights are not kept while using glmulti(). Below is a reproducible example:
require(lme4)
require (glmulti)
data(cake)
cake$wght <- as.numeric(cake$replicate)
fm1 <- lmer(angle ~ recipe + temperature + (1|replicate), cake, REML= FALSE)
print(VarCorr(fm1),comp=c("Variance","Std.Dev."))
In this case residual variance of the random effect equals 22.36.
fm2 <- lmer(angle ~ recipe + temperature + (1|replicate), cake, weights=wght,REML= FALSE)
print(VarCorr(fm2),comp=c("Variance","Std.Dev."))
And here,residual variance of the random effect is now 155.
For linear models, residual variance remains unchanged whatever the weights, while it seems not to be the case here.
The second issue occurs when performing a model averaging with weighted data. It seems that glmulti() does not account for specified weights in the following examples:
wlmer.glmulti <- function (formula, data, random = "", weights ,...) {
lmer(paste(deparse(formula), random), data = data, weights)}
#(watch out doesn't converge!!)
LMM <- glmulti(angle ~ recipe + temperature , data=cake, random="+ (1|replicate)", fitfunc = wlmer.glmulti, weights=cake$wght,report=T, level = 1,crit="aic",method="g")
summary(LMM#objects[[1]]) # is similar to fm1
Any suggestion is most welcome. Thanks

A short answer to my own post above.
1/ Dealing with weights in Linear Mixed Models:
To avoid large inflation of residual variances of the RE due to weight, once can "simply" ensure that the sum of all weights equal 1.
2/ Integrating weights into model averaging procedure (glmulti)
Directly insert the weights into the wrapper (thanks to Vincent Calcagno for the tip).
From previous example:
wlmer.glmulti <- function (formula, data, random = "", weights ,...) {
lmer(paste(deparse(formula), random), data = data, weights)}
Updated working alternative
wlmer.glmulti <- function (formula, data, random = "", weights ,...) {
lmer(paste(deparse(formula), random), data = data, data$wght)}
Hope this will be useful at some point ;)

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Estimating risk ratio instead of odds ratio in mixed effect logistic regression in `R`

glmer is used to estimate effects on the logit scale of y when the data are clustered. In the following model
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "logit"))
the exp of the coefficient of treat is the odds ratio of a binary 0-1 treatment variable, x is a covariate, and cluster is a clustering indicator across which we model a random effect (intercept). A standard approach in glm's to estimate risk ratios is to use a log link instead, i.e. family=binomial(link = "log"). However using this in glmer I get error
Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, :
(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
after calling
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "log"))
A web search revealed other people had similar issues with the Gamma family.
This seems to be a general problem as the reproducible example below demonstrates. My question thus is: how can I estimate risk ratios using a mixed effect model like glmer?
Reproducible Example
The following code simulates data that replicates the problem.
n = 1000 # sample size
m = 50 # number of clusters
J = sample(1:m, n, replace = T) # simulate cluster membership
x = rnorm(n) # simulate covariate
treat = rbinom(n, 1, 0.5) # simulate random treatment
u = rnorm(m) # simulate random intercepts
lt = x + treat + u[J] # compute linear term of logistic mixed effect model
p = 1/(1+exp(-lt)) # use logit link to transform to probabilities
y = rbinom(n,1,p) # draw binomial outcomes
d = data.frame(y, x, treat)
# First fit logistic model with glmer
fit1 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "logit"), data = d)
summary(fit1)
# Now try to log link
fit2 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "log"), data = d)
This error is returned due to your model producing values > 1:
PIRLS step-halvings failed to reduce deviance in pwrssUpdate
...
When using lme4 to fit GLMMs with link functions that do not automatically constrain the response to the allowable range of the distributional family (e.g. binomial models with a log link, where the estimated probability can be >1, or inverse-Gamma models, where the estimated mean can be negative), it is not unusual to get this error. This occurs because lme4 doesn’t do anything to constrain the predicted values, so NaN values pop up, which aren’t handled gracefully. If possible, switch to a link function to one that constrains the response (e.g. logit link for binomial or log link for Gamma).
Unfortunately, the suggested workaround is to use a different link function.
The following paper surveys a number of alternative model choices for calculation for [adjusted] relative risk:
Model choices to obtain adjusted risk difference estimates from a binomial regression model with convergence problems: An assessment of methods of adjusted risk difference estimation (2016)

Permutation test error for likelihood ratio test of mixed model in R: permlmer, lmer, lme4, predictmeans

I would like to test the main effect of a categorical variable using a permutation test on a likelihood ratio test. I have a continuous outcome and a dichotomous grouping predictor and a categorical time predictor (Day, 5 levels).
Data is temporarily available in rda format via this Drive link.
library(lme4)
lmer1 <- lmer(outcome ~ Group*Day + (1 | ID), data = data, REML = F, na.action=na.exclude)
lmer2 <- lmer(outcome ~ Group + (1 | ID), data = data, REML = F, na.action=na.exclude)
library(predictmeans)
permlmer(lmer2,lmer1)
However, this code gives me the following error:
Error in density.default(c(lrtest1, lrtest), kernel = "epanechnikov") :
need at least 2 points to select a bandwidth automatically
The following code does work, but does not exactly give me the outcome of a permutated LR-test I believe:
library(nlme)
lme1 <- lme(outcome ~ Genotype*Day,
random = ~1 | ID,
data = data,
na.action = na.exclude)
library(pgirmess)
PermTest(lme1)
Can anyone point out why I get the "epanechnikov" error when using the permlmer function?
Thank you!
The issue is with NANs, remove all nans from your dataset and rerun the models. I had the same problem and that solved it.

Lagged Residual as Independent Variable in R

I am building a factor model to estimate future equity returns. I'd like to include an autoregressive residual term in this model. I'd like to have yesterday's error (the difference between yesterday's predicted return and actual return) to be included in the regression as an independent variable. What type of autoregressive model is this called? I've searched through various time series econometrics texts and have not found this particular model described. My current solution in R is to rerun the regression at every discrete time step (t), and manually include yesterday's residual, but I am curious if there is a more efficient method or package that does this.
Below is some sample code without the residual term included:
Data:
# fake data
set.seed(333)
df <- data.frame(seq(as.Date("2017/1/1"), as.Date("2017/2/19"), "days"),
matrix(runif(50*506), nrow = 50, ncol = 506))
names(df) <- c("Date", paste0("var", 1:503), c("mktrf", "smb", "hml"))
Then I store my necessary variables for regression:
1.All the dep var
x = df[,505:507]
2.All the indep var
y <- df[,2:504]
4.Fit all the models
list_models_AR= lapply(y, function(y)
with(x, lm(y ~ mktrf + smb + hml , na.action = na.exclude)))
It’s a ARIMA(0, 0, 1), with regressors model

Estimating variance attributed a fixed effect

Disregarding how "important" it is, I am interested in trying to estimate how much of the variance is attributed to a single fixed effect (it being a main effect, or interaction term).
As a quick thought I imagined that constructing a linear model for the predicted values of mixed model (without the random effect), and assessing the ANOVA-table would provide a estimate (yes, the residual variance will then be zero, but we know(?) this from the mixed model). However, from playing around apparently not.
Where is the flaw in my reasoning? Or did I do something wrong along the way? Is there an alternative method?
Disclaimer: I know some people have suggested looking at the change in residual variance when removing/adding fixed effects, but as this does not take into account the correlation between fixed and random effects I am not interested .
data(Orthodont,package="nlme")
Orthodont = na.omit(Orthodont)
#Fitting a linear mixed model
library(lme4)
mod = lmer(distance ~ age*Sex + (1|Subject) , data=Orthodont)
# Predicting across all observed values,
pred.frame = expand.grid(age = seq(min(Orthodont$age, na.rm = T),max(Orthodont$age, na.rm=T)),
Sex = unique(Orthodont$Sex))
# But not including random effects
pred.frame$fit = predict(mod, newdata = pred.frame, re.form=NA)
anova(lm(fit~age*Sex, data = pred.frame))
library(data.table)
Orthodont = data.table(Orthodont)
# to test the validity of the approach
# by estimating a linear model using a random observation
# per individual and look at the means
tmp = sapply(1:500, function(x){
print(x)
as.matrix(anova(lm(distance~age*Sex, data =Orthodont[,.SD[sample(1:.N,1)],"Subject"])))[,2]
}
)
# These are clearly not similar
prop.table(as.table(rowMeans(tmp)[-4]))
age Sex age:Sex
0.60895615 0.31874622 0.07229763
> prop.table(as.table(anova(lm(fit~age*Sex, data = pred.frame))[1:3,2]))
A B C
0.52597575 0.44342996 0.03059429

Resources