Estimating variance attributed a fixed effect - r

Disregarding how "important" it is, I am interested in trying to estimate how much of the variance is attributed to a single fixed effect (it being a main effect, or interaction term).
As a quick thought I imagined that constructing a linear model for the predicted values of mixed model (without the random effect), and assessing the ANOVA-table would provide a estimate (yes, the residual variance will then be zero, but we know(?) this from the mixed model). However, from playing around apparently not.
Where is the flaw in my reasoning? Or did I do something wrong along the way? Is there an alternative method?
Disclaimer: I know some people have suggested looking at the change in residual variance when removing/adding fixed effects, but as this does not take into account the correlation between fixed and random effects I am not interested .
data(Orthodont,package="nlme")
Orthodont = na.omit(Orthodont)
#Fitting a linear mixed model
library(lme4)
mod = lmer(distance ~ age*Sex + (1|Subject) , data=Orthodont)
# Predicting across all observed values,
pred.frame = expand.grid(age = seq(min(Orthodont$age, na.rm = T),max(Orthodont$age, na.rm=T)),
Sex = unique(Orthodont$Sex))
# But not including random effects
pred.frame$fit = predict(mod, newdata = pred.frame, re.form=NA)
anova(lm(fit~age*Sex, data = pred.frame))
library(data.table)
Orthodont = data.table(Orthodont)
# to test the validity of the approach
# by estimating a linear model using a random observation
# per individual and look at the means
tmp = sapply(1:500, function(x){
print(x)
as.matrix(anova(lm(distance~age*Sex, data =Orthodont[,.SD[sample(1:.N,1)],"Subject"])))[,2]
}
)
# These are clearly not similar
prop.table(as.table(rowMeans(tmp)[-4]))
age Sex age:Sex
0.60895615 0.31874622 0.07229763
> prop.table(as.table(anova(lm(fit~age*Sex, data = pred.frame))[1:3,2]))
A B C
0.52597575 0.44342996 0.03059429

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

lme4: How to specify random slopes while constraining all correlations to 0?

Due to an interesting turn of events, I'm trying use the lme4 package in R to fit a model in which the random slopes are not allowed to correlate with each other or the random intercept. Effectively, I want to estimate the variance parameter for each random slope, but none of the correlations/covariances. From the reading I've done so far, I think what I want is effectively a diagonal variance/covariance structure for the random effects.
An answer to a similar question here provides a workaround to specify a model where slopes are correlated with intercepts, but not with each other. I also know the || syntax in lme4 makes slopes that are correlated with each other, but not with the intercepts. Neither of these seems to fully accomplish what I'm looking to do.
Borrowing the example from the earlier post, if my model is:
m1 <- lmer (Y ~ A + B + (1+A+B|Subject), data=mydata)
is there a way to specify the model such that I estimate variance parameters for A and B while constraining all three correlations to 0? I would like to achieve a result that looks something like this:
VarCorr(m1)
## Groups Name Std.Dev. Corr
## Subject (Intercept) 1.41450
## A 1.49374 0.000
## B 2.47895 0.000 0.000
## Residual 0.96617
I'd prefer a solution that could achieve this for an arbitrary number of random slopes. For example, if I were to add a random effect for a third variable C, there would be 6 correlation parameters to fix at 0 rather than 3. However, anything that could get me started in the right direction would be extremely helpful.
Edit:
On asking this question, I misunderstood what the || syntax does in lme4. Struck through the incorrect statement above to avoid misleading anyone in the future.
This is exactly what the double-bar notation does. However, note that the || in lme4 does not work as one might expect for factor variables. It does work 'properly' in glmmTMB, and the afex::mixed() function is a wrapper for [g]lmer which does implement a fully functional version of ||. (I have meant to import this into lme4 for years but just haven't gotten around to it yet ...)
simulated example
library(lme4)
set.seed(101)
dd <- data.frame(A = runif(500), B = runif(500),
Subject = factor(rep(1:25, 20)))
dd$Y <- simulate(~ A + B + (1 + A + B|Subject),
newdata = dd,
family = gaussian,
newparams = list(beta = rep(1,3), theta = rep(1,6), sigma = 1))[[1]]
solution
summary(m <- lmer (Y ~ A + B + (1+A+B||Subject), data=dd))
The correlations aren't listed because they are structurally absent (internally, the random effects term is expanded to (1|Subject) + (0 + A|Subject) + (0+B|Subject), which is also why the groups are listed as Subject, Subject.1, Subject.2).
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.8744 0.9351
Subject.1 A 2.0016 1.4148
Subject.2 B 2.8718 1.6946
Residual 0.9456 0.9724
Number of obs: 500, groups: Subject, 25

Complete separation - how to impose 'zero-mean Normal priors on fixed effects' in bglmer

My glmer model, which contains two predictors and an interaction term, suffers from complete separation. Following Ben Bolker's recommendations here and here, I then fit the model with bglmer, imposing zero-mean Normal priors on the fixed effects. My code is as follows:
bglmer(Binary_outcome ~ (1|Subject) + Factor1 + Factor2 + Factor1:Factor2,
mydata,
control=glmerControl(optimizer="bobyqa"),
family = binomial,
fixef.prior = normal(sd = c(3, 3, 3)))
Both Factor1 and Factor2 are factor variables, with four levels each. For my code, I followed the example here. As far as I understand, I now put zero-mean Normal priors with SD of 3 on all elements of my fixed effects structure.
The code seems to have worked, but I am completely uncertain whether what I did is in fact correct. Is 3 SD the general recommendation to help with complete separation? And how would I specify fixef.priors that only go on the interaction term? (The complete separation relates to a specific combination of Factor1 and Factor2, not Factor1 or Factor2 in general). Or do I have to put fixed effect priors on all three elements anyways if the interaction is concerned?
When in doubt, experiment. tl;dr it's probably fine to penalize everything, but you can if you like penalize the interactions and not penalize the main effects (or almost; you have to give a finite sd but it can be very large). sd=3 is reasonable; sd=2.5 is the default for parameters other than the intercept. Unless you're fitting a predictive model and want to go to the trouble of doing cross-validation to pick the optimal strength of penalization, there's not really an automatic way to decide; you just have to pick a sd that makes all your parameters "reasonable" without squashing your well-determined parameters too much toward zero.
I simulated data a bit like yours (2 factors with 4 levels each, binary response, one random intercept term) and tried different penalization scenarios. I set up the true parameters so that most parameters were reasonable (beta=1) but there was one large main effect and one large interaction parameter (beta=12).
no penalization (glmer)
penalization with sd=3
penalization of all terms with sd=1
very weak penalization of main effect terms (sd=1e4) and medium penalization of interactions (sd=2) ["mixed"]
from help("bmerDist-class"):
When specifying standard
deviations, a vector of length less than the number of fixed
effects will have its tail repeated, while the first element
is assumed to apply only to the intercept term. So in the
default of ‘c(10, 2.5)’, the intercept receives a standard
deviation of 10 and the various slopes are all given a
standard deviation of 2.5
library(lme4)
library(blme)
library(broom.mixed)
library(dotwhisker)
library(colorspace)
dd <- expand.grid(Subject=factor(1:10),
Factor1=letters[1:4],
Factor2=LETTERS[1:4],
rep=1:20
)
bvec <- rep(1,16)
bvec[c(2,10)] <- 12
form <- response ~ (1|Subject) + Factor1 + Factor2 + Factor1:Factor2
set.seed(101)
dd$response <- simulate(form[-2],
newdata=dd,
newparams=list(beta=bvec, theta=1),
family=binomial,
weights=rep(1,nrow(dd)))[[1]]
b0 <- glmer(form,
dd,
family = binomial)
bfun <- function(sd) {
bglmer(form, dd, family = binomial, fixef.prior = normal(sd = sd))
}
b1 <- bfun(3)
b2 <- bfun(1)
## eight intercept + main effects first, then eight interaction parameters
b3 <- bfun(rep(c(1e4,2),c(8,8))
theme_set(theme_bw())
dwplot(list(unpenalized=b0,sd3=b1,sd1=b2,mixed=b3),effect="fixed") +
coord_cartesian(xlim=c(-5,5))+
geom_vline(xintercept=c(0,1),lty=2,colour="darkgray") +
scale_colour_discrete_qualitative(guide=guide_legend(reverse=TRUE))

IV regression computation

For my thesis I am doing an Instrumental Variables (IV) regression and I was wondering if I did it the right way. Couple of issues I have:
Comparing the linear model with the IV models, the sign of the effect changes (positive to negative or the other way round).
Using Two Stage Least Squares (2SLS) with ivreg (from the AER package) gives negative R² values, so I decided to manually compute the 2SLS estimates. These give the same estimates as the ivreg code but now with statistically significant results.
I have limited data and therefore I did not expect any significant results as I already did some non-parametric tests and the means of the different groups were not significantly different.
I am researching the effect of policies of organizations on a given budget.
The organization performs well if the budget residual is positive, so they have less costs than budgeted.
The variable is a percentage, either positive or negative.
There is non random selection into treatment as organizations can determine their own policy.
Furthermore, the policy factors are mostly dummy variables, 19 variables are binary and 2 are categorical and 1 is ratio.
My IV is any number between 0 and 1.
This is what I did:
1. I estimate a simple Ordinary Least Squares model to see what it would do (I know the results don't mean anything).
lm1 <- lm(budget ~ policy1, data=df)
lm2 <- lm(budget ~ policy2, data=df)
summ(lm1)
summ(lm2)
2. Then I performed an IV with the ivreg code, though the R² became negative which I thought was weird.
ivreg1 <- ivreg(budget ~ policy1| iv, data=df)
ivreg2 <- ivreg(budget ~ policy2 | iv, data=df)
library(stargazer)
stargazer(ivreg1, ivreg2, dep.var.labels=c("Budget"), covariate.labels = c("policy 1", "policy2") , align=TRUE, column.sep.width = "-15pt", font.size = "small", type="text")
3. So I tried to do the 2SLS in steps myself.
Instead of fitted.values(reg1) I also used predict(reg1). This gives the same output.
attach(df)
reg1<- lm(policy1~iv)
policy1.hat <- fitted.values(reg1)
reg2 <- lm(policy2~iv)
policy2.hat <- fitted.values(reg2)
ivreg3 <- lm(budget~policy1.hat)
ivreg4 <- lm(budget~policy2.hat)
stargazer(ivreg1, ivreg2, dep.var.labels=c("Budget"), covariate.labels = c("policy 1", "policy2"), align=TRUE, column.sep.width = "-15pt", font.size = "small", type="text")
detach(df)
With this step I got a positive adjusted R² but I noticed that the policy factors are now significant and that the sign compared to the lm model changes.
Question:
Am I computing the IV regression wrong?
Example data (not real numbers due to anonymity of data):
df <- data.frame(
budget = c(4,2.8,9.1,15.5,10.1,12.9,4.3,
-1.9,-4.9,-1.3,14.1,8.6,7.8,-5.8,3.8,7.2,5.2,-5.3,8.6,
3.5,-1.2,-15.7,1.6,6.9,12.6,10.4,4.5,-8.3,-15.3,
9.8,21.5),
iv = c(0.52,0.43,0.41,0.44,0.41,0.4,0.39,
0.43,0.38,0.37,0.34,0.42,0.4,0.36,0.35,0.41,0.39,
0.35,0.31,0.43,0.36,0.51,0.35,0.34,0.37,0.37,0.39,
0.46,0.44,0.36,0.37),
policy1 = c(1L,1L,1L,1L,1L,1L,0L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L),
policy2 = c(1L,1L,1L,1L,1L,1L,1L,0L,0L,1L,
0L,1L,0L,1L,1L,1L,1L,0L,1L,1L,1L,1L,1L,1L,1L,
0L,1L,1L,0L,1L,0L)
)

Issue with weighted data in lmer() and model averaging using glmulti()

I am having hard time while performing Linear Mixed Models (LMM) on data with weights (I mean the weight of the different groups differs). Furthermore, it seems that weights are not kept while using glmulti(). Below is a reproducible example:
require(lme4)
require (glmulti)
data(cake)
cake$wght <- as.numeric(cake$replicate)
fm1 <- lmer(angle ~ recipe + temperature + (1|replicate), cake, REML= FALSE)
print(VarCorr(fm1),comp=c("Variance","Std.Dev."))
In this case residual variance of the random effect equals 22.36.
fm2 <- lmer(angle ~ recipe + temperature + (1|replicate), cake, weights=wght,REML= FALSE)
print(VarCorr(fm2),comp=c("Variance","Std.Dev."))
And here,residual variance of the random effect is now 155.
For linear models, residual variance remains unchanged whatever the weights, while it seems not to be the case here.
The second issue occurs when performing a model averaging with weighted data. It seems that glmulti() does not account for specified weights in the following examples:
wlmer.glmulti <- function (formula, data, random = "", weights ,...) {
lmer(paste(deparse(formula), random), data = data, weights)}
#(watch out doesn't converge!!)
LMM <- glmulti(angle ~ recipe + temperature , data=cake, random="+ (1|replicate)", fitfunc = wlmer.glmulti, weights=cake$wght,report=T, level = 1,crit="aic",method="g")
summary(LMM#objects[[1]]) # is similar to fm1
Any suggestion is most welcome. Thanks
A short answer to my own post above.
1/ Dealing with weights in Linear Mixed Models:
To avoid large inflation of residual variances of the RE due to weight, once can "simply" ensure that the sum of all weights equal 1.
2/ Integrating weights into model averaging procedure (glmulti)
Directly insert the weights into the wrapper (thanks to Vincent Calcagno for the tip).
From previous example:
wlmer.glmulti <- function (formula, data, random = "", weights ,...) {
lmer(paste(deparse(formula), random), data = data, weights)}
Updated working alternative
wlmer.glmulti <- function (formula, data, random = "", weights ,...) {
lmer(paste(deparse(formula), random), data = data, data$wght)}
Hope this will be useful at some point ;)

Resources