comparing coefficients between lmList and lmer - r

can anyone tell me why the slope coefficients deviate between those extracted from a lmer model with a random slope, and those from a lmList model fitted to the same dataset?
Thanks...

After some digging I found the answer in Doug Bates' book on lme4. Paraphrasing... when the individual linear fit at the subject level is poor the linear mixed effects model coefficient tends to exhibit what is called “shrinkage” (see http://lme4.r-forge.r-project.org/lMMwR/lrgprt.pdf) towards the population level value (e.g. the fixed effect). In this case the uncertainty in the site-level coefficient is large (e.g. our confidence in our absolute estimate of its precise value is low), so in order to balance fidelity to the data, measured by the residual sum of squares, with simplicity of the model, the mixed-effects model smooths out the between-subject differences in the predictions by bringing them closer to a common set of predictions, but not at the expense of dramatically increasing the sum of squared residuals.

Note that the "shrinkage" might be a good thing assuming some degree of similarity among your subjects (or observational units), for example if you assume they are drawn from the same population, because it makes the model more robust to outliers at the individual level.
You can quantify the increase in the sum of squared residuals by computing an overall coefficient of determination for the mixed-effects model and the within-subject fits. I am doing it here for the sleepstudy dataset contained in the lme4 package.
> library(lme4)
> mm <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy) # mixef-effects
> ws <- lmList(Reaction ~ Days |Subject, data = sleepstudy) # within-subject
>
> # coefficient of determination for mixed-effects model
> summary(lm(sleepstudy$Reaction ~ predict(mm)))$r.squared
[1] 0.8271702
>
> # coefficient of determination for within subjects fit
> require(nlme)
> summary(lm(sleepstudy$Reaction ~ predict(ws)))$r.squared
[1] 0.8339452
You can check that the decrease in the proportion of variability explained by the mixed-effects model respect to within-subjects fits is quite small 0.8339452 - 0.8271702 = 0.006775.

Related

How can i get predictions with CI from lmerTest models?

We are currently working with plant phenology.
We built a linear mixed model for each species present in the study area.
We set Days From Snowmelt (The sum of days from snowmelt to the visit day along the summer) as the response variable while Mean phenology (mean phenology state for each plot ( there are 3 on each locality) is calculated by the mean phenological state from the 12 subplots into each plot is divided. from 1-6, the higher the number the more advanced the cycle). year and plot nested within the locality are set as random factors.
Once the model is built and revised, we want to predict the days from snowmelt for each species to achieve the phenological phases of interest, which happen to have a mean of 2, 3, 4, and 5. (corresponding to vegetative, flowering, fruit development and dispersion, respectively)
I have tried the function predict() but I get no heterogeneity between phases for each species, the progression seems to be linear (as shown in the image file).
Could this be just because is a linear model so will it only give linear responses? Are there any other ways to get predictions from these kinds of models and show their CI?
How can i get predictions with CI from lmerTest models?
I think you probably mean pediction intervals. You can use the predictInterval function in the merTools package. For example:
library(lmerTest); library(merTools)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy)
head(predictInterval(fm1, level = 0.95, seed = 123, n.sims = 100))
Could this be just because is a linear model so will it only give linear responses?
Yes ! If you fit a linear model, then the predictions will be linear. Of course, you can model nonlinearity with a linear model in several ways including transformation(s), nonlinear terms (the model is still linear in the parameters) and splines.

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

Test of second differences for average marginal effects in logistic regression

I have a question similar to the one here: Testing the difference between marginal effects calculated across factors. I used the same code to generate average marginal effects for two groups. The difference is that I am running a logistic rather than linear regression model. My average marginal effects are on the probability scale, so emmeans will not provide the correct contrast. Does anyone have any suggestions for how to test whether there is a significant difference in the average marginal effects between group 1 and group 2?
Thank you so much,
Ilana
It is a bit unclear what the issue really is, but I'll try. I'm supposing your logistic regression model was fitted using, say, glm:
mod <- glm(cbind(heads, tails) ~ treat, data = mydata, family = binomial())
If you then do
emm <- emmeans(mod, "treat")
emm ### marginal means
pairs(emm) ### differences
Your results will be presented on the logit scale.
If you want them on the probability scale, you can do
summary(emm, type = "response")
summary(pairs(emm), type = "response")
However, the latter will back-transform the differences of logits, thereby producing odds ratios.
If you actually want differences of probabilities rather than ratios of odds, use regrid(), which will construct a new grid of values after back-transforming (and hence it will forget the log transformation):
pairs(regrid(emm))
It seems possible that two or more factors are present and you want contrasts of contrasts on the probability scale. In that case, extend this idea by calling regrid() on the table of EMMs to put everything on the probability scale, then follow the analogous procedure used in the linked article.

Using Zero-inflation regression and Zero-inflation negative binomial regression for trend

I am using Zero-inflation Poisson (zip) and Zero-inflation negative binomial (zinb) regressions to detect temporal trends in count data (death per year for 30 years reported at 6 hospitals) that has may zeros and Overdispersion.
I have written some codes using pscl package and my goal is to compare trends among hospitals.
Counts<- read.csv("data.csv", header = T)
Years= Counts$X
Ho1= Counts$Ho1
Ho2= Counts$Ho2
Ho3= Counts$Ho3
... .........
... ..........
require(pscl)
zip1 <- zeroinfl(Ho1 ~ Years, dist = "poisson")
zinb4 <- zeroinfl(Ho4 ~ Years, dist = "negbin")
But when I plot some of the data it shows slightly increasing trends whereas the zip and zinb show negative trends
Here is an example:
zip result:
zip1
Call:
zeroinfl(formula = Ho1 ~ Years, dist = "poisson")
Count model coefficients (poisson with log link):
(Intercept) Years
-4.836815 0.002837
Zero-inflation model coefficients (binomial with logit link):
(Intercept) Years
467.2323 -0.2353
for this model the trend (slope) is -0.235 and when I used ordinary least squares (OLS) the trend= 0.043.
My understanding is that both zip and OLS should differ slightly.
So I was thinking maybe my codes are not correct or I am missing something.
I would appreciate any thoughts and suggestion
With increasing Years you get increasing counts (= higher responses and less zeros) and you get decreasing zero inflation (= higher responses and less zeros). Thus, the effects in both components of the model appear to be in sync and conform with your OLS results.

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

Resources