R: mixed model with heteroscedastic data -> only lm function works? - r

This question asks the same question, but hasn't been answered. My question relates to how to specify the model with the lm() function and is therefore a programming (not statistical) question.
I have a mixed design (2 repeated and 1 independent predictors). Participants were first primed into group A or B (this is the independent predictor) and then they rated how much they liked 4 different statements (these are the two repeated predictors).
There are many great online resources how to model this data. However, my data is heterscedastic. So I like to use heteroscedastic-consistent covariance matrices. This paper explains it well. The sandwich and lmtest packages are great. Here is a good explanation how to do it for a indpendent design in R with lm(y ~ x).
It seems that I have use lm, else it wont work?
Here is the code for a regression model assuming that all variances are equal (which they are not as Levene's test comes back significant).
fit3 <- nlme:::lme(DV ~ repeatedIV1*repeatedIV2*independentIV1, random = ~1|participants, df) ##works fine
Here is the code for an indepedent model correcting for heteroscedasticity, which works.
fit3 <- lm(DV ~ independentIV1)
library(sandwich)
vcovHC(fit3, type = 'HC4', sandwich = F)
library(lmtest)
coef(fit3, vcov. = vcovHC, type = 'HC4')
So my question really is, how to specify my model with lm?
Alternative approaches in R how to fit my model accounting for heteroscedasticity are welcome too!
Thanks a lot!!!

My impression is that your problems come from mixing various approaches for various aspects (repeated measurements/correlation vs. heteroscedasticity) that cannot be mixed so easily. Instead of using random effects you might also consider fixed effects, or instead of only adjusting the inference for heteroscedasticity you might consider a Gaussian model and model both mean and variance, etc. For me, it's hard to say what is the best route forward here. Hence, I only comment on some aspects regarding the sandwich package:
The sandwich package is not limited to lm/glm only but it is in principle object-oriented, see vignette("sandwich-OOP", package = "sandwich") (also published as doi:10.18637/jss.v016.i09.
There are suitable methods for a wide variety of packages/models but not
for nlme or lme4. The reason is that it's not so obvious for which mixed-effects models the usual sandwich trick actually works. (Disclaimer: But I'm no expert in mixed-effects modeling.)
However, for lme4 there is a relatively new package
called merDeriv (https://CRAN.R-project.org/package=merDeriv) that
supplies estfun and bread methods so that sandwich covariances can be
computed for lmer output etc. There is also a working paper associated
with that package: https://arxiv.org/abs/1612.04911

Related

Linear mixed model in replicated crossover design

I am struggling on how to fit the model for replicated crossover design using REML method. The suggested model by FDA is as above and can someone help on how to code it into R coding ? This is my coding and I wonder if it is right or wrong ?
samplePK.lmer <- lmer(ykir2~1+Treatment:Sequence:Replication+
(1|Subject:Sequence:Treatment), data=samplePK, REML=TRUE)
I would say the formula should be
response ~ trt + trt:seq:rep + (trt|subj:seq)
The key difference from your specification is that (trt|subj:seq) is fitting something like a randomized-block or random-slopes model, where we are allowing for the variation of the treatment effect across subject/sequence combinations.
There are a few issues here that I ran into/noticed when trying to fit this model to simulated data:
if we are fitting this with "modern" mixed-model approaches, there will be some parameters aliased between the treatment effect (trt) and the trt:seq:rep term. In a method-of-moments approach this doesn't matter so much (because you never explicitly estimate the parameters), but it leads to complaints about rank-deficient models (which are ignorable if you know what you're doing ...).
it seems wrong that the random effect delta_{ij} is given as having a mean of {mu_R,mu_T}; this is redundant with the fixed effect mu_k
Obviously I could have something wrong or misunderstood something about the original model specification.
I might suggest that you try follow-up questions on the r-sig-mixed-models#r-project.org mailing list, where there is a wide readership with broad expertise (i.e., more expertise on the subject of mixed models specifically than here on StackOverflow)

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

Does first-difference work on unbalanced data with holes in plm?

On page 423 in Computational Laboratory for Economics, it is stated that "the argument model="fd" doesn't work correctly, with the current version (1.3-1) of plm, on unbalanced data with holes." Has this been fixed in the newer versions of plm?
As a workaround, the author used diff to obtain the first differences and fitted a model="pooling"on the differenced data. Can someone explain how the diff function works on unbalanced data with holes?
Also, on page 68 in the plm documenation version (1.6-5), it is stated that
"plm is a general function for the estimation of linear panel models. It supports the following estimation methods: pooled OLS (model = "pooling"), fixed effects ("within"), random effects ("random"), first–differences ("fd"), and between ("between"). It supports unbalanced panels and two–way effects (although not with all methods)."

Repeated measures ANOVA

What is the generic code to perform a repeated measures ANOVA?
I am currently using the code:
summary(aov(Chlo~Site,data=alldata)).
There are three different sites (Site) and four expirmental variables that I am testing individually (Chlo, SST, DAC and PAR). I am also assessing any differences in these variables and year (between 2003 and 2012):
summary(aov(Chlo~Year,data=year))
Any help would be appreciated!
In general you should avoid performing multiple calls with aov and rather use a mixed effects linear model.
You can find several examples in this post by Paul Gribble
I often use the nlme package, such as:
require(nlme)
model <- lme(dv ~ myfactor, random = ~1|subject/myfactor, data=mydata)
Depending on the situation you may run in more complex situations, I would suggest to have a look at the very clear book by Julian Faraway "Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models".
Also, you may want to ask on CrossValidated if you have more specific statistical questions.
The trick using aov function is that you just need to add Error term. As one of the guides says: The Error term must reflect that we have "treatments nested within subjects".
So, in your case, if Site is a repeated measure you should use:
summary(aov(Chlo ~ Site + Error(subject/Site), data=alldata))

Can I perform all subsets variable selection for a Cox Proportional Hazards Model in R?

I am trying to use a function similar to, if not actually, regsubsets in the leaps package in program R when selecting the top Cox Proportional Hazards models for my data. Is this possible? and if so does a function already exist?
I'm guessing you're already familiar with the following... If you're using AIC as your criterion for 'top model' then this would be a reasonable starting point:
library(survival)
data(colon)
c1 <- coxph(Surv(time=time, event=status) ~
as.factor(extent) + age + sex, data=colon)
step(c1)
Caution with this if you have missing values (NA). There may of course be a better model which is not found by this method, but with a small number of potential predictors you're unlikely to miss it. Caveats as above (thanks #DWin) about using using numerical methods where an informed opinion may be more reliable.

Resources