Is using 'nAGQ = 0' in glmer an issue? - r

I have seen a couple of similar questions to my own without sufficient answers, so posting this here. I am running a generalised linear mixed effect regression model in R using the glmer() function in the lme4 package. My code is as follows:
model <- glmer(Response ~ step * type + duration + StemArousal + StemValency + stemBNCfreq + (1|ParticipantID)+(1|word),family=binomial, data=demodata, nAGQ = 0)
Response is a binary numerical variable (0 & 1). Step and type are both categorical variables (step has 7 levels, type has 3) and the other 4 predictors are numerical. Both random effects are categorical. I have 389 participants and 20 words.
Currently, I also include the argument 'nAGQ = 0' which I found via another post (https://stats.stackexchange.com/questions/77313/why-cant-i-match-glmer-family-binomial-output-with-manual-implementation-of-g). If I don't do this, then the model does not converge.
I found an explanation elsewhere of the difference between 'nAGQ = 1' and 'nAGQ = 0' (https://stat.ethz.ch/pipermail/r-sig-mixed-models/2017q3/025942.html) which suggests that what I have done is less precise as there is less interaction with the random effects.
Is there a general consensus about this acceptability of this approach? and does anyone have a reliable source for a discussion about it?
Apologies if this sounds like a cross-post/repeat question; it's just that nothing clear seemed to have been resolved.

Related

LME4-GLMER - PROBLEMS WITH ALLFIT. Mismatch between summary(allFit(model)) and summary(model)

I'm writing to ask a question about the allFit function of the lme4 package.
I have raised a GLMM, with the following structure:
MM <-glmer(y~ x1 + x2 + x3+x4 + (1 | subject)+(0+x1+x2|school), data=data, family =poisson(), offset=log(n))
I can't offer the original data, but suppose:
x1,x2,x3, x4 are the auxiliary variables.
y: response variable
subject: represents each row of the data frame
school: represents groups of rows of the data frame.
n: sample size
Therefore, I have a GLMM model, with a random intercept and two random slopes that are also correlated with each other, but without correlation with the intercept.
When I perform simulations, on some occasions, convergence warnings are given.
I have made a review of all available documentation and related questions. Specifically, looking at the latest lme4 documentation (page 16), to further investigate these warnings, I used the allFit function .
The results show me that no optimizer, or only occasionally the L-BFGS-B, gives problems, and that all the parameter estimates, both for the fixed and random effects, are practically the same.
However, I don't understand why, when these models have convergence warnings, the results I get by doing the variance-covariance matrix and the summary on the model are completely different from those returned by the summary function on the object resulting from applying allFit.
beta <- fixef(MM) differs noticeably from summary(allFit(MM))$fixef
var <-as.data.frame(VarCorr(MM))$sdcor differs noticeably from summary(allFit(MM))$sdcor
Being the values ​​returned by summary(allFit) consistent with those deduced from the sample.
I have verified that when the model shows no convergence problems, the results of fixef(MM) and VarCorr(MM) exactly match those returned by summary(allFit(MM)).
I have performed the test in the latest available update '1.1.29' and in '1.1.28', and the same thing happens to me in both.
I'm sorry I can't provide the dataset, and I apologize in advance if this has already been asked, because I've searched a lot but didn't find this bug.

Fitting random factors for a linear model using lme4

I have 4 random factors and I want to provide its linear model using lme4. But struggled to fit the model.
Assuming A is nested within B (2 levels), which in turn nested within each of xx preceptors (P). All responded to xx Ms (M).
I want to fit my model to get variances for each factor and their interactions.
I have used the following codes to fit the model, but I was unsuccessful.
lme4::lmer(value ~ A +
(1 + A|B) +
(1 + P|A),
(1+ P|M),
data = myData, na.action = na.exclude)
I also read interesting materials here, but Still, I struggle to fit the model. Any help?
At a guess, if the nesting structure is ( P (teachers) / B (occasions) / A (participants) ), meaning that the occasions for one teacher are assumed to be completely independent of the occasions for any other teacher, and that participants in turn are never shared across occasions or teachers, but questions (M) are shared across all teachers and occasions and participants:
value ~ 1 + (1| P / B / A) + (1|M)
Some potential issues:
as you hint in the comments, it may not be practical to fit random effects for factors with small numbers of levels (say, < 5); this is likely to lead to the dreaded "singular model" message (see the GLMM FAQ for more detail).
if all of the questions (M) are answered by every participant, then in principle it's possible to fit a model that takes account of the among-question correlation within participants: the maximal model would be ~ 1 + (M | P / B / A) (which would look for among-question correlations at the level of teacher, occasion within teacher, and participant within occasion within teacher). However, this is very unlikely to work in practice (especially if each participant answers each question only once, in which case the teacher:occasion:participant:question variance will be confounded with the residual variance in a linear model). In this case, you will get an error about "probably unidentifiable": see e.g. this question for more explanation/detail.

lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to?

I am trying to run a mixed-effects model that predicts F2_difference with the rest of the columns as predictors, but I get an error message that says
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
From this link, Fixed-effects model is rank deficient, I think I should use findLinearCombos in the R package caret. However, when I try findLinearCombos(data.df), it gives me the error message
Error in qr.default(object) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In qr.default(object) : NAs introduced by coercion
My data does not have any NAs - What could be causing this? (Sorry if the answer is various obvious - I am new to R).
All of my data are factors except the numerical value that I am trying to predict. Here is a small sample of my data.
sex <- c("f", "m", "f", "m")
nasal <- c("TRUE", "TRUE", "FALSE", "FALSE")
vowelLabel <- c("a", "e", "i", "o")
speaker <- c("Jim", "John", "Ben", "Sally")
word_1 <- c("going", "back", "bag", "back")
type <- c("coronal", "coronal", "labial", "velar")
F2_difference <- c(345.6, -765.8, 800, 900.5)
data.df <- data.frame(sex, nasal, vowelLabel, speaker,
word_1, type, F2_difference
stringsAsFactors = TRUE)
Edit:
Here is some more code, if it helps.
formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type +
(1|speaker) + (1|word_1)
lmer(formula, REML = FALSE, data = data.df)
Editor edit:
The OP did not provide sufficient number of test data to allow an actual run of the model in lmer for the reader. But this is not too big a issue. This is still a very good post!
You are slightly over-concerned with the warning message:
fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.
It is a warning not an error. There is neither misuse of lmer nor ill-specification of model formula, thus you will obtain an estimated model. But to answer your question, I shall strive to explain it.
During execution of lmer, your model formula is broken into a fixed effect formula and a random effect formula, and for each a model matrix is constructed. Construction for the fixed one is via the standard model matrix constructor model.matrix; construction for the random one is complicated but not related to your question, so I just skip it.
For your model, you can check what the fixed effect model matrix looks like by:
fix.formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type
X <- model.matrix (fix.formula, data.df)
All your variables are factor so X will be binary. Though model.matrix applies contrasts for each factor and their interaction, it is still possible that X does not end up with full column rank, as a column may be a linear combination of some others (which can either be precise or numerically close). In your case, some levels of one factor may be nested in some levels of another.
Rank deficiency can arise in many different ways. The other answer shares a CrossValidated answer offering substantial discussions, on which I will make some comments.
For case 1, people can actually do a feature selection model via say, LASSO.
Cases 2 and 3 are related to the data collection process. A good design of experiment is the best way to prevent rank-deficiency, but for many people who build models, the data are already there and no improvement (like getting more data) is possible. However, I would like to stress that even for a dataset without rank-deficiency, we can still get this problem if we don't use it carefully. For example, cross-validation is a good method for model comparison. To do this we need to split the complete dataset into a training one and a test one, but without care we may get a rank-deficient model from the training dataset.
Case 4 is a big problem that could be completely out of our control. Perhaps a natural choice is to reduce model complexity, but an alternative is to try penalized regression.
Case 5 is a numerical concern leading to numerical rank-deficiency and this is a good example.
Cases 6 and 7 tell the fact that numerical computations are performed in finite precision. Usually these won't be an issue if case 5 is dealt with properly.
So, sometimes we can workaround the deficiency but it is not always possible to achieve this. Thus, any well-written model fitting routine, like lm, glm, mgcv::gam, will apply QR decomposition for X to only use its full-rank subspace, i.e., a maximum subset of X's columns that gives a full-rank space, for estimation, fixing coefficients associated with the rest of the columns at 0 or NA. The warning you got just implies this. There are originally ncol(X) coefficients to estimate, but due to deficiency, only ncol(X) - 7 will be estimated, with the rest being 0 or NA. Such numerical workaround ensures that a least squares solution can be obtained in the most stable manner.
To better digest this issue, you can use lm to fit a linear model with fix.formula.
fix.fit <- lm(fix.formula, data.df, method = "qr", singular.ok = TRUE)
method = "qr" and singular.ok = TRUE are default, so actually we don't need to set it. But if we specify singular.ok = FALSE, lm will stop and complain about rank-deficiency.
lm(fix.formula, data.df, method = "qr", singular.ok = FALSE)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# singular fit encountered
You can then check the returned values in fix.fit.
p <- length(coef)
coef <- fix.fit$coef
no.NA <- sum(is.na(coef))
rank <- fix.fit$rank
It is guaranteed that p = ncol(X), but you should see no.NA = 7 and rank + no.NA = p.
Exactly the same thing happens inside lmer. lm will not report deficiency while lmer does. This is in fact informative, as too often, I see people asking why lm returns NA for some coefficients.
Update 1 (2016-05-07):
Let me see if I have this right: The short version is that one of my predictor variables is correlated with another, but I shouldn't worry about it. It is appropriate to use factors, correct? And I can still compare models with anova or by looking at the BIC?
Don't worry about the use of summary or anova. Methods are written so that the correct number of parameters (degree of freedom) will be used to produce valid summary statistics.
Update 2 (2016-11-06):
Let's also hear what package author of lme4 would say: rank deficiency warning mixed model lmer. Ben Bolker has mentioned caret::findLinearCombos, too, particularly because the OP there want to address deficiency issue himself.
Update 3 (2018-07-27):
Rank-deficiency is not a problem for valid model estimation and comparison, but could be a hazard in prediction. I recently composed a detailed answer with simulated examples on CrossValidated: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? So, yes, in theory we should avoid rank-deficient estimation. But in reality, there is no so-called "true model": we try to learn it from data. We can never compare an estimated model to "truth"; the best bet is to choose the best one from a number of models we've built. So if the "best" model ends up rank-deficient, we can be skeptical about it but probably there is nothing we can do immediately.
This response does an excellent job of explaining what rank deficiency is, and what the possible causes may be.
Viz:
Too little data: You cannot uniquely estimate n parameters with less than n data points
Too many points are replicates.
Information in the wrong places.
Complicated model (too many variables)
Units and scaling
Variation in numbers: 12.001 vs. 12.005 & 44566 vs 44555
Data precision: Even Double-precision variables have limits

Error message: Error in fn(x, ...) : Downdated VtV is not positive definite

I'm trying to use the lmer function to create a minimum adequate model. My model is Mated ~ Size * Attempts * Status + (random factor).
as.logical(Mated)
as.numeric(Size)
as.factor(Attempts)
as.factor(Status)
(These have all worked on previous models)
So after all that I try running my model:
Model1<-lmer(Mated ~ Size*Status*Attempts + (1|FemaleID),data=mydata)
And it can be submitted without fault.It's only when I try to apply this update that it goes wrong:
Model2<-update(Model1, REML=FALSE)
Here is the error message supplied:
Error in fn(x, ...) : Downdated VtV is not positive definite
If I make a third model without the interaction and do an ANOVA between that and model one, then it says the two are significantly different.
Model3<-update(Model1,~.-Size:Status:Attempts
anova(Model1,Model3)
What am I doing wrong? Is the three way interaction really significant or have I made some mistake?
Thank you
If Mated is binary, then you should probably be using glmer with a logit or probit link function instead, something like:
model <- glmer(Mated ~ Size * Status * Attempts + (1|FemaleID),
data = mydata, family = binomial)
It would help if you could let us know what your data looks like (head(mydata) might be fine, or see here for how to make a reproducible example).
Also, I would avoid making Mated logical (see this question and answer for how it can make your life more difficult). Instead, as.factor(Mated) will explicitly make your response variable discrete.
After that, you can compare your full and reduced models with anova().

adding interaction to GEE - model matrix is rank deficient

I am trying to run a GEE model, using geepack. I have done this successfully, using the below call.
Call:
geeglm(formula = pdc1 ~ country + post + time_post +
TIME + age + sex + country * time_post + country * post, family = gaussian("identity"), data = lipid_data,
id = id, waves = ID, corstr = "ar1", std.err = "san.se").
where:
pdc1=numeric
country=factor
post=factor
time_post=numeric
TIME=numeric
I'm trying to run the exact same model on different data, which are in the exact same format as above. I can run the model with main effects, but not with the interactions. this is the error I get:
Error in geeglm(pdc1 ~ STATE + post + time_post + TIME + STATE * post, :
Model matrix is rank deficient; geeglm can not proceed
I have tried recoding STATE as a numeric variable (and post) but this does not prove fruitful. I don't understand whats going on, the variables hold the exact same data as the first model, and are coded the way. Does anyone know what could be going on here?
I have recently solved a similar problem - it related to having one of the covariates constant across the dataset. For example, STATE might be the same state for each observation which may be causing the error "rank deficient". This might also explain why this error is specific to your new dataset.
I am not sure if you encountered the same problem as mine. I used to fit a geeglm model in R and got the same problem.
My model is something like:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
This model runs without any problem.
However, when I tried to put another binary covariates setting (Urban/Rural) into the model:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)+as.factor(setting)-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
R gave me the error:
Model matrix is rank deficient; geeglm can not proceed
Then I decided to create a new binary variable: setting_new (0/1) and put this new variable in the mode:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)+setting_new-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
And now the problem is solved.
I got the same problem when trying to put another categorical variables into the above mentioned model, with the syntax:
as.factor(Q12_travel)
And the model could not run until I tried to create new dummy variables (0/1) and put it back to the model.
Maybe you have another problem, but I suggest you can try this approach to see if the problem is solved.

Resources