LMEM: Chi-square = 0 , prob = 1 - what's wrong with my code? - r

I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!

This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.

Related

R Syntax Simple Slopes MEM

Question regarding the syntax of a mixed effects model on R.
I have run the following code to examine the simple slope to determine the effect of one of my variables (variability) within another one of my variables (ambiguity):
lmer.E1.v2 <- lmer(logRT ~ Variability.c / Ambiguity.c + (Variability.c + Ambiguity.c|ID),
data=data %>% filter(Experiment == "E1"),
control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
summary(lmer.E1.v2)
When I reverse these two variables, so that the code looks like this:
lmer.E1.v2 <- lmer(logRT ~ Ambiguity.c / Variability.c + (Ambiguity.c + Variability.c|ID),
data=data %>% filter(Experiment == "E1"),
control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
summary(lmer.E1.v2)
.. and I get different output in the first section of code than the second. What is the difference in interpretation in reversing the order of my two variables in the syntax?
The primary issue is that the / operator is not commutative (i.e. a/b != b/a): a/b expands to a + a:b, while b/a expands to b + a:b. You should get the same overall fit (predictions, likelihood, etc.), at least up to some degree of numeric fuzz, but the model parameterization will be different.
There do exist cases where (a+b|g) gives different answers from (b+a|g) (see here, but this is unusual).

Combining grep() family of functions with a conditional if statment

I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?
According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)
Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"

Trying to fit a response surface with R's formula notation

I have a set of a couple of dozen numeric variables and am trying to figure out how to compactly express a quadratic form in those variables. I also want to include the variables themselves. The idea here is that we are fitting a response surface, rather than interacting a group of treatments, as the standard R formula notation seems to assume. I am trying to get appropriate expressions turned into an R formula, suitable for estimation by different techniques, with different data sets, or over different periods.
If there is an explicit statement of how R's formula notation works, anywhere, I have not been able to find it. There is an ancient paper from which R supposedly copied the notation, but it is by no means identical to current R usage. Every other description I have found just gives examples, that do not cover every case -- not even close to every case.
So, just as an example, here I try to construct a quadratic form in three variables, without writing out all the pairs by hand with an I() around each pair.
library(tidyverse)
A <- B <- C <- 1:10
LHS <- 1:10 * 600
tb <- tibble(LHS, A, B, C)
my_eq <- as.formula(LHS ~ I(A + B + C)*I(A + B + C))
I have not found any way to tell if I have succeded
Neither
my_form_eq nor
terms(my_form_eq)
seem at all enlightening.
For example, can one predict whether
identical(as.formula(LHS ~ I(A + B + C)*I(A + B + C)), as.formula(LHS ~ I((A + B + C)*(A + B + C)))
is true or false? I can not even guess. Or to take an even simpler case, is ~ A * I(A) equal to A, I(A^2), or something else? And how would you know?
To restate my question, I would like either a full statement of how R's formula notation works, adequate to cover every case and predict what each would mean, or, failing that, a straightforward way of producing an expansion of any existing formula into all the atomic terms for which coefficients will be estimated.
This may not answer your question, but I'll post this anyway since I think it may help a little.
The I function inhibits the interpretation of operators such as "+", so your formula is probably not going to do what you expect it to do. For example, the results of lm(my_eq) will be the same as the results of doing the following:
D <- A + B + C
lm(LHS ~ D * D)
And then you may as well just do lm(LHS ~ D).
For your question, I believe John Maindonald wrote a good book that explains R formulas for many situations. But it's in my office and today is a Sunday.
Edit: For the expansion, I believe you have to fit the model and then look at the call or the terms:
> my_eq <- as.formula(LHS ~ (A + B + C) * (A + B + C))
> my_formula <- lm(my_eq)
> attr(terms(my_formula), "term.labels")
[1] "A" "B" "C" "A:B" "A:C" "B:C"

Grouping Variables in Multilevel Linear Models

I am trying to learn hierarchical models in R and I have generated some sample data for myself. I am having trouble with the correct syntax for coding a multilevel regression problem.
I generated some data for salaries in a Business school. I made the salaries depend linearly on the number of years of employment and the total number of publications by the faculty member. The faculty are in various departments and I made the base salary(intercept) different for each department and also the yearly hike(slopes) different for each department. This way, I have the intercept (base salary) and slope(w.r.t experience in number of years) of the salary depend on the nested level (department) and slope w.r.t another explanatory variable (Publications) not depend on the nested level. What would be the correct syntax to model this in R?
here's my data
Data <-data.frame(Sl_No = c(1:40),
+ Dept = as.factor(sample(c("Mark","IT","Fin"),40,replace = TRUE)),
+ Years = round(runif(40,1,10)))
pubs <-round(Data$Years*runif(40,1,3))
Data$Pubs <- pubs
lookup_table<-data.frame(Dept = c("Mark","IT","Fin","Strat","Ops"),
+ base = c(100000,140000,150000,150000,120000),
+ slope = c(6000,5000,3000,2000,4000))
Data <- merge(Data,lookup_table,by = 'Dept')
salary <-Data$base+Data$slope*Data$Years+Data$Pubs*10000+rnorm(length(Data$Dept))*10000
Data$base<-NULL
Data$slope<-NULL
I have tried the following:
1)
multilevel_model<-lmer(Salary~1|Dept+Pubs+Years|Dept, data = Data)
Error in model.matrix.default(eval(substitute(~foo, list(foo = x[[2]]))), :
model frame and formula mismatch in model.matrix()
2)
multilevel_model<-lmer(`Salary`~ Dept + `Pubs`+`Years`|Dept , data = Data)
boundary (singular) fit: see ?isSingular
I want to see the estimates of the salary intercept and yearly hike by Dept and the estimate of the effect of publication as a standalone (pooled). Right now I am not getting the code to work at all.
I know the base salary and the yearly hike by dept and the effect of a publication (since I generated it).
Dept base Slope
Fin 150000 3000
Mark 100000 6000
Ops 120000 4000
IT 140000 5000
Strat 150000 2000
Every publication increases the salary by 10,000.
ANSWER:
Thanks to #Ben 's answer here I think the correct model is
multilevel_model<-lmer(Salary~(1|Dept)+ Pubs +(0+Years|Dept), data = Data)
This gives me the following fixed effects by running
summary(multilevel_model)
Fixed effects:
Estimate Std. Error t value
(Intercept) 131667.4 10461.0 12.59
Pubs 10235.0 550.8 18.58
Correlation of Fixed Effects:
Pubs -0.081
The Department level coefficients are as follows:
coef(multilevel_model)
$Dept
Years (Intercept) Pubs
Fin 3072.5133 148757.6 10235.02
IT 5156.6774 136710.7 10235.02
Mark 5435.8301 102858.3 10235.02
Ops 3433.1433 118287.1 10235.02
Strat 963.9366 151723.1 10235.02
These are pretty good estiamtes of the original values. Now I need to learn to assess "how good" they are. :)
(1)
multilevel_model<-lmer(`Total Salary`~ 1|Dept +
`Publications`+`Years of Exp`|Dept , data = sample_data)
I can't immediately diagnose why this gives a syntax error, but parentheses are generally recommended around random-effect terms because the | operator has high precedence in formulas. Thus the response/right-hand side (RHS) formula
~ (1|Dept) + (`Publications`+`Years of Exp`|Dept)
might work, except that it would be problematic because both terms contain the same intercept term: if you wanted to do this you'd probably need
~ (1|Dept) + (0+`Publications`+`Years of Exp`|Dept)
(2)
~ Dept + `Publications`+`Years of Exp`|Dept
It doesn't really make any sense to put the same variable (Dept) on both the left- and right-hand sides of the bar.
You should probably use
~ pubs + years_exp + (1 + years_exp|Dept)
Since in principle the effect of publication could vary across departments, the maximal model would be
~ pubs + years_exp + (1 + pubs + years_exp|Dept)
It rarely makes sense to include a random effect without its corresponding fixed effect.
Note that you may get singular fits even if you have the right model; see the ?isSingular man page.
if the 18 observations listed above represent your whole data set, it's very likely too small to fit the maximal model successfully. Rule of thumb is that you need 10-20 observations per parameter estimated, and the maximal model has (intercept + 2 fixed-effect params + (3*4)/2=6 random-effect parameters) = 9 parameters. (Since it's simulated, you can easily simulate a big data set ...)
I'd recommend renaming variables in your data frame so you don't have to fuss with backtick-protecting variable names with spaces in them ...
The GLMM FAQ has more on model specification

Using a matrix of predictors with lmer()

I am attempting to fit a model with a large number of predictors, such that it would be tedious to enumerate them in a model formula. This is straightforward to do with lm():
indicatorMatrix <- data.frame(matrix(rbinom(26000, 1, 1/3), ncol = 26))
colnames(indicatorMatrix) <- LETTERS
someDV <- rnorm(nrow(indicatorMatrix))
head(indicatorMatrix)
# One method, enumerating variables by name:
olsModel1 <- lm(someDV ~ A + B + C + D, # ...etc.
data = indicatorMatrix)
# Preferred method, including the matrix of predictors:
olsModel2 <- lm(someDV ~ as.matrix(indicatorMatrix))
summary(olsModel2)
Since I have a very large number of predictors (more than the 26 in this invented example), I don't want to list them individually as in the first example (someDV ~ A + B + C + D...), and I can avoid this by just including the predictors as.matrix.
However, I want to fit a mixed effects model, like this:
library(lme4)
meModel1 <- lmer(someDV ~ (1 | A) + (1 | B) + (1 | C), # ...etc.
data = indicatorMatrix)
summary(meModel1)
Except that I want to include a large number of random effects terms. Rather than having to type (1 | A) ... (1 | ZZZ), I would like to include each predictor in a manner analogous to the matrix approach used for olsModel2 above. The following, obviously, does not work:
meModel2 <- lmer(someDV ~ (1 | as.matrix(indicatorMatrix)))
Do you have any suggestions for how I can best replicate the matrix-predictor approach for random effects with lmer()? I am very willing to consider "pragmatic" solutions (i.e. hacks), so long as they are "programmatic," and don't require me to copy & paste, etc. etc.
Thanks in advance for your time.
I think that constructing the formula as a string and then using as.formula, something along the lines of
restring1 <- paste0("(1 | ",colnames(indicatorMatrix),")",collapse="+")
form <- as.formula(paste0("someDV ~",restring1))
meModel1 <- lmer(form, data = data.frame(someDV,indicatorMatrix))
should work (it runs without complaining on my system, anyway ...)

Resources