lme4 translate formula to code in 3-level model - r

I have been provided with the following formulas and need to find the correct lme4 code. I find this rather challenging and could not find a good example I could follow...perhaps you can help?
I have two patient groups: group1 and group2. Both groups came to the lab 4 times for testing (4 VISITs) and during each of these visits, their memory was tested 4 times (4 RECALLs). The memory performance should be predicted from Age, Sex and two sleep parameters, which were assessed during each VISIT.
Thus Level 1 should be the RECALL (index i), Level 2 the VISIT (index j) and Level 3 the subject level (index k).
Level 1:
MEMSCOREijk = β0jk + β1jk * RECALLijk + Rijk
Level 2:
β0jk = γ00k + γ01k * VISITjk + U0jk
β1jk = γ10k + γ11k * VISITjk + U1jk
Level 3:
γ00k = δ000 + δ001 * SLEEPPARAM + V0k
γ10k = δ100 + δ101 * SLEEPPARAM + V1k
Thanks so much for your thoughts!

Something like this should work:
lmer(memscore ~ age + sex + sleep1 + sleep2 + (1 | visit) + (1 + sleep1 + sleep2 | subject), data = mydata)
By adding sleep1 and sleep2 in (1 + sleep1 + sleep2 | subject) you will be allowing the effects of the two sleep parameters to vary by participant (random slopes), and have a random intercept (more on that next sentence).(1 | visit) will allow random intercepts for each visit (random intercepts would model data where different visits had a higher or lower mean memscore), but not random slopes; I don't think that you want random slopes for the sleep parameters by visit - were they only measured one time per visit? If so, there would be no slope variation to model, I believe.
Hope that helps! I found this book very useful:
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: an introduction to basic and advanced multilevel modeling (2nd ed.): Sage.

Related

Permutation test / create a function for event dummy for different dates

This question will be very difficult to formulate, but essentially I'm doing an event study for my bachelor thesis and would like to examine how significant the coefficient for my interaction effect is.
What I find difficult is that I want to run a fixed effect regression in which the interaction effect, treatment * event, change dates for the event window. I want to create a function that makes the dummy variable for the event to take on different dates (it should be a length of 7 consequitive days) and that the start date of these seven dates change.
The equation looks something like this:
return = intercept + ROA + Tanibility + ... + Event + Treatment group + Event * Treatment group
Then I would like to run the fixed effects regression with these different dummies to then extract the estimates for the dummy. This would enable me to compare the coefficent estimate obtained for the actual event date with the "fake" event days.
Thank you in advance if you decide to help me :)
I have tried to find different functions and I have only found permutation tests that test the means of the two groups.
I've tried to make a function and have this:
set.seed(1000)
N <- 10^3-1 # Number of permutations
resultX <- numeric(N)
for(i in 1:N){
index1 <- sample(length(merged_df), size = 7, replace = FALSE)
df_c$eventdummy <- ifelse(merged_df$day == index1, 1, 0)
resultX[i] <- plm(ret ~ sse*eventdummy +
sse +
eventdummy +
roa +
leverage +
mtb +
tangibility,
data = merged_df,
index = c("entity"),
model = "within")
}

lme4: Handling lmer "convergence code: 0"

I am currently calculating a multi-level analysis with 32 countries (country variable "CNTRY3"). The dependent variable is the willingness to pay for environmental protection "WTP" (scale 1 - 5, centred). I have included four random slopes in the third step (Random Intercept Random Slope) of the multi-level analysis (I would like to vary these based on my theory):
RINC_ALL_z -> Income per person (z-standardized)
social_trust (scale 1 - 5, centered) -> trust in other people
political_trust (scale 1 - 5, centred) -> trust in politics
EC_cen (scale 1 - 5, centered) -> Environmental awareness
.
modell.3a <- lmer(WTP ~ RINC_ALL_z + social_trust_cen + political_trust_cen +
Men + lowest_degree + middle_degree + requirement_university +
uncompleted_university + university_degree + AGE_cen + urban +
EC_cen + (RINC_ALL_z + social_trust_cen + political_trust_cen +
EC_cen|CNTRY3), data=ISSP2010_1)
Then the error message appears:
convergence code: 0 Model failed to converge with max|grad| =
0.00527884 (tol = 0.002, component 1)
Now I was able to research that if you calculate the model with the optimizer "BOBYQA", for example, these "convergence warnings" can be bypassed. And indeed, if I calculate the model like this, then no convergence warning appears any more:
modell.3b <- lmer(WTP ~ RINC_ALL_z + social_trust_cen + political_trust_cen +
Men + lowest_degree + middle_degree + requirement_university +
uncompleted_university + university_degree + AGE_cen + urban +
EC_cen + (RINC_ALL_z + social_trust_cen + political_trust_cen +
EC_cen|CNTRY3), data=ISSP2010_1, control = lmerControl(optimizer = "bobyqa",
optCtrl=list(maxfun=1e5)))
So I read in this very interesting article that if you use one optimizer, you should compare it with all other available optimizers to find out if certain optimizers influence the parameters of the regression. No sooner said than done (with the variable environmental awareness), I replicated the graphs of the article, but unfortunately the optimizers are not in one column (loglikelihood / t-value) as in the article. I have attached the two pictures here. My interpretation would be, regarding the log-likelihood comparison and the t-value comparison, that the majority (5) of the optimizers are in one column (including my used "BOBYQA" ) and only 2 optimizers differ from the majority of optimizers, so my used optimizer should not influence the parameters, right?
Loglikelihood comparison
T-Value Comparison
The first question would be what such an optimizer does (you just read that you should use optimizers to avoid convergence issues)?
The second question I have is, would you agree with this interpretation of the two diagrams?
I would be very happy about an answer, I have been thinking about this for several days... :-(
Many greetings
Joern

Grouping Variables in Multilevel Linear Models

I am trying to learn hierarchical models in R and I have generated some sample data for myself. I am having trouble with the correct syntax for coding a multilevel regression problem.
I generated some data for salaries in a Business school. I made the salaries depend linearly on the number of years of employment and the total number of publications by the faculty member. The faculty are in various departments and I made the base salary(intercept) different for each department and also the yearly hike(slopes) different for each department. This way, I have the intercept (base salary) and slope(w.r.t experience in number of years) of the salary depend on the nested level (department) and slope w.r.t another explanatory variable (Publications) not depend on the nested level. What would be the correct syntax to model this in R?
here's my data
Data <-data.frame(Sl_No = c(1:40),
+ Dept = as.factor(sample(c("Mark","IT","Fin"),40,replace = TRUE)),
+ Years = round(runif(40,1,10)))
pubs <-round(Data$Years*runif(40,1,3))
Data$Pubs <- pubs
lookup_table<-data.frame(Dept = c("Mark","IT","Fin","Strat","Ops"),
+ base = c(100000,140000,150000,150000,120000),
+ slope = c(6000,5000,3000,2000,4000))
Data <- merge(Data,lookup_table,by = 'Dept')
salary <-Data$base+Data$slope*Data$Years+Data$Pubs*10000+rnorm(length(Data$Dept))*10000
Data$base<-NULL
Data$slope<-NULL
I have tried the following:
1)
multilevel_model<-lmer(Salary~1|Dept+Pubs+Years|Dept, data = Data)
Error in model.matrix.default(eval(substitute(~foo, list(foo = x[[2]]))), :
model frame and formula mismatch in model.matrix()
2)
multilevel_model<-lmer(`Salary`~ Dept + `Pubs`+`Years`|Dept , data = Data)
boundary (singular) fit: see ?isSingular
I want to see the estimates of the salary intercept and yearly hike by Dept and the estimate of the effect of publication as a standalone (pooled). Right now I am not getting the code to work at all.
I know the base salary and the yearly hike by dept and the effect of a publication (since I generated it).
Dept base Slope
Fin 150000 3000
Mark 100000 6000
Ops 120000 4000
IT 140000 5000
Strat 150000 2000
Every publication increases the salary by 10,000.
ANSWER:
Thanks to #Ben 's answer here I think the correct model is
multilevel_model<-lmer(Salary~(1|Dept)+ Pubs +(0+Years|Dept), data = Data)
This gives me the following fixed effects by running
summary(multilevel_model)
Fixed effects:
Estimate Std. Error t value
(Intercept) 131667.4 10461.0 12.59
Pubs 10235.0 550.8 18.58
Correlation of Fixed Effects:
Pubs -0.081
The Department level coefficients are as follows:
coef(multilevel_model)
$Dept
Years (Intercept) Pubs
Fin 3072.5133 148757.6 10235.02
IT 5156.6774 136710.7 10235.02
Mark 5435.8301 102858.3 10235.02
Ops 3433.1433 118287.1 10235.02
Strat 963.9366 151723.1 10235.02
These are pretty good estiamtes of the original values. Now I need to learn to assess "how good" they are. :)
(1)
multilevel_model<-lmer(`Total Salary`~ 1|Dept +
`Publications`+`Years of Exp`|Dept , data = sample_data)
I can't immediately diagnose why this gives a syntax error, but parentheses are generally recommended around random-effect terms because the | operator has high precedence in formulas. Thus the response/right-hand side (RHS) formula
~ (1|Dept) + (`Publications`+`Years of Exp`|Dept)
might work, except that it would be problematic because both terms contain the same intercept term: if you wanted to do this you'd probably need
~ (1|Dept) + (0+`Publications`+`Years of Exp`|Dept)
(2)
~ Dept + `Publications`+`Years of Exp`|Dept
It doesn't really make any sense to put the same variable (Dept) on both the left- and right-hand sides of the bar.
You should probably use
~ pubs + years_exp + (1 + years_exp|Dept)
Since in principle the effect of publication could vary across departments, the maximal model would be
~ pubs + years_exp + (1 + pubs + years_exp|Dept)
It rarely makes sense to include a random effect without its corresponding fixed effect.
Note that you may get singular fits even if you have the right model; see the ?isSingular man page.
if the 18 observations listed above represent your whole data set, it's very likely too small to fit the maximal model successfully. Rule of thumb is that you need 10-20 observations per parameter estimated, and the maximal model has (intercept + 2 fixed-effect params + (3*4)/2=6 random-effect parameters) = 9 parameters. (Since it's simulated, you can easily simulate a big data set ...)
I'd recommend renaming variables in your data frame so you don't have to fuss with backtick-protecting variable names with spaces in them ...
The GLMM FAQ has more on model specification

Why variable importance is not reflected in variable actually used in tree construction?

I generated an (unpruned) classification tree on R with the following code:
fit <- rpart(train.set$line ~ CountryCode + OrderType + Bon + SupportCode + prev_AnLP + prev_TXLP + prev_ProfLP + prev_EVProfLP + prev_SplLP + Age + Sex + Unknown.Position + Inc + Can + Pre + Mol, data=train.set, control=rpart.control(minsplit=5, cp=0.001), method="class")
printcp(fit) shows:
Variables actually used in tree construction:
Age
CountryCode
SupportCode
OrderType
prev_AnLP
prev_EVProfLP
prev_ProfLP
prev_TXLP
prev_SplLP
Those are the same variables I can see at each node in the classification tree, so they are correct.
What I do not understand is the result of summary(fit):
Variable importance:
29 prev_EVProfLP
19 prev_AnLP
16 prev_TXLP
15 prev_SplLP
9 prev_ProfLP
7 CountryCode
2 OrderType
1 Pre
1 Mol
From summary(fit) results it seems that variables Pre and Mol are more important than SupportCode and Age, but in the tree Pre and Mol are not used to split the data, while SupportCode and Age are used (just before two leafs, actually... but still used!).
Why?
The importance of an attribute is based on the sum of the improvements in all nodes in which the attribute appears as a splitter (weighted by the fraction of the training data in each node split). Surrogates are also included in the importance calculations, which means that even a variable that never splits a node may be assigned a large importance score. This allows the variable importance rankings to reveal variable masking and nonlinear correlation among the attributes. Importance scores may optionally be confined to splitters; comparing the splitters-only and the full (splitters and surrogates) importance rankings is a useful diagnostic.
Also see chapter 10 of book 'The Top Ten Algorithms in Data Mining' for more information
https://www.researchgate.net/profile/Dan_Steinberg2/publication/265031802_Chapter_10_CART_Classification_and_Regression_Trees/links/567dcf8408ae051f9ae493fe/Chapter-10-CART-Classification-and-Regression-Trees.pdf.

LMEM: Chi-square = 0 , prob = 1 - what's wrong with my code?

I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!
This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.

Resources