Diagnostic plots fail with LMMs - r

I've been working on the following problem recently: We sent 18 people, 9 each, several times to two different clubs "N" and "O". These people arrived at the club either between 8 and 10 am (10) or between 10 and 12 pm (12). Each club consists of four sectors with ascending price classes. At the end of each test run, the subjects filled out a questionnaire reflecting a score for their satisfaction depending on the different parameters. The aim of the study is to find out how satisfaction can be modelled as a function of the club. You can download the data as csv for one week with this link (without spaces): https: // we.tl/t-I0UXKYclUk
After some try and error, I fitted the following model using the lme4 package in R (the other models were singular, had too strong internal correlations or higher AIC/BIC):
mod <- lmer(Score ~ Club + (1|Sector:Subject) + (1|Subject), data = dl)
Now I wanted to create some diagnostic plots as indicated here.
plot(resid(mod), dl$Score)
plot(mod, col=dl$Club)
library(lattice)
qqmath(mod, id=0.05)
Unfortunately, it turns out that there are still patterns in the residuals that can be attributed to the club but are not captured by the model. I have already tried to incorporate the club into the random effects, but this leads to singularities. Does anyone have a suggestion on how I can deal with these patterns in the residuals? Thank you!

Related

Problem with heteroscedastic residuals using lme with varIdent

I´m having problems when I try to fix heteroscedasticity in mixed models with lme.
My experimental design consists of 12 independent enclosures (Encl) with populations of lizards (Subject_ID). We applied 2 crossed treatments (Lag: 3 levels and Var: 2 levels). And we repeated the experiment two years (Year), so individuals that survived the first year, were measured again the next year. We analyse the snout vent length (SVL) in mm. Sex (males and females). This was my model:
ctrl <- lmeControl(maxIter=200, msMaxIter=200, msVervose=TRUE, opt="optim")
options (contrast=c(factor="contr.sum", ordered="contr.poly"))
model.SVL <- lme(SVL~Lag*Var*Sex*Year, random=list(~1|Subject_ID, ~1|Encl), control=ctrl, data=data)
enter image description here
enter image description here
enter image description here
The model showed heteroscedasticity in several triple interactions using bartlett.test, so I corrected it with varIdent. However, heteroscedasticity was not fixed, and now, qqplot indicates leptokurtic distribution.
model.SVL2 <- lme(SVL~Lag*Var*Sex*Year, random=list(~1|Subject_ID, ~1|Encl), control=ctrl, weights=varIdent (form=~1|Lag*Var*Sex*Year), data=data)
What could be the problem?
I think the problem is using varIdent when I include Subject_ID as a random factor. If I remove it, this doesn't happen. Maybe it is because many individuals do not survive two years, and it is a random factor with many levels but few replications

How do we run a linear regression with the given data?

We have a large data set with 26 brands, sold in 93 stores, during 399 weeks. The brands are still divided into sub brands (f.ex.: brand = Colgate, but sub brands(556) still exist: Colgate premium white/ Colgate extra etc.)
We calculated for each Subbrand a brandshared price on a weekly store level:
Calculation: (move per ounce for each subbrand and every single store weekly) DIVIDED BY (sum for move per ounce over the subbrands refering to one brand for every single store weekly)* (log price per ounce for each sub brand each week on storelevel)
Everything worked! We created a data frame with all the detailed calculation (data = tooth4) Our final interest is to run a linear regression to predict the influence of price on the move variable
--> the problem now is that the sale variable (a dummy, which says if there is a promotion in a specific week for a specific sub brand in a specific store ) is on subbrandlevel
--> we tried to run a regression on sub brand level (variable = descrip) but it doesn't work due to big data
lm(formula = logmove_ounce ~ log_wei_price_ounce + descrip - 1 *
(log_wei_price_ounce) + sale - 1, data = tooth4)
logmove_ounce = log of weekly subbrand based move on store level
log_wei_price_ounce = weighted subbrand based price for each store for each week
sale-1 = fixed effect for promotion
descrip-1 = fixed effect for subbrand
Does anyone have a solution how to run a regression only on brand level but include the promotion variable ?
We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
Another question, assuming my regression is right/ partly right -- how can I weight the results to get the results only on store level not weekly storelevel?
Thank you in advance !!!
We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
This is variously called a multilevel model, a nested model, hierarchical model, mixed model, or random-effect model which are all the same mathematical model. It is widely used to analyze the kind of longitudinal panel data you describe. A serious book on the subject is Gelman.
The most common approach in R is to use the lmer() function from the lme4 package. If you're using lme4 on uncomfortably large data, you should read their performance tips.
lmer() models accept a slightly different formula syntax, which I'll describe only briefly so that you can see how it can solve the problems you're having.
For example, let's assume we're modeling future salary as a function of the GPA and IQ of certain students. We know that students come from certain schools, so all students which go to the same school are part of a group, and schools are again grouped into counties, states. Furthermore, students graduate in different years which may have an effect. This is a generic example, but I chose it because it shares many of the same characteristics as your own longitudinal panel data.
We can use the generalized formula syntax to specify groups with a varying intercept:
lmer(salary ~ gpa + iq + (1|school), data=df)
A nested hierarchy of such groups:
lmer(salary ~ gpa + iq + (1|state/county/school), data=df)
Or group-varying slopes to capture changes overtime:
lmer(salary ~ gpa + iq + (1 + year|school), data=df)
You'll have to make your own decisions about how to model your data, but lme4::lmer() will give you a larger toolbox than lm() for dealing with groups and levels. I'd recommend asking on https://stats.stackexchange.com/ if you have questions about the modeling side.

R: lmer coding for a (random) discontinuous time for all subjects with multiple treatments

I have a set of data that came from a psychological experiment where subjects were randomly assigned to one of four treatment conditions and their wellbeing w measured on six different occasions. The exact day of measurement on each occasion differs slightly from subject to subject. The first measurement occasion for all subjects is day zero.
I analyse this with lmer :
model.a <- lmer(w ~ day * treatment + (day | subject),
REML=FALSE,
data=exper.data)
Following a simple visual inspection of the change-trajectories of subjects, I'd now like to include (and examine the effect of including) the possibility that the slope of the line for each subject changes at a point mid-way between measurement occasion 3 and 4.
I'm familiar with modeling the alteration in slope by including an additional time-variable in the lmer specification. The approach is described in chapter 6 ('Modeling non-linear change') of the book Applied Longitudinal Data Analysis by Singer and Willett (2005). Following their advice, for each measurement, for each subject, there is now an additional variable called latter.day. For measurements up to measurement 3, the value of latter.day is zero; for later measurements, latter.day encodes the number of days after day 40 (which is the point at which I'd like to include the possible slope-change).
What I cannot see is how to adjust the lmer coding of the examples in the Singer and Willett cases to suit my own problem ... which includes the same point-of-slope-change for all subjects as well as a between-subjects factor (treatment). I'd appreciate help on how to write the specification for lmer.

How do I code a Mixed effects model for abalone growth in Aquaculture nutrition with nested individuals

I am a biologist working in aquaculture nutrition research and until recently I haven't paid much attention to the power of statistics. The usual method of analysis had been to run ANOVA on final weights of animals given various treatments and boom, you have a result. I have tried to improve my results by designing an experiment that could track individuals growth over time but I am having a really hard time trying to understand which model to use for the data I have.
For simplified explanation of my experiment: I have 900 abalone/snails which were sourced from a single cohort (spawned/born at the same time). I have individually marked each abalone (id) and recorded a length and weight at Time 0. The animals were then randomly assigned 1 of 6 treatment diets (n=30 abalone per treatment) each replicated n=5 times (n=150 abalone / replicate). Each replicate looks like a randomized block design where each treatment is only replicate once within each block and each is assigned to independent tank with n=30 abalone/tank (n treatment). Abalone were fed a known amount of feed for 90 days before being weighed and measured again (Time 1). They are back in their homes for another 90 days before the concluding the experiment.
From my understanding:
fixed effects - Time, Treatment
nested random effects - replicate, id
My raw data entered is in Long format with each row being a unique animal and columns for Time (0 or 1), Replicate (1-5), Treatment (1-6), Sex (M or F) Animal ID (1-900), Length (mm), Weight (g), Condition Factor (Weight/Length^2.99*5655)
I have used columns from my raw data and converted them to factors and vectors before using the new variables to create a data frame.
id<-as.factor(data.long[,5])
time<-as.factor(data.long[,1])
replicate<-as.factor(data.long[,2])
treatment<-data.long[,3]
weight<-as.vector(data.long[,7])
length<-as.vector(data.long[,6])
cf<-as.vector(data.long[,10])
My data frame is currently in the following structure:
df1<-data.frame(time,replicate,treatment,id,weight,length,cf)
I am struggling to understand how to nest my individual abalone within replicates. I can convert the weight data to change from initial but I think the package nlme already accounts this change when coded correctly. I could also create another measure of Specific Growth Rate for each animal at Time 1 but this would not allow the Time factor to be used.
lme(weight ~ time*treatment, random=~1 | id, method="ML", data=df1))
I would like to structure a mixed effects model so that my code takes into account the individual animal variability to detect statistical differences in their weight at Time 1 between treatments.

How to add level2 predictors in multilevel regression (package nlme)

I have a question concerning multi level regression models in R, specifically how to add predictors for my level 2 "measure".
Please consider the following example (this is not a real dataset, so the values might not make much sense in reality):
date id count bmi poll
2012-08-05 1 3 20.5 1500
2012-08-06 1 2 20.5 1400
2012-08-05 2 0 23 1500
2012-08-06 2 3 23 1400
The data contains
different persons ("id"...so it's two persons)
the body mass index of each person ("bmi", so it doesn't vary within an id)
the number of heart problems each person has on a specific day ("count). So person 1 had three problems on August the 5th, whereas person 2 had no difficulties/problems on that day
the amount of pollutants (like Ozon or sulfit dioxide) which have been measured on that given day
My general research question is, if the amount of pollutants effects the numer of heart problems in the population.
In a first step, this could be a simple linear regression:
lm(count ~ poll)
However, my data for each day is so to say clustered within persons. I have two measures from person 1 and two measures from person 2.
So my basic idea was to set up a multilevel model with persons (id) as my level 2 variable.
I used the nlme package for this analysis:
lme(fixed=count ~ poll, random = ~poll|id, ...)
No problems so far.
However, the true influence on level 2 might not only come from the fact that I have different persons. Rather it would be much more likely that the effect WITHIN a person might come from his or her bmi (and many other person related variables, like age, amount of smoking and so on).
To make a longstory short:
How can I specify such level two predictors in the lme function?
Or in other words: How can I setup a model, where the relationship between heart problems and pollution is different/clustered/moderated by the body mass index of a person (and as I said maybe additionally by this person's amount of smoking or age)
Unfortunately, I don't have a clue, how to tell R, what I want. I know oif other software (one of them called HLM), which is capable of doing waht I want, but I'm quite sure that R can this as well...
So, many thanks for any help!
deschen
Short answer: you do not have to, as long as you correctly specify random effects. The lme function automatically detects which variables are level 1 or 2. Consider this example using Oxboys where each subject was measured 9 times. For the time being, let me use lmer in the lme4 package.
library(nlme)
library(dplyr)
library(lme4)
library(lmerTest)
Oxboys %>% #1
filter(as.numeric(Subject)<25) %>% #2
mutate(Group=rep(LETTERS[1:3], each=72)) %>% #3
lmer(height ~ Occasion*Group + (1|Subject), data=.) %>% #4
anova() #5
Here I am picking 24 subjects (#2) and arranging them into 3 groups (#3) to make this data balanced. Now the design of this study is a split-plot design with a repeated-measures factor (Occasion) with q=9 levels and a between-subject factor (Group) with p=3 levels. Each group has n=8 subjects. Occasion is a level-1 variable while Group is level 2.
In #4, I did not specify which variable is level 1 or 2, but lmer gives you correct output. How do I know it is correct? Let us check the multi-level model's degrees of freedom for the fixed effects. If your data is balanced, the Kenward–Roger approximation used in the lmerTest will give you exact dfs and F/t-ratios according to this article. That is, in this example dfs for the test of Group, Occasion, and their interaction should be p-1=2, q-1=8, and (p-1)*(q-1)=16, respectively. The df for the Subject error term is (n-1)p = 21 and the df for the Subject:Occasion error term is p(n-1)(q-1)=168. In fact, these are the "exact" values we get from the anova output (#5).
I do not know what algorithm lme uses for approximating dfs, but lme does give you the same dfs. So I am assuming that it is accurate.

Resources