I have doubts about how to specify the random effects structure in my mixed models.
My experimental design consists of 12 independent enclosures (Encl) with populations of lizards (Subject*_ID).* We applied 2 crossed treatments (Lag: 3 levels and Var: 2 levels). And we repeated the experiment two years (Year), so individuals that survived the first year, were measured again the next year. We analyse the snout vent length (SVL) in mm. Sex (males and females). Individuals were redistributed to different enclosures and treatments the second year, so I include the interaction of Encl:Year in a new column (Encl_Year).
This was my model:
ctrl <- lmeControl(maxIter=200, msMaxIter=200, msVervose=TRUE, opt="optim")
options (contrast=c(factor="contr.sum", ordered="contr.poly"))
model.SVL <- lme(SVL~Lag*Var*Sex*Year, random=list(~1|Subject_ID, ~1|Encl_Year), weight=varIdent(form=~1|Lag*Var*Sex), control=ctrl, data=data)
But I don't know how it would be correct to define random effects. Since it would not be a cross random effect model, because not all levels (Subject_ID) are replicated in the enclosures (Encl:Year), but it is not nested either, because there are repeated individuals in different enclosures. What would be the most correct way to write the model?
Depending on the order:
random=list(~1|Subject_ID, ~1|Encl_Year)
or
random=list(~1|Encl_Year, ~1|Subject_ID)
, the results change quite a lot. I also tried a cross random effect model:
data$Dummy <- factor(1)
data <- groupedData(SVL ~ 1 | Dummy, data)
model.SVL <- lme(SVL~Lag*Var*Sex*Year, random=pdBlocked(list(pdIdent(~ 0 + Subject_ID), pdIdent(~ 0 + Encl_Year))) control=ctrl, weight=varIdent(form=~1|Lag*Var*Sex), data=data)
I should add, that I use the lme function, because there is heteroscedasticity in the residuals that I have corrected with the varident function.
Related
I´m having problems when I try to fix heteroscedasticity in mixed models with lme.
My experimental design consists of 12 independent enclosures (Encl) with populations of lizards (Subject_ID). We applied 2 crossed treatments (Lag: 3 levels and Var: 2 levels). And we repeated the experiment two years (Year), so individuals that survived the first year, were measured again the next year. We analyse the snout vent length (SVL) in mm. Sex (males and females). This was my model:
ctrl <- lmeControl(maxIter=200, msMaxIter=200, msVervose=TRUE, opt="optim")
options (contrast=c(factor="contr.sum", ordered="contr.poly"))
model.SVL <- lme(SVL~Lag*Var*Sex*Year, random=list(~1|Subject_ID, ~1|Encl), control=ctrl, data=data)
enter image description here
enter image description here
enter image description here
The model showed heteroscedasticity in several triple interactions using bartlett.test, so I corrected it with varIdent. However, heteroscedasticity was not fixed, and now, qqplot indicates leptokurtic distribution.
model.SVL2 <- lme(SVL~Lag*Var*Sex*Year, random=list(~1|Subject_ID, ~1|Encl), control=ctrl, weights=varIdent (form=~1|Lag*Var*Sex*Year), data=data)
What could be the problem?
I think the problem is using varIdent when I include Subject_ID as a random factor. If I remove it, this doesn't happen. Maybe it is because many individuals do not survive two years, and it is a random factor with many levels but few replications
I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF package. The methods behind this package are detailed here :
Capitaine, L., et al. Random forests for high-dimensional longitudinal data. Stat Methods Med Res (2020) doi:10.1177/0962280220946080.
Conveniently the authors provide some useful data generating functions for testing. So we have
install.packages("LongituRF")
library(LongituRF)
Let's generate some data with DataLongGenerator() which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.
my_data <- DataLongGenerator(n=50,p=6,G=6)
my_data is a list of what you'd expect Y (response vector),
X (matrix of fixed effects predictors), Z (matrix of random-effects predictors),
id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply
model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
id=my_data$id,sto="BM",mtry=2)
takes about 50secs here so bear with me
so far so good. Now im clear about all the parameters here except for Z. What is Z when i go to fit this model on my actual data?
Looking at my_data$Z.
dim(my_data$Z)
[1] 471 2
head(my_data$Z)
[,1] [,2]
[1,] 1 1.1128914
[2,] 1 1.0349287
[3,] 1 0.7308948
[4,] 1 1.0976203
[5,] 1 1.3739856
[6,] 1 0.6840415
Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif().
The documentation of REEMforest() indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?
My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z from the DataLongGenerator() should be nxG (471x6) sparse matrix no?
Clarity on how to specify the Z parameter with actual data would be appreciated.
EDIT
My specific example is as follows, i have a response variable (Y). Samples (identified with id) were randomly assigned to intervention (I, intervention or no intervention). A high dimensional set of features (X). Features and response were measured at two timepoints (Time, baseline and endpoint). I am interested in predicting Y, using X and I. I am also interested in extracting which features were most important to predicting Y (the same way Capitaine et al. did with HIV in their paper).
I will call REEMforest() as follows
REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)
What should i use for Z?
When the function DataLongGenerator() creates Z, it's a random uniform data in a matrix. The actual coding is
Z <- as.matrix(cbind(rep(1, length(f)), 2 * runif(length(f))))
Where f represents the length of the matrices that represent each of the elements. In your example, you used 6 groups of 50 participants with 6 fixed effects. That led to a length of 472.
From what I can gather, since this function is designed to simulate longitudinal data, this is a simulation of random effects on that data. If you were working with real data, I think it would be a lot easier to understand.
While this example doesn't use RE-EM forests, I thought it was pretty clear, because it uses tangible elements as an example. You can read about random effects in section 1.2.2 Fixed v. Random Effects. https://ademos.people.uic.edu/Chapter17.html#32_fixed_effects
Look at section 3.2 to see examples of random effects that you could intentionally model if you were working with real data.
Another example: You're running a cancer drug trial. You've collected patient demographics on a weekly basis: weight, temperature, and a CBC panel and different groups of drug administration: 1 unit per day, 2 units per day, and 3 units per day.
In traditional regression, you'd model these variables to determine how accurately the model identifies the outcome. The fixed effects are the explainable variance or R2. So if you've .86 or 86% then 14% is unexplained. It could be an interaction causing the noise, the unexplained variance between perfect and what the model determined was the outcome.
Let's say the patients with really low white blood cell counts and were overweight responded far better to the treatment. Or perhaps the patients with red hair responded better; that's not in your data. In terms of longitudinal data, let's say that the relationship (the interaction relationship) only appears after some measure of time passes.
You can try to model different relationships to evaluate the random interactions in the data. I think you'd be better off with one of the many ways to evaluate interactions systematically than a random attempt to identify random effects, though.
EDITED I started to write this in the comments with #JustGettinStarted, but it was too much.
Without the background - the easiest way to achieve this would be to run something like REEMtree::REEMtree(), setting the random effects argument to random = ~1 | time / id). After it runs, extract the random effects it's calculated. You can do it like this:
data2 <- data %>% mutate(oOrder = row_number()) %>% # identify original order of the data
arrange(time, id) %>%
mutate(zOrder = row_number()) # because the random effects will be in order by time then id
extRE <- data.frame(time = attributes(fit$RandomEffects[2][["id"]])[["row.names"]]) %>%
separate(col = time,
into = c("time", "id"),
sep = "\\/") %>%
mutate(Z = fit$RandomEffects[[2]] %>% unlist(),
id = as.integer(id),
time = time)) # set data type to match dataset for time
data2 <- data2 %>% left_join(extRE) %>% arrange(oOrder) # return to original order
Z = cbind(rep(1, times = nrows(data2)), data2$Z)
Alternatively, I suggest that you start with the random generation of random effects. The random-effects you start with are just a jumping-off point. The random effects at the end will be different.
No matter how many ways I tried to use LongituRF::REEMforest() with real data, I ran into errors. I had an uninvertible matrix failure every time.
I noticed that the data generated by DataLongGenerator() comes in order by id, then time. I tried to order the data (and Z) that way, but it didn't help. When I extracted all the functionality out of the package LongituRF, I used the MERF (multiple-effects random forest) function with no problems. Even in the research paper, that method was solid. Just thought it was worth mentioning.
I have a set of data that came from a psychological experiment where subjects were randomly assigned to one of four treatment conditions and their wellbeing w measured on six different occasions. The exact day of measurement on each occasion differs slightly from subject to subject. The first measurement occasion for all subjects is day zero.
I analyse this with lmer :
model.a <- lmer(w ~ day * treatment + (day | subject),
REML=FALSE,
data=exper.data)
Following a simple visual inspection of the change-trajectories of subjects, I'd now like to include (and examine the effect of including) the possibility that the slope of the line for each subject changes at a point mid-way between measurement occasion 3 and 4.
I'm familiar with modeling the alteration in slope by including an additional time-variable in the lmer specification. The approach is described in chapter 6 ('Modeling non-linear change') of the book Applied Longitudinal Data Analysis by Singer and Willett (2005). Following their advice, for each measurement, for each subject, there is now an additional variable called latter.day. For measurements up to measurement 3, the value of latter.day is zero; for later measurements, latter.day encodes the number of days after day 40 (which is the point at which I'd like to include the possible slope-change).
What I cannot see is how to adjust the lmer coding of the examples in the Singer and Willett cases to suit my own problem ... which includes the same point-of-slope-change for all subjects as well as a between-subjects factor (treatment). I'd appreciate help on how to write the specification for lmer.
I have a question concerning multi level regression models in R, specifically how to add predictors for my level 2 "measure".
Please consider the following example (this is not a real dataset, so the values might not make much sense in reality):
date id count bmi poll
2012-08-05 1 3 20.5 1500
2012-08-06 1 2 20.5 1400
2012-08-05 2 0 23 1500
2012-08-06 2 3 23 1400
The data contains
different persons ("id"...so it's two persons)
the body mass index of each person ("bmi", so it doesn't vary within an id)
the number of heart problems each person has on a specific day ("count). So person 1 had three problems on August the 5th, whereas person 2 had no difficulties/problems on that day
the amount of pollutants (like Ozon or sulfit dioxide) which have been measured on that given day
My general research question is, if the amount of pollutants effects the numer of heart problems in the population.
In a first step, this could be a simple linear regression:
lm(count ~ poll)
However, my data for each day is so to say clustered within persons. I have two measures from person 1 and two measures from person 2.
So my basic idea was to set up a multilevel model with persons (id) as my level 2 variable.
I used the nlme package for this analysis:
lme(fixed=count ~ poll, random = ~poll|id, ...)
No problems so far.
However, the true influence on level 2 might not only come from the fact that I have different persons. Rather it would be much more likely that the effect WITHIN a person might come from his or her bmi (and many other person related variables, like age, amount of smoking and so on).
To make a longstory short:
How can I specify such level two predictors in the lme function?
Or in other words: How can I setup a model, where the relationship between heart problems and pollution is different/clustered/moderated by the body mass index of a person (and as I said maybe additionally by this person's amount of smoking or age)
Unfortunately, I don't have a clue, how to tell R, what I want. I know oif other software (one of them called HLM), which is capable of doing waht I want, but I'm quite sure that R can this as well...
So, many thanks for any help!
deschen
Short answer: you do not have to, as long as you correctly specify random effects. The lme function automatically detects which variables are level 1 or 2. Consider this example using Oxboys where each subject was measured 9 times. For the time being, let me use lmer in the lme4 package.
library(nlme)
library(dplyr)
library(lme4)
library(lmerTest)
Oxboys %>% #1
filter(as.numeric(Subject)<25) %>% #2
mutate(Group=rep(LETTERS[1:3], each=72)) %>% #3
lmer(height ~ Occasion*Group + (1|Subject), data=.) %>% #4
anova() #5
Here I am picking 24 subjects (#2) and arranging them into 3 groups (#3) to make this data balanced. Now the design of this study is a split-plot design with a repeated-measures factor (Occasion) with q=9 levels and a between-subject factor (Group) with p=3 levels. Each group has n=8 subjects. Occasion is a level-1 variable while Group is level 2.
In #4, I did not specify which variable is level 1 or 2, but lmer gives you correct output. How do I know it is correct? Let us check the multi-level model's degrees of freedom for the fixed effects. If your data is balanced, the Kenward–Roger approximation used in the lmerTest will give you exact dfs and F/t-ratios according to this article. That is, in this example dfs for the test of Group, Occasion, and their interaction should be p-1=2, q-1=8, and (p-1)*(q-1)=16, respectively. The df for the Subject error term is (n-1)p = 21 and the df for the Subject:Occasion error term is p(n-1)(q-1)=168. In fact, these are the "exact" values we get from the anova output (#5).
I do not know what algorithm lme uses for approximating dfs, but lme does give you the same dfs. So I am assuming that it is accurate.
The design of the experiment involves 10 participants. All of them go through conditions A, B, C, D for treatment, however for participants 1-5 go through conditions E,F and participants 6-10 go through conditions G,H.
I'm using the nlme package with lme function to deal with missing data and prevent list-wise deletion of participants. Measured variable = DV, fixed effect = condition, random effect = participant). When everything is just crossed this is what I have:
lme(DV~cond, random =~1|ppt, data = OutcomeData, method = "ML", na.action = na.exclude)
What is the statistics set up for when the first part (conditions A, B, C, D) is crossed whereas the second part E,F and G,H are nested.... any help or guidance would be greatly appreciated! Thanks.
I think your design can be considered a planned "missing" design, where a portion of subjects are not exposed to certain conditions in a planned way (see Enders, 2010). If these values are "missing completely at random" you can treat your data as obtained from a one-way repeated-measures design with missing values in conditions E-H.
I suggest you include a variable "block" that distinguish subjects going through conditions A-D plus E and F from the other subjects. Then you can specify your model as
summary(m1 <- lme(DV ~ cond, random=~1|block/ppt, data=OutcomeData, method = "REML"))
If you randomize the subjects into 2 blocks properly, there should not be significant variability associated with the blocks. You can test this by fitting another model without the block random effect and compare the 2 models like this:
summary(m0 <- lme(DV ~ cond, random=~1|ppt, data=OutcomeData, method = "REML"))
anova(m0, m1)
method = "REML" because we are comparing nested models that differ in random effects. To estimate the fixed effect, you can refit the model with better fit (hopefully m0) with method = "ML".
If you have not collected data yet, I strongly encourage you to randomly assign the subjects to the 2 blocks. Assigning subjects 1-5 to block 1 (i.e., going through conditions E and F) and subjects 6-10 to the other block can introduce confounding variables (e.g., time, technicians getting used to the procedure).