Partially nested/blocked experimental design in R - r

The design of the experiment involves 10 participants. All of them go through conditions A, B, C, D for treatment, however for participants 1-5 go through conditions E,F and participants 6-10 go through conditions G,H.
I'm using the nlme package with lme function to deal with missing data and prevent list-wise deletion of participants. Measured variable = DV, fixed effect = condition, random effect = participant). When everything is just crossed this is what I have:
lme(DV~cond, random =~1|ppt, data = OutcomeData, method = "ML", na.action = na.exclude)
What is the statistics set up for when the first part (conditions A, B, C, D) is crossed whereas the second part E,F and G,H are nested.... any help or guidance would be greatly appreciated! Thanks.

I think your design can be considered a planned "missing" design, where a portion of subjects are not exposed to certain conditions in a planned way (see Enders, 2010). If these values are "missing completely at random" you can treat your data as obtained from a one-way repeated-measures design with missing values in conditions E-H.
I suggest you include a variable "block" that distinguish subjects going through conditions A-D plus E and F from the other subjects. Then you can specify your model as
summary(m1 <- lme(DV ~ cond, random=~1|block/ppt, data=OutcomeData, method = "REML"))
If you randomize the subjects into 2 blocks properly, there should not be significant variability associated with the blocks. You can test this by fitting another model without the block random effect and compare the 2 models like this:
summary(m0 <- lme(DV ~ cond, random=~1|ppt, data=OutcomeData, method = "REML"))
anova(m0, m1)
method = "REML" because we are comparing nested models that differ in random effects. To estimate the fixed effect, you can refit the model with better fit (hopefully m0) with method = "ML".
If you have not collected data yet, I strongly encourage you to randomly assign the subjects to the 2 blocks. Assigning subjects 1-5 to block 1 (i.e., going through conditions E and F) and subjects 6-10 to the other block can introduce confounding variables (e.g., time, technicians getting used to the procedure).

Related

Implementing Longitudinal Random Forest with LongituRF package in R

I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF package. The methods behind this package are detailed here :
Capitaine, L., et al. Random forests for high-dimensional longitudinal data. Stat Methods Med Res (2020) doi:10.1177/0962280220946080.
Conveniently the authors provide some useful data generating functions for testing. So we have
install.packages("LongituRF")
library(LongituRF)
Let's generate some data with DataLongGenerator() which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.
my_data <- DataLongGenerator(n=50,p=6,G=6)
my_data is a list of what you'd expect Y (response vector),
X (matrix of fixed effects predictors), Z (matrix of random-effects predictors),
id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply
model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
id=my_data$id,sto="BM",mtry=2)
takes about 50secs here so bear with me
so far so good. Now im clear about all the parameters here except for Z. What is Z when i go to fit this model on my actual data?
Looking at my_data$Z.
dim(my_data$Z)
[1] 471 2
head(my_data$Z)
[,1] [,2]
[1,] 1 1.1128914
[2,] 1 1.0349287
[3,] 1 0.7308948
[4,] 1 1.0976203
[5,] 1 1.3739856
[6,] 1 0.6840415
Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif().
The documentation of REEMforest() indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?
My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z from the DataLongGenerator() should be nxG (471x6) sparse matrix no?
Clarity on how to specify the Z parameter with actual data would be appreciated.
EDIT
My specific example is as follows, i have a response variable (Y). Samples (identified with id) were randomly assigned to intervention (I, intervention or no intervention). A high dimensional set of features (X). Features and response were measured at two timepoints (Time, baseline and endpoint). I am interested in predicting Y, using X and I. I am also interested in extracting which features were most important to predicting Y (the same way Capitaine et al. did with HIV in their paper).
I will call REEMforest() as follows
REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)
What should i use for Z?
When the function DataLongGenerator() creates Z, it's a random uniform data in a matrix. The actual coding is
Z <- as.matrix(cbind(rep(1, length(f)), 2 * runif(length(f))))
Where f represents the length of the matrices that represent each of the elements. In your example, you used 6 groups of 50 participants with 6 fixed effects. That led to a length of 472.
From what I can gather, since this function is designed to simulate longitudinal data, this is a simulation of random effects on that data. If you were working with real data, I think it would be a lot easier to understand.
While this example doesn't use RE-EM forests, I thought it was pretty clear, because it uses tangible elements as an example. You can read about random effects in section 1.2.2 Fixed v. Random Effects. https://ademos.people.uic.edu/Chapter17.html#32_fixed_effects
Look at section 3.2 to see examples of random effects that you could intentionally model if you were working with real data.
Another example: You're running a cancer drug trial. You've collected patient demographics on a weekly basis: weight, temperature, and a CBC panel and different groups of drug administration: 1 unit per day, 2 units per day, and 3 units per day.
In traditional regression, you'd model these variables to determine how accurately the model identifies the outcome. The fixed effects are the explainable variance or R2. So if you've .86 or 86% then 14% is unexplained. It could be an interaction causing the noise, the unexplained variance between perfect and what the model determined was the outcome.
Let's say the patients with really low white blood cell counts and were overweight responded far better to the treatment. Or perhaps the patients with red hair responded better; that's not in your data. In terms of longitudinal data, let's say that the relationship (the interaction relationship) only appears after some measure of time passes.
You can try to model different relationships to evaluate the random interactions in the data. I think you'd be better off with one of the many ways to evaluate interactions systematically than a random attempt to identify random effects, though.
EDITED I started to write this in the comments with #JustGettinStarted, but it was too much.
Without the background - the easiest way to achieve this would be to run something like REEMtree::REEMtree(), setting the random effects argument to random = ~1 | time / id). After it runs, extract the random effects it's calculated. You can do it like this:
data2 <- data %>% mutate(oOrder = row_number()) %>% # identify original order of the data
arrange(time, id) %>%
mutate(zOrder = row_number()) # because the random effects will be in order by time then id
extRE <- data.frame(time = attributes(fit$RandomEffects[2][["id"]])[["row.names"]]) %>%
separate(col = time,
into = c("time", "id"),
sep = "\\/") %>%
mutate(Z = fit$RandomEffects[[2]] %>% unlist(),
id = as.integer(id),
time = time)) # set data type to match dataset for time
data2 <- data2 %>% left_join(extRE) %>% arrange(oOrder) # return to original order
Z = cbind(rep(1, times = nrows(data2)), data2$Z)
Alternatively, I suggest that you start with the random generation of random effects. The random-effects you start with are just a jumping-off point. The random effects at the end will be different.
No matter how many ways I tried to use LongituRF::REEMforest() with real data, I ran into errors. I had an uninvertible matrix failure every time.
I noticed that the data generated by DataLongGenerator() comes in order by id, then time. I tried to order the data (and Z) that way, but it didn't help. When I extracted all the functionality out of the package LongituRF, I used the MERF (multiple-effects random forest) function with no problems. Even in the research paper, that method was solid. Just thought it was worth mentioning.

Treatment of categorical variables in rpart

I wonder how rpart treats categorical variables. There are several references suggesting that for unordered factors it looks through all combinations. Actually, even the vignette at the end section 6.2 states
(F)or a categorical predictor with m levels, all 2^m−1 different possible
splits are tested.
However, given my experience with the code, I find it difficult to believe. The vignette shows a supporting evidence that running
rpart(Reliability ~ ., data=car90)
takes a really long, long time. However, in my case, it runs in seconds. Despite having an unordered factor variable with 30 levels.
To demonstrate the issue further, I have created several variables with 52 levels, meaning that 2^51 - 1 ~ 2.2 10^15 splits would need to be checked if all possibilities were explored. This code runs in about a minute, IMHO proving that all combinations are not checked.
NROW = 50000
NVAR = 20
rand_letters = data.frame(replicate(NVAR, as.factor(c(
letters[sample.int(26, floor(NROW/2), replace = TRUE)],
LETTERS[sample.int(26, ceiling(NROW/2), replace = TRUE)]))))
rand_letters$target = rbinom(n = NROW, size = 1, prob = 0.1)
system.time({
tree_letter = rpart(target ~., data = rand_letters, cp = 0.0003)
})
tree_letter
What combinations of categorical variables are ACTUALLY checked in rpart?
I know it is an old question but I found this link that might answer some of it.
Bottom line is that rpart seems to be applying a simple algorithm:
First, sort the conditional means, p_i = E(Y|X = x_i)
Then compute Gini indices based on groups obtained from that ordering.
Pick the two groups giving the maximum of these Gini indices.
So it should not be nearly as computationally expensive.
However, I personally have a case where I have a single categorical variable, whose categories are US states, and rpart overtimes when trying to use it to produce a classification tree. Creating dummy variables and running rpart with the 51 variables (1 for each state) works fine.

How do I perform a TukeyHSD like test on a GLM in R?

I'm trying to analyse a glm I created in R and what I'd like to do is get a pairwise comparison of which of my factors are significantly difference from eachother similar to the TukeyHSD test for Anovas. However I have been told that TukeyHSD does not work for GLM's. After doing some research I've found a couple of options and I'm not sure which one is correct or applicable those are the glht or contrast commands.
Here is the code for the GLM.
glm.mod <- glm(as.numeric(Ostra..Avg.body.size) ~ as.factor(Macrophytes)*as.factor(Leaves)*as.factor(MacrophyteintLeaves) = 'gaussian', data = main)
The body size variable is what I expect to change based on my factors of Macrophytes (which is the presence/absence of species of macrophyte with the options Without for none, C or E for different species). and the leaves have three options of without, q1 , q2 )
Here is an example of what my data looks like (with made up values)
Macrophyte Leaves Animals CODE Ostra. Avg body size
Without Q1 N 1 11000
E Q2 Y 2 11853
C without N 3 13422
Without Q1 Y 4 13838
How would I get an output that shows me if there is an effect for example
without Macrophytes Q1 Leaf - without Macrophytes Q2 leaf (Then a value denoting if they have are significantly different from each other such as a p value).
Any help would be greatly appreciated and thanks in advance.If there is any important info I have missed please tell me.
With a numeric response variable, and one (or several) categorical predictors, I would typically use the following function to get pairwise comparisons of significance for each main effect (for example, Macrophytes alone; Leaves alone) and interaction effects:
TukeyHDS(aov(as.numeric(Ostra..Avg.body.size) ~ as.factor(Macrophytes)*as.factor(Leaves)*as.factor(MacrophyteintLeaves)

Model with Matched pairs and repeated measures

I will delete if this is too loosely programming but my search has turned up NULL so I'm hoping someone can help.
I have a design that has a case/control matched pairs design with repeated measurements. Looking for a model/function/package in R
I have 2 measures at time=1 and 2 measures at time=2. I have Case/Control status as Group (2 levels), and matched pairs id as match_id and want estimate the effect of Group, time and the interaction on speed, a continuous variable.
I wanted to do something like this:
(reg_id is the actual participant ID)
speed_model <- geese(speed ~ time*Group, id = c(reg_id,match_id),
data=dataforGEE, corstr="exchangeable", family=gaussian)
Where I want to model the autocorrelation within a person via reg_id, but also within the matched pairs via match_id
But I get:
Error in model.frame.default(formula = speed ~ time * Group, data = dataFullGEE, :
variable lengths differ (found for '(id)')
Can geese or GEE in general not handle clustering around 2 sets of id? Is there a way to even do this? I'm sure there is.
Thank you for any help you can provide.
This is definatly a better question for Cross Validated, but since you have exactly 2 observations per subject, I would consider the ANCOVA model:
geese(speed_at_time_2 ~ speed_at_time_1*Group, id = c(match_id),
data=dataforGEE, corstr="exchangeable", family=gaussian)
Regarding the use of ANCOVA, you might find this reference useful.

Multiple comparisions using glht with repeated measure anova

I'm using the following code to try to get at post-hoc comparisons for my cell means:
result.lme3<-lme(Response~Pressure*Treatment*Gender*Group, mydata, ~1|Subject/Pressure/Treatment)
aov.result<-aov(result.lme3, mydata)
TukeyHSD(aov.result, "Pressure:Treatment:Gender:Group")
This gives me a result, but most of the adjusted p-values are incredibly small - so I'm not convinced the result is correct.
Alternatively I'm trying this:
summary(glht(result.lme3,linfct=mcp(????="Tukey")
I don't know how to get the Pressure:Treatment:Gender:Group in the glht code.
Help is appreciated - even if it is just a link to a question I didn't find previously.
I have 504 observations, Pressure has 4 levels and is repeated in each subject, Treatment has 2 levels and is repeated in each subject, Group has 3 levels, and Gender is obvious.
Thanks
I solved a similar problem creating a interaction dummy variable using interaction() function which contains all combinations of the leves of your 4 variables.
I made many tests, the estimates shown for the various levels of this variable show the joint effect of the active levels plus the interaction effect.
For example if:
temperature ~ interaction(infection(y/n), acetaminophen(y/n))
(i put the possible leves in the parenthesis for clarity) the interaction var will have a level like "infection.y:acetaminophen.y" which show the effect on temperature of both infection, acetaminophen and the interaction of the two in comparison with the intercept (where both variables are n).
Instead if the model was:
temperature ~ infection(y/n) * acetaminophen(y/n)
to have the same coefficient for the case when both vars are y, you would have had to add the two simple effect plus the interaction effect. The result is the same but i prefer using interaction since is more clean and elegant.
The in glht you use:
summary(glht(model, linfct= mcp(interaction_var = 'Tukey'))
to achieve your post-hoc, where interaction_var <- interaction(infection, acetaminophen).
TO BE NOTED: i never tested this methodology with nested and mixed models so beware!

Resources