GAMM4 smoothing spline for time variable - r

I am constructing a GAMM model (for the first time) to compare longitudinal slopes of cognitive performance in a Bipolar Disorder (BD) sample, compared to a control (HC) sample. The study design is referred to as an "accelerated longitudinal study" where participants across a large span of ages 25-60, are followed for 2 years (HC group) and 4 years (BD group).
Hypothesis (1) The BD group’s yearly rate of change on processing speed will be higher overall than the healthy control group, suggesting a more rapid cognitive decline in BD than seen in HC.
Here is my R code formula, which I think is a bit off:
RUN2 <- gamm4(BACS_SC_R ~ group + s(VISITMONTH, bs = "cc") +
s(VISITMONTH, bs = "cc", by=group), random=~(1|SUBNUM), data=Df, REML = TRUE)
The visitmonth variable is coded as "months from first visit." Visit 1 would equal 0, and the following visits (3 per year) are coded as months elapsed from visit 1. Is a cyclic smooth correct in this case?
I plan on adding additional variables (i.e peripheral inflammation) to the model to predict individual slopes of cognitive trajectories in BD.
If you have any other suggestions, it would be greatly appreciated. Thank you!

If VISITMONTH is over years (i.e. for a BD observation we would have VISITMONTH in {0, 1, 2, ..., 48} (for the four years)), then no, you don't want a cyclic smooth unless there is some 4-year periodicity that would mean 0 and 11 should be constrained to be the same.
The default thin plate spline bs = 'tp' should suffice.
I'm also assuming that there are many possible values for VISITMONTH as not everyone was followed up at the same monthly intervals? Otherwise you're not going to have many degrees of freedom available for the temporal smooth.
Is group coded as an ordered factor here? If so that's great; the by smooth will encode the difference between the reference level (be sure to set HC as the reference level) and the other level so you can see directly in the summary a test for a difference of the BD group.
It's not clear how you are dealing with the fact that HC are followed up over fewer months than the BD group. It looks like the model has VISITMONTH representing the full time of the study not just a winthin-year term. So how do you intend to compare the BD group with the HC group for the 2 years where the HC group are not observed?

Related

Should I use Friedman test or Mixed Model for my data in R? Nested or not?

I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.

Implementing Longitudinal Random Forest with LongituRF package in R

I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF package. The methods behind this package are detailed here :
Capitaine, L., et al. Random forests for high-dimensional longitudinal data. Stat Methods Med Res (2020) doi:10.1177/0962280220946080.
Conveniently the authors provide some useful data generating functions for testing. So we have
install.packages("LongituRF")
library(LongituRF)
Let's generate some data with DataLongGenerator() which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.
my_data <- DataLongGenerator(n=50,p=6,G=6)
my_data is a list of what you'd expect Y (response vector),
X (matrix of fixed effects predictors), Z (matrix of random-effects predictors),
id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply
model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
id=my_data$id,sto="BM",mtry=2)
takes about 50secs here so bear with me
so far so good. Now im clear about all the parameters here except for Z. What is Z when i go to fit this model on my actual data?
Looking at my_data$Z.
dim(my_data$Z)
[1] 471 2
head(my_data$Z)
[,1] [,2]
[1,] 1 1.1128914
[2,] 1 1.0349287
[3,] 1 0.7308948
[4,] 1 1.0976203
[5,] 1 1.3739856
[6,] 1 0.6840415
Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif().
The documentation of REEMforest() indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?
My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z from the DataLongGenerator() should be nxG (471x6) sparse matrix no?
Clarity on how to specify the Z parameter with actual data would be appreciated.
EDIT
My specific example is as follows, i have a response variable (Y). Samples (identified with id) were randomly assigned to intervention (I, intervention or no intervention). A high dimensional set of features (X). Features and response were measured at two timepoints (Time, baseline and endpoint). I am interested in predicting Y, using X and I. I am also interested in extracting which features were most important to predicting Y (the same way Capitaine et al. did with HIV in their paper).
I will call REEMforest() as follows
REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)
What should i use for Z?
When the function DataLongGenerator() creates Z, it's a random uniform data in a matrix. The actual coding is
Z <- as.matrix(cbind(rep(1, length(f)), 2 * runif(length(f))))
Where f represents the length of the matrices that represent each of the elements. In your example, you used 6 groups of 50 participants with 6 fixed effects. That led to a length of 472.
From what I can gather, since this function is designed to simulate longitudinal data, this is a simulation of random effects on that data. If you were working with real data, I think it would be a lot easier to understand.
While this example doesn't use RE-EM forests, I thought it was pretty clear, because it uses tangible elements as an example. You can read about random effects in section 1.2.2 Fixed v. Random Effects. https://ademos.people.uic.edu/Chapter17.html#32_fixed_effects
Look at section 3.2 to see examples of random effects that you could intentionally model if you were working with real data.
Another example: You're running a cancer drug trial. You've collected patient demographics on a weekly basis: weight, temperature, and a CBC panel and different groups of drug administration: 1 unit per day, 2 units per day, and 3 units per day.
In traditional regression, you'd model these variables to determine how accurately the model identifies the outcome. The fixed effects are the explainable variance or R2. So if you've .86 or 86% then 14% is unexplained. It could be an interaction causing the noise, the unexplained variance between perfect and what the model determined was the outcome.
Let's say the patients with really low white blood cell counts and were overweight responded far better to the treatment. Or perhaps the patients with red hair responded better; that's not in your data. In terms of longitudinal data, let's say that the relationship (the interaction relationship) only appears after some measure of time passes.
You can try to model different relationships to evaluate the random interactions in the data. I think you'd be better off with one of the many ways to evaluate interactions systematically than a random attempt to identify random effects, though.
EDITED I started to write this in the comments with #JustGettinStarted, but it was too much.
Without the background - the easiest way to achieve this would be to run something like REEMtree::REEMtree(), setting the random effects argument to random = ~1 | time / id). After it runs, extract the random effects it's calculated. You can do it like this:
data2 <- data %>% mutate(oOrder = row_number()) %>% # identify original order of the data
arrange(time, id) %>%
mutate(zOrder = row_number()) # because the random effects will be in order by time then id
extRE <- data.frame(time = attributes(fit$RandomEffects[2][["id"]])[["row.names"]]) %>%
separate(col = time,
into = c("time", "id"),
sep = "\\/") %>%
mutate(Z = fit$RandomEffects[[2]] %>% unlist(),
id = as.integer(id),
time = time)) # set data type to match dataset for time
data2 <- data2 %>% left_join(extRE) %>% arrange(oOrder) # return to original order
Z = cbind(rep(1, times = nrows(data2)), data2$Z)
Alternatively, I suggest that you start with the random generation of random effects. The random-effects you start with are just a jumping-off point. The random effects at the end will be different.
No matter how many ways I tried to use LongituRF::REEMforest() with real data, I ran into errors. I had an uninvertible matrix failure every time.
I noticed that the data generated by DataLongGenerator() comes in order by id, then time. I tried to order the data (and Z) that way, but it didn't help. When I extracted all the functionality out of the package LongituRF, I used the MERF (multiple-effects random forest) function with no problems. Even in the research paper, that method was solid. Just thought it was worth mentioning.

How would I devise code to get both within subject and between subject comparisons when attempting to carry out a repeated measures ANOVA?

I understand I can use lmer but I would like to undertake a repeated measures anova in order to carry out both a within group and a between group analysis.
So I am trying to compare the difference in metabolite levels between three groups ( control, disease 1 and disease 2) over time ( measurements collected at two timepoints), and to also make a within group comparison, comparing time point 1 with time point 2.
Important to note - these are subjects sending in samples not timed trial visits where samples would have been taken on the same day or thereabouts. For instance time point 1 for one subject could be 1995, time point 1 for another subject 1996, the difference between timepoint 1 and timepoint 2 is also not consistent. There is an average of around 5 years, however max is 15, min is .5 years.
I have 43, 45, and 42 subjects respectively in each group. My response variable would be say metabolite 1, the predictor would be Group. I also have covariates I would like to be accounted for such as age, BMI, and gender. I would also need to account for family ID (which I have as a random effect in my lmer model). My column with Time has a 0 to mark the time point 1 and 1 is timepoint 2). I understand I must segregate the within and between subjects command, however, I am unsure how to do this. From my understanding so far;
If I am using the anova_test, my formula that needs to be specified for between subjects would be;
Metabolite1 ~ Group*Time
Whilst for within subjects ( seeing whether there is any difference within each group at TP1 vs TP2), I am unsure how I would specify this ( the below is not correct).
Metabolite1 ~ Time + Error(ID/Time)
The question is, how do I combine this altogether to specify the between and within subject comparisons I would like and accounting for the covariates such as gender, age and BMI? I am assuming if I specify covariates it will become an ANCOVA not an ANOVA?
Some example code that I found that had both a between and within subject comparison design (termed mixed anova).
aov1 <- aov(Recall~(Task*Valence*Gender*Dosage)+Error(Subject/(Task*Valence))+(Gender*Dosage),ex5)
Where he specifies that the within subject comparison is within the Error term. Also explained here https://rpkgs.datanovia.com/rstatix/reference/anova_test.html
However, mine, which I realise is very wrong currently ( is missing a correct within subject comparison).
repmes<-anova_test(data=mets, Metabolite1~ Group*Time + Error(ID/Time), covariate=c("Age", "BMI",
"Gender", "FamilyID")
I ultimately would like to determine from this with appropriate post hoc tests ( if p < 0.05) whether there are any significant differences in Metabolite 1 expression between groups between the two time points (i.e over time), and whether there are any significant differences between subjects comparing TP1 with TP2. Please can anybody help.

Reduce range of function for functional PCA in R - Functional Data Analysis

I have discrete measurements of river flow spanning 22 years. As river flow is naturally continuous, I have attempted to fit a function to the data.
library(FDA)
set.seed(1)
### 3 years of flow data
base = c(1,1,1,1,1,2,2,1,2,2,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,6,5,5,4,4,4,3,4,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1)
year1 = sapply(base, function(x){x + runif(1)})
year2 = sapply(base, function(x){x + runif(1)})
year3 = sapply(base, function(x){x + runif(1)})
flow.mat = matrix(c(year1, year2, year3), ncol = 3)
Whilst Fourier basis systems are recommended for periodic data, the true data does not exhibit a strongly repeating pattern (ignore data simulation for this assumption). It also contains important extreme values. Therefore, I attempted to fit bSpline basis systems to the data.
sp.basis=create.bspline.basis(c(1,length(base)), norder=6, nbasis=15)
sb.fd=smooth.basis(1:length(base), flow.mat, sp.basis)$fd
Ultimately, I intend on using the flow data as a covariate in a regression model with a monthly interval. This poses an issue as I fit annual functions to the data, as this provided an improved fit for monthly data, given the data lack of temporal independence.
Therefore, I was wondering if it was possible for me to subset the generated functions, selecting a month at a time.
I suspect this is not possible, therefore, is it possible to run a fPCA on subsetted data, as I intend on using the fPCA scores as the covariate in the model?
So far I have been completely unsuccessful in running a subsetted fPCA. Instead, I have been obtaining annual scores via the following:
pca.flow=pca.fd(sb.fd, 2)
Without getting into much sophistication, I just plotted your data and made a polynomial fit. I did use a 4 degree polynomial because it is wave with 3 ups and downs (4 is one more than the extrema of the fitting curve). A a matter of facts, degree 5 or more did not gave a significant improvement.
What about doing the same for you 22 years time series?

How to add level2 predictors in multilevel regression (package nlme)

I have a question concerning multi level regression models in R, specifically how to add predictors for my level 2 "measure".
Please consider the following example (this is not a real dataset, so the values might not make much sense in reality):
date id count bmi poll
2012-08-05 1 3 20.5 1500
2012-08-06 1 2 20.5 1400
2012-08-05 2 0 23 1500
2012-08-06 2 3 23 1400
The data contains
different persons ("id"...so it's two persons)
the body mass index of each person ("bmi", so it doesn't vary within an id)
the number of heart problems each person has on a specific day ("count). So person 1 had three problems on August the 5th, whereas person 2 had no difficulties/problems on that day
the amount of pollutants (like Ozon or sulfit dioxide) which have been measured on that given day
My general research question is, if the amount of pollutants effects the numer of heart problems in the population.
In a first step, this could be a simple linear regression:
lm(count ~ poll)
However, my data for each day is so to say clustered within persons. I have two measures from person 1 and two measures from person 2.
So my basic idea was to set up a multilevel model with persons (id) as my level 2 variable.
I used the nlme package for this analysis:
lme(fixed=count ~ poll, random = ~poll|id, ...)
No problems so far.
However, the true influence on level 2 might not only come from the fact that I have different persons. Rather it would be much more likely that the effect WITHIN a person might come from his or her bmi (and many other person related variables, like age, amount of smoking and so on).
To make a longstory short:
How can I specify such level two predictors in the lme function?
Or in other words: How can I setup a model, where the relationship between heart problems and pollution is different/clustered/moderated by the body mass index of a person (and as I said maybe additionally by this person's amount of smoking or age)
Unfortunately, I don't have a clue, how to tell R, what I want. I know oif other software (one of them called HLM), which is capable of doing waht I want, but I'm quite sure that R can this as well...
So, many thanks for any help!
deschen
Short answer: you do not have to, as long as you correctly specify random effects. The lme function automatically detects which variables are level 1 or 2. Consider this example using Oxboys where each subject was measured 9 times. For the time being, let me use lmer in the lme4 package.
library(nlme)
library(dplyr)
library(lme4)
library(lmerTest)
Oxboys %>% #1
filter(as.numeric(Subject)<25) %>% #2
mutate(Group=rep(LETTERS[1:3], each=72)) %>% #3
lmer(height ~ Occasion*Group + (1|Subject), data=.) %>% #4
anova() #5
Here I am picking 24 subjects (#2) and arranging them into 3 groups (#3) to make this data balanced. Now the design of this study is a split-plot design with a repeated-measures factor (Occasion) with q=9 levels and a between-subject factor (Group) with p=3 levels. Each group has n=8 subjects. Occasion is a level-1 variable while Group is level 2.
In #4, I did not specify which variable is level 1 or 2, but lmer gives you correct output. How do I know it is correct? Let us check the multi-level model's degrees of freedom for the fixed effects. If your data is balanced, the Kenward–Roger approximation used in the lmerTest will give you exact dfs and F/t-ratios according to this article. That is, in this example dfs for the test of Group, Occasion, and their interaction should be p-1=2, q-1=8, and (p-1)*(q-1)=16, respectively. The df for the Subject error term is (n-1)p = 21 and the df for the Subject:Occasion error term is p(n-1)(q-1)=168. In fact, these are the "exact" values we get from the anova output (#5).
I do not know what algorithm lme uses for approximating dfs, but lme does give you the same dfs. So I am assuming that it is accurate.

Resources