How to check and control for autocorrelation in a mixed effect model of longitudinal data? - r

I have behavioral data for many groups of birds over 10 days of observation. I wanted to investigate whether there is a temporal pattern in some behaviors (e.g. does mate competition increase over time?) And I was told that I had to account for the autocorrelation of the data, since behavior is unlikely to be independent in each day.
However I was wondering about two things:
Since I'm not interested in the differences in y among days but the trend of y over days, do I still need to correct for autocorrelation?
If yes, how do I control for the autocorrelation so that I'm left out only with the signal (and noise of course)?
For the second question, keep in mind I will be analyzing the effect of time on behavior using mixed models in R (since there are random effects such as pseudo-replication), but I have not found any straightforward method of correcting for autocorrelation in the data when modeling the responses.

(1) Yes, you should check for/account for autocorrelation.
The first example here shows an example of estimating trends in a mixed model while accounting for autocorrelation.
You can fit these models with lme from the nlme package. Here's a mixed model without autocorrelation included:
cmod_lme <- lme(GS.NEE ~ cYear,
data=mc2, method="REML",
random = ~ 1 + cYear | Site)
and you can explore the autocorrelation by using plot(ACF(cmod_lme)).
(2) Add correlation to the model something like this:
cmod_lme_acor <- update(cmod_lme,
correlation=corAR1(form=~cYear|Site)
#JeffreyGirard notes that
to check the ACF after updating the model to include the correlation argument, you will need to use plot(ACF(cmod_lme_acor, resType = "normalized"))

Related

Extracting linear term from a polynomial predictor in a GLM

I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.

GLS / GLM nested design with autocorrelation over time

Still fairly new to GLM and a bit confused about how to establish my model.
About my project:
I sampled the microbiome (and measured a diversity index value = Shannon) from the root system of a sample of 9 trees (=tree1_cat).
In each tree I sampled fine and thick roots (=rootpart) and each tree was sampled four times (=days) over the course of one season. Thus I have a nested design but have to keep time in mind for autocorrelation. Also not all values are present, thus I have a few missing values). So far I have tried and tested the following:
Model <- gls(Shannon ~ tree1_cat/rootpart + tree1_cat + days,
na.action = na.omit, data = psL.meta,
correlation = corAR1(form =~ 1|days),
weights = varIdent(form= ~ 1|days))
Furthermore I've tried to get more insight and used anova(Model) to get the p-values of those factors. Am I allowed to use those p-values? Also I've used emmeans(Model, specs = pairwise ~ rootpart)for pairwise comparisons but since rootpart was entered as nested factor it only gives me the paired interactions.
It all works, but I am not sure, whether this is the right model! Any help would be highly appreciated!
It would be helpful to know your scientific question, but let's suppose you're interested in differences in Shannon diversity between fine and thick roots and in time trends. A model you could use would be:
library(lmerTest)
lmer(Shannon ~ rootpart*days + (rootpart*days|tree1_cat), data = ...)
The fixed-effect component rootpart*days can be expanded into 1 + rootpart + days + rootpart:days (where 1 signifies the intercept)
intercept: SD in fine roots on day 0 (hopefully that's the beginning of the season)
rootpart: difference between fine and thick roots on day 0
days: change per day in SD in fine roots (slope)
rootpart:days difference in slope between thick roots and fine roots
The random-effect component (rootpart*days|tree1_cat) measures how all four of these effects vary across trees, and their correlations (e.g. do trees with a larger-than-average difference between fine and thick roots on day 0 also have a larger-than-average change over time in fine root SD?)
This 'maximal' random effects model is almost certainly too complex for your data; a rough rule of thumb says you should have 10-20 data points per parameter estimated, the fixed-effect model takes 4 parameters. A full model with 4 random effects requires the estimate of a 4×4 covariance matrix, which has (4*5)/2 = 10 parameters all by itself. I might just try (1+days|tree1_cat) (random slopes) or (rootpart|tree_cat) (among-tree difference in fine vs. thick differences), with a bias towards allowing for the variation in the effect that is your primary interest (e.g. if your primary question is about fine vs. thick then go with (rootpart|tree_cat).
I probably wouldn't worry about autocorrelation at all, nor heteroscedasticity by day (your varIdent(~1|days) term) unless those patterns are very strongly evident in the data.
If you want to allow for autocorrelation you'll need to fit the model with nlme::lme or glmmTMB (lmer still doesn't have machinery for autocorrelation models); something like
library(nlme)
lme(Shannon ~ rootpart*days,
random = ~days|tree1_cat,
data = ...,
correlation = corCAR1(form = ~days|tree1_cat)
)
You need to use corCAR1 (continuous-time autoregressive order-1) rather than the more common corAR1 for unevenly sampled data. Be aware that lme is more finicky/worse at dealing with singular models, so you may discover you need to simplify your model before you can actually get this model to run.

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

Consider autocorrelation in a Linear Quantile mixed models (LQMM)

(I am using R and the lqmm package)
I was wondering how to consider autocorrelation in a Linear Quantile mixed models (LQMM).
I have a data frame that looks like this:
df1<-data.frame( Time=seq(as.POSIXct("2017-11-13 00:00:00",tz="UTC"),
as.POSIXct("2017-11-13 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=60, sd=10),
Treatment=rep("TreatmentA",120),
AnimalID=rep("ID01",120),
Experiment=rep("Exp01",120))
df2<-data.frame( Time=seq(as.POSIXct("2017-08-11 00:00:00",tz="UTC"),
as.POSIXct("2017-08-11 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=62, sd=14),
Treatment=rep("TreatmentB",120),
AnimalID=rep("ID02",120),
Experiment=rep("Exp02",120))
df<-rbind(df1,df2)
head(df)
With:
The heart rates (HeartRate) that are measured every second on some animals (AnimalID). These measures are carried during an experiment (Experiment) with different treatment possible (Treatment). Each animal (AnimalID) was observed for multiple experiments with different treatments. I wish to look at the effect of the variable Treatment on the 90th percentile of the Heart Rates but including Experiment as a random effect and consider the autocorrelation (as heart rates are taken every second). (If there is a way to include AnimalID as random effect as well it would be even better)
Model for now:
library(lqmm)
model<-lqmm(fixed= HeartRate ~ Treatment, random= ~1| Exp01, data=df, tau=0.9)
Thank you very much in advance for your help.
Let me know if you need more information.
For resources on thinking about this type of problem you might look at chapters 17 and 19 of Koenker et al. 2018 Handbook of Quantile Regression from CRC Press. Neither chapter has nice R code to go from, but they discuss different approaches to the kind of data you're working with. lqmm does use nlme machinery, so there may be a way to customize the covariance matrices for the random effects, but I suspect it would be easiest to either ask for help from the package author or to do a deep dive into the package code to figure out how to do that.
Another resource is the quantile regression model for mixed effects models accounting for autocorrelation in 'Quantile regression for mixed models with an application to examine blood pressure trends in China' by Smith et al. (2015). They model a bivariate response with a copula, but you could do the simplified version with univariate response. I think their model only at this points incorporates lag-1 correlation structure within subjects/clusters. The code for that model does not seem to be available online either though.

evaluate forecast by the terms of p-value and pearson correlation

I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.

Resources