Appropriate statistical test to analyze differences in slopes of time series data - r

I have created the following plot based on air quality data over three years of observation, and would like to know if these slopes are different across the two time periods (March-June 2018-2019 average vs. March-June 2020):
A snapshot of my data frame is shown here:
The figure is made using the following code:
Lockdown_Period_plot_weekday <- ggplot(COVID_NO2_weekday_avgs_Rathmines, aes(x = Date_1, y = avg_daily_Rath_NO2, color = Period, shape = Period)) +
geom_smooth(method="lm", se = FALSE) +
geom_point(size=2) +
theme_bw() +
labs(x = 'Date',
y = 'Daily Avg [NO2] µg/m^3',
title = 'Weekday NO2 Trends During Lockdown',
subtitle = 'Rathmines AQ Station')
I know that I need to remove the effect of serial correlation first (as the independent variable is a time series), but I'm not exactly sure how to do this. Should I use the date column to do so? Or should I use the dummy column Date_2 to do this? This column is just a concatenation of Month.Date to create a series of x values that are numerical and continuous.
I used the gls() function to do this, and believe I have designated the date column as my serial correlation.
My attempt is displayed here:
library(nlme)
m <- gls(avg_daily_Rath_NO2 ~ Period,
data=COVID_NO2_weekday_avgs_Rathmines,
correlation=corARMA(p=1, q=0, form=~date))
summary(m)
Output:
Generalized least squares fit by REML
Model: avg_daily_Rath_NO2 ~ Period
Data: COVID_NO2_weekday_avgs_Rathmines
Correlation Structure: ARMA(1,0)
Formula: ~date
Parameter estimate(s):
Phi1
0.6066636
Coefficients:
Correlation:
(Intr)
PeriodMarch-June 2020 -0.569
Standardized residuals:
Min Q1 Med Q3
-1.8573362 -0.6487672 -0.1588551 0.5597100
Max
3.4017470
Residual standard error: 10.46725
Degrees of freedom: 256 total; 254 residual
I am a tad rusty when it comes to linear regression outputs, and am not sure how to interpret this one.
Additionally, I would like to check that my model is correctly structured to achieve my desired output.
Any help with this would be appreciated.
-TL;DR-
I want to run a ANCOVA on two lines to find out if the slopes differ across the Period variable.
I would like to remove the effect of serial correlation since the independent variable is a time series.
What is the most effective way to accomplish this?
More information can be provided if necessary.

Related

What is actually occurring in this multiple linear regression analysis done in R?

I am working on a project with data analysis in R.
What I am seeking to do is determine if a dataset can be described with a linear regression model, and then trying to test if certain subgroups of that dataset have a stronger correlation than the whole dataset. More specifically, I am comparing a dataset where students recorded their pulse and time estimations, and checking to see if there is a stronger correlation in a subgroup of the data where students were not found to have a daily rhythm to either variable vs. a subgroup where students were calculated to have a daily rhythm in both time estimation and heart rate. The values I am using are their daily averages for both time estimation and heart rate.
I ran a linear model of the whole dataset:
> summary(ptmod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-11.7310 -1.6725 -0.0162 2.0134 9.8548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.82047 2.99244 22.998 <2e-16 ***
avg.pulse -0.10449 0.04115 -2.539 0.0125 *
and also attempted to run a linear regression of each subgroup
> summary(ptmod2)
Call:
lm(formula = avg.time ~ avg.pulse + Group, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-12.9884 -1.7723 -0.1873 2.4900 8.7424
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.45350 2.92287 23.420 < 2e-16 ***
avg.pulse -0.08566 0.03985 -2.149 0.03388 *
GroupOne -1.22325 0.91444 -1.338 0.18386
GroupThree 0.11062 0.97666 0.113 0.91003
GroupTwo -3.09096 0.95446 -3.238 0.00161 **
However, I wanted to make sure that what I was seeing was correct, because I did not really expect so many of the groups to have significant coefficients. So I cut the groups up into their own .csv files and generated linear models for each of them individually. Cutting them up into their own files also made it easier to run a Chow test as a post-hoc analysis. When I ran regressions on them again, I got quite different coefficients.
For example, here is the summary for Group One:
> summary(mod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = group1)
Residuals:
Min 1Q Median 3Q Max
-7.1048 -1.6529 -0.7279 1.4063 5.6574
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.41445 4.15917 16.209 8.99e-15 ***
avg.pulse -0.08916 0.05657 -1.576 0.128
This makes me question what exactly my results from the summary of ptmod2 actually mean? I was uncertain of how to set up the R code for the linear model sorted by individual subgroups, so my code for it was
> ptmod2<-lm(avg.time~avg.pulse+Group, data=pulsetime)
In my spreadsheet file, I have three columns: avg.pulse, avg.time, and Group. "Group" is a column of the words "One", "Two", "Three", and "Four" assigned based on subgroup.
Did the summary for ptmod2 fit a linear regression across the whole dataset? I am really not sure what happened.
Thankyou so much for any insight you can provide. Perhaps my code for comparing regressions by group was incorrect.
This is somewhat of a split between a programming and statistics question. It is maybe better suited for crossvalidation. However the question is simple enough to get an understanding about.
Your question can be split into the following sub-questions:
Am I fitting a model on the full (whole) dataset in ptmod2?
How do I estimate multiple models across grouped datasets?
What is the correct way to analyse the coefficients of such a situation?
Am I fitting a model on the full (whole) datset in ptmod2?
The long and short is "yes". In R and statistics, when you add a "group" variable to your dataset, this is not equivalent to splitting your dataset into multiple groups. Instead it is adding an indicator variable (0 or 1) indicating the specific groups, including a reference level. So in your case you have 4 groups, 1 through 4, and you are adding an indicator for whether someone is in group 1, group 2, group 3 or (reference level) group 4. This is a measure of how much the intercept differs between the groups. Eg. these variables have the interpretation:
If the models share a common slope avg.pulse are there a significant difference in the avg.time explained by the specific group?
The reason why you see only 3 groups and not 4, is that the fourth group is explained by setting all the other groups equal to FALSE. Eg. if you are not in group 1, 2 or 3 you are part of group 4. So the "effect" of being in group 4, is the effect of not being in group 1, 2 or 3 (in this case).
A method for studying this, that it seems many of my students found helpful, is to study a small version of the model.matrix for example:
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
where you can very actively see that there is a column for the (intercept), hp and 2 columns for cyl6 and cyl8 (no column for cyl4 which is the reference). Matching the indices in cyl6 and cyl8 to the value in mtcars illustrates that a 1 in the cyl6 column indicates that cyl == 6.
How do I estimate multiple models across grouped datasets?
There are multiple methods for doing this depending on the question being sought. In your case you seem interested in the question "Are there a significant difference in the effect of avg.pulse depending for each group?". Eg, you want to estimate the avg.pulse coefficient for each group. One is to do as you did later, and estimate a model across each group
groups <- split(pulsetime, pulsetime$Group)
models <- lapply(groups, function(df)lm(avg.time ~ avg.pulse, data = df))
lapply(models, summary)
which gives the estimate. The problem here is "how to compare these". There are methods for doing so, by comparing the covariance between the parameters between each model, which is called "multivariate statistical analysis" or multiple regression models. This is overly complicated however, as the models share a common outcome.
A much simpler method is to incorporate the different estimate by adding the "extra" effect for each group using indicator variables. This works similar to adding the group variable, but instead of adding it alone (indicating the effect of being in group X), we multiply it to the variable in question using one of
# Let each group have their own `avg.pulse` variable
ptmod2_1 <- lm(formula = avg.time ~ avg.pulse : Group, data = pulsetime)
# Let each group have their own `avg.pulse` variable and account for the effect of `Group`
ptmod2_2 <- lm(formula = avg.time ~ avg.pulse * Group, data = pulsetime)
In the prior you'll see avg.time:GroupX, for all 4 groups, meaning these are the "effect of avg.time in group X", while in the latter you'll one again have a reference level. Note a stark difference between the 2 methods is that in the latter all group have the same intercept while in the latter all groups can have a different intercept.
In general statistics the latter is the preferred method, unless you have a very good reason not to expect each group to have a different average. It is very similar to the rule of thumb: Don't test your intercept, unless you have a very good reason, and even then you probably shouldn't". Basically because it makes a lot of logical sense to follow those rules (though it can take a few days of reflecting to realize why).
What is the correct way to analyse the coefficients of such a situation?
If you've stuck with one of the 2 latter methods, the analysis is similar to a normal regression analysis. Coefficients can be tested using t-tests, anova and so on (using summary/drop1 and anova), and if you have a reason you can test group merging using standard test as well (although if they are insignificant there is rarely a reason to merge them either way). The whole trick becomes "how do I interpret the coefficients".
For method 1 it is glaringly obvious. "Group 1 has an effect of avg.pulse of so much" and so on. For method 2 it is slightly more subtle. The effect of avg.pulse in group 1 is avg.pulse + avg.pulse:GroupOne. Because you have to note that avg.pulsedoes **not** disappear when you change group. It is the reference level, and every other effect is the **additional** effect onavg.pulseof going from being in group X to being in group Y. Visually yourslope` changes in the graph becoming steeper(flatter) if the coefficient is positive(negative).
I've given a visualization below using the mtcars dataset with using mpg for outcome, hp for numeric variable and cyl (as a factor) as a grouping variable. Confidence intervals are removed as they are not important for the illustration. The important part is to note how different the 2 models are (cyl == 4 is positive in one, negative in the other!). This further goes along the idea why method 2 is often "more correct" than the prior.
Code for reproducibility
Below is the code I've used for my illustrations and examples
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
#split-fit across groups and run summary
groups <- split(mtcars, mtcars$cyl)
models <- lapply(groups, function(df)lm(mpg ~ hp, data = df))
lapply(models, summary)
#Fit using interaction effects
fit_11 <- lm(mpg ~ hp:cyl , mtcars)
fit_12 <- lm(mpg ~ hp*cyl , mtcars)
summary(fit_11)
summary(fit_12)
#Illustrate interaction effects
library(sjPlot)
library(sjmisc)
library(ggplot2)
library(patchwork)
theme_set(theme_sjplot())
p1 <- plot_model(fit_11, type = "pred", terms = c("hp","cyl"), ci.lvl = 0) + ggtitle("Same intercept different slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p2 <- plot_model(fit_12, type = "pred", terms = c("hp", "cyl"), ci.lvl = 0) + ggtitle("Different intercept and slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p1 / p2

How do I plot predicted probabilities for a Logit regression with fixed effects in R?

I am a complete newbie to R.
I have the following logit equation I am estimating:
allAM <- glm (AM ~ VS + Prom + LS_Exp + Sex + Age + Age2 + Jpart + X2004LS + X2009LS + X2014LS + factor(State), family = binomial(link = "logit"), data = mydata)
AM is a standard binary (happened/didn’t happen). The three “X****LS” variables are dummies indicating different sessions of congress and “factor(State)” is used to generate fixed effects/dummies for each state.
VS is the key independent variable of interest and I want to generate the predicated probability that AM=1 for each value of VS between 0 and 60, holding everything else at its mean.
I am running into trouble, however, generating and plotting the predicted probabilities because “State” is a factor. I want to be able to show the average effects, not 50 different charts/effects for each state.
Per (Hanmer and Kalkan 2013) http://onlinelibrary.wiley.com/doi/10.1111/j.1540-5907.2012.00602.x/abstract I was advised to do the following to plot the predicted probabilities:
pred.seq <- seq(from=0, to=60, by=0.01)
pred.out <- c()
for(i in 1:length(pred.seq)){
mydata.c <- mydata
mydata.c$VS <- pred.seq[i]
pred.out[i] <- mean(predict(allAM, newdata=mydata.c, type="response"))
}
plot(pred.out ~ pred.seq, type="l")
This approach seems to work, though I don’t really understand it.
I want to add the upper and lower 95% confidence intervals to the plot, but when I attempt to do it by hand the way I know how:
lower <- pred.out$fit - (1.96*pred.out$se.fit)
upper <- pred.out$fit + (1.96*pred.out$se.fit)
I get the following error:
Error in pred.outfit:fit: operator is invalid for atomic vectors
Can anyone advise how I can plot the confidence intervals and how I can specify different levels of VS so that I can report some specific predicted probabilities?

Analyze longitudinal data with a mixed effects model in R

I try to analyze some simulated longitudinal data in R using a mixed-effects model (lme4 package).
Simulated data: 25 subjects have to perform 2 tasks at 5 consecutive time points.
#Simulate longitudinal data
N <- 25
t <- 5
x <- rep(1:t,N)
#task1
beta1 <- 4
e1 <- rnorm(N*t, mean = 0, sd = 1.5)
y1 <- 1 + x * beta1 + e1
#task2
beta2 <- 1.5
e2 <- rnorm(N*t, mean = 0, sd = 1)
y2 <- 1 + x * beta2 + e2
data1 <- data.frame(id=factor(rep(1:N, each=t)), day = x, y = y1, task=rep(c("task1"),length(y1)))
data2 <- data.frame(id=factor(rep(1:N, each=t)), day = x, y = y2, task=rep(c("task2"),length(y2)))
data <- rbind(data1, data2)
Question1: How to analyze how a subject learns each task?
library(lme4)
m1 <- lmer(y ~ day + (1 | id), data=data1)
summary(m1)
...
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.2757 0.3561 123.0000 3.582 0.000489 ***
day 3.9299 0.1074 123.0000 36.603 < 2e-16 ***
With ranef(m1) I get the random intercept for each subject, which I think reflects the baseline value for each subject at day = 1. But I don't understand how I can tell how an individual learns a task, or whether subjects differ in the way how they learn the task.
Question2: How can I analyze whether the way subjects learn differ between task1 and task2.
I expanded on your example to answer your questions briefly, but I can recommend reading chapter 15 of Snijders & Bosker (2012) or the book by Singer & Willet (2003) for a better explanation. Day is treated as a continuous variable in your model, seeing as you have panel data (i.e. everyone is measured at the same day) and day has no meaning apart from indicating the different measurement occasions, it may be better to treat day as a factor (i.e. use dummy variables).
However, for now I will continue with your example
Your first model (I think you want data instread of data1) gives a fixed linear slope (i.e. average slope, no difference in the tasks, no difference between individuals). The fixed intercept is the performance when day is 0, which has no meaning so you may want to consider centering the effect of day for a better interpretation (or indeed use dummies). The random effect gives the individual deviance from this intercept which has an estimated variance of 0.00 in your example so individuals hardly differ from each other in their starting position.
m1 <- lmer(y ~ day + (1 | id), data=data)
summary(m1)
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.00 0.000
Residual 18.54 4.306
Number of obs: 250, groups: id, 25
We can extend this model by adding an interaction with task. Meaning that the fixed slope is different for task1 and task2 which answers question 2 I believe (you can also use update() to update your model)
m2 <- lmer(y ~ day*task + (1|id), data = data)
summary(m2)
The effect of day in this model is the fixed slope of your reference category (task1) and the interaction is the difference between the slope of task1 and task2. The fixed effect of task is the difference in intercept.
model fit can be assessed with a deviance test, read Snijders & Boskers (2012) for an explanation of ML and REML estimates.
anova(m1,m2)
To add a random effect for the growth of individuals we can update the model again, which answers question 1
m3 <- lmer(y ~ day*task + (day|id), data = data)
summary(m3)
ranef(m3)
The random effects indicate the individual deviations in slope and intercept. A summary of the distribution of you random effects is included in the model summary (same as for m1).
Finally I think you could add a random effect on the day-task interaction to assess whether individuals differ in their performance growth on task1 and task2. But this depends very much on your data and the performance of the previous models.
m4 <- lmer(y ~ day*task + (day*task|id), data = data)
summary(m4)
ranef(m4)
Hope this helps. The books I recommended certainly should. Both provide excellent examples and explanation of theory (no R examples unfortunately). If you decide on a fixed occasion model (effect of day expressed by dummies) the nlme package provides excellent options to control the covariance structure of random effects. Good documentation of the package is provided by Pinheiro & Bates (2000).

glm model dataset summarisation

first post, so go easy.
In the insurance world of GLMing, the classic approach is to model claims frequency and average severity. With that in mind, I built a couple of models to experiment for myself and now have a question.
Could somebody please explain how GLM handles varying levels of summarisation of a dataset, particularly with regard to error estimates?
Consider the example below. The data exhibits strong severity trends for both variables:
- A has more expensive claims than B
- Ford > Kia > Vaux > Jag
I fitted a model to unsummarised and a summarised version of the dataset, and accordingly GLM fitted the same parameters in both cases
However, GLM indicates a well fitted model to the unsummarised data. But when I summarise and use a weighted mean, ie average severity, the model fits poorly. Maybe this is as you would expect, after all the unsummarised data has more points to model with. Also, it appears the weighted mean is used to indicate RELATIVE strength, so here, specifiying the weighted mean is pointless, since they are all the same weights.
But more fundementally, can I not model average severity with GLM? I mean, I know the result of fitting a GLM to an unsummarised dataset will be a average severity, but I was hoping to fit a model to already summarised data. It appears that modelling on aggregated datasets will not give a true indication of the model fit.
Apologies if this a stupid question, I'm not a statistician, so don't fully understand the Hessian Matrix.
Please see code below:
library(boot)
library(reshape)
dataset <- data.frame(
Person = rep(c("A", "B"), each=200),
Car = rep(c("Ford", "Kia", "Vaux", "Jag"), 2, each=50),
Amount = c(rgamma(50, 200), rgamma(50, 180), rgamma(50, 160), rgamma(50, 140),
rgamma(50, 100), rgamma(50, 80), rgamma(50, 60), rgamma(50, 40))
)
Agg1 <- ddply(dataset, .(Person, Car), summarise, mean=mean(Amount), length=length(Amount))
m1 <- glm(Amount ~ Person + Car, data = dataset, family = Gamma(link="log"))
m2 <- glm(mean ~ Person + Car, data = Agg1, family = Gamma(link="log"), weights=length)
summary(m1)
summary(m2)
Thanks,
Nick
Bottom line is that both models are identical - the reason the aggregated model "fits poorly" is entirely due to the reduction in degrees of freedom due to aggregation.
Before getting into why the models are identical, I should point out that this does not necessarily mean that either model is a good fit. You should run diagnostics on both, especially using:
par(mfrow=c(2,2))
plot(m1)
When you do this. you'll see that the residuals are normally distributed (which is essential), but that they follow a pattern (-, +, -), which is disturbing. I would want to understand that before declaring that this is a good model. [Admittedly, this is made up data, but the principles apply nevertheless.]
Comparing the aggregated to base models, look at the values of the coefficients.
coef.m1 <- summary(m1)$coefficients
coef.m2 <- summary(m2)$coefficients
cbind(coef.m1[,1],coef.m2[,1])
# [,1] [,2]
# (Intercept) 5.4096980 5.4096976
# PersonB -0.9249371 -0.9249366
# CarJag -0.6144606 -0.6144602
# CarKia -0.1786556 -0.1786555
# CarVaux -0.3597925 -0.3597923
The reason you think the aggregated model is "worse" is because of the p-values, but these depend on t = coeff/se . The ratio of se in m1 vs. m2 is the same for all coefficients:
coef.m2[,2]/coef.m1[,2]
# (Intercept) PersonB CarJag CarKia CarVaux
# 7.836171 7.836171 7.836171 7.836171 7.836171
Since
se ~ sd / √ df
the ratio of se for the two models should be approx
sem1/sem2 = √( (nm1-1) / (nm2-1) )
sqrt((nrow(dataset)-1)/(nrow(Agg1)-1))
# [1] 7.549834
Frankly I'm puzzled why the ratio is not exactly equal to 7.55.
Put another way, glm(...) has no way of knowing that you aggregated your data. It thinks you are trying to fit a model with 4 parameters and an intercept to 8 data points.

Converting Repeated Measures mixed model formula from SAS to R

There are several questions and posts about mixed models for more complex experimental designs, so I thought this more simple model would help other beginners in this process as well as I.
So, my question is I would like to formulate a repeated measures ancova in R from sas proc mixed procedure:
proc mixed data=df1;
FitStatistics=akaike
class GROUP person day;
model Y = GROUP X1 / solution alpha=.1 cl;
repeated / type=cs subject=person group=GROUP;
lsmeans GROUP;
run;
Here is the SAS output using the data created in R (below):
. Effect panel Estimate Error DF t Value Pr > |t| Alpha Lower Upper
Intercept -9.8693 251.04 7 -0.04 0.9697 0.1 -485.49 465.75
panel 1 -247.17 112.86 7 -2.19 0.0647 0.1 -460.99 -33.3510
panel 2 0 . . . . . . .
X1 20.4125 10.0228 7 2.04 0.0811 0.1 1.4235 39.4016
Below is how I formulated the model in R using 'nlme' package, but am not getting similar coefficient estimates:
## create reproducible example fake panel data set:
set.seed(94); subject.id = abs(round(rnorm(10)*10000,0))
set.seed(99); sds = rnorm(10,15,5);means = 1:10*runif(10,7,13);trends = runif(10,0.5,2.5)
this = NULL; set.seed(98)
for(i in 1:10) { this = c(this,rnorm(6, mean = means[i], sd = sds[i])*trends[i]*1:6)}
set.seed(97)
that = sort(rep(rnorm(10,mean = 20, sd = 3),6))
df1 = data.frame(day = rep(1:6,10), GROUP = c(rep('TEST',30),rep('CONTROL',30)),
Y = this,
X1 = that,
person = sort(rep(subject.id,6)))
## use package nlme
require(nlme)
## run repeated measures mixed model using compound symmetry covariance structure:
summary(lme(Y ~ GROUP + X1, random = ~ +1 | person,
correlation=corCompSymm(form=~day|person), na.action = na.exclude,
data = df1,method='REML'))
Now, the output from R, which I now realize is similar to the output from lm():
Value Std.Error DF t-value p-value
(Intercept) -626.1622 527.9890 50 -1.1859379 0.2413
GROUPTEST -101.3647 156.2940 7 -0.6485518 0.5373
X1 47.0919 22.6698 7 2.0772934 0.0764
I believe I'm close as to the specification, but not sure what piece I'm missing to make the results match (within reason..). Any help would be appreciated!
UPDATE: Using the code in the answer below, the R output becomes:
> summary(model2)
Scroll to bottom for the parameter estimates -- look! identical to SAS.
Linear mixed-effects model fit by REML
Data: df1
AIC BIC logLik
776.942 793.2864 -380.471
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUPCONTROL GROUPTEST Residual
StdDev: 184.692 14.56864 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
TEST CONTROL
1.000000 3.068837
Fixed effects: Y ~ GROUP + X1
Value Std.Error DF t-value p-value
(Intercept) -9.8706 251.04678 50 -0.0393178 0.9688
GROUPTEST -247.1712 112.85945 7 -2.1900795 0.0647
X1 20.4126 10.02292 7 2.0365914 0.0811
Please try below:
model1 <- lme(
Y ~ GROUP + X1,
random = ~ GROUP | person,
correlation = corCompSymm(form = ~ day | person),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model1)
I think random = ~ groupvar | subjvar option with R lme provides similar result of repeated / subject = subjvar group = groupvar option with SAS/MIXED in this case.
Edit:
SAS/MIXED
R (a revised model2)
model2 <- lme(
Y ~ GROUP + X1,
random = list(person = pdDiag(form = ~ GROUP - 1)),
correlation = corCompSymm(form = ~ day | person),
weights = varIdent(form = ~ 1 | GROUP),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model2)
So, I think these covariance structures are very similar (σg1 = τg2 + σ1).
Edit 2:
Covariate estimates (SAS/MIXED):
Variance person GROUP TEST 8789.23
CS person GROUP TEST 125.79
Variance person GROUP CONTROL 82775
CS person GROUP CONTROL 33297
So
TEST group diagonal element
= 125.79 + 8789.23
= 8915.02
CONTROL group diagonal element
= 33297 + 82775
= 116072
where diagonal element = σk1 + σk2.
Covariate estimates (R lme):
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUP1TEST GROUP2CONTROL Residual
StdDev: 14.56864 184.692 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
1TEST 2CONTROL
1.000000 3.068837
So
TEST group diagonal element
= 14.56864^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + 93.28885^2
= 8913.432
CONTROL group diagonal element
= 184.692^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + (3.068837 * 93.28885)^2
= 116070.5
where diagonal element = τg2 + σ1 + σg2.
Oooh, this is going to be a tricky one, and if it's even possible using standard nlme functions, is going to take some serious study of Pinheiro/Bates.
Before you spend the time doing that though, you should make absolutely sure that this is exact model you need. Perhaps there's something else that might fit the story of your data better. Or maybe there's something R can do more easily that is just as good, but not quite the same.
First, here's my take on what you're doing in SAS with this line:
repeated / type=cs subject=person group=GROUP;
This type=cs subject=person is inducing correlation between all the measurements on the same person, and that correlation is the same for all pairs of days. The group=GROUP is allowing the correlation for each group to be different.
In contrast, here's my take on what your R code is doing:
random = ~ +1 | person,
correlation=corCompSymm(form=~day|person)
This code is actually adding almost the same effect in two different ways; the random line is adding a random effect for each person, and the correlation line is inducing correlation between all the measurements on the same person. However, these two things are almost identical; if the correlation is positive, you get the exact same result by including either of them. I'm not sure what happens when you include both, but I do know that only one is necessary. Regardless, this code has the same correlation for all individuals, it's not allowing each group to have their own correlation.
To let each group have their own correlation, I think you have to build a more complicated correlation structure up out of two different pieces; I've never done this but I'm pretty sure I remember Pinheiro/Bates doing it.
You might consider instead adding a random effect for person and then letting the variance be different for the different groups with weights=varIdent(form=~1|group) (from memory, check my syntax, please). This won't quite be the same but tells a similar story. The story in SAS is that the measurements on some individuals are more correlated than the measurements on other individuals. Thinking about what that means, the measurements for individuals with higher correlation will be closer together than the measurements for individuals with lower correlation. In contrast, the story in R is that the variability of measurements within individuals varies; thinking about that, measurements with higher variability with have lower correlation. So they do tell similar stories, but come at it from opposite sides.
It is even possible (but I would be surprised) that these two models end up being different parameterizations of the same thing. My intuition is that the overall measurement variability will be different in some way. But even if they aren't the same thing, it would be worth writing out the parameterizations just to be sure you understand them and to make sure that they are appropriately describing the story of your data.

Resources