I'm studying the relationship between expenditure per student and performance on pisa (a standardized test), i know that this regression can't give me a ceteris paribus relationship but this is the point of my exercise, i have to explain why it will not work.
I was running the regression on R with the basic code:
lm1=lm(a~b)
but the problem is that R reports me 32 coefficient, which is the number of the components of my population, while i should only receive the slope and the intercept, given that is a simple regression
This is the output that R gives me:
Call:
lm(formula = a ~ b)
Coefficients:
(Intercept) b10167.3 b10467.8 b10766.4 b10863.4 b10960.1 b11.688.4 b11028.1 b11052 b11207.3 b11855.9 b12424.3 b13930.8
522.9936 5.9561 0.3401 -20.6884 -14.8603 -15.0777 -3.5752 -23.0459 -27.1021 -42.2692 -20.4485 -35.3906 -30.7468
b14353.3 b2.997.9 b20450.9 b3714.8 b4996.3 b5291.6 b5851.7 b6190.7 b6663.3 b6725.3 b6747.2 b7074.9 b8189.1
-18.4412 -107.2872 -39.6793 -98.2315 -80.2505 -36.2202 -48.6179 -64.2414 1.3887 -19.0389 -59.9734 -32.0751 -31.5962
b8406.2 b8533.5 b8671.1 b8996.3 b9265.7 b9897.2
-13.4219 -26.0155 -13.9045 -37.9996 -17.0271 -27.2954
As you can see there are 32 coefficient while i should receive only two, it seems that R is reading each unite of the population as a variable but the dataset is, as always, set with variable in row. I can't figure out what is the problem.
It's not a problem with the lm function. It appears that R is treating $b$ as a categorical variable.
I have a made a small data with 5 observations, $a$ (numeric variable) and $b$ (categorical variable).
When I fit my model you will see that I am seeing a similar output as you (5 estimated coefficients).
data = data.frame(a = 1:5, b = as.factor(rnorm(5)))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b-0.16380292500502 b0.213340249988902 b0.423891299272316 b0.63738307939327
4 -3 -1 1 -2
To correct this you need to convert $b$ into a numerical vector.
data$b = as.numeric(as.character(data$b))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b
2.9580 0.2772
```
Related
I am working on a project with data analysis in R.
What I am seeking to do is determine if a dataset can be described with a linear regression model, and then trying to test if certain subgroups of that dataset have a stronger correlation than the whole dataset. More specifically, I am comparing a dataset where students recorded their pulse and time estimations, and checking to see if there is a stronger correlation in a subgroup of the data where students were not found to have a daily rhythm to either variable vs. a subgroup where students were calculated to have a daily rhythm in both time estimation and heart rate. The values I am using are their daily averages for both time estimation and heart rate.
I ran a linear model of the whole dataset:
> summary(ptmod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-11.7310 -1.6725 -0.0162 2.0134 9.8548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.82047 2.99244 22.998 <2e-16 ***
avg.pulse -0.10449 0.04115 -2.539 0.0125 *
and also attempted to run a linear regression of each subgroup
> summary(ptmod2)
Call:
lm(formula = avg.time ~ avg.pulse + Group, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-12.9884 -1.7723 -0.1873 2.4900 8.7424
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.45350 2.92287 23.420 < 2e-16 ***
avg.pulse -0.08566 0.03985 -2.149 0.03388 *
GroupOne -1.22325 0.91444 -1.338 0.18386
GroupThree 0.11062 0.97666 0.113 0.91003
GroupTwo -3.09096 0.95446 -3.238 0.00161 **
However, I wanted to make sure that what I was seeing was correct, because I did not really expect so many of the groups to have significant coefficients. So I cut the groups up into their own .csv files and generated linear models for each of them individually. Cutting them up into their own files also made it easier to run a Chow test as a post-hoc analysis. When I ran regressions on them again, I got quite different coefficients.
For example, here is the summary for Group One:
> summary(mod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = group1)
Residuals:
Min 1Q Median 3Q Max
-7.1048 -1.6529 -0.7279 1.4063 5.6574
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.41445 4.15917 16.209 8.99e-15 ***
avg.pulse -0.08916 0.05657 -1.576 0.128
This makes me question what exactly my results from the summary of ptmod2 actually mean? I was uncertain of how to set up the R code for the linear model sorted by individual subgroups, so my code for it was
> ptmod2<-lm(avg.time~avg.pulse+Group, data=pulsetime)
In my spreadsheet file, I have three columns: avg.pulse, avg.time, and Group. "Group" is a column of the words "One", "Two", "Three", and "Four" assigned based on subgroup.
Did the summary for ptmod2 fit a linear regression across the whole dataset? I am really not sure what happened.
Thankyou so much for any insight you can provide. Perhaps my code for comparing regressions by group was incorrect.
This is somewhat of a split between a programming and statistics question. It is maybe better suited for crossvalidation. However the question is simple enough to get an understanding about.
Your question can be split into the following sub-questions:
Am I fitting a model on the full (whole) dataset in ptmod2?
How do I estimate multiple models across grouped datasets?
What is the correct way to analyse the coefficients of such a situation?
Am I fitting a model on the full (whole) datset in ptmod2?
The long and short is "yes". In R and statistics, when you add a "group" variable to your dataset, this is not equivalent to splitting your dataset into multiple groups. Instead it is adding an indicator variable (0 or 1) indicating the specific groups, including a reference level. So in your case you have 4 groups, 1 through 4, and you are adding an indicator for whether someone is in group 1, group 2, group 3 or (reference level) group 4. This is a measure of how much the intercept differs between the groups. Eg. these variables have the interpretation:
If the models share a common slope avg.pulse are there a significant difference in the avg.time explained by the specific group?
The reason why you see only 3 groups and not 4, is that the fourth group is explained by setting all the other groups equal to FALSE. Eg. if you are not in group 1, 2 or 3 you are part of group 4. So the "effect" of being in group 4, is the effect of not being in group 1, 2 or 3 (in this case).
A method for studying this, that it seems many of my students found helpful, is to study a small version of the model.matrix for example:
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
where you can very actively see that there is a column for the (intercept), hp and 2 columns for cyl6 and cyl8 (no column for cyl4 which is the reference). Matching the indices in cyl6 and cyl8 to the value in mtcars illustrates that a 1 in the cyl6 column indicates that cyl == 6.
How do I estimate multiple models across grouped datasets?
There are multiple methods for doing this depending on the question being sought. In your case you seem interested in the question "Are there a significant difference in the effect of avg.pulse depending for each group?". Eg, you want to estimate the avg.pulse coefficient for each group. One is to do as you did later, and estimate a model across each group
groups <- split(pulsetime, pulsetime$Group)
models <- lapply(groups, function(df)lm(avg.time ~ avg.pulse, data = df))
lapply(models, summary)
which gives the estimate. The problem here is "how to compare these". There are methods for doing so, by comparing the covariance between the parameters between each model, which is called "multivariate statistical analysis" or multiple regression models. This is overly complicated however, as the models share a common outcome.
A much simpler method is to incorporate the different estimate by adding the "extra" effect for each group using indicator variables. This works similar to adding the group variable, but instead of adding it alone (indicating the effect of being in group X), we multiply it to the variable in question using one of
# Let each group have their own `avg.pulse` variable
ptmod2_1 <- lm(formula = avg.time ~ avg.pulse : Group, data = pulsetime)
# Let each group have their own `avg.pulse` variable and account for the effect of `Group`
ptmod2_2 <- lm(formula = avg.time ~ avg.pulse * Group, data = pulsetime)
In the prior you'll see avg.time:GroupX, for all 4 groups, meaning these are the "effect of avg.time in group X", while in the latter you'll one again have a reference level. Note a stark difference between the 2 methods is that in the latter all group have the same intercept while in the latter all groups can have a different intercept.
In general statistics the latter is the preferred method, unless you have a very good reason not to expect each group to have a different average. It is very similar to the rule of thumb: Don't test your intercept, unless you have a very good reason, and even then you probably shouldn't". Basically because it makes a lot of logical sense to follow those rules (though it can take a few days of reflecting to realize why).
What is the correct way to analyse the coefficients of such a situation?
If you've stuck with one of the 2 latter methods, the analysis is similar to a normal regression analysis. Coefficients can be tested using t-tests, anova and so on (using summary/drop1 and anova), and if you have a reason you can test group merging using standard test as well (although if they are insignificant there is rarely a reason to merge them either way). The whole trick becomes "how do I interpret the coefficients".
For method 1 it is glaringly obvious. "Group 1 has an effect of avg.pulse of so much" and so on. For method 2 it is slightly more subtle. The effect of avg.pulse in group 1 is avg.pulse + avg.pulse:GroupOne. Because you have to note that avg.pulsedoes **not** disappear when you change group. It is the reference level, and every other effect is the **additional** effect onavg.pulseof going from being in group X to being in group Y. Visually yourslope` changes in the graph becoming steeper(flatter) if the coefficient is positive(negative).
I've given a visualization below using the mtcars dataset with using mpg for outcome, hp for numeric variable and cyl (as a factor) as a grouping variable. Confidence intervals are removed as they are not important for the illustration. The important part is to note how different the 2 models are (cyl == 4 is positive in one, negative in the other!). This further goes along the idea why method 2 is often "more correct" than the prior.
Code for reproducibility
Below is the code I've used for my illustrations and examples
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
#split-fit across groups and run summary
groups <- split(mtcars, mtcars$cyl)
models <- lapply(groups, function(df)lm(mpg ~ hp, data = df))
lapply(models, summary)
#Fit using interaction effects
fit_11 <- lm(mpg ~ hp:cyl , mtcars)
fit_12 <- lm(mpg ~ hp*cyl , mtcars)
summary(fit_11)
summary(fit_12)
#Illustrate interaction effects
library(sjPlot)
library(sjmisc)
library(ggplot2)
library(patchwork)
theme_set(theme_sjplot())
p1 <- plot_model(fit_11, type = "pred", terms = c("hp","cyl"), ci.lvl = 0) + ggtitle("Same intercept different slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p2 <- plot_model(fit_12, type = "pred", terms = c("hp", "cyl"), ci.lvl = 0) + ggtitle("Different intercept and slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p1 / p2
Using R, I want to statistically compare two coefficients from the same regression. In the Stata software, there is the test B1 = B2. What is the equivalent in R? I check several posts, but no one answered this issue.
https://stats.stackexchange.com/questions/33013/what-test-can-i-use-to-compare-slopes-from-two-or-more-regression-models
SPSS: Comparing regression coefficient from multiple models
Comparing regression models in R
Here are some simulated data.
library('MASS')
mu <- c(0,0,0)
Sigma <- matrix(.5, nrow=3, ncol=3) + diag(3)*0.3
MyData <- mvrnorm(n=10000, mu=mu, Sigma=Sigma) %>%
as.data.frame()
names(MyData) = c('v1', 'v2', 'y')
MyModel = lm(y ~ v1 * v2, data = MyData)
summary(MyModel)
I want to compare the estimate of V1 to the one of V2. So that if V1 and V2 are manipulated, I would like to tell something like "the influence of V1 on Y, is significantly higher than the influence of V2 on Y"
You can try multcomp , so if you look at the coefficients of your model:
coefficients(MyModel)
(Intercept) v1 v2 v1:v2
0.006961219 0.373547048 0.394760005 -0.012167754
You want to find the difference between the 2nd and 3rd term, so your contrast matrix is:
# yes it looks a bit weird at first
ctr = rbind("v1-v2"=c(0,1,-1,0))
And we can apply this using glht:
summary(glht(MyModel,ctr))
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = y ~ v1 * v2, data = MyData)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
v1-v2 == 0 -0.02121 0.01640 -1.294 0.196
(Adjusted p values reported -- single-step method)
This works for most general linear models. In your summary function, you get the the significance of each term based on the effect / standard error. The glht function does something similar. One exception for logistic regression I can think of, is when you have complete separation
I am using phylolm package to run a model with binary response variable (0/1), and a continuous predictor and a categorical predictor having more than 3 levels. If I consider the categorical predictor as continuous, i.e., 0, 1, 2, 3, the model ran well and I can use summary(model) to obtain the model output. However, this kind of treatment (treating categorical levels as continuous) does not fit the reality, I think it will be right to consider them as category, in this way, The model worked, but I had the problem to obtain the model output, as when I used summary(model), it gave the results for each category compared to the first. I would like to have an "anova" kind of table to summarize the significance of each variable, however, anova function does not apply to this kind of analysis. I want to ask is there any way to obtain the results for this model?
Some example scripts:
require(phylolm)
set.seed(123456)
# Simulate a tree of 50 species
tre = rtree(50)
# Simulate a continuous trait
conTrait = rTrait(n=1,phy=tre)
# Make a design matrix for the binary trait simulation
X = cbind(rep(1,50),conTrait)
# Simulate a binary trait
binTrait = rbinTrait(n=1,phy=tre, beta=c(-1,0.5), alpha=1 ,X=X)
# Simulate a radom categorical trait
catTrait <-
as.factor(sample(c("A","B","C"),size=length(tre$tip.label),replace=TRUE))
# Create data frame
dat = data.frame(binTrait = binTrait, conTrait = conTrait, catTrait = catTrait)
### run the model
fit = phyloglm(binTrait ~ conTrait*catTrait, phy=tre, data=dat)
##model output
summary(fit)
Call:
phyloglm(formula = binTrait ~ conTrait * catTrait, data = dat,
phy = tre)
AIC logLik Pen.logLik
52.07 -19.04 -17.28
Method: logistic_MPLE
Mean tip height: 3.596271
Parameter estimate(s):
alpha: 1.437638
Coefficients:
Estimate StdErr z.value p.value
(Intercept) -0.61804 0.83270 -0.7422 0.4580
conTrait 1.52295 1.16256 1.3100 0.1902
catTraitB 0.92563 0.98812 0.9368 0.3489
catTraitC -0.24900 1.01255 -0.2459 0.8057
conTrait:catTraitB 0.49031 1.41858 0.3456 0.7296
conTrait:catTraitC -0.74376 1.29850 -0.5728 0.5668
Note: Wald-type p-values for coefficients, conditional on alpha=1.437638
I try to analyze some simulated longitudinal data in R using a mixed-effects model (lme4 package).
Simulated data: 25 subjects have to perform 2 tasks at 5 consecutive time points.
#Simulate longitudinal data
N <- 25
t <- 5
x <- rep(1:t,N)
#task1
beta1 <- 4
e1 <- rnorm(N*t, mean = 0, sd = 1.5)
y1 <- 1 + x * beta1 + e1
#task2
beta2 <- 1.5
e2 <- rnorm(N*t, mean = 0, sd = 1)
y2 <- 1 + x * beta2 + e2
data1 <- data.frame(id=factor(rep(1:N, each=t)), day = x, y = y1, task=rep(c("task1"),length(y1)))
data2 <- data.frame(id=factor(rep(1:N, each=t)), day = x, y = y2, task=rep(c("task2"),length(y2)))
data <- rbind(data1, data2)
Question1: How to analyze how a subject learns each task?
library(lme4)
m1 <- lmer(y ~ day + (1 | id), data=data1)
summary(m1)
...
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.2757 0.3561 123.0000 3.582 0.000489 ***
day 3.9299 0.1074 123.0000 36.603 < 2e-16 ***
With ranef(m1) I get the random intercept for each subject, which I think reflects the baseline value for each subject at day = 1. But I don't understand how I can tell how an individual learns a task, or whether subjects differ in the way how they learn the task.
Question2: How can I analyze whether the way subjects learn differ between task1 and task2.
I expanded on your example to answer your questions briefly, but I can recommend reading chapter 15 of Snijders & Bosker (2012) or the book by Singer & Willet (2003) for a better explanation. Day is treated as a continuous variable in your model, seeing as you have panel data (i.e. everyone is measured at the same day) and day has no meaning apart from indicating the different measurement occasions, it may be better to treat day as a factor (i.e. use dummy variables).
However, for now I will continue with your example
Your first model (I think you want data instread of data1) gives a fixed linear slope (i.e. average slope, no difference in the tasks, no difference between individuals). The fixed intercept is the performance when day is 0, which has no meaning so you may want to consider centering the effect of day for a better interpretation (or indeed use dummies). The random effect gives the individual deviance from this intercept which has an estimated variance of 0.00 in your example so individuals hardly differ from each other in their starting position.
m1 <- lmer(y ~ day + (1 | id), data=data)
summary(m1)
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.00 0.000
Residual 18.54 4.306
Number of obs: 250, groups: id, 25
We can extend this model by adding an interaction with task. Meaning that the fixed slope is different for task1 and task2 which answers question 2 I believe (you can also use update() to update your model)
m2 <- lmer(y ~ day*task + (1|id), data = data)
summary(m2)
The effect of day in this model is the fixed slope of your reference category (task1) and the interaction is the difference between the slope of task1 and task2. The fixed effect of task is the difference in intercept.
model fit can be assessed with a deviance test, read Snijders & Boskers (2012) for an explanation of ML and REML estimates.
anova(m1,m2)
To add a random effect for the growth of individuals we can update the model again, which answers question 1
m3 <- lmer(y ~ day*task + (day|id), data = data)
summary(m3)
ranef(m3)
The random effects indicate the individual deviations in slope and intercept. A summary of the distribution of you random effects is included in the model summary (same as for m1).
Finally I think you could add a random effect on the day-task interaction to assess whether individuals differ in their performance growth on task1 and task2. But this depends very much on your data and the performance of the previous models.
m4 <- lmer(y ~ day*task + (day*task|id), data = data)
summary(m4)
ranef(m4)
Hope this helps. The books I recommended certainly should. Both provide excellent examples and explanation of theory (no R examples unfortunately). If you decide on a fixed occasion model (effect of day expressed by dummies) the nlme package provides excellent options to control the covariance structure of random effects. Good documentation of the package is provided by Pinheiro & Bates (2000).
There are several questions and posts about mixed models for more complex experimental designs, so I thought this more simple model would help other beginners in this process as well as I.
So, my question is I would like to formulate a repeated measures ancova in R from sas proc mixed procedure:
proc mixed data=df1;
FitStatistics=akaike
class GROUP person day;
model Y = GROUP X1 / solution alpha=.1 cl;
repeated / type=cs subject=person group=GROUP;
lsmeans GROUP;
run;
Here is the SAS output using the data created in R (below):
. Effect panel Estimate Error DF t Value Pr > |t| Alpha Lower Upper
Intercept -9.8693 251.04 7 -0.04 0.9697 0.1 -485.49 465.75
panel 1 -247.17 112.86 7 -2.19 0.0647 0.1 -460.99 -33.3510
panel 2 0 . . . . . . .
X1 20.4125 10.0228 7 2.04 0.0811 0.1 1.4235 39.4016
Below is how I formulated the model in R using 'nlme' package, but am not getting similar coefficient estimates:
## create reproducible example fake panel data set:
set.seed(94); subject.id = abs(round(rnorm(10)*10000,0))
set.seed(99); sds = rnorm(10,15,5);means = 1:10*runif(10,7,13);trends = runif(10,0.5,2.5)
this = NULL; set.seed(98)
for(i in 1:10) { this = c(this,rnorm(6, mean = means[i], sd = sds[i])*trends[i]*1:6)}
set.seed(97)
that = sort(rep(rnorm(10,mean = 20, sd = 3),6))
df1 = data.frame(day = rep(1:6,10), GROUP = c(rep('TEST',30),rep('CONTROL',30)),
Y = this,
X1 = that,
person = sort(rep(subject.id,6)))
## use package nlme
require(nlme)
## run repeated measures mixed model using compound symmetry covariance structure:
summary(lme(Y ~ GROUP + X1, random = ~ +1 | person,
correlation=corCompSymm(form=~day|person), na.action = na.exclude,
data = df1,method='REML'))
Now, the output from R, which I now realize is similar to the output from lm():
Value Std.Error DF t-value p-value
(Intercept) -626.1622 527.9890 50 -1.1859379 0.2413
GROUPTEST -101.3647 156.2940 7 -0.6485518 0.5373
X1 47.0919 22.6698 7 2.0772934 0.0764
I believe I'm close as to the specification, but not sure what piece I'm missing to make the results match (within reason..). Any help would be appreciated!
UPDATE: Using the code in the answer below, the R output becomes:
> summary(model2)
Scroll to bottom for the parameter estimates -- look! identical to SAS.
Linear mixed-effects model fit by REML
Data: df1
AIC BIC logLik
776.942 793.2864 -380.471
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUPCONTROL GROUPTEST Residual
StdDev: 184.692 14.56864 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
TEST CONTROL
1.000000 3.068837
Fixed effects: Y ~ GROUP + X1
Value Std.Error DF t-value p-value
(Intercept) -9.8706 251.04678 50 -0.0393178 0.9688
GROUPTEST -247.1712 112.85945 7 -2.1900795 0.0647
X1 20.4126 10.02292 7 2.0365914 0.0811
Please try below:
model1 <- lme(
Y ~ GROUP + X1,
random = ~ GROUP | person,
correlation = corCompSymm(form = ~ day | person),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model1)
I think random = ~ groupvar | subjvar option with R lme provides similar result of repeated / subject = subjvar group = groupvar option with SAS/MIXED in this case.
Edit:
SAS/MIXED
R (a revised model2)
model2 <- lme(
Y ~ GROUP + X1,
random = list(person = pdDiag(form = ~ GROUP - 1)),
correlation = corCompSymm(form = ~ day | person),
weights = varIdent(form = ~ 1 | GROUP),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model2)
So, I think these covariance structures are very similar (σg1 = τg2 + σ1).
Edit 2:
Covariate estimates (SAS/MIXED):
Variance person GROUP TEST 8789.23
CS person GROUP TEST 125.79
Variance person GROUP CONTROL 82775
CS person GROUP CONTROL 33297
So
TEST group diagonal element
= 125.79 + 8789.23
= 8915.02
CONTROL group diagonal element
= 33297 + 82775
= 116072
where diagonal element = σk1 + σk2.
Covariate estimates (R lme):
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUP1TEST GROUP2CONTROL Residual
StdDev: 14.56864 184.692 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
1TEST 2CONTROL
1.000000 3.068837
So
TEST group diagonal element
= 14.56864^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + 93.28885^2
= 8913.432
CONTROL group diagonal element
= 184.692^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + (3.068837 * 93.28885)^2
= 116070.5
where diagonal element = τg2 + σ1 + σg2.
Oooh, this is going to be a tricky one, and if it's even possible using standard nlme functions, is going to take some serious study of Pinheiro/Bates.
Before you spend the time doing that though, you should make absolutely sure that this is exact model you need. Perhaps there's something else that might fit the story of your data better. Or maybe there's something R can do more easily that is just as good, but not quite the same.
First, here's my take on what you're doing in SAS with this line:
repeated / type=cs subject=person group=GROUP;
This type=cs subject=person is inducing correlation between all the measurements on the same person, and that correlation is the same for all pairs of days. The group=GROUP is allowing the correlation for each group to be different.
In contrast, here's my take on what your R code is doing:
random = ~ +1 | person,
correlation=corCompSymm(form=~day|person)
This code is actually adding almost the same effect in two different ways; the random line is adding a random effect for each person, and the correlation line is inducing correlation between all the measurements on the same person. However, these two things are almost identical; if the correlation is positive, you get the exact same result by including either of them. I'm not sure what happens when you include both, but I do know that only one is necessary. Regardless, this code has the same correlation for all individuals, it's not allowing each group to have their own correlation.
To let each group have their own correlation, I think you have to build a more complicated correlation structure up out of two different pieces; I've never done this but I'm pretty sure I remember Pinheiro/Bates doing it.
You might consider instead adding a random effect for person and then letting the variance be different for the different groups with weights=varIdent(form=~1|group) (from memory, check my syntax, please). This won't quite be the same but tells a similar story. The story in SAS is that the measurements on some individuals are more correlated than the measurements on other individuals. Thinking about what that means, the measurements for individuals with higher correlation will be closer together than the measurements for individuals with lower correlation. In contrast, the story in R is that the variability of measurements within individuals varies; thinking about that, measurements with higher variability with have lower correlation. So they do tell similar stories, but come at it from opposite sides.
It is even possible (but I would be surprised) that these two models end up being different parameterizations of the same thing. My intuition is that the overall measurement variability will be different in some way. But even if they aren't the same thing, it would be worth writing out the parameterizations just to be sure you understand them and to make sure that they are appropriately describing the story of your data.