Repeated measures ANOVA - different resutls for SPSS versus R - r

I am trying to run a repeated - measures ANOVA using R and compared it to the SPSS output and results differ a lot! Maybe I make a mistake somewhere, but I cannot figure it out
So some sample data:
id is the subject. Every subject makes one rating for three items (res_1, res_2 and res_3). I want to compare an overall effect of item.
id<-c(1,2,3,4,5,6)
res_1<-c(1,1,1,2,2,1)
res_2<-c(4,5,2,4,4,3)
res_3<-c(4,5,6,3,6,6)
## wide format for spss
table<-as.data.frame(cbind(id, res_1, res_2, res_3))
## reshape to long format
library(reshape2)
table<-melt(table, id.vars="id")
colnames(table)<-c("id", "item", "rating")
aov.out = aov(rating ~ item+ Error(id/item), data=table)
summary(aov.out)
And here is my SPSS code (from wide format data)
GLM item_1 item_2 item_3
/WSFACTOR=factor1 3 Polynomial
/METHOD=SSTYPE(3)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/WSDESIGN=factor1.
The results I get from
R: p value 0.0526 (error:within)
and SPSS: p value 0.003 (test of within subject effect)
Does anyone have a suggestion that may explain the difference?
If I do a non-parametric Friedmann test, I get the same results in SPSS and R.
Actually, when looking at my data, the summary(aov.out) is the same as SPSS's "test of within subjects contrast" (but I learned to look at the test of within subjects effect).
Thanks!

There's a lot of stuff out there; I am a bit surprised that your google for 'spss versus R anova' did not bring you to links explaining about the difference in sums-of-squares between SPSS (type-III) and R (type-I) as well as difference in how contrasts are handled.
These are the top two results that I found:
http://myowelt.blogspot.ca/2008/05/obtaining-same-anova-results-in-r-as-in.html and
https://stats.stackexchange.com/questions/40958/testing-anova-hypothesis-with-contrasts-in-r-and-spss

Related

R - Mixed Design ANOVA post hoc test

I have the following data structure (with example values):
id var1 var2 value
1 true tr 1.34
2 true ct 4.89
3 false mm 2.38
4 true tr 1.28
The data is saved in 'longData'. So 'var1' is between subject variable that can be true or false, 'var2' is a within subject factor with 3 levels (tr, ct, mm) and 'value' is a numeric value.
I've made a mixed design ANOVA like this:
anovaResult = ezANOVA(data=longData,
dv=.("value"),
wid=.("id"),
within=.("var2"),
between=.("var1"),
type=3)
The result showed signifikant interaction between var1 and var2. Now I would like to examine this interaction further, but I don't know how. I've heard about the emmeans package (estimated marginal means seems to be the statistic of choice here, since I am new to statistics, feel free to advise me otherwise) but could not get the command to work. This is probably because I am new to R and do not understand the syntax fully.
Can anyone provide me with a working example of how to test the interaction between the two factors? I would not say no to an explanation of how to interpret the results as well.
I know this is much I am asking for, but I cannot figure it out for myself and have to present results soon without much time to learn statistics and R.
Thank you.
It would help to provide an example dataset.
However, you can run a Tukey test:
mod1<-aov(value~Factor1*Factor2, df)
TukeyHSD(mod1)
Or to run emmeans on an Anova with an interaction:
mod1<-aov(value~Factor1*Factor2, df)
library(emmeans)
emmeans(mod1, pairwise~Treatment*Time)
Or to do a mixed model, which is what you seem to be doing:
mod1<-lmer(value~Factor1*Factor2+(1|subject), df)
Anova(mod1)
summary(mod1)
emmeans(mod1, pairwise~Factor1*Factor2)

Delayed entry survival model. R vs Stata differences

I have some code in Stata that I'm trying to redo in R. I'm working on a delayed entry survival model and I want to limit the follow-up to 5 years. In Stata this is very easy and can be done as follows for example:
stset end, fail(failure) id(ID) origin(start) enter(entry) exit(time 5)
stcox var1
However, I'm having trouble recreating this in R. I've made a toy example limiting follow-up to 1000 days - here is the setup:
library(survival); library(foreign); library(rstpm2)
data(brcancer)
brcancer$start <- 0
# Make delayed entry time
brcancer$entry <- brcancer$rectime / 2
# Write to dta file for Stata
write.dta(brcancer, "brcancer.dta")
Ok so now we've set up an identical dataset for use in both R and Stata. Here is the Stata bit code and model result:
use "brcancer.dta", clear
stset rectime, fail(censrec) origin(start) enter(entry) exit(time 1000)
stcox hormon
And here is the R code and results:
# Limit follow-up to 1000 days
brcancer$limit <- ifelse(brcancer$rectime <1000, brcancer$rectime, 1000)
# Cox model
mod1 <- coxph(Surv(time=entry, time2= limit, event = censrec) ~ hormon, data=brcancer, ties = "breslow")
summary(mod1)
As you can see the R estimates and State estimates differ slightly, and I cannot figure out why. Have I set up the R model incorrectly to match Stata, or is there another reason the results differ?
Since the methods match on an avaialble dataset after recoding the deaths that occur after to termination date, I'm posting the relevant sections of my comment as an answer.
I also think that you should have changed any of the deaths at time greater than 1000 to be considered censored. (Notice that the numbers of events is quite different in the two sets of results.

how to find differentially methylated regions (for example with probe lasso in Champ) based on regression continuous variable ~ beta (with CpGassoc)

I performed 450K Illumina methylation chips on human samples, and want to search for the association between a continuous variable and beta, adjusted for other covariates. For this, I used the CpGassoc package in R. I would also like to search for differentially methylated regions based on the significant CpG sites. However, the probe lasso function in the Champ package and also other packages for 450K DMR analyses always assume 2 groups for which DMRs need to be find. I do not have 2 groups, but this continuous variable. Is there a way to load my output from CpGassoc in the probe lasso function from Champ? Or into another bump hunter package? I'm a MD, not a bio-informatician, thus comb-p, etc. would not be possible for me.
Thank you very much for your help.
Kind regards,
Line
I have not worked with methylation data before, so take what I say with a grain of salt. Also, don't use acronyms without describing them I'm guessing most people on this site don't know what a DMR is.
you could use lasso from the glmnet package to run a lasso on your data. So if your continuous variable was age you could do something like. If meth.dt is your methylations data.table with your columns as the amount of methylation for a given site, and your rows as subjects. I'm not sure if methylation data is considered to be poisson, I know RNA-seq data is. I also can't get too specific but the following code should work after adjusting to your number of columns
#load libraries
library(data.table)
library(glmnet)
#read in data
meth.dt <- fread("/data")
#lasso
AgeLasso <- glmnet(as.matrix(meth.dt[,1:70999,with=F]),meth.dt$Age, family="poisson")
cv.AgeLasso <- cv.glmnet(as.matrix(meth.dt[,1:70999,with=F]), meth.dt$Age, family="poisson")
coefTranscripts <- coef(cv.AgeLasso, s= "lambda.1se")[,1][coef(cv.AgeLasso, s= "lambda.1se")[,1] != 0]
This will give you the methylation sites that are the best predictors of your continuous variable using a parsimonious model. For additional info about glmnet see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Also might want to ask the people over at cross validated. They may have some better answers. http://stats.stackexchange.com
What is your continuous variable just out of curiosity?
Let me know how you ended up solving it if you don't use this method.

coxph stratified by year

I think this should be something very easy, but I can't quite get my head around it.
I have the following code:
library(survival)
cox <- coxph(Surv(SURV, DEAD)~YEAR, data)
summary(cox)
but I would like to have the result split down into the individual years.
Here's what the SPSS syntax and solution would look like:
COXREG surv /STATUS=dead(1) /CONTRAST (year)=Indicator(1)
/METHOD=ENTER year /PRINT=CI(95)
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20).
EXECUTE.
and the same thing in STATA:
xi: stcox i.year
Here's the output of
str(data)
You did not show us str(data) or how to construct a reproducible example the gave "data". I suspect that "YEAR" will turn out to be a numeric vector. If it had been a factor variable you would have seen an Intercept and n-1 coefficients. The Interecpt coefficient would then have been the same as the "year" and the other coefficients would have matched up to the year(n) values. You told the SPSS engine that "year" was an "INDICATOR" but you didn't offer the same courtesy to the R engine.
Try this:
data$year.ind <- factor(data$year) # equivalent of SPSS INDICATOR
# or SAS /CLASS
cox.mdl <- coxph(Surv(SURV, DEAD)~YEAR, data)
as.matrix(coef(coc.mdl)
summary(cox.mdl)
R often splits computing and display of results to allow more freedom. I assume you need the predict function of coxph (?predict.coxph).
There are examples at the bottom of the documentation page, most likely you want
predict(cox, type="terms")

How to structure stratified data for Poisson regression

I'm trying to use R to conduct Poisson regression on some data that I have. The current structure of the data is as follows:
Data is stratified based on three occupations. There are four levels of income in the data. Within each stratum, for each level of income there is
the number of workplace accidents that have occurred, and
the total man months observed.
Here's an example of the setup. The number in parentheses is the total man months observed and the number not in parentheses is the number of workplace accidents.
My question is how do I set up this data and perform a Poisson regression on the effect of income level on the occurrence of workplace accidents? Ideally I would like to adjust for occupation and find out the effect of only income, but as a starting point, I'm not sure how to set it up as a Poisson regression problem at all. I thought about doing something like dividing the number of injuries by the months of observation, but then that gives non-integer values so I assume that's not the right thing to do.
To reiterate, predictor: income level; response variable: workplace accidents.
BTW, it would be very easy to separate the parentheses numbers and put them into their own column, if that would make sense to do.
I'd really appreciate any suggestions on how to set this up. I am sure other statisticians are working with similarly structured data and might like to gain some insight as well. Thanks so much!
#thelatemail might be correct in think this to be better suited for stats.stackexchange.com but here is some R code. That data is in wide format and you need to re-structure it to long format. (And you will not want to include the totals columns. After converting the first four columns to a long format where you had 'occupation' and 'level' as factor-class variables, and accident 'counts' and exposure 'months' as numeric columns, you could use this call to glm.
fit <- glm( counts ~ level + occup + offset(log(months)), data=dfrm, family="poisson")
The offset needs to be log()-ed to agree with the logged counts created by the default link function for the poisson-family.
(You cannot really expect us to redo that data entry task, now can you?)

Resources