Anova difference in SPSS and R - r

I'm quite new to R but I've tried recently to make a two way repeated measures ANOVA, to replicate the results that my supervisor did on SPSS.
I've struggled for days and read dozens of articles to understand what was going on in R, but I still don't get the same results.
> mod <- lm(Y~A*B)
> Anova(mod, type="III")
Anova Table (Type III tests)
Response: Y
Sum Sq Df F value Pr(>F)
(Intercept) 0.000 1 0.0000 1.00000
A 2.403 5 8.6516 4.991e-08 ***
B 0.403 2 3.6251 0.02702 *
A:B 1.220 10 2.1962 0.01615 *
Residuals 51.987 936
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
My data are from a balanced design, I used the type III SS since it's the one used in SPSS as well. The Sum squares the Df, and the linear model are the same in SPSS, the only things that differ being the F and p value. Thus, it should not be a Sum square mistake.
Results in SPSS are:
F Sig.
A 7.831 .000
B 2.681 .073
A:B 2.247 .014
I'm a little bit lost. Would it be a problem related to the contrasts?
Lucas

Related

How to run 2x2 mixed anova with unequal sample size in R?

I am trying to run 2x2 mixed ANOVA with unequal sample size using R. The data was collected from 30 individuals with two different conditions (i.e., 2 levels of within factor), and they were allocated by k-means clustering analysis (i.e., # of group 1 = 11, # of group 2 = 19). Here is a sample of my code and the output:
summary(aov(JH ~ Box+Group+Box:Group+Error(P/Box), data = d3))
Error: P
Df Sum Sq Mean Sq F value Pr(>F)
Group 1 0.027 0.02715 1.56 0.22
Box:Group 1 0.001 0.00078 0.04 0.83
Residuals 27 0.470 0.01741
Error: P:Box
Df Sum Sq Mean Sq F value Pr(>F)
Box 1 0.000000 1.00e-07 0.00 0.97
Group 1 0.000022 2.17e-05 0.24 0.63
Box:Group 1 0.000032 3.24e-05 0.35 0.56
Residuals 27 0.002488 9.21e-05
Unlike the output when I ran 2x2 mixed ANOVA with equal sample sizes using another dataset (the output attached below), I had another interaction effect under Error: P.
summary(aov(HipLoadingK ~ Box*Sex+Error(P/Box), data = TW))
Error: P
Df Sum Sq Mean Sq F value Pr(>F)
Sex 1 0.02578 0.025779 8.038 0.00841 **
Residuals 28 0.08980 0.003207
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: P:Box
Df Sum Sq Mean Sq F value Pr(>F)
Box 1 0.003454 0.003454 4.271 0.0481 *
Box:Sex 1 0.000066 0.000066 0.082 0.7765
Residuals 28 0.022645 0.000809
Is this result fine as long as I interpret appropriate results for each between/within main effects and interaction effect? Or should I edit my code to get the correct outputs? If so, please let me know what should be added or edited in my code.
When I change the dataset from a short format to a long format, I accidentally put the different order of group name for each participant. For instance, P01 was group A in X condition, but his or her group was B in Y condition. After I fixed the code, it worked well.

How to set up a 2 way repeated measures ANCOVA using R?

I have not seen an example of this particular test being done in many forums I have been scouring using R. I have seen an example of a 1 way ANCOVA here or 2 way ANOVA (repeated measures) here. I have presented here my version of it after jumping through several hoops and I am not sure if it is correct.
I am trying to setup a statistical test for results from a clinical trial. A simplified version of the study is as follows. In this study I have divided my subjects into two groups (Condition A & B) with n=3 subjects per group. Dependent variable (DV) was measured in these subjects over 4 time points. I aim to conduct a two way ANCOVA (repeated measures). To ask the questions:- is the Response variable (DV) affected by the Condition-timepoint interaction with bodywt as a covariate.
Note: I have to stick to "traditional" aov function and not lme or lmer since my reports will be compared to SPSS generated output-- see discussion here that aov and lme are not the same.
My example code:
SubjectID <- rep(1:6, each= 4)
Condition <- c(rep('A',12),rep('B',12))
bodywt <- rep(16:20, each= 4)
timepoint <- rep(1:4, 6)
DV <- c(42,51,63,57,46,60,63,61,41,55,62,57,73,56,53,58,50,60,56,54,52,57,54,54)
data <- data.frame(SubjectID = SubjectID,
Condition = Condition,
bodywt = bodywt,
timepoint = timepoint,
DV = DV)
#Setting up my 2 way ANCOVA repeated measures
stat_result <- aov(DV~Condition*timepoint + bodywt:Condition + Error(SubjectID), data)
summary(stat_result)
#Error: SubjectID
# Df Sum Sq Mean Sq
#Condition 1 0.8036 0.8036
#
#Error: Within
# Df Sum Sq Mean Sq F value Pr( #F)
#Condition 1 41.8 41.8 1.341 0.26282
#timepoint 1 126.1 126.1 4.045 0.06042 .
#Condition:timepoint 1 323.4 323.4 10.377 0.00501 **
#Condition:bodywt 2 81.7 40.9 1.311 0.29540
#Residuals 17 529.8 31.2
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I know that I am on the right track because when I remove the covariate bodywt, I Notice that the F value and Pr(F) for Condition:timepoint are predictably affected.
stat_result <- aov(DV~Condition*timepoint + bodywt + Error(SubjectID), data)
summary(stat_result)
#Error: SubjectID
# Df Sum Sq Mean Sq
#Condition 1 0.8036 0.8036
#
#Error: Within
# Df Sum Sq Mean Sq F value Pr( #F)
#Condition 1 41.8 41.8 1.231 0.28175
#timepoint 1 126.1 126.1 3.714 0.06989 .
#bodywt 1 0.5 0.5 0.015 0.90435
#Condition:timepoint 1 323.4 323.4 9.527 0.00636 **
#Residuals 18 611.0 33.9
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
My question is that:
Is my code the correct way to set up a 2 way repeated measures ANCOVA?
(bonus question) What would be a best post hoc test to compare means between the two groups. Again I would prefer staying away from lme or lmer.
Thanks in advance

Proportion of variance of outcome explained by each variable in a linear regression

In the example data set found below I want to calculate the proportion of variance in science explained by each independent variable using linear regression model. How could I achieve that in R?
hsb2 <- read.table('http://www.ats.ucla.edu/stat/r/modules/hsb2.csv', header=T, sep=",")
m1<-lm(science ~ math+female+ socst+ read, data =hsb2)
One of the ways is to use anova() function from stats package.
It gives you the residual sum of squares explained by each variable and total sum of squares (i.e. variance)
anova(m1)
Analysis of Variance Table
Response: science
Df Sum Sq Mean Sq F value Pr(>F)
math 1 7760.6 7760.6 151.8810 < 2.2e-16 ***
female 1 233.0 233.0 4.5599 0.033977 *
socst 1 465.6 465.6 9.1128 0.002878 **
read 1 1084.5 1084.5 21.2254 7.363e-06 ***
Residuals 195 9963.8 51.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

coercing ANOVA table into table object R

I am running a lot of ANOVA tables and would ultimately like to run them through the 'xtable' function in the xtable package for export into LaTeX. However I need to supply xtable with a table object, it will not accept an ANOVA object. I basically want to make the ANOVA table as a table object. Here is some reproducable code:
utils::data(npk, package="MASS")
npk.aovE <- aov(yield ~ N*P*K + Error(block), npk)
summary(npk.aovE) ## THIS IS THE TABLE I WANT AS A TABLE OBJECT
I have tried all the usual suspects (as.table, print and xtable(summary(npk.aoE))) with no success. Any help would be greatly appreciated
Generally what people want is the matrix obtained with:
coef( summary(npk.aovE) ) # which returns NULL
As the help page says: "Function coef will extract the matrix of coefficients with standard errors, t-statistics and p-values." Unfortunately theory and practice do not always agree. That summary object is actually two dataframes and its behavior is described on ?summary.aovlist :
> summary(npk.aovE)[[2]]
Df Sum Sq Mean Sq F value Pr(>F)
N 1 189.28 189.28 12.259 0.00437 **
P 1 8.40 8.40 0.544 0.47490
K 1 95.20 95.20 6.166 0.02880 *
N:P 1 21.28 21.28 1.378 0.26317
N:K 1 33.14 33.14 2.146 0.16865
P:K 1 0.48 0.48 0.031 0.86275
Residuals 12 185.29 15.44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(npk.aovE)[[1]]
Df Sum Sq Mean Sq F value Pr(>F)
N:P:K 1 37.0 37.00 0.483 0.525
Residuals 4 306.3 76.57

conducting GAM-GEE in gamm4 R package?

I am trying to analyze some visual transect data of organisms to generate a habitat distribution model. Once organisms are sighted, they are followed as point data is collected at a given time interval. Because of the autocorrelation among these "follows," I wish to utilize a GAM-GEE approach similar to that of Pirotta et al. 2011, using packages 'yags' and 'splines' (http://www.int-res.com/abstracts/meps/v436/p257-272/). Their R scripts are shown here (http://www.int-res.com/articles/suppl/m436p257_supp/m436p257_supp1-code.r). I have used this code with limited success and multiple issues of models failing to converge.
Below is the structure of my data:
> str(dat2)
'data.frame': 10792 obs. of 4 variables:
$ dist_slag : num 26475 26340 25886 25400 24934 ...
$ Depth : num -10.1 -10.5 -16.6 -22.2 -29.7 ...
$ dolphin_presence: int 0 0 0 0 0 0 0 0 0 0 ...
$ block : int 1 1 1 1 1 1 1 1 1 1 ...
> head(dat2)
dist_slag Depth dolphin_presence block
1 26475.47 -10.0934 0 1
2 26340.47 -10.4870 0 1
3 25886.33 -16.5752 0 1
4 25399.88 -22.2474 0 1
5 24934.29 -29.6797 0 1
6 24519.90 -26.2370 0 1
Here is the summary of my block variable (indicating the number of groups for which autocorrelation exists within each block
> summary(dat2$block)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 39.00 76.00 73.52 111.00 148.00
However, I would like to use the package 'gamm4', as I am more familiar with Professor Simon Wood's packages and functions, and it appears gamm4 might be the most appropriate. It is important to note that the models have a binary response (organism presence of absence along a transect), and thus why I think gamm4 is more appropriate than gamm. In the gamm help, it provides the following example for autocorrelation within factors:
## more complicated autocorrelation example - AR errors
## only within groups defined by `fac'
e <- rnorm(n,0,sig)
for (i in 2:n) e[i] <- 0.6*e[i-1]*(fac[i-1]==fac[i]) + e[i]
y <- f + e
b <- gamm(y~s(x,k=20),correlation=corAR1(form=~1|fac))
Following this example, the following is the code I used for my dataset
b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),random=(form=~1|block), family=binomial(),data=dat)
However, by examining the output (summary(b$gam)) and specifically summary(b$mer)), I am either unsure of how to interpret the results, or do not believe that the autocorrelation within the group is being taken into consideration.
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
>
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
How do I ensure that autocorrelation is indeed being accounted for within each unique value of the "block" variable? What is the simplest way to interpret the output for "summary(b$mer)"?
The results do differ from a normal gam (package mgcv) using the same variables and parameters without the "correlation=..." term, indicating that something different is occurring.
However, when I use a different variable for the correlation term (season), I get the SAME output:
> dat2 <- data.frame(dist_slag = dat$dist_slag, Depth = dat$Depth, dolphin_presence = dat$dolphin_presence,
+ block = dat$block, season=dat$season)
> head(dat2)
dist_slag Depth dolphin_presence block season
1 26475.47 -10.0934 0 1 F
2 26340.47 -10.4870 0 1 F
3 25886.33 -16.5752 0 1 F
4 25399.88 -22.2474 0 1 F
5 24934.29 -29.6797 0 1 F
6 24519.90 -26.2370 0 1 F
> summary(dat2$season)
F S
3224 7568
> b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),correlation=corAR1(1, form=~1 | season), family=binomial(),data=dat2)
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
I just want to make sure it is correctly allowing for correlation within each value for the "block" variable. How do I formulate the model to say that autocorrelation can exist within each single value for block, but assume independence among blocks?
On another note, I am also receiving the following warning message after model completion for larger models (with many more variables than 2):
Warning message:
In mer_finalize(ans) : false convergence (8)
gamm4 is built on top of lme4, which does not allow for a correlation parameter (in contrast to the nlme, package, which underlies mgcv::gamm). mgcv::gamm does handle binary data, although it uses PQL, which is generally less accurate than Laplace/GHQ approximations as in gamm4/lme4. It is unfortunate (!!) that you're not getting a warning telling you that the correlation argument is being ignored (when I try a simple example using a correlation argument with lme4, I do get a warning, but it's possible that the extra argument is getting swallowed somewhere inside gamm4).
Your desired autocorrelation structure ("autocorrelation can exist within each single value for block, but assume independence among blocks") is exactly the way correlation structures are coded in nlme (and hence in mgcv::gamm).
I would use mcgv::gamm, and would suggest that if at all possible you try it out on some simulated data with known structure (or use the data set provided in the supplementary material above and see if you can reproduce their qualitative conclusions with your alternative approach).
StackOverflow is nice, but there is probably more mixed model expertise at r-sig-mixed-models#r-project.org

Resources