conducting GAM-GEE in gamm4 R package? - r
I am trying to analyze some visual transect data of organisms to generate a habitat distribution model. Once organisms are sighted, they are followed as point data is collected at a given time interval. Because of the autocorrelation among these "follows," I wish to utilize a GAM-GEE approach similar to that of Pirotta et al. 2011, using packages 'yags' and 'splines' (http://www.int-res.com/abstracts/meps/v436/p257-272/). Their R scripts are shown here (http://www.int-res.com/articles/suppl/m436p257_supp/m436p257_supp1-code.r). I have used this code with limited success and multiple issues of models failing to converge.
Below is the structure of my data:
> str(dat2)
'data.frame': 10792 obs. of 4 variables:
$ dist_slag : num 26475 26340 25886 25400 24934 ...
$ Depth : num -10.1 -10.5 -16.6 -22.2 -29.7 ...
$ dolphin_presence: int 0 0 0 0 0 0 0 0 0 0 ...
$ block : int 1 1 1 1 1 1 1 1 1 1 ...
> head(dat2)
dist_slag Depth dolphin_presence block
1 26475.47 -10.0934 0 1
2 26340.47 -10.4870 0 1
3 25886.33 -16.5752 0 1
4 25399.88 -22.2474 0 1
5 24934.29 -29.6797 0 1
6 24519.90 -26.2370 0 1
Here is the summary of my block variable (indicating the number of groups for which autocorrelation exists within each block
> summary(dat2$block)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 39.00 76.00 73.52 111.00 148.00
However, I would like to use the package 'gamm4', as I am more familiar with Professor Simon Wood's packages and functions, and it appears gamm4 might be the most appropriate. It is important to note that the models have a binary response (organism presence of absence along a transect), and thus why I think gamm4 is more appropriate than gamm. In the gamm help, it provides the following example for autocorrelation within factors:
## more complicated autocorrelation example - AR errors
## only within groups defined by `fac'
e <- rnorm(n,0,sig)
for (i in 2:n) e[i] <- 0.6*e[i-1]*(fac[i-1]==fac[i]) + e[i]
y <- f + e
b <- gamm(y~s(x,k=20),correlation=corAR1(form=~1|fac))
Following this example, the following is the code I used for my dataset
b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),random=(form=~1|block), family=binomial(),data=dat)
However, by examining the output (summary(b$gam)) and specifically summary(b$mer)), I am either unsure of how to interpret the results, or do not believe that the autocorrelation within the group is being taken into consideration.
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
>
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
How do I ensure that autocorrelation is indeed being accounted for within each unique value of the "block" variable? What is the simplest way to interpret the output for "summary(b$mer)"?
The results do differ from a normal gam (package mgcv) using the same variables and parameters without the "correlation=..." term, indicating that something different is occurring.
However, when I use a different variable for the correlation term (season), I get the SAME output:
> dat2 <- data.frame(dist_slag = dat$dist_slag, Depth = dat$Depth, dolphin_presence = dat$dolphin_presence,
+ block = dat$block, season=dat$season)
> head(dat2)
dist_slag Depth dolphin_presence block season
1 26475.47 -10.0934 0 1 F
2 26340.47 -10.4870 0 1 F
3 25886.33 -16.5752 0 1 F
4 25399.88 -22.2474 0 1 F
5 24934.29 -29.6797 0 1 F
6 24519.90 -26.2370 0 1 F
> summary(dat2$season)
F S
3224 7568
> b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),correlation=corAR1(1, form=~1 | season), family=binomial(),data=dat2)
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
I just want to make sure it is correctly allowing for correlation within each value for the "block" variable. How do I formulate the model to say that autocorrelation can exist within each single value for block, but assume independence among blocks?
On another note, I am also receiving the following warning message after model completion for larger models (with many more variables than 2):
Warning message:
In mer_finalize(ans) : false convergence (8)
gamm4 is built on top of lme4, which does not allow for a correlation parameter (in contrast to the nlme, package, which underlies mgcv::gamm). mgcv::gamm does handle binary data, although it uses PQL, which is generally less accurate than Laplace/GHQ approximations as in gamm4/lme4. It is unfortunate (!!) that you're not getting a warning telling you that the correlation argument is being ignored (when I try a simple example using a correlation argument with lme4, I do get a warning, but it's possible that the extra argument is getting swallowed somewhere inside gamm4).
Your desired autocorrelation structure ("autocorrelation can exist within each single value for block, but assume independence among blocks") is exactly the way correlation structures are coded in nlme (and hence in mgcv::gamm).
I would use mcgv::gamm, and would suggest that if at all possible you try it out on some simulated data with known structure (or use the data set provided in the supplementary material above and see if you can reproduce their qualitative conclusions with your alternative approach).
StackOverflow is nice, but there is probably more mixed model expertise at r-sig-mixed-models#r-project.org
Related
r - emmeans pairwise analysis for multilevel repeated measures ANCOVA
I'm building a repeated measures ANCOVA using a multi-level framework through the AOV package. I have one continuous response variable, two factor predictors, and 3 continuous covariates. My script for the model is below: ModelDV <- aov(DV ~ IV1 + IV2 + IV1*IV2 + CV1 + CV2 + CV3 + Error(PartID/(IV1 + IV2 + IV1:IV2)), data) A snippet of my data set shows how it is formatted: PartID DV IV1 IV2 CV1 CV2 CV3 1 56 CondA1 CondB1 Contunous values 2 45 CondA2 CondB2 - 3 32 CondA3 CondB1 - 4 21 CondA4 CondB2 - 1 10 CondA1 CondB1 - 2 19 CondA2 CondB2 - 3 35 CondA3 CondB1 - 4 45 CondA4 CondB2 - My condiitons are embedded in the error term of the participant ID since this is a fully within repeated measures model. I am attempting to conduct a pairwise analysis on these values. My output provides omnibus F-tests: Error: PartID Df Sum Sq Mean Sq F value Pr(>F) CV1 1 348 348 0.442 0.5308 CV2 1 9 9 0.011 0.9193 CV3 1 3989 3989 5.063 0.0654 . Residuals 6 4727 788 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Error: PartID:IV1 Df Sum Sq Mean Sq F value Pr(>F) IV1 1 6222 6222 17.41 0.0024 ** Residuals 9 3217 357 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Error: PartID:IV2 Df Sum Sq Mean Sq F value Pr(>F) IV2 2 6215 3107.7 16.18 9.51e-05 *** Residuals 18 3457 192.1 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Error: PartID:IV1:IV2 Df Sum Sq Mean Sq F value Pr(>F) IV1:IV2 2 575.2 287.6 1.764 0.2 Residuals 18 2934.4 163.0 When calculating emmeans via: emm<-emmeans(Model, ~ IV1) pairs(emm) I get a sensible output. However, when using this for the covariates: emm<-emmeans(Model, ~ CV1) pairs(emm) I get the following output: contrast estimate SE df z.ratio p.value (nothing) nonEst NA NA NA NA Results are averaged over the levels of: IV1, IV2 What am I doing wrong here that a pairwise comparison is not working for the covariate?
Short answer is because you have made them covariates to control for them and not to consider them as part of the explanation for your model. You of course could do pairwise comparisons for the covariates outside the model but not inside the model framework. Longer blogpost using these tools I wrote here...
How to perform a three-way (binary factors) between-subjects ANOVA with main effects and all interactions in R
The study randomized participants by Source (Expert vs Attractive) and by Argument (Strong vs Weak), were categorized into Monitor type (High vs Low). I want to test the significance of the main effects, the two-way interactions, and the three-way interactions of the following dataframe - specifically, Main effects = Self-Monitors (High vs. Low), Argument (Strong vs. Weak), Source (Attractive vs. Expert) Two-way interactions = Self-MonitorsArgument, Self-MonitorsSource, Argument*Source Three-way interactions = Self-MonitorsArgumentSource This is the code: data<-data.frame(Monitor=c(rep("High.Self.Monitors", 24),rep("Low.Self.Monitors", 24)), Argument=c(rep("Strong", 24), rep("Weak", 24), rep("Strong", 24), rep("Weak", 24)), Source=c(rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12), rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12)), Response=c(4,3,4,5,2,5,4,6,3,4,5,4,4,4,2,3,5,3,2,3,4,3,2,4,3,5,3,2,6,4,4,3,5,3,2,3,5,5,7,5,6,4,3,5,6,7,7,6, 3,5,5,4,3,2,1,5,3,4,3,4,5,4,3,2,4,6,2,4,4,3,4,3,5,6,4,7,6,7,5,6,4,6,7,5,6,4,4,2,4,5,4,3,4,2,3,4)) data$Monitor<-as.factor(data$Monitor) data$Argument<-as.factor(data$Argument) data$Source<-as.factor(data$Source) I'd like to obtain the main effects, as well as all two-way interactions and the three-way interaction. However, if I type in anova(lm(Response ~ Monitor*Argument*Source, data=data)) I obtain: Analysis of Variance Table Response: Response Df Sum Sq Mean Sq F value Pr(>F) Monitor 1 24.000 24.0000 13.5322 0.0003947 *** Source 1 0.667 0.6667 0.3759 0.5413218 Monitor:Source 1 0.667 0.6667 0.3759 0.5413218 Residuals 92 163.167 1.7736 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 and if I enter summary(aov(Response ~ Monitor*Argument*Source, data=data)) Call: lm.default(formula = Response ~ Monitor * Argument * Source, data = data) Residuals: Min 1Q Median 3Q Max -2.7917 -0.7917 0.2083 1.2083 2.5417 Coefficients: (4 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4583 0.2718 12.722 < 2e-16 *** MonitorLow.Self.Monitors 1.1667 0.3844 3.035 0.00313 ** ArgumentWeak NA NA NA NA SourceExpert 0.3333 0.3844 0.867 0.38817 MonitorLow.Self.Monitors:ArgumentWeak NA NA NA NA MonitorLow.Self.Monitors:SourceExpert -0.3333 0.5437 -0.613 0.54132 ArgumentWeak:SourceExpert NA NA NA NA MonitorLow.Self.Monitors:ArgumentWeak:SourceExpert NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.332 on 92 degrees of freedom Multiple R-squared: 0.1344, Adjusted R-squared: 0.1062 F-statistic: 4.761 on 3 and 92 DF, p-value: 0.00394 Any thoughts or ideas? Edit
Your data isn't well randomized as you say. In order to estimate a three-way interaction you'd have to have a group of subjects having "Low", "Strong" and "Expert" combination of levels of the three factors. You do not have such a group. Look at: table(data[,1:3]) For example.
Proportion of variance of outcome explained by each variable in a linear regression
In the example data set found below I want to calculate the proportion of variance in science explained by each independent variable using linear regression model. How could I achieve that in R? hsb2 <- read.table('http://www.ats.ucla.edu/stat/r/modules/hsb2.csv', header=T, sep=",") m1<-lm(science ~ math+female+ socst+ read, data =hsb2)
One of the ways is to use anova() function from stats package. It gives you the residual sum of squares explained by each variable and total sum of squares (i.e. variance) anova(m1) Analysis of Variance Table Response: science Df Sum Sq Mean Sq F value Pr(>F) math 1 7760.6 7760.6 151.8810 < 2.2e-16 *** female 1 233.0 233.0 4.5599 0.033977 * socst 1 465.6 465.6 9.1128 0.002878 ** read 1 1084.5 1084.5 21.2254 7.363e-06 *** Residuals 195 9963.8 51.1 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Adjust degrees of freedom for sphericity correction in repeated measure Anova from car package
I am running a repeated measures ANOVA using the car package. Which works fine and returns an output similar to this: Univariate Type III Repeated-Measures ANOVA Assuming Sphericity SS num Df Error SS den Df F Pr(>F) (Intercept) 7260.0 1 603.33 15 180.4972 9.100e-10 *** phase 167.5 2 169.17 30 14.8522 3.286e-05 *** hour 106.3 4 73.71 60 21.6309 4.360e-11 *** phase:hour 11.1 8 122.92 120 1.3525 0.2245 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Mauchly Tests for Sphericity Test statistic p-value phase 0.70470 0.086304 hour 0.11516 0.000718 phase:hour 0.01139 0.027376 Greenhouse-Geisser and Huynh-Feldt Corrections for Departure from Sphericity GG eps Pr(>F[GG]) phase 0.77202 0.0001891 *** hour 0.49842 1.578e-06 *** phase:hour 0.51297 0.2602357 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Now it shows me that in some cases a correction for Sphericity has to be applied. As I understand it this correction does not only affect p-values but also degrees of freedom (df). Output however does not show this. So how can I display the adjusted df?
Maybe you've already sorted this out, but to correct the degrees of freedom in cases of Greenhouse-Geisser or Huynh-Feldt corrections, you simply multiply each degree of freedom by the corresponding epsilon value. Here is an example based on your result: # degrees of freedom for phase:hour num_Df <- 8 den_Df <- 120 # Greenhouse-Geisser epsilon gg_e <- 0.51297 # adjusted degrees of freedom num_Df_adj <- gg_e * num_Df den_Df_adj <- gg_e * den_Df
Anova difference in SPSS and R
I'm quite new to R but I've tried recently to make a two way repeated measures ANOVA, to replicate the results that my supervisor did on SPSS. I've struggled for days and read dozens of articles to understand what was going on in R, but I still don't get the same results. > mod <- lm(Y~A*B) > Anova(mod, type="III") Anova Table (Type III tests) Response: Y Sum Sq Df F value Pr(>F) (Intercept) 0.000 1 0.0000 1.00000 A 2.403 5 8.6516 4.991e-08 *** B 0.403 2 3.6251 0.02702 * A:B 1.220 10 2.1962 0.01615 * Residuals 51.987 936 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 My data are from a balanced design, I used the type III SS since it's the one used in SPSS as well. The Sum squares the Df, and the linear model are the same in SPSS, the only things that differ being the F and p value. Thus, it should not be a Sum square mistake. Results in SPSS are: F Sig. A 7.831 .000 B 2.681 .073 A:B 2.247 .014 I'm a little bit lost. Would it be a problem related to the contrasts? Lucas