How GAM regression (mgcv) deals with repeated values? - r

I am trying to explore regressions between abundances and 3 variables. My data (test.gam)looks like this:
# A tibble: 6 x 5
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 18.2
3 cycle1 0.983 5328652. 105 18.2
4 cycle1 1 6212034. 110 18.2
5 cycle1 0.821 5468987. 105 18.2
6 cycle1 0.734 5280549. 112. 18.2
In one of these variable (SiOH4), I have only one value per Site, while for the 2 other variables, I have single value for each station (each row being a station).
To plot the relation between abundances and SiOH4 I would simply compute a mean value for each Site. The relation show that there is a constant increase of abundances with SiOH4 levels: Plot1.
Now I tried running a GAM on this data using the following code:
mod_gam1 <- gam(Abundance ~ s(isotherm, bs = "cr", k = 5)
+ SPM + s(SiOH4, bs = "cr", k = 5), data = test.gam, family = gaussian(link = log), gamma = 1.4)
giving me these results:
Family: gaussian
Link function: log
Formula:
Abundance ~ s(isotherm, bs = "cr", k = 5) + SPM + s(SiOH4, bs = "cr",
k = 5)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.182e-01 8.244e-02 -9.925 < 2e-16 ***
SPM -4.356e-08 1.153e-08 -3.778 0.000219 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(isotherm) 2.019 2.485 10.407 1.46e-05 ***
s(SiOH4) 3.861 3.986 9.823 1.01e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.492 Deviance explained = 51.2%
GCV = 0.044202 Scale est. = 0.040674 n = 177
So I am happy quite happy about the results but then by checking with gam.check, I find that k is too low.
Method: GCV Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-8.801477e-14,5.555545e-13]
(score 0.04420205 & scale 0.04067442).
Hessian positive definite, eigenvalue range [6.631202e-05,7.084933e-05].
Model rank = 10 / 10
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(isotherm) 4.00 2.02 0.85 0.01 **
s(SiOH4) 4.00 3.86 0.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that I voluntarily set my k to 5, otherwise there is an overfit of the pattern.
I thought that this might due to the fact that many of the values in SiOH4 are repeated. By modifying my data to keep only the first value of each Site (replacing all other rows with NA) like:
# A tibble: 6 x 5
# Groups: Site [1]
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 NA
3 cycle1 0.983 5328652. 105 NA
4 cycle1 1 6212034. 110 NA
5 cycle1 0.821 5468987. 105 NA
6 cycle1 0.734 5280549. 112. NA
I hoped preventing this repeated levels. But this way I am also loosing most of my rows, with the na.omit option on. However running the same GAM, I don't have problems with k being too low after using gam.check.
So do I need to keep repetitive values and ignore the warning from gam.check or there is a way to somehow keep all rows even if NA exist?

Related

How to use ANOVA (function = aov) for a fractional factorial designed experiment with 6 independent variables in R

I have done an incomplete factorial design (fractional factorial design) experiment on different fertilizer applications.
Here is the format of the data: Excerpt of data
I want to do an ANOVA in R using the function aov. I have 450 data points in total, 'Location' has 5 factors, N has 3, and F1,F2,F3+F4 have two each.
Here is the code that I am using:
ANOVA1<-aov(PlantWeight~Location*N*F1*F2*F3*F4, data = data)
summary(ANOVA1)
Independent variables F1,F2,F3+F4 are not applied in a factorial manner. Each sample either has F1,F2,F3+F4 or nothing applied. In the cases where no F1,F2,F3 or F4 fertiliser was applied the value 0 has been put in every column. This is the control to which each of F1,F2,F3+F4 will be compared to. If F1 has been applied then column F1 will read 1 and it will read NA in F2,F3+F4 columns.
When I try to run this ANOVA I get this error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Another approach I had was to but an 'x' instead of 'NA'. This has issues because it is assuming that x is a factor when it is not. It seemed to work fine except it would always ignore the F4.
ANOVA2<-aov(PlantWeight~((F1*Location*N)+(F2*Location*N)+(F3*Location*N)+
(F4*Location*N)), data = data)
summary(ANOVA2)
Results:
Df Sum Sq Mean Sq F value Pr(>F)
F1 2 10.3 5.13 5.742 0.00351 **
Location 6 798.6 133.11 149.027 < 2e-16 ***
N 2 579.6 289.82 324.485 < 2e-16 ***
F2 1 0.3 0.33 0.364 0.54667
F3 1 0.4 0.44 0.489 0.48466
F1:Location 10 26.5 2.65 2.962 0.00135 **
F1:N 4 6.6 1.66 1.857 0.11737
Location:N 10 113.5 11.35 12.707 < 2e-16 ***
Location:F2 5 6.5 1.30 1.461 0.20188
N:F2 2 2.7 1.37 1.537 0.21641
Location:F3 5 33.6 6.72 7.529 9.73e-07 ***
N:F3 2 2.5 1.23 1.375 0.25409
F1:Location:N 20 12.4 0.62 0.696 0.83029
F2:Location:N 10 18.9 1.89 2.113 0.02284 *
F3:Location:N 10 26.8 2.68 3.001 0.00118 **
Residuals 359 320.6 0.89
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Any help on how to approach this would be wonderful!

Why is my summary in R only including some of my variables?

I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.
SAMPLE DATA SET:
Calls<- c("55","60","180","160","110","50")
Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
q<-data.frame(Calls, Pup)
q
q1<-lm(Calls~Pup, data=q)
summary(q1)
OUTPUT AND ERROR MESSAGE FROM SAMPLE:
> Calls Pup
1 55 Pre
2 60 Middle
3 180 Post
4 160 Post
5 110 Middle
6 50 Pre
Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
ACTUAL INPUT FOR MY ANALYSIS:
> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)
THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:
Residuals:
Min 1Q Median 3Q Max
-66.40 -37.63 -26.02 -5.39 299.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.54 35.82 1.858 0.0734 .
PupPost -51.98 48.50 -1.072 0.2927
PupPre -26.47 39.86 -0.664 0.5118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared: 0.03822, Adjusted R-squared: -0.02811
F-statistic: 0.5762 on 2 and 29 DF, p-value: 0.5683
Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.
For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234
If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:
q1<-lm(Calls~Pup - 1, data=q)

Confidence intervals for hurdle or zero inflated mixed models

I want to calculate CI in mixed models, zero inflated negative binomial and hurdle model. My code for hurdle model looks like this (x1, x2 continuous, x3 categorical):
m1 <- glmmTMB(count~x1+x2+x3+(1|year/class),
data = bd, zi = ~x2+x3+(1|year/class), family = truncated_nbinom2,
)
I used confint, and I got these results:
ci <- confint(m1,parm="beta_")
ci
2.5 % 97.5 % Estimate
cond.(Intercept) 1.816255e-01 0.448860094 0.285524861
cond.x1 9.045278e-01 0.972083366 0.937697401
cond.x2 1.505770e+01 26.817439186 20.094998772
cond.x3high 1.190972e+00 1.492335046 1.333164894
cond.x3low 1.028147e+00 1.215828654 1.118056377
cond.x3reg 1.135515e+00 1.385833853 1.254445909
class:year.cond.Std.Dev.(Intercept)2.256324e+00 2.662976154 2.441845815
year.cond.Std.Dev.(Intercept) 1.051889e+00 1.523719169 1.157153015
zi.(Intercept) 1.234418e-04 0.001309705 0.000402085
zi.x2 2.868578e-02 0.166378014 0.069084606
zi.x3high 8.972025e-01 1.805832900 1.272869874
Am I calculating the intervals correctly? Why is there only one category in x3 for zi?
If possible, I would also like to know if it's possible to plot these CIs.
Thanks!
Data looks like this:
class id year count x1 x2 x3
956 5 3002 2002 3 15.6 47.9 high
957 5 4004 2002 3 14.3 47.9 low
958 5 6021 2002 3 14.2 47.9 high
959 4 2030 2002 3 10.5 46.3 high
960 4 2031 2002 3 15.3 46.3 high
961 4 2034 2002 3 15.2 46.3 reg
with x1 and x2 continuous, x3 three level categorical variable (factor)
Summary of the model:
summary(m1)
'giveCsparse' has been deprecated; setting 'repr = "T"' for you'giveCsparse' has been deprecated; setting 'repr = "T"' for you'giveCsparse' has been deprecated; setting 'repr = "T"' for you
Family: truncated_nbinom2 ( log )
Formula: count ~ x1 + x2 + x3 + (1 | year/class)
Zero inflation: ~x2 + x3 + (1 | year/class)
Data: bd
AIC BIC logLik deviance df.resid
37359.7 37479.7 -18663.8 37327.7 13323
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
class:year(Intercept) 0.79701 0.8928
year (Intercept) 0.02131 0.1460
Number of obs: 13339, groups: class:year, 345; year, 15
Zero-inflation model:
Groups Name Variance Std.Dev.
dpto:year (Intercept) 1.024e+02 1.012e+01
year (Intercept) 7.842e-07 8.856e-04
Number of obs: 13339, groups: class:year, 345; year, 15
Overdispersion parameter for truncated_nbinom2 family (): 1.02
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.25343 0.23081 -5.431 5.62e-08 ***
x1 -0.06433 0.01837 -3.501 0.000464 ***
x2 3.00047 0.14724 20.378 < 2e-16 ***
x3high 0.28756 0.05755 4.997 5.82e-07 ***
x3low 0.11159 0.04277 2.609 0.009083 **
x3reg 0.22669 0.05082 4.461 8.17e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.8188 0.6025 -12.977 < 2e-16 ***
x2 -2.6724 0.4484 -5.959 2.53e-09 ***
x3high 0.2413 0.1784 1.352 0.17635
x3low -0.1325 0.1134 -1.169 0.24258
x3reg -0.3806 0.1436 -2.651 0.00802 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
CI with broom.mixed
> broom.mixed::tidy(m1, effects="fixed", conf.int=TRUE)
# A tibble: 12 x 9
effect component term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 fixed cond (Intercept) -1.25 0.231 -5.43 5.62e- 8 -1.71 -0.801
2 fixed cond x1 -0.0643 0.0184 -3.50 4.64e- 4 -0.100 -0.0283
3 fixed cond x2 3.00 0.147 20.4 2.60e-92 2.71 3.29
4 fixed cond x3high 0.288 0.0575 5.00 5.82e- 7 0.175 0.400
5 fixed cond x3low 0.112 0.0428 2.61 9.08e- 3 0.0278 0.195
6 fixed cond x3reg 0.227 0.0508 4.46 8.17e- 6 0.127 0.326
7 fixed zi (Intercept) -9.88 1.32 -7.49 7.04e-14 -12.5 -7.30
8 fixed zi x1 0.214 0.120 1.79 7.38e- 2 -0.0206 0.448
9 fixed zi x2 -2.69 0.449 -6.00 2.01e- 9 -3.57 -1.81
10 fixed zi x3high 0.232 0.178 1.30 1.93e- 1 -0.117 0.582
11 fixed zi x3low -0.135 0.113 -1.19 2.36e- 1 -0.357 0.0878
12 fixed zi x4reg -0.382 0.144 -2.66 7.74e- 3 -0.664 -0.101
tl;dr as far as I can tell this is a bug in confint.glmmTMB (and probably in the internal function glmmTMB:::getParms). In the meantime, broom.mixed::tidy(m1, effects="fixed") should do what you want. (There's now a fix in progress in the development version on GitHub, should make it to CRAN sometime? soon ...)
Reproducible example:
set up data
set.seed(101)
n <- 1e3
bd <- data.frame(
year=factor(sample(2002:2018, size=n, replace=TRUE)),
class=factor(sample(1:20, size=n, replace=TRUE)),
x1 = rnorm(n),
x2 = rnorm(n),
x3 = factor(sample(c("low","reg","high"), size=n, replace=TRUE),
levels=c("low","reg","high")),
count = rnbinom(n, mu = 3, size=1))
fit
library(glmmTMB)
m1 <- glmmTMB(count~x1+x2+x3+(1|year/class),
data = bd, zi = ~x2+x3+(1|year/class), family = truncated_nbinom2,
)
confidence intervals
confint(m1, "beta_") ## wrong/ incomplete
broom.mixed::tidy(m1, effects="fixed", conf.int=TRUE) ## correct
You may want to think about which kind of confidence intervals you want:
Wald CIs (default) are much faster to compute and are generally OK as long as (1) your data set is large and (2) you aren't estimating any parameters on the log/logit scale that are near the boundaries
likelihood profile CIs are more accurate but much slower

r - emmeans pairwise analysis for multilevel repeated measures ANCOVA

I'm building a repeated measures ANCOVA using a multi-level framework through the AOV package. I have one continuous response variable, two factor predictors, and 3 continuous covariates. My script for the model is below:
ModelDV <- aov(DV ~ IV1 + IV2 + IV1*IV2 + CV1 + CV2 + CV3 + Error(PartID/(IV1 + IV2 + IV1:IV2)), data)
A snippet of my data set shows how it is formatted:
PartID DV IV1 IV2 CV1 CV2 CV3
1 56 CondA1 CondB1 Contunous values
2 45 CondA2 CondB2 -
3 32 CondA3 CondB1 -
4 21 CondA4 CondB2 -
1 10 CondA1 CondB1 -
2 19 CondA2 CondB2 -
3 35 CondA3 CondB1 -
4 45 CondA4 CondB2 -
My condiitons are embedded in the error term of the participant ID since this is a fully within repeated measures model.
I am attempting to conduct a pairwise analysis on these values. My output provides omnibus F-tests:
Error: PartID
Df Sum Sq Mean Sq F value Pr(>F)
CV1 1 348 348 0.442 0.5308
CV2 1 9 9 0.011 0.9193
CV3 1 3989 3989 5.063 0.0654 .
Residuals 6 4727 788
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: PartID:IV1
Df Sum Sq Mean Sq F value Pr(>F)
IV1 1 6222 6222 17.41 0.0024 **
Residuals 9 3217 357
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: PartID:IV2
Df Sum Sq Mean Sq F value Pr(>F)
IV2 2 6215 3107.7 16.18 9.51e-05 ***
Residuals 18 3457 192.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: PartID:IV1:IV2
Df Sum Sq Mean Sq F value Pr(>F)
IV1:IV2 2 575.2 287.6 1.764 0.2
Residuals 18 2934.4 163.0
When calculating emmeans via:
emm<-emmeans(Model, ~ IV1)
pairs(emm)
I get a sensible output.
However, when using this for the covariates:
emm<-emmeans(Model, ~ CV1)
pairs(emm)
I get the following output:
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
Results are averaged over the levels of: IV1, IV2
What am I doing wrong here that a pairwise comparison is not working for the covariate?
Short answer is because you have made them covariates to control for them and not to consider them as part of the explanation for your model. You of course could do pairwise comparisons for the covariates outside the model but not inside the model framework. Longer blogpost using these tools I wrote here...

conducting GAM-GEE in gamm4 R package?

I am trying to analyze some visual transect data of organisms to generate a habitat distribution model. Once organisms are sighted, they are followed as point data is collected at a given time interval. Because of the autocorrelation among these "follows," I wish to utilize a GAM-GEE approach similar to that of Pirotta et al. 2011, using packages 'yags' and 'splines' (http://www.int-res.com/abstracts/meps/v436/p257-272/). Their R scripts are shown here (http://www.int-res.com/articles/suppl/m436p257_supp/m436p257_supp1-code.r). I have used this code with limited success and multiple issues of models failing to converge.
Below is the structure of my data:
> str(dat2)
'data.frame': 10792 obs. of 4 variables:
$ dist_slag : num 26475 26340 25886 25400 24934 ...
$ Depth : num -10.1 -10.5 -16.6 -22.2 -29.7 ...
$ dolphin_presence: int 0 0 0 0 0 0 0 0 0 0 ...
$ block : int 1 1 1 1 1 1 1 1 1 1 ...
> head(dat2)
dist_slag Depth dolphin_presence block
1 26475.47 -10.0934 0 1
2 26340.47 -10.4870 0 1
3 25886.33 -16.5752 0 1
4 25399.88 -22.2474 0 1
5 24934.29 -29.6797 0 1
6 24519.90 -26.2370 0 1
Here is the summary of my block variable (indicating the number of groups for which autocorrelation exists within each block
> summary(dat2$block)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 39.00 76.00 73.52 111.00 148.00
However, I would like to use the package 'gamm4', as I am more familiar with Professor Simon Wood's packages and functions, and it appears gamm4 might be the most appropriate. It is important to note that the models have a binary response (organism presence of absence along a transect), and thus why I think gamm4 is more appropriate than gamm. In the gamm help, it provides the following example for autocorrelation within factors:
## more complicated autocorrelation example - AR errors
## only within groups defined by `fac'
e <- rnorm(n,0,sig)
for (i in 2:n) e[i] <- 0.6*e[i-1]*(fac[i-1]==fac[i]) + e[i]
y <- f + e
b <- gamm(y~s(x,k=20),correlation=corAR1(form=~1|fac))
Following this example, the following is the code I used for my dataset
b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),random=(form=~1|block), family=binomial(),data=dat)
However, by examining the output (summary(b$gam)) and specifically summary(b$mer)), I am either unsure of how to interpret the results, or do not believe that the autocorrelation within the group is being taken into consideration.
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
>
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
How do I ensure that autocorrelation is indeed being accounted for within each unique value of the "block" variable? What is the simplest way to interpret the output for "summary(b$mer)"?
The results do differ from a normal gam (package mgcv) using the same variables and parameters without the "correlation=..." term, indicating that something different is occurring.
However, when I use a different variable for the correlation term (season), I get the SAME output:
> dat2 <- data.frame(dist_slag = dat$dist_slag, Depth = dat$Depth, dolphin_presence = dat$dolphin_presence,
+ block = dat$block, season=dat$season)
> head(dat2)
dist_slag Depth dolphin_presence block season
1 26475.47 -10.0934 0 1 F
2 26340.47 -10.4870 0 1 F
3 25886.33 -16.5752 0 1 F
4 25399.88 -22.2474 0 1 F
5 24934.29 -29.6797 0 1 F
6 24519.90 -26.2370 0 1 F
> summary(dat2$season)
F S
3224 7568
> b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),correlation=corAR1(1, form=~1 | season), family=binomial(),data=dat2)
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
I just want to make sure it is correctly allowing for correlation within each value for the "block" variable. How do I formulate the model to say that autocorrelation can exist within each single value for block, but assume independence among blocks?
On another note, I am also receiving the following warning message after model completion for larger models (with many more variables than 2):
Warning message:
In mer_finalize(ans) : false convergence (8)
gamm4 is built on top of lme4, which does not allow for a correlation parameter (in contrast to the nlme, package, which underlies mgcv::gamm). mgcv::gamm does handle binary data, although it uses PQL, which is generally less accurate than Laplace/GHQ approximations as in gamm4/lme4. It is unfortunate (!!) that you're not getting a warning telling you that the correlation argument is being ignored (when I try a simple example using a correlation argument with lme4, I do get a warning, but it's possible that the extra argument is getting swallowed somewhere inside gamm4).
Your desired autocorrelation structure ("autocorrelation can exist within each single value for block, but assume independence among blocks") is exactly the way correlation structures are coded in nlme (and hence in mgcv::gamm).
I would use mcgv::gamm, and would suggest that if at all possible you try it out on some simulated data with known structure (or use the data set provided in the supplementary material above and see if you can reproduce their qualitative conclusions with your alternative approach).
StackOverflow is nice, but there is probably more mixed model expertise at r-sig-mixed-models#r-project.org

Resources