lmer and lm - use all factor levels as 'control' group - r

In the case when there's a categorical variable (unordered factor) in a linear model or lmer formula the function uses the first factor level as the 'control' group for contrasts. In my case I have a categorical variable with several levels and would like for each level to be the 'control' base group. Is there a function that automates this process and creates a nice matrix with p-values for all combinations? Here's a sample code using the diamonds dataset.
library(lmer);library(lmerTest)
#creating unordered factor
diamonds$color=factor(sample(c('red','white','blue','green','black'),nrow(diamonds),replace=T))
#lmer formula with factor in fixed effects
mod=lmer(data=diamonds,carat~color+(1|clarity))
summary(mod,corr=F)
As show in the summary, 'black' is used as the control, so I would like all the other colors to be used as control.
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest] Formula: carat ~ color + (1 | clarity) Data:
diamonds
REML criterion at convergence: 64684
Scaled residuals: Min 1Q Median 3Q Max
-2.228 -0.740 -0.224 0.540 8.471
Random effects: Groups Name Variance Std.Dev. clarity
(Intercept) 0.0763 0.276 Residual 0.1939 0.440
Number of obs: 53940, groups: clarity, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|) (Intercept) 0.786709 0.097774 7.005805 8.05 0.000087
*** colorblue -0.000479 0.005989 53927.996020 -0.08 0.94 colorgreen 0.007455 0.005998 53927.990722 1.24 0.21 colorred 0.000746 0.005986 53927.988909 0.12 0.90 colorwhite 0.000449 0.005971 53927.993708 0.08 0.94
--- Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

I could imagine wanting to do this for one of two reasons. First, would be to get the predicted value of the outcome at each level of the unordered factor (controlling for everything else in the model). The other would be to calculate all of the pairwise differences across the levels of the factor. If either of these is your goal, there are better ways to do it. Let's take the first one - generating predicted outcomes for each value of the factor holding everything else constant. Let's start by using the diamonds data and using the existing color variable, but making it an unordered factor.
library(lme4)
library(lmerTest)
library(multcomp)
library(ggeffects)
#creating unordered factor
data(diamonds, package="ggplot2")
diamonds$color <- as.factor(as.character(diamonds$color))
Now, we can run the model:
#lmer formula with factor in fixed effects
mod=lmer(data=diamonds,carat~color+(1|clarity))
The function glht in the multcomp package tests pairwise differences among factor levels. Here is the output.
summary(glht(mod, linfct = mcp(color="Tukey")))
#>
#> Simultaneous Tests for General Linear Hypotheses
#>
#> Multiple Comparisons of Means: Tukey Contrasts
#>
#>
#> Fit: lmer(formula = carat ~ color + (1 | clarity), data = diamonds)
#>
#> Linear Hypotheses:
#> Estimate Std. Error z value Pr(>|z|)
#> E - D == 0 0.025497 0.006592 3.868 0.00216 **
#> F - D == 0 0.116241 0.006643 17.497 < 0.001 ***
#> G - D == 0 0.181010 0.006476 27.953 < 0.001 ***
#> H - D == 0 0.271558 0.006837 39.721 < 0.001 ***
#> I - D == 0 0.392373 0.007607 51.577 < 0.001 ***
#> J - D == 0 0.511159 0.009363 54.592 < 0.001 ***
#> F - E == 0 0.090744 0.005997 15.130 < 0.001 ***
#> G - E == 0 0.155513 0.005789 26.863 < 0.001 ***
#> H - E == 0 0.246061 0.006224 39.536 < 0.001 ***
#> I - E == 0 0.366876 0.007059 51.975 < 0.001 ***
#> J - E == 0 0.485662 0.008931 54.380 < 0.001 ***
#> G - F == 0 0.064768 0.005807 11.154 < 0.001 ***
#> H - F == 0 0.155317 0.006258 24.819 < 0.001 ***
#> I - F == 0 0.276132 0.007091 38.939 < 0.001 ***
#> J - F == 0 0.394918 0.008962 44.065 < 0.001 ***
#> H - G == 0 0.090548 0.006056 14.952 < 0.001 ***
#> I - G == 0 0.211363 0.006910 30.587 < 0.001 ***
#> J - G == 0 0.330150 0.008827 37.404 < 0.001 ***
#> I - H == 0 0.120815 0.007276 16.606 < 0.001 ***
#> J - H == 0 0.239602 0.009107 26.311 < 0.001 ***
#> J - I == 0 0.118787 0.009690 12.259 < 0.001 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Adjusted p values reported -- single-step method)
If you wanted all the predicted values of carat for the different values of color, you could us ggpredict() from the ggeffects package:
g <- ggpredict(mod, terms = "color")
plot(g)
Plotting the g object produces the plot, but printing it will show the values and confidence intervals/
Created on 2023-02-01 by the reprex package (v2.0.1)

Related

Plotting mean and standard error of mean from linear regression

I've run a multiple linear regression where pred_acc is the dependent continuous variable and emotion_pred and emotion_target are two dummy coded independent variables with 0 and 1. Furthermore I am interested in the interaction between the two independent variables.
model <- lm(predic_acc ~ emotion_pred * emotion_target, data = data_almost_final)
summary(model)
Residuals:
Min 1Q Median 3Q Max
-0.66049 -0.19522 0.01235 0.19213 0.67284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.97222 0.06737 14.432 < 2e-16 ***
emotion_pred 0.45988 0.09527 4.827 8.19e-06 ***
emotion_target 0.24383 0.09527 2.559 0.012719 *
emotion_pred:emotion_target -0.47840 0.13474 -3.551 0.000703 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2858 on 68 degrees of freedom
(1224 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.2555, Adjusted R-squared: 0.2227
F-statistic: 7.781 on 3 and 68 DF, p-value: 0.0001536
In case some context is needed: I did a survey where couples had to predict their partners preferences. The predictor individual was either in emotion state 0 or 1 (emotion_pred) and the target individual was either in emotion state 0 or 1 (emotion_target). Accordingly, there are four combinations.
Now I want to plot the regression with the means of each combination of the independent variables (0,1; 1,0; 1,1; 0,0) and add an error bar with the standard error of the means. I have literally no idea at all how to do this. Anyone can help me with this?
Here's an extraction from my data:
pred_acc emotion_pred emotion_target
1 1.0000000 1 0
2 1.2222222 0 1
3 0.7777778 0 0
4 1.1111111 1 1
5 1.3888889 1 1
Sketch of how I want it to look like
Using emmip from the emmeans library:
model <- lm(data=d2, pred_acc ~ emotion_pred*emotion_target)
emmip(model, emotion_pred ~ emotion_target, CIs = TRUE, style = "factor")
If you want more control over the image or just to get the values you can use the emmeans function directly:
> emmeans(model , ~ emotion_pred * emotion_target )
emotion_pred emotion_target emmean SE df lower.CL upper.CL
0 0 0.778 0.196 1 -1.718 3.27
1 0 1.000 0.196 1 -1.496 3.50
0 1 1.222 0.196 1 -1.274 3.72
1 1 1.250 0.139 1 -0.515 3.01
Then you can use ggplot on this dataframe to make whatever graph you like.

Coeftest function in R--Variable not reported in output

I have a linear regression I am running in R. I am calculating clustered standard errors. I get the output of coeftest() but in some cases it doesn't report anything for a variable. I don't get an error. Does this mean the coefficient couldn't be calculated or does the coeftest not report variables that are insignificant? I can't seem to find the answer in any of the R documentation.
Here is the output from R:
lm1 <- lm(PeaceA ~ Soc_Edu + Pol_Constitution + mediation + gdp + enrollratio + infantmortality , data=qsi.surv)
coeftest(lm1, vcov = vcovHC(lm1, type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.05780946 0.20574444 -5.1414 4.973e-06 ***
Soc_Edu -1.00735592 0.11756507 -8.5685 3.088e-11 ***
mediation 0.65682159 0.06291926 10.4391 6.087e-14 ***
gdp 0.00041894 0.00010205 4.1052 0.000156 ***
enrollratio 0.00852143 0.00177600 4.7981 1.598e-05 ***
infantmortality 0.00455383 0.00079536 5.7255 6.566e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice that there is nothing reported for the variable Pol_Constitution.
I assume you mean functions coeftest() from package lmtest and vcovHC() from package sandwich. In this combination, coefficients for linear dependend columns are silently dropped in coeftest's output. Thus, I assume your variable/column Pol_Constitution suffers from linear dependence.
Below is an example which demonstrates the behaviour with a linear dependend column. See how the estimated coefficient for I(2 * cyl) is NA in a simple summary() and in coeftest() but silently dropped when the latter is combind with vcovHC().
library(lmtest)
library(sandwich)
data(mtcars)
summary(mod <- lm(mpg ~ cyl + I(2*cyl), data = mtcars))
#> [...]
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
#> cyl -2.8758 0.3224 -8.92 6.11e-10 ***
#> I(2 * cyl) NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> [...]
coeftest(mod)
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.88458 2.07384 18.2678 < 2.2e-16 ***
#> cyl -2.87579 0.32241 -8.9197 6.113e-10 ***
#> I(2 * cyl) NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mod, vcov. = vcovHC(mod))
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.88458 2.74154 13.8187 1.519e-14 ***
#> cyl -2.87579 0.38869 -7.3987 3.040e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Levene's test using the squared residuals

How can Levene's test be performed using the squared residuals rather than the absolute?
I have tried levene.test in the lawstat package, and leveneTest in the car package, which both use the absolute residuals.
The purpose is to reproduce SAS output, which uses squared residuals by default.
iris.lm <- lm(Petal.Width ~ Species, data = iris)
anova(lm(residuals(iris.lm)^2 ~ iris$Species))
## Analysis of Variance Table
##
## Response: residuals(iris.lm)^2
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 0.100 0.0500 14.8 1.4e-06 ***
## Residuals 147 0.497 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Perhaps it helps to understand how this works.
As is pointed out out here, Levene's test is just an ANOVA of the distance between each observation and the centre of its group. Different implementations of Levene's test differ by their definitions of “distance” and “centre”.
“Distance” can mean the absolute differences, or the squared differences.
“Centre” can mean the mean or the median.
SAS uses squared differences by default, and the mean. leveneTest in the car package of R uses the absolute differences only, and the median by default. So does levene.test in the lawstat package.
All four possible combinations can be done by hand as follows.
require(plyr)
x <- ddply(iris, .(Species), summarize
, abs.mean = abs(Petal.Width - mean(Petal.Width))
, abs.median = abs(Petal.Width - median(Petal.Width))
, squared.mean = (Petal.Width - mean(Petal.Width))^2
, squared.median = (Petal.Width - median(Petal.Width))^2)
anova(lm(abs.mean ~ Species, data = x)) # Levene's test
## Analysis of Variance Table
##
## Response: abs.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.53 0.2648 19.6 2.7e-08 ***
## Residuals 147 1.98 0.0135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(abs.median ~ Species, data = x)) # Brown-Forsythe test
## Analysis of Variance Table
##
## Response: abs.median
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.642 0.321 19.9 2.3e-08 ***
## Residuals 147 2.373 0.016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(squared.mean ~ Species, data = x)) # default SAS Levene's Test
## Analysis of Variance Table
##
## Response: squared.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.100 0.0500 14.8 1.4e-06 ***
## Residuals 147 0.497 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(squared.median ~ Species, data = x)) # Who-Knows-Whose Test
## Analysis of Variance Table
##
## Response: squared.median
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.096 0.0478 13.6 3.7e-06 ***
## Residuals 147 0.515 0.0035
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To show that the first two above reproduce leveneTest:
require(car)
leveneTest(Petal.Width ~ Species, data = iris, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 2 19.6 2.7e-08 ***
## 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(Petal.Width ~ Species, data = iris, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 19.9 2.3e-08 ***
## 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since one often has a linear model to hand with the residuals (mean) ready to go, it's often more convenient to do
iris.lm <- lm(Petal.Width ~ Species, data = iris)
anova(lm(residuals(iris.lm)^2 ~ iris$Species))
## Analysis of Variance Table
##
## Response: residuals(iris.lm)^2
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 0.100 0.0500 14.8 1.4e-06 ***
## Residuals 147 0.497 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hence the answer at the top.

conducting GAM-GEE in gamm4 R package?

I am trying to analyze some visual transect data of organisms to generate a habitat distribution model. Once organisms are sighted, they are followed as point data is collected at a given time interval. Because of the autocorrelation among these "follows," I wish to utilize a GAM-GEE approach similar to that of Pirotta et al. 2011, using packages 'yags' and 'splines' (http://www.int-res.com/abstracts/meps/v436/p257-272/). Their R scripts are shown here (http://www.int-res.com/articles/suppl/m436p257_supp/m436p257_supp1-code.r). I have used this code with limited success and multiple issues of models failing to converge.
Below is the structure of my data:
> str(dat2)
'data.frame': 10792 obs. of 4 variables:
$ dist_slag : num 26475 26340 25886 25400 24934 ...
$ Depth : num -10.1 -10.5 -16.6 -22.2 -29.7 ...
$ dolphin_presence: int 0 0 0 0 0 0 0 0 0 0 ...
$ block : int 1 1 1 1 1 1 1 1 1 1 ...
> head(dat2)
dist_slag Depth dolphin_presence block
1 26475.47 -10.0934 0 1
2 26340.47 -10.4870 0 1
3 25886.33 -16.5752 0 1
4 25399.88 -22.2474 0 1
5 24934.29 -29.6797 0 1
6 24519.90 -26.2370 0 1
Here is the summary of my block variable (indicating the number of groups for which autocorrelation exists within each block
> summary(dat2$block)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 39.00 76.00 73.52 111.00 148.00
However, I would like to use the package 'gamm4', as I am more familiar with Professor Simon Wood's packages and functions, and it appears gamm4 might be the most appropriate. It is important to note that the models have a binary response (organism presence of absence along a transect), and thus why I think gamm4 is more appropriate than gamm. In the gamm help, it provides the following example for autocorrelation within factors:
## more complicated autocorrelation example - AR errors
## only within groups defined by `fac'
e <- rnorm(n,0,sig)
for (i in 2:n) e[i] <- 0.6*e[i-1]*(fac[i-1]==fac[i]) + e[i]
y <- f + e
b <- gamm(y~s(x,k=20),correlation=corAR1(form=~1|fac))
Following this example, the following is the code I used for my dataset
b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),random=(form=~1|block), family=binomial(),data=dat)
However, by examining the output (summary(b$gam)) and specifically summary(b$mer)), I am either unsure of how to interpret the results, or do not believe that the autocorrelation within the group is being taken into consideration.
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
>
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
How do I ensure that autocorrelation is indeed being accounted for within each unique value of the "block" variable? What is the simplest way to interpret the output for "summary(b$mer)"?
The results do differ from a normal gam (package mgcv) using the same variables and parameters without the "correlation=..." term, indicating that something different is occurring.
However, when I use a different variable for the correlation term (season), I get the SAME output:
> dat2 <- data.frame(dist_slag = dat$dist_slag, Depth = dat$Depth, dolphin_presence = dat$dolphin_presence,
+ block = dat$block, season=dat$season)
> head(dat2)
dist_slag Depth dolphin_presence block season
1 26475.47 -10.0934 0 1 F
2 26340.47 -10.4870 0 1 F
3 25886.33 -16.5752 0 1 F
4 25399.88 -22.2474 0 1 F
5 24934.29 -29.6797 0 1 F
6 24519.90 -26.2370 0 1 F
> summary(dat2$season)
F S
3224 7568
> b <- gamm4(dolphin_presence~s(dist_slag)+s(Depth),correlation=corAR1(1, form=~1 | season), family=binomial(),data=dat2)
> summary(b$gam)
Family: binomial
Link function: logit
Formula:
dolphin_presence ~ s(dist_slag) + s(Depth)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.968 5.145 -2.715 0.00663 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_slag) 4.943 4.943 70.67 6.85e-14 ***
s(Depth) 6.869 6.869 115.59 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.317glmer.ML score = 10504 Scale est. = 1 n = 10792
> summary(b$mer)
Generalized linear mixed model fit by the Laplace approximation
AIC BIC logLik deviance
10514 10551 -5252 10504
Random effects:
Groups Name Variance Std.Dev.
Xr s(dist_slag) 1611344 1269.39
Xr.0 s(Depth) 98622 314.04
Number of obs: 10792, groups: Xr, 8; Xr.0, 8
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
X(Intercept) -13.968 5.145 -2.715 0.00663 **
Xs(dist_slag)Fx1 -35.871 33.944 -1.057 0.29063
Xs(Depth)Fx1 3.971 3.740 1.062 0.28823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
X(Int) X(_)F1
Xs(dst_s)F1 0.654
Xs(Dpth)Fx1 -0.030 0.000
>
I just want to make sure it is correctly allowing for correlation within each value for the "block" variable. How do I formulate the model to say that autocorrelation can exist within each single value for block, but assume independence among blocks?
On another note, I am also receiving the following warning message after model completion for larger models (with many more variables than 2):
Warning message:
In mer_finalize(ans) : false convergence (8)
gamm4 is built on top of lme4, which does not allow for a correlation parameter (in contrast to the nlme, package, which underlies mgcv::gamm). mgcv::gamm does handle binary data, although it uses PQL, which is generally less accurate than Laplace/GHQ approximations as in gamm4/lme4. It is unfortunate (!!) that you're not getting a warning telling you that the correlation argument is being ignored (when I try a simple example using a correlation argument with lme4, I do get a warning, but it's possible that the extra argument is getting swallowed somewhere inside gamm4).
Your desired autocorrelation structure ("autocorrelation can exist within each single value for block, but assume independence among blocks") is exactly the way correlation structures are coded in nlme (and hence in mgcv::gamm).
I would use mcgv::gamm, and would suggest that if at all possible you try it out on some simulated data with known structure (or use the data set provided in the supplementary material above and see if you can reproduce their qualitative conclusions with your alternative approach).
StackOverflow is nice, but there is probably more mixed model expertise at r-sig-mixed-models#r-project.org

Resources