Generalized least squares results interpretation - r

I checked my linear regression model (WMAN = Species, WDNE = sea surface temp) and found auto-correlation so instead, I am trying generalized least squares with the following script;
library(nlme)
modelwa <- gls(WMAN ~WDNE, data=dat,
correlation = corAR1(form=~MONTH),
na.action=na.omit)
summary(modelwa)
I compared both models;
> library(MuMIn)
> model.sel(modelw,modelwa)
Model selection table
(Intrc) WDNE class na.action correlation df logLik AICc delta
modelwa 31.50 0.1874 gls na.omit crAR1(MONTH) 4 -610.461 1229.2 0.00
modelw 11.31 0.7974 lm na.excl 3 -658.741 1323.7 94.44
weight
modelwa 1
modelw 0
Abbreviations:
na.action: na.excl = ‘na.exclude’
correlation: crAR1(MONTH) = ‘corAR1(~MONTH)’
Models ranked by AICc(x)
I believe the results suggest I should use gls as the AIC is lower.
My problem is, I have been reporting F-value/R²/p-value, but the output from the gls does not have these?
I would be very grateful if someone could assist me in interpreting these results?
> summary(modelwa)
Generalized least squares fit by REML
Model: WMAN ~ WDNE
Data: mp2017.dat
AIC BIC logLik
1228.923 1240.661 -610.4614
Correlation Structure: ARMA(1,0)
Formula: ~MONTH
Parameter estimate(s):
Phi1
0.4809973
Coefficients:
Value Std.Error t-value p-value
(Intercept) 31.496911 8.052339 3.911524 0.0001
WDNE 0.187419 0.091495 2.048401 0.0424
Correlation:
(Intr)
WDNE -0.339
Standardized residuals:
Min Q1 Med Q3 Max
-2.023362 -1.606329 -1.210127 1.427247 3.567186
Residual standard error: 18.85341
Degrees of freedom: 141 total; 139 residual
>

I have now overcome the problem of auto-correlation so I can use lm()
Add lag1 of residual as an X variable to the original model. This can be done using the slide function in DataCombine package.
library(DataCombine)
econ_data <- data.frame(economics, resid_mod1=lmMod$residuals)
econ_data_1 <- slide(econ_data, Var="resid_mod1",
NewVar = "lag1", slideBy = -1)
econ_data_2 <- na.omit(econ_data_1)
lmMod2 <- lm(pce ~ pop + lag1, data=econ_data_2)
This script can be found here

Related

Equivalence of a mixed model fitted by lme and lmer

I have fitted a mixed effects model considering both functions widely used in R, namely: the lme function from the nlme package and the lmer function from the lme4 package.
To readjust the model from lme to lme4, following the same reparametrization, I used the following information from this topic, being that is only possible to do this in lme4 in a hackable way.: Heterocesdastic model of mixed effects via lmer function
I apologize for hosting the data in a link, however, I couldn't find an internal R database that has variables that might match my problem.
Data: https://drive.google.com/file/d/1jKFhs4MGaVxh-OPErvLDfMNmQBouywoM/view?usp=sharing
The fitted models were:
library(nlme)
library(lme4)
ModLME = lme(Var1~I(Var2)+I(Var2^2),
random = ~1|Var3,
weights = varIdent(form=~1|Var4),
Dataone, method="REML")
ModLMER = lmer(Var1~I(Var2)+I(Var2^2)+(1|Var3)+(0+dummy(Var4,"1")|Var5),
Dataone, REML = TRUE,
control=lmerControl(check.nobs.vs.nlev="ignore",
check.nobs.vs.nRE="ignore"))
Which are equivalent, see:
all.equal(REMLcrit(ModLMER), c(-2*logLik(ModLME)))
[1] TRUE
all.equal(fixef(ModLME), fixef(ModLMER), tolerance=1e-7)
[1] TRUE
> summary(ModLME)
Linear mixed-effects model fit by REML
Data: Dataone
AIC BIC logLik
-209.1431 -193.6948 110.5715
Random effects:
Formula: ~1 | Var3
(Intercept) Residual
StdDev: 0.05789852 0.03636468
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | Var4
Parameter estimates:
0 1
1.000000 5.641709
Fixed effects: Var1 ~ I(Var2) + I(Var2^2)
Value Std.Error DF t-value p-value
(Intercept) 0.9538547 0.01699642 97 56.12093 0
I(Var2) -0.5009804 0.09336479 97 -5.36584 0
I(Var2^2) -0.4280151 0.10038257 97 -4.26384 0
summary(ModLMER)
Linear mixed model fit by REML. t-tests use Satterthwaites method [lmerModLmerTest]
Formula: Var1 ~ I(Var2) + I(Var2^2) + (1 | Var3) + (0 + dummy(Var4, "1") |
Var5)
Data: Dataone
Control: lmerControl(check.nobs.vs.nlev = "ignore", check.nobs.vs.nRE = "ignore")
REML criterion at convergence: -221.1
Scaled residuals:
Min 1Q Median 3Q Max
-4.1151 -0.5891 0.0374 0.5229 2.1880
Random effects:
Groups Name Variance Std.Dev.
Var3 (Intercept) 6.466e-12 2.543e-06
Var5 dummy(Var4, "1") 4.077e-02 2.019e-01
Residual 4.675e-03 6.837e-02
Number of obs: 100, groups: Var3, 100; Var5, 100
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.95385 0.01700 95.02863 56.121 < 2e-16 ***
I(Var2) -0.50098 0.09336 92.94048 -5.366 5.88e-07 ***
I(Var2^2) -0.42802 0.10038 91.64017 -4.264 4.88e-05 ***
However, when observing the residuals of these models, note that they are not similar. See that in the model adjusted by lmer, mysteriously appears a residue with the shape of a few points close to a straight line. So, how could you solve such a problem so that they are identical? I believe the problem is in the lme4 model.
aa=plot(ModLME, main="LME")
bb=plot(ModLMER, main="LMER")
gridExtra::grid.arrange(aa,bb,ncol=2)
I can tell you what's going on and what should in principle fix it, but at the moment the fix doesn't work ...
The residuals being plotted take all of the random effects into account, which in the case of the lmer fit includes the individual-level random effects (the (0+dummy(Var4,"1")|Var5) term), which leads to weird residuals for the Var4==1 group. To illustrate this:
plot(ModLMER, col = Dataone$Var4+1)
i.e., you can see that the weird residuals are exactly the ones in red == those for which Var4==1.
In theory we should be able to get the same residuals via:
res <- Dataone$Var1 - predict(ModLMER, re.form = ~(1|Var3))
i.e., ignore the group-specific observation-level random effect term. However, it looks like there is a bug at the moment ("contrasts can be applied only to factors with 2 or more levels").
An extremely hacky solution is to construct the random-effect predictions without the observation-level term yourself:
## fixed-effect predictions
p0 <- predict(ModLMER, re.form = NA)
## construct RE prediction, Var3 term only:
Z <- getME(ModLMER, "Z")
b <- drop(getME(ModLMER, "b"))
## zero out observation-level components
b[101:200] <- 0
## add RE predictions to fixed predictions
p1 <- drop(p0 + Z %*% b)
## plot fitted vs residual
plot(p1, Dataone$Var1 - p1)
For what it's worth, this also works:
library(glmmTMB)
ModGLMMTMB <- glmmTMB(Var1~I(Var2)+I(Var2^2)+(1|Var3),
dispformula = ~factor(Var4),
REML = TRUE,
data = Dataone)

How do I find the p-value for my random effect in my linear mixed effect model?

I am running the following line of code in R:
model = lme(divedepth ~ oarea, random=~1|deployid, data=GDataTimes, method="REML")
summary(model)
and I am seeing this result:
Linear mixed-effects model fit by REML
Data: GDataTimes
AIC BIC logLik
2512718 2512791 -1256352
Random effects:
Formula: ~1 | deployid
(Intercept) Residual
StdDev: 9.426598 63.50004
Fixed effects: divedepth ~ oarea
Value Std.Error DF t-value p-value
(Intercept) 25.549003 3.171766 225541 8.055135 0.0000
oarea2 12.619669 0.828729 225541 15.227734 0.0000
oarea3 1.095290 0.979873 225541 1.117787 0.2637
oarea4 0.852045 0.492100 225541 1.731447 0.0834
oarea5 2.441955 0.587300 225541 4.157933 0.0000
[snip]
Number of Observations: 225554
Number of Groups: 9
However, I cannot find the p-value for the random variable: deployID. How can I see this value?
As stated in the comments, there is stuff about significance tests of random effects in the GLMM FAQ. You should definitely consider:
why you are really interested in the p-value (it's not never of interest, but it's an unusual case)
the fact that the likelihood ratio test is extremely conservative for testing variance parameters (in this case it gives a p-value that's 2x too large)
Here's an example that shows that the lme() fit and the corresponding lm() model without the random effect have commensurate log-likelihoods (i.e., they're computed in a comparable way) and can be compared with anova():
Load packages and simulate data (with zero random effect variance)
library(lme4)
library(nlme)
set.seed(101)
dd <- data.frame(x = rnorm(120), f = factor(rep(1:3, 40)))
dd$y <- simulate(~ x + (1|f),
newdata = dd,
newparams = list(beta = rep(1, 2),
theta = 0,
sigma = 1))[[1]]
Fit models (note that you cannot compare a model fitted with REML to a model without random effects).
m1 <- lme(y ~ x , random = ~ 1 | f, data = dd, method = "ML")
m0 <- lm(y ~ x, data = dd)
Test:
anova(m1, m0)
## Model df AIC BIC logLik Test L.Ratio p-value
## m1 1 4 328.4261 339.5761 -160.2131
## m0 2 3 326.4261 334.7886 -160.2131 1 vs 2 6.622332e-08 0.9998
Here the test correctly identifies that the two models are identical and gives a p-value of 1.
If you use lme4::lmer instead of lme you have some other, more accurate (but slower) options (RLRsim and PBmodcomp packages for simulation-based tests): see the GLMM FAQ.

How do I grab the AR1 estimate and its SE from the gls function in R?

I am attempting to get the lag one autocorrelation estimates from the gls function (package {nlme}) with its SE. This is being done on a non-stationary univariate time series. Here is the output:
Generalized least squares fit by REML
Model: y ~ year
Data: tempdata
AIC BIC logLik
51.28921 54.37957 -21.64461
Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.9699799
Coefficients:
Value Std.Error t-value p-value
(Intercept) -1.1952639 3.318268 -0.3602072 0.7234
year -0.2055264 0.183759 -1.1184567 0.2799
Correlation:
(Intr)
year -0.36
Standardized residuals:
Min Q1 Med Q3 Max
-0.12504485 -0.06476076 0.13948378 0.51581993 0.66030397
Residual standard error: 3.473776
Degrees of freedom: 18 total; 16 residual
The phi coefficient seemed promising since it was under the correlation structure in the output
Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.9699799
but it regularly goes over one, which is not possible for correlation. Then there is the
Correlation:
(Intr)
Yearctr -0.36
but I was advised that this was likely not a correct estimate for the data (there were multiple test sites so this is just one of the unexpected estimates). Is there a function that outputs an AR1 estimate and its SE (other than arima)?
sample of autocorrelated data:
set.seed(29)
y = diffinv(rnorm(500))
x = 1:length(y)
gls(y~x, correlation=corAR1(form=~1))
Note: I am comparing the function arima() to gls() (or another method) to compare AR1 estimates and SE's. I am doing this under adviser request.

Interpreting the output of summary(glmer(...)) in R

I'm an R noob, I hope you can help me:
I'm trying to analyse a dataset in R, but I'm not sure how to interpret the output of summary(glmer(...)) and the documentation isn't a big help:
> data_chosen_stim<-glmer(open_chosen_stim~closed_chosen_stim+day+(1|ID),family=binomial,data=chosenMovement)
> summary(data_chosen_stim)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: open_chosen_stim ~ closed_chosen_stim + day + (1 | ID)
Data: chosenMovement
AIC BIC logLik deviance df.resid
96.7 105.5 -44.4 88.7 62
Scaled residuals:
Min 1Q Median 3Q Max
-1.4062 -1.0749 0.7111 0.8787 1.0223
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 0 0
Number of obs: 66, groups: ID, 35
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4511 0.8715 0.518 0.605
closed_chosen_stim2 0.4783 0.5047 0.948 0.343
day -0.2476 0.5060 -0.489 0.625
Correlation of Fixed Effects:
(Intr) cls__2
clsd_chsn_2 -0.347
day -0.916 0.077
I understand the GLM behind it, but I can't see the weights of the independent variables and their error bounds.
update: weights.merMod already has a type argument ...
I think what you're looking for weights(object,type="working").
I believe these are the diagonal elements of W in your notation?
Here's a trivial example that matches up the results of glm and glmer (since the random effect is bogus and gets an estimated variance of zero, the fixed effects, weights, etc etc converges to the same value).
Note that the weights() accessor returns the prior weights by default (these are all equal to 1 for the example below).
Example (from ?glm):
d.AD <- data.frame(treatment=gl(3,3),
outcome=gl(3,1,9),
counts=c(18,17,15,20,10,20,25,13,12))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson(),
data=d.AD)
library(lme4)
d.AD$f <- 1 ## dummy grouping variable
glmer.D93 <- glmer(counts ~ outcome + treatment + (1|f),
family = poisson(),
data=d.AD,
control=glmerControl(check.nlev.gtr.1="ignore"))
Fixed effects and weights are the same:
all.equal(fixef(glmer.D93),coef(glm.D93)) ## TRUE
all.equal(unname(weights(glm.D93,type="working")),
weights(glmer.D93,type="working"),
tol=1e-7) ## TRUE

The best model, according to both AIC and BIC, contains only a non significant term

Id like to know how is it possible that the best model (m6), based on AIC and BIC values, can have a non significant term whereas the second best model(m5) has a signficant term.
I have the following competing models list:
m1=gls(Area_Km2~size+cent_Latitude+PCptail+pwingbilltar,corMartins(1,phy=ctree),data = c)
m2=gls(Area_Km2~size+cent_Latitude+PCptail,corMartins(1,phy=ctree),data = c)
m3=gls(Area_Km2~size+cent_Latitude,corMartins(1,phy=ctree),data = c)
m4=gls(Area_Km2~size,corMartins(1,phy=ctree),data = c)
m5=gls(Area_Km2~PCptail,corMartins(1,phy=ctree),data = c)
m6=gls(Area_Km2~cent_Latitude,corMartins(1,phy=ctree),data = c)
m7=gls(Area_Km2~pwingbilltar,corMartins(1,phy=ctree),data = c)
Here is model comparison
Model df AIC BIC logLik Test L.Ratio p-value
m1 1 7 147.2775 157.9620 -66.63873
m2 2 6 139.4866 148.8187 -63.74331 1 vs 2 5.790838 0.0161
m3 3 5 133.3334 141.2510 -61.66672 2 vs 3 4.153191 0.0416
m4 4 4 130.7749 137.2186 -61.38746 3 vs 4 0.558517 0.4549
m5 5 4 127.0635 133.5072 -59.53175
m6 6 4 125.1006 131.5443 -58.55029
m7 7 4 132.4542 138.8979 -62.22711
here goes m6
Generalized least squares fit by REML
Model: Area_Km2 ~ cent_Latitude
Data: c
AIC BIC logLik
125.1006 131.5442 -58.55029
Correlation Structure: corMartins
Formula: ~1
Parameter estimate(s):
alpha
1
Coefficients:
Value Std.Error t-value p-value
(Intercept) 0.4940905 0.1730082 2.8558795 0.0070
cent_Latitude -0.1592109 0.1726268 -0.9222837 0.3624
Correlation:
(Intr)
cent_Latitude -0.158
Standardized residuals:
Min Q1 Med Q3 Max
-1.3270048 -0.7088524 -0.2828898 0.4672255 2.2203523
Residual standard error: 1.066911
Degrees of freedom: 39 total; 37 residual
here goes m5
Generalized least squares fit by REML
Model: Area_Km2 ~ PCptail
Data: c
AIC BIC logLik
127.0635 133.5072 -59.53175
Correlation Structure: corMartins
Formula: ~1
Parameter estimate(s):
alpha
1
Coefficients:
Value Std.Error t-value p-value
(Intercept) 0.19752329 0.20158500 0.9798512 0.3335
PCptail 0.01925621 0.00851536 2.2613499 0.0297
Correlation:
(Intr)
PCptail -0.595
Standardized residuals:
Min Q1 Med Q3 Max
-1.3416127 -0.6677304 -0.2467510 0.3198370 2.3339127
Residual standard error: 1.01147
Degrees of freedom: 39 total; 37 residual
There are at least two things going on here. First, it is not meaningful to assert that the model with the lowest AIC is the "best" model. For a set of models with different AIC, the relative probability that the ith model is better than the model with the lowest AIC is given by: (see here, and references cited therein):
L = exp[ ( AICmin - AICi ) / 2 ]
So, comparing m5 and m6:
L = exp[ (125.1006 - 127.0635) / 2 ] = 0.374
Or, there is a 37% chance that m5 is in fact the better model. The point is that there is not a significant difference between an AIC of 125.2 and an AIC of 127, so you cannot say the m6 is "best". Both models do about equally well.
So why is cent_Latitude insignificant in m6? A p-value > 0.05 means that the effect of cent_Latitude on the response is small compared to the error in the response. This could happen because the true effect size is 0, or it could happen because the effect size, combined with the range in cent_latitude results in an effect on the response which is small compared to the error in the response. You can see this in the following, which uses made-up data and creates the same effect yor are seeing with your real data.
Suppose the the response variable actually depends on both cent_Latitude and PCptail. In m6, variability in response due to the effect of PCptail is counted towards the "error" in the model, reducing the calculated significance of cent_Latitude. On the other hand, in m5 the variability in response due to the effect of cent_Latitude is counted toward the error, reducing the significance of PCptail. If the effect sizes, compared to the true error, are just right you can get this effect, as shown below. Note that this is one of the reasons it is not recommended to compare non-nested models using a single statistic (like AIC, or RSQ, or even F).
library(nlme)
set.seed(1)
# for simplicity, use un-correlated predictors
c <- data.frame(PCptail=sample(seq(0,10,.1),length(seq)),
cent_Latitude=sample(seq(0,1,.01),length(seq)))
# response depends on both predictors
c$Area <- 1 + .01*c$PCptail +.1*c$cent_Latitude + rnorm(nrow(c),0,1)
m6 <- gls(Area~cent_Latitude,c)
m5 <- gls(Area~PCptail,c)
summary(m6)
# Generalized least squares fit by REML
# Model: Area ~ cent_Latitude
# Data: c
# AIC BIC logLik
# 288.5311 296.3165 -141.2656
#
# Coefficients:
# Value Std.Error t-value p-value
# (Intercept) 1.1835936 0.1924341 6.150645 0.0000
# cent_Latitude -0.1882202 0.3324754 -0.566118 0.5726
# ...
summary(m5)
# Generalized least squares fit by REML
# Model: Area ~ PCptail
# Data: c
# AIC BIC logLik
# 289.2713 297.0566 -141.6356
#
# Coefficients:
# Value Std.Error t-value p-value
# (Intercept) 0.7524261 0.18871413 3.987121 0.0001
# PCptail 0.0674115 0.03260484 2.067530 0.0413
# ...
So how to deal with this? Well, have you looked at the residuals plots for all these models? Have you looked at the Q-Q plots? Leverage plots? Generally, all other thins being equal, I would pick the model with the more significant parameters, assuming the residuals were random and normally distributed, and the none of the data points had unusually high leverage.
You are fitting the model using method = "REML" which is the restricted likelihood. It does not always follow that the maximized likelihood under REML will be close to the likelihood under unrestricted ML. Set method = "ML" and see if that fixes the "problem" with AIC/BIC.

Resources