I´ve ben analyzing data for a paper and have now obtained results in mutiple linear regression. However the summaries R provides are not really fir for publication in the final paper. Also I have specified one variable in several different ways, to showcase the robustness of the results.
Hw can I create a nice, exportable table in R, that contains Variable Names (ideally also allows to name the variables in a more informative way), estimates, standard errors, robust standard errors p values and ideally also the significance indicators? For illustration:
I have summary outputs like this:
Residuals:
Min 1Q Median 3Q Max
-50.868 -4.644 1.583 7.054 20.490
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.710e+01 1.848e+01 2.549 0.0136 *
Var1 -8.588e-01 2.201e+00 -0.390 0.6979
Var2 2.486e+00 1.055e+00 2.357 0.0220 *
log(specification1) 3.376e+00 2.152e+00 1.569 0.1223
Var4 -3.651e-04 2.797e-04 -1.305 0.1971
Var5 4.809e+00 2.654e+00 1.812 0.0753 .
Var6 -8.706e+00 6.972e+00 -1.249 0.2170
Var7 -8.172e+00 5.755e+00 -1.420 0.1612
Var8 -3.276e+00 7.067e+00 -0.463 0.6448
Var9 -1.477e+01 7.849e+00 -1.882 0.0650 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and
Residuals:
Min 1Q Median 3Q Max
-48.881 -5.699 0.956 8.947 17.888
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.258e+01 1.750e+01 2.405 0.0195 *
Var1 4.298e-01 2.120e+00 0.200 0.8421
Var2 5.179e+00 1.027e+00 2.122 0.0271 *
log(specification 2) 2.050e+00 9.435e-01 2.173 0.0338 *
Var4 -1.420e-04 2.261e-04 -1.513 0.1356
Var5 4.584e+00 2.511e+00 1.826 0.0730 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and I would like to get to a table looking something like this:
Model1 Model2
Intercept Estimate Std.Error p robust_Std.Error robust_p Estimate Std.Error p robust ...
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
which in the columns of course contains the values of the estimates. Is there a function/ package that does that nicely?
Thanks in advance
I suggest you to use the broom package, like this:
fit1 <- lm(mpg ~ ., mtcars)
broom::tidy(fit1)
# # A tibble: 11 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 12.3 18.7 0.657 0.518
# 2 cyl -0.111 1.05 -0.107 0.916
# 3 disp 0.0133 0.0179 0.747 0.463
# 4 hp -0.0215 0.0218 -0.987 0.335
# 5 drat 0.787 1.64 0.481 0.635
# 6 wt -3.72 1.89 -1.96 0.0633
# 7 qsec 0.821 0.731 1.12 0.274
# 8 vs 0.318 2.10 0.151 0.881
# 9 am 2.52 2.06 1.23 0.234
# 10 gear 0.655 1.49 0.439 0.665
# 11 carb -0.199 0.829 -0.241 0.812
It will extract a tibble from the output of the lm function.
If you have more than one model and you wanna set all the tibbles together with common terms you can deal with it this way:
Create a list x of your models.
fit1 <- lm(mpg ~ cyl + disp + gear, mtcars)
fit2 <- lm(mpg ~ cyl + hp + drat, mtcars)
x <- list(fit1, fit2)
You can use this solution:
library(purrr)
library(dplyr)
library(stringr)
# set names for the list
names(x) <- paste("Model", seq_along(x), sep = "_")
# tidy them up
x <- map(x, broom::tidy)
# set the list names at the beginning of each column
x <- imap(x, ~set_names(.x, paste(.y, names(.x), sep = "_")))
# rename each term column as "term"
x <- map(x, ~rename_with(.x, str_replace, pattern = ".*term", replacement = "term"))
# join them all together
reduce(x, full_join, by = "term")
It returns the output you asked for:
# A tibble: 6 x 9
term Model1_estimate Model1_std.error Model1_statistic Model1_p.value Model2_estimate Model2_std.error Model2_statistic Model2_p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 34.0 4.76 7.13 0.0000000925 22.5 7.99 2.82 0.00880
2 cyl -1.59 0.724 -2.20 0.0366 -1.36 0.735 -1.85 0.0747
3 disp -0.0200 0.0109 -1.83 0.0774 NA NA NA NA
4 gear 0.158 0.910 0.174 0.863 NA NA NA NA
5 hp NA NA NA NA -0.0288 0.0153 -1.88 0.0704
6 drat NA NA NA NA 2.84 1.52 1.87 0.0725
If your list has more than two models, the code will be stable.
I have fitted a linear mixed model with split-plot design to assess the effects of water, nitrogen and phosphorus on BWC (biomass-weighted 2c-value, achieved by summing the product of each species' 2C-value(DNA content) with its biomass fraction (species subplot biomass/total subplot biomass):
model1.1<-lmer(log(BWC)~W*N*P+(1|year)+(1|W:Block),data=BWC)
There are two levels for W(0,1), N(0,1) and p(0,1) I would like to use boxplot to report my results with the output of the linear mixed model. However, I'm confused with the output of the linear mixed model.
The estimated value (slope) for WNP in model1.1 is negative, Does that mean WNP treatment will decrease BWC comparing to control plot? But we can see the BWC was highest in boxplot under the WNP treatment.
There is a discrepancy between summary() and anova(), for example, the significance for N and P effects. Estimate value for N is-4.0911 which means N addition decreased BWC But N effect was insignificant. How can I report the treatment effects like N?
Many thanks for any comments.
Boxplot of WNP treatment on BWC:
enter image description here
https://i.stack.imgur.com/cKOFt.png
(Sorry for the links,it seem I need at least 10 reputations to post images)
The summary() and anova() output:
> summary(model1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: BWC ~ W * N * P + (1 | year) + (1 | W:Block)
Data: BWC
REML criterion at convergence: 2969.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.93847 -0.71228 -0.07573 0.68191 2.92589
Random effects:
Groups Name Variance Std.Dev.
W:Block (Intercept) 0.9169 0.9575
year (Intercept) 0.8346 0.9136
Residual 18.2966 4.2774
Number of obs: 515, groups: W:Block, 14; year, 10
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 10.8498 0.6985 46.5200 15.532 < 2e-16 ***
W1 2.0844 0.8969 45.9613 2.324 0.02460 *
N1 -4.0911 0.7364 486.0288 -5.556 4.56e-08 ***
P1 -2.0460 0.7600 490.1120 -2.692 0.00734 **
W1:N1 4.6738 1.0394 485.9800 4.497 8.65e-06 ***
W1:P1 0.9695 1.0687 485.9809 0.907 0.36478
N1:P1 5.7550 1.0687 485.9773 5.385 1.13e-07 ***
W1:N1:P1 -3.3306 1.5100 485.9541 -2.206 0.02788 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) W1 N1 P1 W1:N1 W1:P1 N1:P1
W1 -0.645
N1 -0.531 0.414
P1 -0.515 0.401 0.488
W1:N1 0.376 -0.582 -0.708 -0.346
W1:P1 0.366 -0.566 -0.347 -0.706 0.488
N1:P1 0.366 -0.285 -0.689 -0.706 0.488 0.502
W1:N1:P1 -0.259 0.400 0.488 0.499 -0.688 -0.708 -0.708
> anova(model1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
W 750.15 750.15 1 11.90 40.9995 3.519e-05 ***
N 10.84 10.84 1 485.95 0.5926 0.44177
P 29.14 29.14 1 494.92 1.5926 0.20755
W:N 290.51 290.51 1 485.95 15.8778 7.793e-05 ***
W:P 15.54 15.54 1 485.96 0.8493 0.35721
N:P 536.85 536.85 1 485.95 29.3415 9.562e-08 ***
W:N:P 89.01 89.01 1 485.95 4.8648 0.02788 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> emmeans::emmeans(model1,pairwise~N*P*W)
$emmeans
N P W emmean SE df lower.CL upper.CL
0 0 0 10.85 0.699 46.9 9.44 12.26
1 0 0 6.76 0.696 46.2 5.36 8.16
0 1 0 8.80 0.721 52.1 7.36 10.25
1 1 0 10.47 0.721 52.1 9.02 11.91
0 0 1 12.93 0.696 46.2 11.53 14.33
1 0 1 13.52 0.696 46.2 12.12 14.92
0 1 1 11.86 0.721 52.1 10.41 13.30
1 1 1 14.86 0.721 52.1 13.42 16.31
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
I am trying to predict and graph models with species presence as the response. However I've run into the following problem: the ggpredict outputs are wildly different for the same data in glmer and glmmTMB. However, the estimates and AIC are very similar. These are simplified models only including date (which has been centered and scaled), which seems to be the most problematic to predict.
yntest<- glmer(MYOSOD.P~ jdate.z + I(jdate.z^2) + I(jdate.z^3) +
(1|area/SiteID), family = binomial, data = sodpYN)
> summary(yntest)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: MYOSOD.P ~ jdate.z + I(jdate.z^2) + I(jdate.z^3) + (1 | area/SiteID)
Data: sodpYN
AIC BIC logLik deviance df.resid
1260.8 1295.1 -624.4 1248.8 2246
Scaled residuals:
Min 1Q Median 3Q Max
-2.0997 -0.3218 -0.2013 -0.1238 9.4445
Random effects:
Groups Name Variance Std.Dev.
SiteID:area (Intercept) 1.6452 1.2827
area (Intercept) 0.6242 0.7901
Number of obs: 2252, groups: SiteID:area, 27; area, 9
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.96778 0.39190 -7.573 3.65e-14 ***
jdate.z -0.72258 0.17915 -4.033 5.50e-05 ***
I(jdate.z^2) 0.10091 0.08068 1.251 0.21102
I(jdate.z^3) 0.25025 0.08506 2.942 0.00326 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) jdat.z I(.^2)
jdate.z 0.078
I(jdat.z^2) -0.222 -0.154
I(jdat.z^3) -0.071 -0.910 0.199
The glmmTMB model + summary:
Tyntest<- glmmTMB(MYOSOD.P ~ jdate.z + I(jdate.z^2) + I(jdate.z^3) +
(1|area/SiteID), family = binomial("logit"), data = sodpYN)
> summary(Tyntest)
Family: binomial ( logit )
Formula: MYOSOD.P ~ jdate.z + I(jdate.z^2) + I(jdate.z^3) + (1 | area/SiteID)
Data: sodpYN
AIC BIC logLik deviance df.resid
1260.8 1295.1 -624.4 1248.8 2246
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
SiteID:area (Intercept) 1.6490 1.2841
area (Intercept) 0.6253 0.7908
Number of obs: 2252, groups: SiteID:area, 27; area, 9
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.96965 0.39638 -7.492 6.78e-14 ***
jdate.z -0.72285 0.18250 -3.961 7.47e-05 ***
I(jdate.z^2) 0.10096 0.08221 1.228 0.21941
I(jdate.z^3) 0.25034 0.08662 2.890 0.00385 **
---
ggpredict outputs
testg<-ggpredict(yntest, terms ="jdate.z[all]")
> testg
# Predicted probabilities of MYOSOD.P
# x = jdate.z
x predicted std.error conf.low conf.high
-1.95 0.046 0.532 0.017 0.120
-1.51 0.075 0.405 0.036 0.153
-1.03 0.084 0.391 0.041 0.165
-0.58 0.072 0.391 0.035 0.142
-0.14 0.054 0.390 0.026 0.109
0.35 0.039 0.399 0.018 0.082
0.79 0.034 0.404 0.016 0.072
1.72 0.067 0.471 0.028 0.152
Adjusted for:
* SiteID = 0 (population-level)
* area = 0 (population-level)
Standard errors are on link-scale (untransformed).
testgTMB<- ggpredict(Tyntest, "jdate.z[all]")
> testgTMB
# Predicted probabilities of MYOSOD.P
# x = jdate.z
x predicted std.error conf.low conf.high
-1.95 0.444 0.826 0.137 0.801
-1.51 0.254 0.612 0.093 0.531
-1.03 0.136 0.464 0.059 0.280
-0.58 0.081 0.404 0.038 0.163
-0.14 0.054 0.395 0.026 0.110
0.35 0.040 0.402 0.019 0.084
0.79 0.035 0.406 0.016 0.074
1.72 0.040 0.444 0.017 0.091
Adjusted for:
* SiteID = NA (population-level)
* area = NA (population-level)
Standard errors are on link-scale (untransformed).
The estimates are completely different and I have no idea why.
I did try to use both the ggeffects package from CRAN and the developer version in case that changed anything. It did not. I am using the most up to date version of glmmTMB.
This is my first time asking a question here so please let me know if I should provide more information to help explain the problem.
I checked and the issue is the same when using predict instead of ggpredict, which would imply that it is a glmmTMB issue?
GLMER:
dayplotg<-expand.grid(jdate.z=seq(min(sodp$jdate.z), max(sodp$jdate.z), length=92))
Dfitg<-predict(yntest, re.form=NA, newdata=dayplotg, type='response')
dayplotg<-data.frame(dayplotg, Dfitg)
head(dayplotg)
> head(dayplotg)
jdate.z Dfitg
1 -1.953206 0.04581691
2 -1.912873 0.04889584
3 -1.872540 0.05195598
4 -1.832207 0.05497553
5 -1.791875 0.05793307
6 -1.751542 0.06080781
glmmTMB:
dayplot<-expand.grid(jdate.z=seq(min(sodp$jdate.z), max(sodp$jdate.z), length=92),
SiteID=NA,
area=NA)
Dfit<-predict(Tyntest, newdata=dayplot, type='response')
head(Dfit)
dayplot<-data.frame(dayplot, Dfit)
head(dayplot)
> head(dayplot)
jdate.z SiteID area Dfit
1 -1.953206 NA NA 0.4458236
2 -1.912873 NA NA 0.4251926
3 -1.872540 NA NA 0.4050944
4 -1.832207 NA NA 0.3855801
5 -1.791875 NA NA 0.3666922
6 -1.751542 NA NA 0.3484646
I contacted the ggpredict developer and figured out that if I used poly(jdate.z,3) rather than jdate.z + I(jdate.z^2) + I(jdate.z^3) in the glmmTMB model, the glmer and glmmTMB predictions were the same.
I'll leave this post up even though I was able to answer my own question in case someone else has this question later.
So I'm an R novice attempting a GLMM and post hoc analysis... help! I've collected binary data on 9 damselflys under 6 light levels, 1=response to movement of optomotor drum, 0=no response. My data was imported into R with the headings 'Animal_ID, light_intensity, response'. Animal ID (1-9) repeated for each light intensity (3.36-0.61) (see below)
Using the following code (lme4 package), I've performed a GLMM and found a light level to have a significant effect on response:
d = data.frame(id = data[,1], var = data$Light_Intensity, Response = data$Response)
model <- glmer(Response~var+(1|id),family="binomial",data=d)
summary(model)
Returns
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod]
Family: binomial ( logit )
Formula: Response ~ var + (1 | Animal_ID)
Data: d
AIC BIC logLik deviance df.resid
66 72 -30 60 51
Scaled residuals:
Min 1Q Median 3Q Max
-3.7704 -0.6050 0.3276 0.5195 1.2463
Random effects:
Groups Name Variance Std.Dev.
Animal_ID (Intercept) 1.645 1.283
Number of obs: 54, groups: Animal_ID, 9
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7406 1.0507 -1.657 0.0976 .
var 1.1114 0.4339 2.561 0.0104 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
var -0.846
Then running:
m1 <- update(model, ~.-var)
anova(model, m1, test = 'Chisq')
Returns
Data: d
Models:
m1: Response ~ (1 | Animal_ID)
model: Response ~ var + (1 | Animal_ID)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
m1 2 72.555 76.533 -34.278 68.555
model 3 66.017 71.983 -30.008 60.017 8.5388 1 0.003477 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I've installed the multcomp and lsmeans packages in an attempt at performing a Tukey post hoc to see where the difference is, but have run into difficulties with both.
Running:
summary(glht(m1,linfct=mcp("Animal_ID"="Tukey")))
Returns:
"Error in mcp2matrix(model, linfct = linfct) :
Variable(s) ‘Animal_ID’ have been specified in ‘linfct’ but cannot be found in ‘model’! "
Running:
lsmeans(model,pairwise~Animal_ID,adjust="tukey")
Returns:
"Error in lsmeans.character.ref.grid(object = new("ref.grid", model.info = list( :
No variable named Animal_ID in the reference grid"
I'm aware that I'm probably being very stupid here, but any help would be very much appreciated. My confusion is snowballing.
Also, does anyone have any suggestions as to how I might best visualize my results (and how to do this)?
Thank you very much in advance!
UPDATE:
New code-
Light <- c("3.36","3.36","3.36","3.36","3.36","3.36","3.36","3.36","3.36","2.98","2.98","2.98","2.98","2.98","2.98","2.98","2.98","2.98","2.73","2.73","2.73","2.73","2.73","2.73","2.73","2.73","2.73","2.15","2.15","2.15","2.15","2.15","2.15","2.15","2.15","2.15","1.72","1.72","1.72","1.72","1.72","1.72","1.72","1.72","1.72","0.61","0.61","0.61","0.61","0.61","0.61","0.61","0.61","0.61")
Subject <- c("1","2","3","4","5","6","7","8","9","1","2","3","4","5","6","7","8","9","1","2","3","4","5","6","7","8","9","1","2","3","4","5","6","7","8","9","1","2","3","4","5","6","7","8","9","1","2","3","4","5","6","7","8","9")
Value <- c("1","0","1","0","1","1","1","0","1","1","0","1","1","1","1","1","1","1","0","1","1","1","1","1","1","0","1","0","0","1","1","1","1","1","1","1","0","0","0","1","0","0","1","0","1","0","0","0","1","1","0","1","0","0")
data <- data.frame(Light, Subject, Value)
library(lme4)
model <- glmer(Value~Light+(1|Subject),family="binomial",data=data)
summary(model)
Returns:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [
glmerMod]
Family: binomial ( logit )
Formula: Value ~ Light + (1 | Subject)
Data: data
AIC BIC logLik deviance df.resid
67.5 81.4 -26.7 53.5 47
Scaled residuals:
Min 1Q Median 3Q Max
-2.6564 -0.4884 0.2193 0.3836 1.2418
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 2.687 1.639
Number of obs: 54, groups: Subject, 9
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.070e+00 1.053e+00 -1.016 0.3096
Light1.72 -7.934e-06 1.227e+00 0.000 1.0000
Light2.15 2.931e+00 1.438e+00 2.038 0.0416 *
Light2.73 2.931e+00 1.438e+00 2.038 0.0416 *
Light2.98 4.049e+00 1.699e+00 2.383 0.0172 *
Light3.36 2.111e+00 1.308e+00 1.613 0.1067
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) Lg1.72 Lg2.15 Lg2.73 Lg2.98
Light1.72 -0.582
Light2.15 -0.595 0.426
Light2.73 -0.595 0.426 0.555
Light2.98 -0.534 0.361 0.523 0.523
Light3.36 -0.623 0.469 0.553 0.553 0.508
Then running:
m1 <- update(model, ~.-Light)
anova(model, m1, test= 'Chisq')
Returns:
Data: data
Models:
m1: Value ~ (1 | Subject)
model: Value ~ Light + (1 | Subject)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
m1 2 72.555 76.533 -34.278 68.555
model 7 67.470 81.393 -26.735 53.470 15.086 5 0.01 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Finally, running:
library(lsmeans)
lsmeans(model,list(pairwise~Light),adjust="tukey")
Returns (it actually works now!):
$`lsmeans of Light`
Light lsmean SE df asymp.LCL asymp.UCL
0.61 -1.070208 1.053277 NA -3.1345922 0.9941771
1.72 -1.070216 1.053277 NA -3.1345997 0.9941687
2.15 1.860339 1.172361 NA -0.4374459 4.1581244
2.73 1.860332 1.172360 NA -0.4374511 4.1581149
2.98 2.978658 1.443987 NA 0.1484964 5.8088196
3.36 1.040537 1.050317 NA -1.0180467 3.0991215
Results are given on the logit (not the response) scale.
Confidence level used: 0.95
$`pairwise differences of contrast`
contrast estimate SE df z.ratio p.value
0.61 - 1.72 7.933829e-06 1.226607 NA 0.000 1.0000
0.61 - 2.15 -2.930547e+00 1.438239 NA -2.038 0.3209
0.61 - 2.73 -2.930539e+00 1.438237 NA -2.038 0.3209
0.61 - 2.98 -4.048866e+00 1.699175 NA -2.383 0.1622
0.61 - 3.36 -2.110745e+00 1.308395 NA -1.613 0.5897
1.72 - 2.15 -2.930555e+00 1.438239 NA -2.038 0.3209
1.72 - 2.73 -2.930547e+00 1.438238 NA -2.038 0.3209
1.72 - 2.98 -4.048874e+00 1.699175 NA -2.383 0.1622
1.72 - 3.36 -2.110753e+00 1.308395 NA -1.613 0.5897
2.15 - 2.73 7.347728e-06 1.357365 NA 0.000 1.0000
2.15 - 2.98 -1.118319e+00 1.548539 NA -0.722 0.9793
2.15 - 3.36 8.198019e-01 1.302947 NA 0.629 0.9889
2.73 - 2.98 -1.118326e+00 1.548538 NA -0.722 0.9793
2.73 - 3.36 8.197945e-01 1.302947 NA 0.629 0.9889
2.98 - 3.36 1.938121e+00 1.529202 NA 1.267 0.8029
Results are given on the log odds ratio (not the response) scale.
P value adjustment: tukey method for comparing a family of 6 estimates
Your model specifies Animal_ID as a random effect. The glht and lsmeans functions work only for fixed-effect comparisons.
I am trying to explore regressions between abundances and 3 variables. My data (test.gam)looks like this:
# A tibble: 6 x 5
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 18.2
3 cycle1 0.983 5328652. 105 18.2
4 cycle1 1 6212034. 110 18.2
5 cycle1 0.821 5468987. 105 18.2
6 cycle1 0.734 5280549. 112. 18.2
In one of these variable (SiOH4), I have only one value per Site, while for the 2 other variables, I have single value for each station (each row being a station).
To plot the relation between abundances and SiOH4 I would simply compute a mean value for each Site. The relation show that there is a constant increase of abundances with SiOH4 levels: Plot1.
Now I tried running a GAM on this data using the following code:
mod_gam1 <- gam(Abundance ~ s(isotherm, bs = "cr", k = 5)
+ SPM + s(SiOH4, bs = "cr", k = 5), data = test.gam, family = gaussian(link = log), gamma = 1.4)
giving me these results:
Family: gaussian
Link function: log
Formula:
Abundance ~ s(isotherm, bs = "cr", k = 5) + SPM + s(SiOH4, bs = "cr",
k = 5)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.182e-01 8.244e-02 -9.925 < 2e-16 ***
SPM -4.356e-08 1.153e-08 -3.778 0.000219 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(isotherm) 2.019 2.485 10.407 1.46e-05 ***
s(SiOH4) 3.861 3.986 9.823 1.01e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.492 Deviance explained = 51.2%
GCV = 0.044202 Scale est. = 0.040674 n = 177
So I am happy quite happy about the results but then by checking with gam.check, I find that k is too low.
Method: GCV Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-8.801477e-14,5.555545e-13]
(score 0.04420205 & scale 0.04067442).
Hessian positive definite, eigenvalue range [6.631202e-05,7.084933e-05].
Model rank = 10 / 10
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(isotherm) 4.00 2.02 0.85 0.01 **
s(SiOH4) 4.00 3.86 0.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that I voluntarily set my k to 5, otherwise there is an overfit of the pattern.
I thought that this might due to the fact that many of the values in SiOH4 are repeated. By modifying my data to keep only the first value of each Site (replacing all other rows with NA) like:
# A tibble: 6 x 5
# Groups: Site [1]
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 NA
3 cycle1 0.983 5328652. 105 NA
4 cycle1 1 6212034. 110 NA
5 cycle1 0.821 5468987. 105 NA
6 cycle1 0.734 5280549. 112. NA
I hoped preventing this repeated levels. But this way I am also loosing most of my rows, with the na.omit option on. However running the same GAM, I don't have problems with k being too low after using gam.check.
So do I need to keep repetitive values and ignore the warning from gam.check or there is a way to somehow keep all rows even if NA exist?