I am very new to R and I want to know if random effects in the different areas where my research took place could be biasing my results, and providing a false positive effect on my conditions. My data is based on natural observations taking place in four different conditions, over 9 areas. It is a between subjects design- each row is an observation of a different subject.
Condition =factor (4 levels)
Area = random effect (9 levels)
GeneralDisplayingBehaviours = DV
ObsTime = How many minuets the observations took place for
This is what my model looks like
data2$Condition<-as.factor(data2$Condition)
data2$Area<-as.factor(data2$Area)
data2$Condition<-relevel(data2$Condition,ref="22")
data2$Area<-relevel(data2$Area,ref="1")
mod<-glmmadmb(GeneralDisplayingBehaviours~Condition+ObsTime+(1|Area), data=data2, family="nbinom", zeroInflation=TRUE)
This is the out put:
Call:
glmmadmb(formula = GeneralDisplayingBehaviours ~ Condition +
ObsTime + (1 | Area), data = data2, family = "nbinom", zeroInflation = TRUE)
AIC: 2990.1
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6233 0.5019 3.23 0.0012 **
Condition12 1.3291 0.1330 9.99 <2e-16 ***
Condition11 1.2965 0.1294 10.02 <2e-16 ***
Condition21 0.0715 0.1351 0.53 0.5966
ObsTime 0.0829 0.0341 2.43 0.0151 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Number of observations: total=360, Area=9
Random effect variance(s):
Group=Area
Variance StdDev
(Intercept) 8.901e-09 9.434e-05
Negative binomial dispersion parameter: 1.7376 (std. err.: 0.16112)
Zero-inflation: 0.16855 (std. err.: 0.02051 )
Log-likelihood: -1487.06
I would then go on to change the condition ref for each level, and I think I would have to do the same with area.
How would I interpret these results and report them?
What does this tell me about my random effect of area? Does it impact the DV?
Are the conditions significantly different in terms of the DV, and which conditions are significantly different for which?
For the last question I have a feeling I would need to do multiple comparisons so how would I do this in GLMMABMD?
Thank you
Related
This question already has answers here:
Interpreting Estimate of categorical variable coefficient in lm() summary() in R
(1 answer)
Interpretation of .L, .Q., .C, .4… for logistic regression
(1 answer)
Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary
(1 answer)
Closed 7 months ago.
How to get mgcv::gam to display my actual treatment names in the parametric coefficients intercept names? I just have random letters on them.
Family: quasipoisson
Link function: log
Formula:
weekly_eggs ~ food + s(week) + s(week, by = Ofood) + s(id,
bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5624 0.2031 -2.769 0.00581 **
food.L 0.1011 0.4086 0.247 0.80473
food.Q 0.5398 0.4076 1.324 0.18594
food.C 0.2136 0.4053 0.527 0.59838
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(week) 6.403 6.403 49.376 < 2e-16 ***
s(week):Ofoodlow 1.000 1.000 0.001 0.976359
s(week):Ofoodmed 1.000 1.000 14.360 0.000167 ***
s(week):Ofoodhigh 2.078 2.078 1.656 0.141125
s(id) 43.669 49.000 12.984 < 2e-16 ***
Confusingly, the smooth terms actually have the treatment names. Before, this problem with the parametric coefficients was fixed by adding an "O" before naming the treatment column, but that seems to not work.
I set all groups to factors before gam, and also created a new column with the O in front of the treatment column which somehow makes it easier to make the summed smooth curves later.
Any advice...? Still unfamiliar with a lot of the gam function. I know this isn't reproducible code, but wasn't sure how to do that with this situation.
Code:
df_P_egg_gamm$id = as.factor(df_P_egg_gamm$id)
df_P_egg_gamm$food = as.factor(df_P_egg_gamm$food)
df_P_egg_gamm$Ofood = factor(df_N_P_egg_gamm$food, ordered=T)
egg_gamm_P = gamm(weekly_eggs ~ food + s(week) + s(week, by=Ofood) + s(id, bs="re"), family="quasipoisson",
correlation=corCAR1(form=~week|id), data=df_P_egg_gamm)
I want to see if four predictors ("OA_statusclosed" "OA_statusgreen" "OA_statushybrid" "OA_statusbronze") have an effect on "logAlt." I have chosen to do a lmer in r to account for random intercepts and slopes by variable "Journal".
I want to interpret the output so that a higher OA status (in order of highest status: green, hybrid, bronze, closed). In order to do this, I have contrast coded the four variables as such (adhering to the order of the variables in my df being hybrid, closed, green, bronze):
contrasts(df$OA_status.c) <- c(0.25, -0.5, 0.5, -0.25)
contrasts(df$OA_status.c)
I have ran this analysis:
M3 <- lmer(logAlt ~ OA_status + (1|Journal),
data = df,
control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
And got this summary(M3):
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: logAlt ~ OA_status + (1 | Journal)
Data: df
Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
REML criterion at convergence: 20873.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.1272 -0.6719 0.0602 0.6618 4.4344
Random effects:
Groups Name Variance Std.Dev.
Journal (Intercept) 0.08848 0.2975
Residual 1.49272 1.2218
Number of obs: 6435, groups: Journal, 7
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.03867 0.15059 18.27727 13.538 5.71e-11 ***
OA_statusclosed -0.97648 0.09915 6428.62227 -9.848 < 2e-16 ***
OA_statusgreen -0.74956 0.10320 6429.65387 -7.263 4.22e-13 ***
OA_statushybrid 0.04621 0.12590 6427.44114 0.367 0.714
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) OA_sttsc OA_sttsg
OA_sttsclsd -0.640
OA_statsgrn -0.613 0.934
OA_sttshybr -0.501 0.763 0.744
I interpret this to mean that, for example, OA_statusclosed results in an average of -0.97 less of a logAlt value than the reference value, and that OA_statusclosed is a significant predictor.
I have two questions:
Am I approaching contrast coding correctly— in that, am I making "OA_statusgreen" my reference value (which is what I think I need to do?)
Am I interpreting the output correctly?
Thanks in advance!
I am trying to determine the influence of five categorical and one continuous independent variable on some ecological count data using a GLM in R. Here is an example of what the data that I am using looks like:
No. DateNum Tunnel Trt Row Time AvgTemp sqTotal.L
1 44382 1 A 3 AM 30.0 1.41
2 44384 3 C 2 PM 21.0 2.23
3 44384 7 D 3 AM 24.0 3.65
4 44400 4 B 1 AM 27.5 2.78
The fixed effects DateNum, Tunnel and Row are coded as ordered factors, Trt and Time are unordered factors and AvgTemp is coded as a numeric. 'sqTotal.L' is the squart-root-transformed count data, which is a normally distributed response variable. I have decided to use GLM instead of ANOVA because the experimental design is not balanced and there are different numbers of samples from the different experimental plots.
When I run the following code for a GLM and then the drop1() function on the resulting model, the effect of my continuous fixed effect (AvgTemp) appears not be incorporated into the result of the drop1() function:
> rowfvs.glm <- glm(sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel + Trt,
+ family = gaussian, data = rowfvs2)
> summary(rowfvs.glm)
Call:
glm(formula = sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel +
Trt, family = gaussian, data = rowfvs2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.63548 -0.38868 0.06587 0.41777 1.31886
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.89037 5.98768 1.485 0.1492
AvgTemp -0.28191 0.24566 -1.148 0.2612
Row.L -0.46085 0.24735 -1.863 0.0734 .
Row.Q 0.08047 0.25153 0.320 0.7515
DateNum.L -1.17448 0.85015 -1.382 0.1785
DateNum.Q 0.57857 0.64731 0.894 0.3793
DateNum.C -2.17331 2.15684 -1.008 0.3226
DateNum^4 -0.76025 1.09723 -0.693 0.4943
DateNum^5 -1.62269 0.68388 -2.373 0.0250 *
DateNum^6 0.63799 0.70822 0.901 0.3756
DateNum^7 NA NA NA NA
TimePM -0.31436 0.87881 -0.358 0.7233
Tunnel.L 1.38420 0.62199 2.225 0.0346 *
Tunnel.Q -0.03521 0.56561 -0.062 0.9508
Tunnel.C 0.81639 0.54880 1.488 0.1484
Tunnel^4 0.24029 0.61180 0.393 0.6976
Tunnel^5 0.30665 0.51849 0.591 0.5592
Tunnel^6 0.67603 0.53728 1.258 0.2191
TrtB 0.10067 0.40771 0.247 0.8068
TrtC 0.31278 0.41048 0.762 0.4527
TrtD -0.49857 0.46461 -1.073 0.2927
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.7583716)
Null deviance: 50.340 on 46 degrees of freedom
Residual deviance: 20.476 on 27 degrees of freedom
AIC: 136.33
Number of Fisher Scoring iterations: 2
> drop1(rowfvs.glm, test = "Chi")
Single term deletions
Model:
sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel + Trt
Df Deviance AIC scaled dev. Pr(>Chi)
<none> 20.476 136.33
AvgTemp 0 20.476 136.33 0.0000
Row 2 23.128 138.05 5.7249 0.05713 .
DateNum 6 25.517 134.67 10.3447 0.11087
Time 1 20.573 134.55 0.2222 0.63736
Tunnel 6 27.525 138.23 13.9039 0.03073 *
Trt 3 23.201 136.20 5.8725 0.11798
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
By contrast, when I try running the anova() function on the model, I do get an analysis of the influence of AvgTemp on sqTotal.L:
> anova(rowfvs.glm, test = "Chi")
Analysis of Deviance Table
Model: gaussian, link: identity
Response: sqTotal.L
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 46 50.340
AvgTemp 1 0.7885 45 49.551 0.3078945
Row 2 1.0141 43 48.537 0.5124277
DateNum 6 17.6585 37 30.879 0.0007065 ***
Time 1 0.3552 36 30.523 0.4937536
Tunnel 6 7.3222 30 23.201 0.1399428
Trt 3 2.7251 27 20.476 0.3088504
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, my questions are why isn't the drop1() function taking AvgTemp into account, and is it sufficient to use the results from the anova() function in my report of the results, or do I need to figure out how to get the drop1() function to incorporate this continuous predictor?
This is a bit of a guess because we don't have your data, but: I believe the answer is related to the multicollinearity in your design matrix (as signalled by the message "1 not defined because of singularities" and the presence of an NA estimate for the DateNum^7 parameter).
When you have collinear (perfectly correlated) columns in your design matrix, it can be a bit unpredictable how they get dealt with. lm() picks one of the columns to discard: in this case it's DateNum^7. However, assuming that AvgTemp is also in the set of collinear columns, if you drop AvgTemp from the model then when the model gets refitted lm() will not drop DateNum^7 (because it doesn't need to any more), but you will still get the same goodness of fit (AIC/log-likelihood/etc.) — because you dropped a variable that is redundant.
You should be able to explore this possibility via caret::findLinearCombos(model.matrix(rowfvs.glm)), although careful thought about your observational/experimental design might also enlighten you as to why these variables are collinear ...
I have tried to build an ordinal logistic regression using one ordered categorical variable and another three categorical dependent variables (N= 43097). While all coefficients are significant, I have doubts about meeting the parallel regression assumption. Though the probability values of all variables and the whole model in the brant test are perfectly zero (which supposed to be more than 0.05), still test is displaying that H0: Parallel Regression Assumption holds. I am confused here. Is this model perfectly meets the criteria of the parallel regression assumption?
library(MASS)
table(hh18_u_r$cat_ci_score) # Dependent variable
Extremely Vulnerable Moderate Vulnerable Pandemic Prepared
6143 16341 20613
# Ordinal logistic regression
olr_2 <- polr(cat_ci_score ~ r1_gender + r2_merginalised + r9_religion, data = hh18_u_r, Hess=TRUE)
summary(olr_2)
Call:
polr(formula = cat_ci_score ~ r1_gender + r2_merginalised + r9_religion,
data = hh18_u_r, Hess = TRUE)
Coefficients:
Value Std. Error t value
r1_genderMale 0.3983 0.02607 15.278
r2_merginalisedOthers 0.6641 0.01953 34.014
r9_religionHinduism -0.2432 0.03069 -7.926
r9_religionIslam -0.5425 0.03727 -14.556
Intercepts:
Value Std. Error t value
Extremely Vulnerable|Moderate Vulnerable -1.5142 0.0368 -41.1598
Moderate Vulnerable|Pandemic Prepared 0.4170 0.0359 11.6260
Residual Deviance: 84438.43
AIC: 84450.43
## significance of coefficients and intercepts
summary_table_2 <- coef(summary(olr_2))
pval_2 <- pnorm(abs(summary_table_2[, "t value"]), lower.tail = FALSE)* 2
summary_table_2 <- cbind(summary_table_2, pval_2)
summary_table_2
Value Std. Error t value pval_2
r1_genderMale 0.3982719 0.02606904 15.277583 1.481954e-52
r2_merginalisedOthers 0.6641311 0.01952501 34.014386 2.848250e-250
r9_religionHinduism -0.2432085 0.03068613 -7.925682 2.323144e-15
r9_religionIslam -0.5424992 0.03726868 -14.556436 6.908533e-48
Extremely Vulnerable|Moderate Vulnerable -1.5141502 0.03678710 -41.159819 0.000000e+00
Moderate Vulnerable|Pandemic Prepared 0.4169645 0.03586470 11.626042 3.382922e-31
#Test of parallel regression assumption
library(brant)
brant(olr_2) # Probability supposed to be more than 0.05 as I understand
----------------------------------------------------
Test for X2 df probability
----------------------------------------------------
Omnibus 168.91 4 0
r1_genderMale 12.99 1 0
r2_merginalisedOthers 41.18 1 0
r9_religionHinduism 86.16 1 0
r9_religionIslam 25.13 1 0
----------------------------------------------------
H0: Parallel Regression Assumption holds
# Similar test of parallel regression assumption using car package
library(car)
car::poTest(olr_2)
Tests for Proportional Odds
polr(formula = cat_ci_score ~ r1_gender + r2_merginalised + r9_religion,
data = hh18_u_r, Hess = TRUE)
b[polr] b[>Extremely Vulnerable] b[>Moderate Vulnerable] Chisquare df Pr(>Chisq)
Overall 168.9 4 < 2e-16 ***
r1_genderMale 0.398 0.305 0.442 13.0 1 0.00031 ***
r2_merginalisedOthers 0.664 0.513 0.700 41.2 1 1.4e-10 ***
r9_religionHinduism -0.243 -0.662 -0.147 86.2 1 < 2e-16 ***
r9_religionIslam -0.542 -0.822 -0.504 25.1 1 5.4e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Kindly suggest whether this model satisfies the parallel regression assumption? Thank you
It tells you the null hypothesis (H0) is that it holds. Your values are statistically significant, which means you reject the null hypothesis (H0). It wasn't showing you that to say the assumption was met but rather it was just a reminder of what the null is.
The full model for growth of plants is as follows:
lmer(log(growth) ~ nutrition + fertilizer + season + (1|block)
where nutrition(nitrogen/phosphorus), fertilizer(none/added), season(dry/wet)
The summary of the model is as follows:
REML criterion at convergence: 71.9
Scaled residuals:
Min 1Q Median 3Q Max
-1.82579 -0.59620 0.04897 0.62629 1.54639
Random effects:
Groups Name Variance Std.Dev.
block (Intercept) 0.06008 0.2451
Residual 0.48633 0.6974
Number of obs: 32, groups: tank, 16
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.5522 0.2684 19.6610 13.233 3.02e-11 ***
nutritionP 0.2871 0.2753 13.0000 1.043 0.31601
fertlizeradded -0.3513 0.2753 13.0000 -1.276 0.22436
seasonwet 1.0026 0.2466 15.0000 4.066 0.00101 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Plant growth here is only dependent on season, and the increase in growth is 1.0026 on the log scale. How do I interpret this on the scale of the original data, if I want to what the increase in actual plant height was? Is it only e(1.0026) ~ 3 cms, or is there any other way to interpret this?
exp(1.0026) is indeed about 3 (2.72), but this value represents proportional change. Growth is three times higher in the wet than in the dry season, all other things being equal.