I have discrete count data (trap_catch) for two groups withing the variable in_tree (1 = trap in a tree, or 0 = trap not in a tree), and I want to see if counts were different between these two groups. The data is overdispersed and there are many zeroes, so I have come to the conclusion that I need a hurdle model. Is this OK?
trap_id trap_catch in_tree
1 0 0
2 10 1
3 0 0
4 0 1
5 9 1
6 3 0
Here is an example of how the data is set up. My code is as follows:
mod.hurdle <- hurdle(trap_catch~in_tree, data=data,dist="negbin")
summary(mod.hurdle)
The results I get are as follows and seem so different to any examples I have read:
Pearson residuals:
Min 1Q Median 3Q Max
-0.8986 -0.6635 -0.2080 0.2474 6.8513
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2582 0.1285 9.793 < 2e-16 ***
in_tree 1.3722 0.3100 4.426 9.58e-06 ***
Log(theta) -0.2056 0.2674 -0.769 0.442
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5647 0.1944 8.049 8.32e-16 ***
in_tree 16.0014 1684.1379 0.010 0.992
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.8142
Number of iterations in BFGS optimization: 8
Log-likelihood: -513.7 on 5 Df
I am confused as to how to interpret these results.
I apologise in advance for my lack of understanding - I am very new to this type of analysis.
Related
I've run a multiple linear regression where pred_acc is the dependent continuous variable and emotion_pred and emotion_target are two dummy coded independent variables with 0 and 1. Furthermore I am interested in the interaction between the two independent variables.
model <- lm(predic_acc ~ emotion_pred * emotion_target, data = data_almost_final)
summary(model)
Residuals:
Min 1Q Median 3Q Max
-0.66049 -0.19522 0.01235 0.19213 0.67284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.97222 0.06737 14.432 < 2e-16 ***
emotion_pred 0.45988 0.09527 4.827 8.19e-06 ***
emotion_target 0.24383 0.09527 2.559 0.012719 *
emotion_pred:emotion_target -0.47840 0.13474 -3.551 0.000703 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2858 on 68 degrees of freedom
(1224 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.2555, Adjusted R-squared: 0.2227
F-statistic: 7.781 on 3 and 68 DF, p-value: 0.0001536
In case some context is needed: I did a survey where couples had to predict their partners preferences. The predictor individual was either in emotion state 0 or 1 (emotion_pred) and the target individual was either in emotion state 0 or 1 (emotion_target). Accordingly, there are four combinations.
Now I want to plot the regression with the means of each combination of the independent variables (0,1; 1,0; 1,1; 0,0) and add an error bar with the standard error of the means. I have literally no idea at all how to do this. Anyone can help me with this?
Here's an extraction from my data:
pred_acc emotion_pred emotion_target
1 1.0000000 1 0
2 1.2222222 0 1
3 0.7777778 0 0
4 1.1111111 1 1
5 1.3888889 1 1
Sketch of how I want it to look like
Using emmip from the emmeans library:
model <- lm(data=d2, pred_acc ~ emotion_pred*emotion_target)
emmip(model, emotion_pred ~ emotion_target, CIs = TRUE, style = "factor")
If you want more control over the image or just to get the values you can use the emmeans function directly:
> emmeans(model , ~ emotion_pred * emotion_target )
emotion_pred emotion_target emmean SE df lower.CL upper.CL
0 0 0.778 0.196 1 -1.718 3.27
1 0 1.000 0.196 1 -1.496 3.50
0 1 1.222 0.196 1 -1.274 3.72
1 1 1.250 0.139 1 -0.515 3.01
Then you can use ggplot on this dataframe to make whatever graph you like.
I am trying to run a glm that looks at the effects of food type, habitat, and starvation period on food preference in Ants, however I simply want to look at food type as a single factor, even though I provide the ants with 5 foods. I have used as.factor on the food variable, but it still doesn't seem to work! I want a single p-value for how food affects individuals. Am I missing something?
NumofAnts FoodType Trial SiteType
1 0 Pink 1 natural
2 4 Pink 1 natural
3 5 Pink 1 natural
4 4 Pink 1 natural
5 8 Pink 1 natural
6 5 Pink 1 natural
fit<-glm(NumofAnts~as.factor(FoodType) + Trial + SiteType,
family=poisson(link=log), data=stacked1)
glm(formula = NumofAnts ~ as.factor(FoodType) + Trial + SiteType,
family = poisson(link = log), data = stacked1)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5644 -2.2495 -1.0023 0.8588 8.8051
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.46177 0.08031 18.202 < 2e-16 ***
as.factor(FoodType)Blue -0.66665 0.06824 -9.769 < 2e-16 ***
as.factor(FoodType)Green -0.29987 0.06093 -4.922 8.57e-07 ***
as.factor(FoodType)Yellow -0.28086 0.06060 -4.635 3.57e-06 ***
as.factor(FoodType)Red -0.92502 0.07459 -12.401 < 2e-16 ***
Trial 0.19355 0.04327 4.473 7.73e-06 ***
SiteTypeurban -0.19730 0.04328 -4.558 5.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A glm will estimate one coefficient (so one p-value) when the variable is numeric. But when the variable is categoric (like food on your case) it will calculate one coefficient for each level (except one) of your variable. In your case, food has 5 levels so 4 coefficients are estimated (so 4 p-values).
What formula is used to calculate the value of Pr(>|t|) that is output when linear regression is performed by R?
I understand that the value of Pr (> | t |) is a p-value, but I do not understand how the value is calculated.
For example, although the value of Pr (> | t |) of x1 is displayed as 0.021 in the output result below, I want to know how this value was calculated
x1 <- c(10,20,30,40,50,60,70,80,90,100)
x2 <- c(20,30,60,70,100,110,140,150,180,190)
y <- c(100,120,150,180,210,220,250,280,310,330)
summary(lm(y ~ x1+x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6 -2 0 2 6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0000 3.4226 21.621 1.14e-07 ***
x1 1.8000 0.6071 2.965 0.021 *
x2 0.4000 0.3071 1.303 0.234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.781 on 7 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.9963
F-statistic: 1209 on 2 and 7 DF, p-value: 1.291e-09
Basically, the values in the column t-value are obtained by dividing the coefficient estimate (which is in the Estimate column) by the standard error.
For example in your case in the second row we get that:
tval = 1.8000 / 0.6071 = 2.965
The column you are interested in is the p-value. It is the probability that the absolute value of t-distribution is greater than 2.965. Using the symmetry of the t-distribution this probability is:
2 * pt(abs(tval), rdf, lower.tail = FALSE)
Here rdf denotes the residual degrees of freedom, which in our case is equal to 7:
rdf = number of observations minus total number of coefficient = 10 - 3 = 7
And a simple check shows that this is indeed what R does:
2 * pt(2.965, 7, lower.tail = FALSE)
[1] 0.02095584
Why are these GLMMs so different?
Both are made with lme4, both use the same data, but one is framed in terms of successes and trials (m1bin) while one just uses the raw accuracy data (m1). Have I been completely mistaken thinking that lme4 figures out the binomial structure from the raw data this whole time? (BRMS does it just fine.) I'm scared, now, that some of my analyses will change.
d:
uniqueid dim incorrectlabel accuracy
1 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 0
2 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental extreme 1
3 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant marginal 1
4 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 1
5 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant marginal 0
6 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 0
dbin:
uniqueid dim incorrectlabel right count
<fctr> <fctr> <fctr> <int> <int>
1 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental extreme 3 3
2 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 1 5
3 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant extreme 3 4
4 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant marginal 3 4
5 A16HSMUJ7C7QA7:3DY46V3X3PI4B0HROD2HN770M46557 incidental extreme 3 4
6 A16HSMUJ7C7QA7:3DY46V3X3PI4B0HROD2HN770M46557 incidental marginal 2 4
> summary(m1bin)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(right, count) ~ dim * incorrectlabel + (1 | uniqueid)
Data: dbin
AIC BIC logLik deviance df.resid
398.2 413.5 -194.1 388.2 151
Scaled residuals:
Min 1Q Median 3Q Max
-1.50329 -0.53743 0.08671 0.38922 1.28887
Random effects:
Groups Name Variance Std.Dev.
uniqueid (Intercept) 0 0
Number of obs: 156, groups: uniqueid, 39
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.48460 0.13788 -3.515 0.00044 ***
dimrelevant -0.13021 0.20029 -0.650 0.51562
incorrectlabelmarginal -0.15266 0.18875 -0.809 0.41863
dimrelevant:incorrectlabelmarginal -0.02664 0.27365 -0.097 0.92244
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) dmrlvn incrrc
dimrelevant -0.688
incrrctlblm -0.730 0.503
dmrlvnt:ncr 0.504 -0.732 -0.690
> summary(m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: accuracy ~ dim * incorrectlabel + (1 | uniqueid)
Data: d
AIC BIC logLik deviance df.resid
864.0 886.2 -427.0 854.0 619
Scaled residuals:
Min 1Q Median 3Q Max
-1.3532 -1.0336 0.7524 0.9350 1.1514
Random effects:
Groups Name Variance Std.Dev.
uniqueid (Intercept) 0.04163 0.204
Number of obs: 624, groups: uniqueid, 39
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.140946 0.088242 1.597 0.1102
dim1 0.155923 0.081987 1.902 0.0572 .
incorrectlabel1 0.180156 0.081994 2.197 0.0280 *
dim1:incorrectlabel1 0.001397 0.082042 0.017 0.9864
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) dim1 incrr1
dim1 0.010
incrrctlbl1 0.128 0.006
dm1:ncrrct1 0.005 0.138 0.010
I figured they'd be the same. Modeling both in BRMS gives the same models with the same estimates.
They should be the same (up to small numerical differences: see below), except for the log-likelihoods and metric based on them (although differences among a series of models in log-likelihoods/AIC/etc. should be the same). I think your problem is using cbind(right, count) rather than cbind(right, count-right): from ?glm,
For binomial ... families the response can also be specified as ... a two-column matrix with the columns giving the numbers of successes and failures.
(emphasis added to point out this is not number of successes and total, but successes and failures).
Here's an example with one of the built-in data sets, comparing fits to an aggregated and a disaggregated data set:
library(lme4)
library(dplyr)
## disaggregate
cbpp_disagg <- cbpp %>% mutate(obs=seq(nrow(cbpp))) %>%
group_by(obs,herd,period,incidence) %>%
do(data.frame(disease=rep(c(0,1),c(.$size-.$incidence,.$incidence))))
nrow(cbpp_disagg) == sum(cbpp$size) ## check
g1 <- glmer(cbind(incidence,size-incidence)~period+(1|herd),
family=binomial,cbpp)
g2 <- glmer(disease~period+(1|herd),
family=binomial,cbpp_disagg)
## compare results
all.equal(fixef(g1),fixef(g2),tol=1e-5)
all.equal(VarCorr(g1),VarCorr(g2),tol=1e-6)
I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1