Poisson model for non-integer [migrated] - r

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated yesterday.
I have a GLM with (quasi)poisson family.
My dataset has 3 variables:
rate_data
rate_benchmark
X
So fitting the model:
model <- glm(formula = rate_data ~ offset(log(rate_benchmark)) + X - 1, family = (quasi)poisson, data = data)
model_null <- glm(formula = rate_data ~ offset(log(rate_benchmark)) - 1, family = (quasi)poisson, data = data)
When using "poisson" it gives me warnings about non-integer values, which it doesnt give me for the quasipoisson. However, when testing for my beta being zero anova(model_null, model, test = "LRT") it gives me completely different deviance (hence also different p-values).
Which model am I supposed to use? My first thought was using quasipoisson, but no warnings does not necessarily mean it is correct.

The Poisson and quasi-Poisson models differ in their assumptions about the form of the function relating the mean and variance of each observation. The Poisson assumes the variance equals the mean; the quasi-Poisson assumes that $\sigma^2 = \theta\mu$, which reduces to the Poisson when $\theta=1$. Consequently, the deviance and p-values will, as you have observed, be different between the two models.
You can in fact run a Poisson regression on non-integer data, at least in R; you'll still get the "right" coefficient estimates etc. The warnings are there as warnings; they don't represent an algorithm failure. Here's an example:
z <- 1 + 2*runif(100)
x <- rgamma(100,2,sqrt(z + z*z/2))
summary(glm(x~z, family=poisson))
... blah blah blah ...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.56390 0.17214 3.276 0.00105 **
z 0.43119 0.08368 5.153 2.57e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
... blah blah blah ...
There were 50 or more warnings (use warnings() to see the first 50)
Now we'll compare to a pure "quasi" model with the same link function and relationship between mean and variance; the "quasi" model makes no assumptions about integer values for the target variable:
summary(glm(x~z, family=quasi(link="log", variance="mu")))
... stuff ...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5639 0.2621 2.151 0.03392 *
z 0.4312 0.1274 3.384 0.00103 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasi family taken to be 2.318851)
Note that the parameter estimates are exactly the same, but the standard errors are different; this is due to the different calculations of variance, as reflected by the different dispersion parameters.
Now for the quasi-Poisson model, which will, again, give us the same parameter estimates as the Poisson model, but with different standard errors:
summary(glm(x~z, family=quasipoisson))
... more stuff ...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5639 0.2621 2.151 0.03392 *
z 0.4312 0.1274 3.384 0.00103 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 2.31885)
Since the mean-variance relationship and link functions are the same as in the "quasi" model, the model results are the same also.

The Poisson distribution deals with counts -- the actual number of objects you counted in a defined volume, or the actual number of events you counted in a defined period of time.
If you normalized to a rate, the distribution is not Poisson.

Related

What is ga() in the gamlss package doing?

I have been looking into the gamlss package for fitting semiparametric models and came across something strange in the ga() function. Even if the model is specified as having a gamma distribution, fitted using REML, the output for the model is Gaussian, fitted using GCV.
Example::
library(mgcv)
library(gamlss)
library(gamlss.add)
data(rent)
ga3 <- gam(R~s(Fl)+s(A), method="REML", data=rent, family=Gamma(log))
gn3 <- gamlss(R~ga(~s(Fl)+s(A), method="REML"), data=rent, family=GA)
Model summary for the GAM::
summary(ga3)
Family: Gamma
Link function: log
Formula:
R ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.667996 0.008646 771.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.263 1.482 442.53 <2e-16 ***
s(A) 4.051 4.814 36.34 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.302 Deviance explained = 28.8%
-REML = 13979 Scale est. = 0.1472 n = 1969
Model summary for the GAMLSS::
summary(getSmo(gn3))
Family: gaussian
Link function: identity
Formula:
Y.var ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.306e-13 8.646e-03 0 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.269 1.492 440.14 <2e-16 ***
s(A) 3.747 4.469 38.83 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.294 Deviance explained = 29.6%
GCV = 0.97441 Scale est. = 0.97144 n = 1969
Question::
Why is the model output giving the incorrect distribution and fitting method? Is there something that I am missing here and this is correct?
When using the ga()-function, gamlss calls in the background the gam()-function from mgcv without specifying the family. As a result, the splines are fitted assuming a normal distribution. Therefore you see when showing the fitted smoothers family: gaussian and link function: identity. Also note that the scale estimate returned when using ga() is related to the normal distribution.
Yes, when using the ga()-function, each gamlss iteration calls in the background the gam()-function from mgcv. It uses the correct local working variable and local weights for a gamma distribution on each iteration.

Binary Logistic Regression output

I'm an undergrad student and am currently struggling with R, i'be been trying to teach myself for weeks but I'm not a natural, so I thought i'd seek some support.
I'm currently trying to analyse the interaction of my variables on recall of a target using logistic regression, as specified by my tutor. I have a 2 (isolate x control condition)by 2 (similarity/difference list type) study, and my dependent variable is binary of recall (yes or no). I've tried to clean my data and run the code,
Call:
glm(formula = Target ~ Condition * List, family = "binomial",
data = pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8297 -0.3288 0.6444 0.6876 2.4267
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4663 0.6405 2.289 0.022061 *
Conditionisolate -1.1097 0.8082 -1.373 0.169727
Listsim -4.3567 1.2107 -3.599 0.000320 ***
Conditionisolate:Listsim 5.3218 1.4231 3.740 0.000184 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.736 on 70 degrees of freedom
Residual deviance: 65.869 on 67 degrees of freedom
AIC: 73.869
that's my output^ it completely ignores the difference and control condition, I know i'm doing something wrong and i'm feeling quite exacerbated by it. Can any one help me?
In the model output, R is treating control and difference as the baseline levels of your two variables. The outcome associated with them is wrapped up in the intercept. For other combinations of variable levels, the coefficients show how those differ from that baseline.
Control/Difference: just use the intercept
Control/Similarity: intercept + listsim
Isolate/Difference: intercept + conditionisolate
Isolate/Similarity: intercept + listsim + conditionisolate + conditionisolate:listsim

R code to test the difference between coefficients of regressors from one panel regression

I am trying to compare two regression coefficient from the same panel regression used over two different time periods in order to confirm the statistical significance of difference. Therefore, running my panel regression first with observations over 2007-2009, I get an estimate of one coefficient I am interested in to compare with the estimate of the same coefficient obtained from the same panel model applied over the period 2010-2017.
Based on R code to test the difference between coefficients of regressors from one regression, I tried to compute a likelihood ratio test. In the linked discussion, they use a simple linear equation. If I use the same commands in R than described in the answer, I get results based on a chi-squared distribution and I don't understand if and how I can interpret that or not.
In r, I did the following:
linearHypothesis(reg.pannel.recession.fe, "Exp_Fri=0.311576")
where reg.pannel.recession.fe is the panel regression over the period 2007-2009, Exp_Fri is the coefficient of this regression I want to compare, 0.311576 is the estimated coefficient over the period 2010-2017.
I get the following results using linearHypothesis():
How can I interpret that? Should I use another function as it is plm objects?
Thank you very much for your help.
You get a F test in that example because as stated in the vignette:
The method for "lm" objects calls the default method, but it changes
the
default test to "F" [...]
You can also set the test to F, but basically linearHypothesis works whenever the standard error of the coefficient can be estimated from the variance-covariance matrix, as also said in the vignette:
The default method will work with any model
object for which the coefficient vector can be retrieved by ‘coef’
and the coefficient-covariance matrix by ‘vcov’ (otherwise the
argument ‘vcov.’ has to be set explicitly)
So using an example from the package:
library(plm)
data(Grunfeld)
wi <- plm(inv ~ value + capital,
data = Grunfeld, model = "within", effect = "twoways")
linearHypothesis(wi,"capital=0.3",test="F")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df F Pr(>F)
1 170
2 169 1 6.4986 0.01169 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
linearHypothesis(wi,"capital=0.3")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df Chisq Pr(>Chisq)
1 170
2 169 1 6.4986 0.0108 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And you can also use a t.test:
tested_value = 0.3
BETA = coefficients(wi)["capital"]
SE = coefficients(summary(wi))["capital",2]
tstat = (BETA- tested_value)/SE
pvalue = as.numeric(2*pt(-tstat,wi$df.residual))
pvalue
[1] 0.01168515

Using glmrob function in R with proportional/percentage Dependent Variable

I have been utilising the glmrob function in R to depict whether belonging to a certain strategy (3 categories) increases the percentage (0 - 1) of successful reports made. However, my results show coefficients of 2.1+ which I would believe are impossible since the DV can only range from 0-1.
I have run the function with tcc = 1.345 and tcc = 3.5 with no differences in the results. I'm not sure if there is something wrong with my code, since I have followed what I could find on stack exchange and stack overflow.
rob5 <- glmrob(Num_ofSuccess/Total_No_Reports ~ RGamDesStrat3,
weights = Total_No_Reports, family = binomial, data = MyMasterThesisDataRSTUDIOV3,
method = "Mqle", control = glmrobMqle.control(tcc = 3.5))
Example of my code.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5108 0.3266 1.564 0.118
RGamDesStrat3(0,1] 2.2088 0.4744 4.656 3.22e-06 ***
RGamDesStrat3(1,2] 2.1638 0.4346 4.979 6.40e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robustness weights w.r * w.x:
All 40 weights are ~= 1.
However, I would have expected coefficients below the at around 0.3 or 0.4 for both predictors.

why step() returns weird results from backward elimination for full model using lmerTest

I am confused that why the results from processing step(model) in lmerTest are abnormal.
m0 <- lmer(seed ~ connection*age + (1|unit), data = test)
step(m0)
note: Both "connection" and "age" have been set as.factor()
Random effects:
Chi.sq Chi.DF elim.num p.value
unit 0.25 1 1 0.6194
Fixed effects:
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
connection 1 0.01746 0.017457 1.5214 0.22142
age 1 0.07664 0.076643 6.6794 0.01178 *
connection:age 1 0.04397 0.043967 3.8317 0.05417 .
Residuals 72 0.82617 0.011475
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Least squares means:
Estimate Standard Error DF t-value Lower CI Upper CI p-value
Final model:
Call:
lm(formula = fo, data = mm, contrasts = l.lmerTest.private.contrast)
Coefficients:
(Intercept) connectionD ageB connectionD:ageB
-0.84868 -0.07852 0.01281 0.09634
Why it does not show me the Final model?
The thing here is that random effect was eliminated as being NS according to the LR test. Then the anova method for the fixed effects model, the "lm" object was applied and no elimination of NS fixed effects was done. You are right, that the output is different from "lmer" objects and there are no (differences of ) least squares means. If you want to get the latter you may try the lsmeans package. For the backward elimination of NS effect of the final model you may use stats::step function.

Resources