model checking and test of overdispersion for glmer - r

I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?

a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers

Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823

Related

How to extract the actual values of parameters and their standard error instead of the marginal effect estimates in lmer from package lme4 in R?

I have this fake dataset that describes the effect of air temperature on the growth of two plant species (a and b).
data1 <- read.csv(text = "
year,block,specie,temperature,growth
2019,1,a,0,7.217496163
2019,1,a,1,2.809792001
2019,1,a,2,16.09505635
2019,1,a,3,24.52673264
2019,1,a,4,49.98455022
2019,1,a,5,35.78568291
2019,2,a,0,8.332533323
2019,2,a,1,16.5997836
2019,2,a,2,11.95833966
2019,2,a,3,34.4
2019,2,a,4,54.19081002
2019,2,a,5,41.1291734
2019,1,b,0,14.07939683
2019,1,b,1,13.73257973
2019,1,b,2,31.33076651
2019,1,b,3,44.81995622
2019,1,b,4,79.27999184
2019,1,b,5,75.0527336
2019,2,b,0,14.18896232
2019,2,b,1,29.00692747
2019,2,b,2,27.83736734
2019,2,b,3,61.46006916
2019,2,b,4,93.91100024
2019,2,b,5,92.47922985
2020,1,a,0,4.117536842
2020,1,a,1,12.70711508
2020,1,a,2,16.09570046
2020,1,a,3,29.49417491
2020,1,a,4,35.94571498
2020,1,a,5,50.74477018
2020,2,a,0,3.490585144
2020,2,a,1,3.817105315
2020,2,a,2,22.43112718
2020,2,a,3,14.4
2020,2,a,4,46.84223604
2020,2,a,5,39.10398717
2020,1,b,0,10.17712428
2020,1,b,1,22.04514586
2020,1,b,2,30.37221799
2020,1,b,3,51.80333619
2020,1,b,4,76.22765452
2020,1,b,5,78.37284714
2020,2,b,0,7.308139613
2020,2,b,1,22.03241605
2020,2,b,2,45.88385871
2020,2,b,3,30.43669633
2020,2,b,4,76.12904988
2020,2,b,5,85.9324324
")
The experiment was conducted two years and in a block design (nested within years). The goal is to inform how much growth is affected per unit of change in temperature. Also, the is a need to provide a measure of uncertainty (standard error) for this estimate. The same needs to be done for the growth recorded at zero degrees of temperature.
library(lme4)
library(lmerTest)
library(lsmeans)
test.model.1 <- lmer(growth ~
specie +
temperature +
specie*temperature +
(1|year) +
(1|year:block),
data= data1,
REML=T,
control=lmerControl(check.nobs.vs.nlev = "ignore",
check.nobs.vs.rankZ = "ignore",
check.nobs.vs.nRE="ignore"))
summary(test.model.1)
The summary give me this output for the fixed effect:
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: growth ~ specie + temperature + specie * temperature + (1 | year) +
(1 | year:block)
Data: data1
Control: lmerControl(check.nobs.vs.nlev = "ignore", check.nobs.vs.rankZ = "ignore",
check.nobs.vs.nRE = "ignore")
REML criterion at convergence: 331.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.6408 -0.7637 0.1516 0.5248 2.4809
Random effects:
Groups Name Variance Std.Dev.
year:block (Intercept) 6.231 2.496
year (Intercept) 0.000 0.000
Residual 74.117 8.609
Number of obs: 48, groups: year:block, 4; year, 2
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.699 3.356 26.256 0.804 0.428
specieb 4.433 4.406 41.000 1.006 0.320
temperature 8.624 1.029 41.000 8.381 2.0e-10 ***
specieb:temperature 7.088 1.455 41.000 4.871 1.7e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) specib tmprtr
specieb -0.656
temperature -0.767 0.584
spcb:tmprtr 0.542 -0.826 -0.707
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
From this I can get the growth at 0 degrees of temperature for specie "a" (2.699), and for specie "b" (2.699 + 4.443 = 7.132). Also, the rate of change in growth per unit change in temperature is (8.624) for species "a" and (8.624 + 7.088 = 15.712). The problem I have is that the standard deviation reported in summary() is for the marginal estimate, not for the actual value of the parameter. For instance, the standard error for 4.443 (specieb) is 4.406.. but that is not the standard error for the actual growth at 0 degrees for specie b that is 7.132. What I am looking for is the standard error of let's say 7.132. Also, I'd be nice to have all the calculations I did by hand automatically performed.
I was trying making some tries with emmeans() from lsmeans package but I didn't succeed.
emmeans(test.model.1, growth ~ specie*temperature)
Error:
Error in contrast.emmGrid(object = new("emmGrid", model.info = list(call = lmer(formula = growth ~ :
Contrast function 'growth.emmc' not found
I think your main problem is that you don't need the response variable on the left side of the formula you give to emmeans (the package assumes that you're going to use the same response variable as in the original model!) The left-hand side of the formula is reserved for specifying contrasts, e.g. pairwise ~ ... - see help("contrast-methods", package = "emmeans").
I think you might be looking for:
emmeans(test.model.1, ~specie, at = list(temperature=0))
NOTE: Results may be misleading due to involvement in interactions
specie emmean SE df lower.CL upper.CL
a 2.70 3.36 11.3 -4.665 10.1
b 7.13 3.36 11.3 -0.232 14.5
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
If you don't specify the value of temperature, then emmeans uses (I think) the overall average temperature.
For slopes, you want emtrends:
emtrends(test.model.1, ~specie, var = "temperature")
specie temperature.trend SE df lower.CL upper.CL
a 8.62 1.03 41 6.55 10.7
b 15.71 1.03 41 13.63 17.8
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
I highly recommend the extensive and clearly written vignettes for the emmeans package. Since emmeans has so many capabilities it may take a little while to find the answers to your precise questions, but the effort will be repaid in the long term.
As a small picky point, I would say that what summary() gives you are the "actual" parameters that R uses internally, and what emmeans() gives you are the marginal means (as suggested by the name of the package — expected marginal means ...)

simr: powerSim gives 100% for all effect sizes

I have carried out a binomial GLMM to determine how latitude and native status (native/non-native) of a set of plant species affects herbivory damage. I am now trying to determine the statistical power of my model when I change the effect sizes. My model looks like this:
latglmm <- glmer(cbind(chewing,total.cells-chewing) ~ scale(latitude) * native.status + scale(sample.day.of.year) + (1|genus) + (1|species) + (1|catalogue.number), family=binomial, data=mna)
where cbind(chewing,total.cells-chewing) gives me a proportion (of leaves with herbivory damage), native.status is either "native" or "non-native" and catalogue.number acts as an observation-level random effect to deal with overdispersion. There are 10 genus, each with at least 1 native and 1 non-native species to make 26 species in total. The model summary is:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(chewing, total.cells - chewing) ~ scale(latitude) * native.status +
scale(sample.day.of.year) + (1 | genus) + (1 | species) + (1 | catalogue.number)
Data: mna
AIC BIC logLik deviance df.resid
3986.7 4023.3 -1985.4 3970.7 706
Scaled residuals:
Min 1Q Median 3Q Max
-1.3240 -0.4511 -0.0250 0.1992 1.0765
Random effects:
Groups Name Variance Std.Dev.
catalogue.number (Intercept) 1.26417 1.1244
species (Intercept) 0.08207 0.2865
genus.ID (Intercept) 0.33431 0.5782
Number of obs: 714, groups: catalogue.number, 713; species, 26; genus.ID, 10
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.61310 0.20849 -12.534 < 2e-16 ***
scale(latitude) -0.17283 0.06370 -2.713 0.00666 **
native.statusnon-native 0.11434 0.15554 0.735 0.46226
scale(sample.day.of.year) 0.28521 0.05224 5.460 4.77e-08 ***
scale(latitude):native.statusnon-native -0.02986 0.09916 -0.301 0.76327
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) scallt ntv.s- scaldy
scalelat 0.012
ntv.sttsnn- -0.304 -0.014
scaledoy 0.018 -0.085 -0.027
scllt:ntv.- -0.011 -0.634 0.006 -0.035
I should add that the actual model I have been using is a glmmTMB model as lme4 still had some overdispersion even with the observation-level random effect, but this is not compatible with simr so I am using lme4 (the results are very similar for both). I want to see what happens to the model power when I increase or decrease the effect sizes of latitude and native status but when I run fixef(latglmm1)["scale(latitude)"]<--1 and fixef(latglmm1)["native.statusnon-native"]<--1 and try this:
powerSim(latglmm, fcompare(~ scale(latitude) + native.status))
I get the following output:
Power for model comparison, (95% confidence interval):====================================================================|
100.0% (69.15, 100.0)
Test: Likelihood ratio
Comparison to ~scale(latitude) + native.status + [re]
Based on 10 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 1428
Time elapsed: 0 h 1 m 5 s
The output is the same (100% power) no matter what I change fixef() to. Based on other similar questions online I have ensured that my data has no NA values and according to my powerSim there are no warnings or errors to address. I am completely lost as to why this isn't working so any help would be greatly appreciated!
Alternatively, if anyone has any recommendations for other methods to carry out similar analysis I would love to hear them. What I really want is to get a p-value for each effect size I input but statistical power would be very valuable too.
Thank you!

Multivariable regression interaction term with categorical variables

I am kind of new to R and am working on glm model and wanted to look for the interaction effect of BMI groups and patient groups (4 groups) on mortality (binary) in subgroup analysis. I have the following codes:
model <- glm(death~patient.group*bmi.group, data = data, family = "binomial")
summary(model)
and I get the following:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4798903 0.0361911 -96.153 < 2e-16 ***
patient.group2 0.0067614 0.0507124 0.133 0.894
patient.group3 0.0142658 0.0503444 0.283 0.777
patient.group4 0.0212416 0.0497523 0.427 0.669
bmi.group2 0.1009282 0.0478828 2.108 0.035 *
bmi.group3 0.2397047 0.0552043 4.342 1.41e-05 ***
patient.group2:bmi.group2 -0.0488768 0.0676473 -0.723 0.470
patient.group3:bmi.group2 -0.0461319 0.0672853 -0.686 0.493
patient.group4:bmi.group2 -0.1014986 0.0672675 -1.509 0.131
patient.group2:bmi.group3 -0.0806240 0.0791977 -1.018 0.309
patient.group3:bmi.group3 -0.0008951 0.0785683 -0.011 0.991
patient.group4:bmi.group3 -0.0546519 0.0795683 -0.687 0.492
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So as displayed I will have a p-value for each of the patient.group:bmi.group. My question is, is there a way I can get a single p-value for patient.group:bmi.group instead of one for each subgroup? I have tried to look for answers online but I still could not find the answer :(
Many thanks in advance.
It depends on whether you regard your patient and BMI groups as factors or continuous covariates. If they are covariates, #jay.sf's suggestion is appropriate. It fits a single degree of freedom term for the interaction between the linear effect of patient group and the linear effect of BMI group.
But this depends on both the ordering and definition of the groups. It assumes, for example, that the "difference" between patient groups 1 and 2 is the same as that between patient groups 2 and 3 and so on. Is the ordering of patient groups such that, in some way, group 1 < group 2 < group 3 < group 4? Similarly for BMI. This model would also assume that a change of 1 unit on the patient scale was "the same" as a change of one unit on the BMI scale. I don't know if these are reasonable assumptions.
It would be more usual to consider both patient group and BMI group as factors. This assumes no ordering in groups, nor that the difference between any two groups was equal to that between any other two. In this case, jay.sf's suggestion would give a misleading answer.
To illustrate my point...
First, generate some artifical data as you haven't provided any:
data <- tibble() %>%
expand(patient.group=1:4, bmi.group=1:3, rep=1:5) %>%
mutate(
z=-0.25*patient.group + 0.75*bmi.group,
death=rbernoulli(nrow(.), exp(z)/exp(1+z))
) %>%
select(-z)
Fit a simple continuous covariate model with interaction, as per jay.sf's suggestion:
covariateModel <- glm(death~patient.group * bmi.group, data = data, family = "binomial")
summary(covariateModel)
Giving, in part
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6962 1.8207 -1.481 0.139
patient.group 0.7407 0.6472 1.144 0.252
bmi.group 1.2697 0.8340 1.523 0.128
patient.group:bmi.group -0.3807 0.2984 -1.276 0.202
Here, the p value for the patient.group:bmi.group interaction is a Wald test based on a single degree of freedom z test.
A slightly more complicated approach is necessary to fit the factor model with interaction and obtain a test for the "overall" interaction effect.
mainEffectModel <- glm(death~as.factor(patient.group) + as.factor(bmi.group), data = data, family = "binomial")
interactionModel <- glm(death~as.factor(patient.group) * as.factor(bmi.group), data = data, family = "binomial")
anova(mainEffectModel, interactionModel, test="Chisq")
Giving
Analysis of Deviance Table
Model 1: death ~ as.factor(patient.group) + as.factor(bmi.group)
Model 2: death ~ as.factor(patient.group) * as.factor(bmi.group)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 54 81.159
2 48 70.579 6 10.58 0.1023
Here, the change in deviance is a score test and is distributed as a chi-squared statistic on (4-1) x (3-1) = 6 degrees of freedom.
The two approaches give similar answers using my particular dataset, but they may not always do so. Both are statistically correct, but which one is most appropriate depends on your particular situation. We don't have enough information to comment.
This excellent post provides more context.

glht() and lsmeans() can't find contrast in lmer() model

I have the following situation:
my fixed-effect model find a main effect of Relation_PenultimateLast in the group of participant called 'composers'. I want therefore to find what level of Relation_PenultimateLast differ statistically from the others.
f.e.model.composers = lmer(Score ~ Relation_PenultimateLast + (1|TrajectoryType) + (1|StimulusType) + (1|Relation_FirstLast) + (1|LastPosition), data=datasheet.complete.composers)
Summary(f.e.model.composers)
Random effects:
Groups Name Variance Std.Dev.
TrajectoryType (Intercept) 0.005457 0.07387
LastPosition (Intercept) 0.036705 0.19159
Relation_FirstLast (Intercept) 0.004298 0.06556
StimulusType (Intercept) 0.019197 0.13855
Residual 1.318116 1.14809
Number of obs: 2200, groups:
TrajectoryType, 25; LastPosition, 8; Relation_FirstLast, 4; StimulusType, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.90933 0.12476 14.84800 23.320 4.15e-13 ***
Relation_PenultimateLast 0.09987 0.02493 22.43100 4.006 0.000577 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I have to make a Tukey comparison of my lmer() model.
Now, I find two methods for the comparison among Relation_PenultimateLast levels (I have found them in here: https://stats.stackexchange.com/questions/237512/how-to-perform-post-hoc-test-on-lmer-model):
summary(glht(f.e.model.composers, linfct = mcp(Relation_PenultimateLast = "Tukey")), test = adjusted("holm"))
and
lsmeans(f.e.model.composers, list(pairwise ~ Relation_PenultimateLast), adjust = "holm")
These do not work.
The former reports:
Variable(s) ‘Relation_PenultimateLast’ of class ‘integer’ is/are not contained as a factor in ‘model’
The latter:
Relation_PenultimateLast lsmean SE df lower.CL upper.CL
2.6 3.168989 0.1063552 8.5 2.926218 3.41176
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
$` of contrast`
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
Can somebody help me understand why I have this result?
First, it's important to realize that the model you have fitted is inappropriate. It uses Relation_PenultimateLast as a numeric predictor; thus it fits a linear trend to its values 1, 2, 3, and 4, rather than separate estimates for each level of this as a factor. I also wonder, given the plot you show, why Test is not in the model; it looks like it should be (again as a factor, not a numeric predictor). I suggest that you get some statistical consulting help to check that you are using appropriate models in your research. Perhaps you could give a graduate student in statistics some grounding in practical applications -- a win-win proposition.
To model Relation_PenultimateLast as a factor, one way is to replace it in the model formula with factor(Relation_PenultimateLast). That will work for lsmeans() but not glht(). A better way is probably to change it in the dataset:
datasheet.complete.composers = transform(datasheet.complete.composers,
Relation_PenultimateLast = factor(Relation_PenultimateLast))
f.e.model.composers = lmer(...) ### (as before, assuming Test isn't needed)
(BTW, you must be a heck of a better typist than I am; I'd use shorter names, though I do applaud using informative ones.)
(Note: is f.e.model.composers supoposed to suggest a fixed-effects model? It isn't one; it is a mixed model. Again, a consultant...)
The lsmeans package is destined to be deprecated, so I suggest you use its continuation, the emmeans package:
library(emmeans)
emmeans(f.e.model.composers, pairwise ~ Relation_PenultimateLast)
I suggest using the default "tukey" adjustment rather than Holm for this application.
If indeed Test should be in the model, then it looks like you need to include the interaction; so it'd go something like this:
model.composers = lmer(Score ~ Relation_PenultimateLast * factor(Test) + ...)
### A plot like the one shown, but based on the model predictions:
emmip(model.composers, Relation_PenultimateLast ~ Test)
### Estimates and comparisons of Relation_PenultimateLast for each Test:
emmeans(model.composers, pairwise ~ Relation_PenultimateLast | Test)

How to obtain Poisson's distribution "lambda" from R glm() coefficients

My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009

Resources