anova.rq() in quantreg package in R - r

I'm interested in comparing estimates from different quantiles (same outcome, same covariates) using anova.rqlist function called by anova in the environment of the quantreg package in R. However the math in the function is beyond my rudimentary expertise. Lets say i fit 3 models at different quantiles;
library(quantreg)
data(Mammals) # data in quantreg to be used as a useful example
fit1 <- rq(weight ~ speed + hoppers + specials, tau = .25, data = Mammals)
fit2 <- rq(weight ~ speed + hoppers + specials, tau = .5, data = Mammals)
fit3 <- rq(weight ~ speed + hoppers + specials, tau = .75, data = Mammals)
Then i compare them using;
anova(fit1, fit2, fit3, test="Wald", joint=FALSE)
My question is which is of these models is being used as the basis of the comparison?
My understanding of the Wald test (wiki entry)
where θ^ is the estimate of the parameter(s) of interest θ that is compared with the proposed value θ0.
So my question is what is the anova function in quantreg choosing as the θ0?
Based on the pvalue returned from the anova my best guess is that it is choosing the lowest quantile specified (ie tau=0.25). Is there a way to specify the median (tau = 0.5) or better yet the mean estimate from obtained using lm(y ~ x1 + x2 + x3, data)?
anova(fit1, fit2, fit3, joint=FALSE)
actually produces
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.25 0.5 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
while
anova(fit3, fit1, fit2, joint=FALSE)
produces the exact same result
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.5 0.25 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The order of the models is clearly being changed in the anova, but how is it that the F value and Pr(>F) are identical in both tests?

All the quantiles you input are used and there is not one model used as a reference.
I suggest you read this post and the related answer to understand what your "theta.0" is.
I believe what you are trying to do is to test whether the regression lines are parallel. In other words whether the effects of the predictor variables (only income here) are uniform across quantiles.
You can use the anova() from the quantreg package to answer this question. You should indeed use several fits for each quantile.
When you use joint=FALSE as you did, you get coefficient-wise comparisons. But you only have one coefficient so there is only one line! And your results tells you that the effect of income is not uniform accross quantiles in your example. Use several predictor variables and you will get several p-values.
You can do an overall test of equality of the entire sets of coefficients if you do not use joint=FALSE and that would give you a "Joint Test of Equality of Slopes" and therefore only one p-value.
EDIT:
I think theta.0 is the average slope for all 'tau' values or the actual estimate from 'lm()', rather than a specific slope of any of the models. My reasoning is that 'anova.rq()' does not require any specific low value of 'tau' or even the median 'tau'.
There are several ways to test this. Either do the calculations by hand with theta.0 being equal to the average value, or compare many combinations because then you could a situation where certain of your models are close to the model with a low 'tau' values but not to the 'lm()' value. So if theta.0 is the slope of the first model with lowest 'tau' then your Pr(>F) will be high whereas in the other case, it will be low.
This question should maybe have been asked on cross-validated.

Related

R code to test the difference between coefficients of regressors from one panel regression

I am trying to compare two regression coefficient from the same panel regression used over two different time periods in order to confirm the statistical significance of difference. Therefore, running my panel regression first with observations over 2007-2009, I get an estimate of one coefficient I am interested in to compare with the estimate of the same coefficient obtained from the same panel model applied over the period 2010-2017.
Based on R code to test the difference between coefficients of regressors from one regression, I tried to compute a likelihood ratio test. In the linked discussion, they use a simple linear equation. If I use the same commands in R than described in the answer, I get results based on a chi-squared distribution and I don't understand if and how I can interpret that or not.
In r, I did the following:
linearHypothesis(reg.pannel.recession.fe, "Exp_Fri=0.311576")
where reg.pannel.recession.fe is the panel regression over the period 2007-2009, Exp_Fri is the coefficient of this regression I want to compare, 0.311576 is the estimated coefficient over the period 2010-2017.
I get the following results using linearHypothesis():
How can I interpret that? Should I use another function as it is plm objects?
Thank you very much for your help.
You get a F test in that example because as stated in the vignette:
The method for "lm" objects calls the default method, but it changes
the
default test to "F" [...]
You can also set the test to F, but basically linearHypothesis works whenever the standard error of the coefficient can be estimated from the variance-covariance matrix, as also said in the vignette:
The default method will work with any model
object for which the coefficient vector can be retrieved by ‘coef’
and the coefficient-covariance matrix by ‘vcov’ (otherwise the
argument ‘vcov.’ has to be set explicitly)
So using an example from the package:
library(plm)
data(Grunfeld)
wi <- plm(inv ~ value + capital,
data = Grunfeld, model = "within", effect = "twoways")
linearHypothesis(wi,"capital=0.3",test="F")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df F Pr(>F)
1 170
2 169 1 6.4986 0.01169 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
linearHypothesis(wi,"capital=0.3")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df Chisq Pr(>Chisq)
1 170
2 169 1 6.4986 0.0108 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And you can also use a t.test:
tested_value = 0.3
BETA = coefficients(wi)["capital"]
SE = coefficients(summary(wi))["capital",2]
tstat = (BETA- tested_value)/SE
pvalue = as.numeric(2*pt(-tstat,wi$df.residual))
pvalue
[1] 0.01168515

One-way ANOVA for stratified samples in R

I have a stratified sample with three groups ("a","b","c") that where drawn from a larger population N. All groups have 30 observations but their proportions in N are unequal, hence their sampling weights differ.
I use the survey package in R to calculate summary statistics and linear regression models and would like to know how to calculate a one-way ANOVA correcting for the survey design (if necessary).
My assumption is and please correct me if I'm wrong, that the standard error for the variance should be normally higher for a population where the weight is smaller, hence a simple ANOVA that does not account for the survey design should not be reliable.
Here is an example. Any help would be appreciated.
## Oneway- ANOVA tests in R for surveys with stratified sampling-design
library("survey")
# create test data
test.df<-data.frame(
id=1:90,
variable=c(rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=140,sd=10)),
groups=c(rep("a",30),
rep("b",30),
rep("c",30)),
weights=c(rep(1,30), # undersampled
rep(1,30),
rep(100,30))) # oversampled
# correct for survey design
test.df.survey<-svydesign(id=~id,
strata=~groups,
weights=~weights,
data=test.df)
## descriptive statistics
# boxplot
svyboxplot(~variable~groups,test.df.survey)
# means
svyby(~variable,~groups,test.df.survey,svymean)
# variances
svyby(~variable,~groups,test.df.survey,svyvar)
### ANOVA ###
## One-way ANOVA without correcting for survey design
summary(aov(formula = variable~groups,data = test.df))
Hmm this is a interesting question, as far as I know it'd be difficult to consider weights in one-way anova. Thus I decided to show you the way that I'd solve this problem.
I'm going to use two-way anova and then soem port hoc test.
First of all let's build a linear model based on your data and check how does it look like.
library(car)
library(agricolae)
model.lm = lm(variable ~ groups * weights, data = test.df)
shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.98238, p-value = 0.263
leveneTest(variable ~ groups * factor(weights), data = test.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 2.6422 0.07692 .
87
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Distribution is close to normal, variances differ between groups, so the variance isn't homogeneic - should be for parametrical test - anova. However let's perform the test anyway.
Several plots to check that our data fits to this test:
hist(resid(model.lm))
plot(model.lm)
Here is interpretation of plots, they don't look bad actually.
Let's run two-way anova:
anova(model.lm)
Analysis of Variance Table
Response: variable
Df Sum Sq Mean Sq F value Pr(>F)
groups 2 2267.8 1133.88 9.9566 0.0001277 ***
Residuals 87 9907.8 113.88
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you see, the results are very close to yours. Some post hoc test:
(result.hsd = HSD.test(model.lm, list('groups', 'weights')))
$statistics
MSerror Df Mean CV MSD
113.8831 87 147.8164 7.2195 6.570186
$parameters
test name.t ntr StudentizedRange alpha
Tukey groups:weights 3 3.372163 0.05
$means
variable std r Min Max Q25 Q50 Q75
a:1 150.8601 11.571185 30 113.3240 173.0429 145.2710 151.9689 157.8051
b:1 151.8486 8.330029 30 137.1907 176.9833 147.8404 150.3161 154.7321
c:100 140.7404 11.762979 30 118.0823 163.9753 131.6112 141.1810 147.8231
$comparison
NULL
$groups
variable groups
b:1 151.8486 a
a:1 150.8601 a
c:100 140.7404 b
attr(,"class")
[1] "group"
And maybe some different way:
aov_cont<- aov(test.df$variable ~ test.df$groups * test.df$weights)
summary(aov_cont)
Df Sum Sq Mean Sq F value Pr(>F)
test.df$groups 2 2268 1133.9 9.957 0.000128 ***
Residuals 87 9908 113.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(TukeyHSD(aov_cont))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = test.df$variable ~ test.df$groups * test.df$weights)
$`test.df$groups`
diff lwr upr p adj
b-a 0.9884608 -5.581725 7.558647 0.9315792
c-a -10.1197048 -16.689891 -3.549519 0.0011934
c-b -11.1081657 -17.678352 -4.537980 0.0003461
Summarizing, the results are very close to yours. Personaly I'll run two way anova with (*) symbol or (+) when you are sure that your variables are independent - additive model.
Group c with bigger weight differs from groups a and b substantially.
According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's.
A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.
summary(svyglm(variable~groups,test.df.survey))

glht() and lsmeans() can't find contrast in lmer() model

I have the following situation:
my fixed-effect model find a main effect of Relation_PenultimateLast in the group of participant called 'composers'. I want therefore to find what level of Relation_PenultimateLast differ statistically from the others.
f.e.model.composers = lmer(Score ~ Relation_PenultimateLast + (1|TrajectoryType) + (1|StimulusType) + (1|Relation_FirstLast) + (1|LastPosition), data=datasheet.complete.composers)
Summary(f.e.model.composers)
Random effects:
Groups Name Variance Std.Dev.
TrajectoryType (Intercept) 0.005457 0.07387
LastPosition (Intercept) 0.036705 0.19159
Relation_FirstLast (Intercept) 0.004298 0.06556
StimulusType (Intercept) 0.019197 0.13855
Residual 1.318116 1.14809
Number of obs: 2200, groups:
TrajectoryType, 25; LastPosition, 8; Relation_FirstLast, 4; StimulusType, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.90933 0.12476 14.84800 23.320 4.15e-13 ***
Relation_PenultimateLast 0.09987 0.02493 22.43100 4.006 0.000577 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I have to make a Tukey comparison of my lmer() model.
Now, I find two methods for the comparison among Relation_PenultimateLast levels (I have found them in here: https://stats.stackexchange.com/questions/237512/how-to-perform-post-hoc-test-on-lmer-model):
summary(glht(f.e.model.composers, linfct = mcp(Relation_PenultimateLast = "Tukey")), test = adjusted("holm"))
and
lsmeans(f.e.model.composers, list(pairwise ~ Relation_PenultimateLast), adjust = "holm")
These do not work.
The former reports:
Variable(s) ‘Relation_PenultimateLast’ of class ‘integer’ is/are not contained as a factor in ‘model’
The latter:
Relation_PenultimateLast lsmean SE df lower.CL upper.CL
2.6 3.168989 0.1063552 8.5 2.926218 3.41176
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
$` of contrast`
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
Can somebody help me understand why I have this result?
First, it's important to realize that the model you have fitted is inappropriate. It uses Relation_PenultimateLast as a numeric predictor; thus it fits a linear trend to its values 1, 2, 3, and 4, rather than separate estimates for each level of this as a factor. I also wonder, given the plot you show, why Test is not in the model; it looks like it should be (again as a factor, not a numeric predictor). I suggest that you get some statistical consulting help to check that you are using appropriate models in your research. Perhaps you could give a graduate student in statistics some grounding in practical applications -- a win-win proposition.
To model Relation_PenultimateLast as a factor, one way is to replace it in the model formula with factor(Relation_PenultimateLast). That will work for lsmeans() but not glht(). A better way is probably to change it in the dataset:
datasheet.complete.composers = transform(datasheet.complete.composers,
Relation_PenultimateLast = factor(Relation_PenultimateLast))
f.e.model.composers = lmer(...) ### (as before, assuming Test isn't needed)
(BTW, you must be a heck of a better typist than I am; I'd use shorter names, though I do applaud using informative ones.)
(Note: is f.e.model.composers supoposed to suggest a fixed-effects model? It isn't one; it is a mixed model. Again, a consultant...)
The lsmeans package is destined to be deprecated, so I suggest you use its continuation, the emmeans package:
library(emmeans)
emmeans(f.e.model.composers, pairwise ~ Relation_PenultimateLast)
I suggest using the default "tukey" adjustment rather than Holm for this application.
If indeed Test should be in the model, then it looks like you need to include the interaction; so it'd go something like this:
model.composers = lmer(Score ~ Relation_PenultimateLast * factor(Test) + ...)
### A plot like the one shown, but based on the model predictions:
emmip(model.composers, Relation_PenultimateLast ~ Test)
### Estimates and comparisons of Relation_PenultimateLast for each Test:
emmeans(model.composers, pairwise ~ Relation_PenultimateLast | Test)

How to get probability from GLM output

I'm extremely stuck at the moment as I am trying to figure out how to calculate the probability from my glm output in R. I know the data is very insignificant but I would really love to be shown how to get the probability from an output like this. I was thinking of trying inv.logit() but didn't know what variables to put within the brackets.
The data is from occupancy study. I'm assessing the success of a hair trap method versus a camera trap in detecting 3 species (red squirrel, pine marten and invasive grey squirrel). I wanted to see what affected detection (or non detection) of the various species. One hypotheses was the detection of another focal species at the site would affect the detectability of red squirrel. Given that pine marten is a predator of the red squirrel and that the grey squirrel is a competitor, the presence of those two species at a site might affect the detectability of the red squirrel.
Would this show the probability? inv.logit(-1.14 - 0.1322 * nonRS events)
glm(formula = RS_sticky ~ NonRSevents_before1stRS, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7432 -0.7432 -0.7222 -0.3739 2.0361
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1455 0.4677 -2.449 0.0143 *
NonRSevents_before1stRS -0.1322 0.1658 -0.797 0.4255
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.575 on 33 degrees of freedom
Residual deviance: 33.736 on 32 degrees of freedom
(1 observation deleted due to missingness)
AIC: 37.736
Number of Fisher Scoring iterations: 5*
If you want to predict the probability of response for a specified set of values of the predictor variable:
pframe <- data.frame(NonRSevents_before1stRS=4)
predict(fitted_model, newdata=pframe, type="response")
where fitted_model is the result of your glm() fit, which you stored in a variable. You may not be familiar with the R approach to statistical analysis, which is to store the fitted model as an object/in a variable, then apply different methods to it (summary(), plot(), predict(), residuals(), ...)
This is obviously only a made-up example: I don't know if 4 is a reasonable value for the NonRSevents_before1stRS variable)
you can specify more different values to do predictions for at the same time (data.frame(NonRSevents_before1stRS=c(4,5,6,7,8)))
if you have multiple predictors, you have to specify some value for every predictor for every prediction, e.g. data.frame(x=4:8,y=mean(orig_data$y), ...)
If you want the predicted probabilities for the observations in your original data set, just predict(fitted_model, type="response")
You're correct that inv.logit() (from a bunch of different packages, don't know which you're using) or plogis() (from base R, essentially the same) will translate from the logit or log-odds scale to the probability scale, so
plogis(predict(fitted_model))
would also work (predict provides predictions on the link-function [in this case logit/log-odds] scale by default).
The dependent variable in a logistic regression is a log odds ratio. We'll illustrate how to interpret the coefficients with the space shuttle autolander data from the MASS package.
After loading the data, we'll create a binary dependent variable where:
1 = autolander used,
0 = autolander not used.
We will also create a binary independent variable for shuttle stability:
1 = stable positioning
0 = unstable positioning.
Then, we'll run glm() with family=binomial(link="logit"). Since the coefficients are log odds ratios, we'll exponentiate them to turn them back into odds ratios.
library(MASS)
str(shuttle)
shuttle$stable <- 0
shuttle[shuttle$stability =="stab","stable"] <- 1
shuttle$auto <- 0
shuttle[shuttle$use =="auto","auto"] <- 1
fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
summary(fit)
exp(fit$coefficients)
...and the output:
> fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
>
> summary(fit)
Call:
glm(formula = use ~ factor(stable), family = binomial(link = "logit"),
data = shuttle)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1774 -1.0118 -0.9566 1.1774 1.4155
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.747e-15 1.768e-01 0.000 1.0000
factor(stable)1 -5.443e-01 2.547e-01 -2.137 0.0326 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 350.36 on 255 degrees of freedom
Residual deviance: 345.75 on 254 degrees of freedom
AIC: 349.75
Number of Fisher Scoring iterations: 4
> exp(fit$coefficients)
(Intercept) factor(stable)1
1.0000000 0.5802469
>
The intercept of 0 is the log odds for unstable, and the coefficient of -.5443 is the log odds for stable. After exponentiating the coefficients, we observe that the odds of autolander use under the condition of an unstable shuttle 1.0, and are multiplied by .58 if the shuttle is stable. This means that the autolander is less likely to be used if the shuttle has stable positioning.
Calculating probability of autolander use
We can do this in two ways. First, the manual approach: exponentiate the coefficients and convert the odds to probabilities using the following equation.
p = odds / (1 + odds)
With the shuttle autolander data it works as follows.
# convert intercept to probability
odds_i <- exp(fit$coefficients[1])
odds_i / (1 + odds_i)
# convert stable="stable" to probability
odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
odds_p / (1 + odds_p)
...and the output:
> # convert intercept to probability
> odds_i <- exp(fit$coefficients[1])
> odds_i / (1 + odds_i)
(Intercept)
0.5
> # convert stable="stable" to probability
> odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
> odds_p / (1 + odds_p)
(Intercept)
0.3671875
>
The probability of autolander use when a shuttle is unstable is 0.5, and decreases to 0.37 when the shuttle is stable.
The second approach to generate probabilities is to use the predict() function.
# convert to probabilities with the predict() function
predict(fit,data.frame(stable="0"),type="response")
predict(fit,data.frame(stable="1"),type="response")
Note that the output matches the manually calculated probabilities.
> # convert to probabilities with the predict() function
> predict(fit,data.frame(stable="0"),type="response")
1
0.5
> predict(fit,data.frame(stable="1"),type="response")
1
0.3671875
>
Applying this to the OP data
We can apply these steps to the glm() output from the OP as follows.
coefficients <- c(-1.1455,-0.1322)
exp(coefficients)
odds_i <- exp(coefficients[1])
odds_i / (1 + odds_i)
# convert nonRSEvents = 1 to probability
odds_p <- exp(coefficients[1]) * exp(coefficients[2])
odds_p / (1 + odds_p)
# simulate up to 10 nonRSEvents prior to RS
coef_df <- data.frame(nonRSEvents=0:10,
intercept=rep(-1.1455,11),
nonRSEventSlope=rep(-0.1322,11))
coef_df$nonRSEventValue <- coef_df$nonRSEventSlope *
coef_df$nonRSEvents
coef_df$intercept_exp <- exp(coef_df$intercept)
coef_df$slope_exp <- exp(coef_df$nonRSEventValue)
coef_df$odds <- coef_df$intercept_exp * coef_df$slope_exp
coef_df$probability <- coef_df$odds / (1 + coef_df$odds)
# print the odds & probabilities by number of nonRSEvents
coef_df[,c(1,7:8)]
...and the final output.
> coef_df[,c(1,7:8)]
nonRSEvents odds probability
1 0 0.31806 0.24131
2 1 0.27868 0.21794
3 2 0.24417 0.19625
4 3 0.21393 0.17623
5 4 0.18744 0.15785
6 5 0.16423 0.14106
7 6 0.14389 0.12579
8 7 0.12607 0.11196
9 8 0.11046 0.09947
10 9 0.09678 0.08824
11 10 0.08480 0.07817
>

How to obtain Poisson's distribution "lambda" from R glm() coefficients

My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009

Resources