Bayesian Question: Exponential Prior and Poisson Likelihood: Posterior? - r

I am needing assistance in a particular question and need confirmation of my understanding.
The belief is that absences in a company follow
a Poisson(λ) distribution.
It is believed additionally that 75% of thes value of λ is less than 5 therefore it is decided that a exponential distribution will be prior for λ. You take a random sample of 50 students and find out the number of absences that each has had over the past semester.
The data summarised below, note than 0 and 1 are binned collectively.
Number of absences
≤ 1 2 3 4 5 6 7 8 9 10
Frequency
18 13 8 3 4 3 0 0 0 1
Therefore in order to calculate a posterior distribution, My understanding is that prior x Likelihood which is this case is a Exponential(1/2.56) and a Poisson with the belief incorporated that the probability of less than 5 is 0.75 which is solved using
-ln(1-0.75)/(1/2.56)= 3.5489.
Furthermore a similar thread has calculated the Posterior to be that of a Gamma (sum(xi)+1,n+lambda)
Therefore with those assumptions, I have some code to visualise this
x=seq(from=0, to=10, by= 1)
plot(x,dexp(x,rate = 0.390625),type="l",col="red")
lines(x,dpois(x,3.54890),col="blue")
lines(x,dgamma(x,128+1,50+3.54890),col="green")
Any help or clarification surround this would be greatly appreciated

Related

Random sampling with normal distribution from existing data in R

I have a large dataset of individuals who have rated some items (x1:x10). For each individual. the ratings have been combined into a total score (ranging 0-5). Now, I would like to draw two subsamples with the same sample size, in which the total score has a specific mean (1.5 and 3) and follows a normal distribution. Individuals might be part of both subsamples.
A guess to solve this, sampling with the outlined specifications from a vector (the total score) will work. Unfortunately, I have found only found different ways to draw random samples from a vector, but not a way to sample around a specific mean.
EDIT:
As pointed out I normal distribution would not be possible. Rather than I am looking way to sample a binomial distribution (directly from the data, without the work around of creating a similar distribution and matching).
You can't have normally distributed data on a discrete scale with hard limits. A sample drawn from a normal distribution with a mean between 0 and 5 would be symmetrical around the mean, would take coninuous rather than discrete values and would have a non-zero probability of containing values of less than zero and more than 5.
You want your sample to contain discrete values between zero and five and to have a central tendency around the mean. To emulate scores with a particular mean you need to sample from the binomial distribution using rbinom.
get_n_samples_averaging_m <- function(n, m)
{
rbinom(n, 5, m/5)
}
Now you can do
samp <- get_n_samples_averaging_m(40, 1.5)
print(samp)
# [1] 1 3 2 1 3 2 2 1 1 1 1 2 0 3 0 0 2 2 2 3 1 1 1 1 1 2 1 2 0 1 4 2 0 2 1 3 2 0 2 1
mean(samp)
# [1] 1.5

test what factors determine modelling performance

I am analyzing the performance (estimation error) of a range of forecasting methods (population dynamics model).
To do so I applied each method on a range of population dynamics (model fits to different species), starting from 2 different years (performance might depend on how the species was doing at the time).
This gives an output similar to this:
df1 <- expand.grid(species=letters[1:3],
time=1:2,
method=LETTERS[11:13])
df1$error <- rnorm(nrow(df2),0,10)
species time method error
1 a 1 K 3.93566373
2 b 1 K -26.95329369
3 c 1 K 3.68377329
4 a 2 K -21.68417747
5 b 2 K 6.59804377
6 c 2 K -4.53913733
7 a 1 L -6.94936825
8 b 1 L -0.06846303
...
I am not really interested in the modelling performance for each species, but rather in the underlying factors.
Therefore, for each species I calculated a bunch of characteristics from the time at which I start the forecast:
df2 <- data.frame(species=rep(letters[1:3],2),
time=rep(1:2,each=3),
char1=c(rnorm(3,c(1:3),0.1),rnorm(3,c(1:3),0.1)),
char2=c(rnorm(3,c(100,200,300),20),rnorm(3,c(100,200,300),20)),
char3=c(rnorm(3,c(20,26,40),0.1),rnorm(3,c(20,26,40),0.1)))
species time char1 char2 char3
1 a 1 1.0573675 127.38703 20.06343
2 b 1 2.0849696 215.25302 26.04236
3 c 1 3.1334384 308.42294 39.97982
4 a 2 0.9499281 82.63552 19.99231
5 b 2 2.0510098 214.59121 26.06874
6 c 2 3.0868793 310.00532 40.01716
Note that most (but not all) characteristics are similar for the same species/population.
They are not significantly correlated (R2) but some are clearly dependend (e.g. standard deviation and temporal autocorrelation in the historical dynamcis).
Some of the forecasting methods are more similar to others and their performance is hence not completely independent either.
How do I know which characteristics have the most influence on projection performance, in general and by method?
For instance, char1 might largely determine performance of all projection methods, but could also especially influence method K.
The idea is to find what method might best be used given a set of population characteristics.
I looked into several statistical techniques, but usually the assumptions are violated.
My best idea so far is to use a type of linear regression for each method separately (error ~ char1 + char2 +1|time) and make a table of for instance the deviances explained by each characteristic.
method char1 char2 char3
K 5% 10% 3%
L 3% 20% 9%
M 0% 55% 6%
I feel however highly unsure about whether this is the right approach? What other statistical methods could I consider to answer my questions?
Thank you

two phase sample weights

I am using survey package to analyze data from a two-phase survey. The first phase included ~5000 cases, and the second ~2700. Weights were calculated beforehand to adjust for several variables (ethnicity, sex, etc.) AND(as far as I understand) for the decrease in sample size when performing phase 2.
I'm interested in proportions of binary variables, e.g. suicides in sick vs. healthy individuals.
An example of a simple output I receive in the overall sample:
table (df$schiz,df$suicide)
0 1
0 4857 8
1 24 0
An example of a simple output I receive in the second phase sample only:
table (df2$schiz,df2$suicide)
0 1
0 2685 5
1 24 0
And within the phase two sample with the weights included:
dfw<-svydesign(ids=~1,data=df2, weights=df2$weights)
svytable (~schiz+suicide, design=dfW)
suicide
schiz 0 1
0 2701.51 2.67
1 18.93 0.00
My question is: Shouldn't the weights correct for the decrease in N when moving from phase 1 to phase 2? i.e. shouldn't the total N of the second table after correction be ~5000 cases, and not ~2700 as it is now?

R: how to estimate a fixed effects model with weights

I would like to run a fixed-effects model using OLS with weighted data.
Since there can be some confusion, I mean to say that I used "fixed effects" here in the sense that economists usually imply, i.e. a "within model", or in other words individual-specific effects. What I actually have is "multilevel" data, i.e. observations of individuals, and I would like to control for their region of origin (and have corresponding clustered standard errors).
Sample data:
library(multilevel)
data(bhr2000)
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
GRP AF06 AF07 AP12 AP17 AP33 AP34 AS14 AS15 AS16 AS17 AS28 HRS RELIG weight
1 1 2 2 2 4 3 3 3 3 5 5 3 12 2 6.647987
2 1 3 3 3 1 4 3 3 4 3 3 3 11 1 6.851675
3 1 4 4 4 4 3 4 4 4 2 3 4 12 3 8.202567
4 1 3 4 4 4 3 3 3 3 3 3 4 9 3 1.872407
5 1 3 4 4 4 4 4 3 4 2 4 4 9 3 4.526455
6 1 3 3 3 3 4 4 3 3 3 3 4 8 1 8.236978
The kind of model I would like to estimate is:
AF06_ij = beta_0 + beta_1 AP34_ij + alpha_1 * (GRP == 1) + alpha_2 * (GRP==2) +... + e_ij
where i refer to specific indidividuals and j refer to the group they belong to.
Moreover, I would like observations to be weighted by weight (sampling weights).
However, I would like to get "clustered standard errors", to reflect possible GRP-specific heteroskedasticity. In other words, E(e_ij)=0 but Var(e_ij)=sigma_j^2 where the sigma_j can be different for each GRP j.
If I understood correctly, nlme and lme4 can only estimate random-effects models (or so-called mixed models), but not fixed-effects model in the sense of within.
I tried the package plm, which looked ideal for what I wanted to do, but it does not allow for weights. Any other idea?
I think this is more of a stack exchange question, but aside from fixed effects with model weights; you shouldn't be using OLS for an ordered categorical response variable. This is an ordered logistic modeling type of analysis. So below I use the data you have provided to fit one of those.
Just to be clear we have an ordered categorical response "AF06" and two predictors. The first one "AP34" is also an ordered categorical variable; the second one "GRP" is your fixed effect. So generally you can create a group fixed effect by coercing the variable in question to a factor on the RHS...(I'm really trying to stay away from statistical theory because this isn't the place for it. So I might be inaccurate in some of the things I'm saying)
The code below fits an ordered logistic model using the polr (proportional odds logistic regression) function. I've tried to interpret what you were going for in terms of model specification, but at the end of the day OLS is not the right way forward. The call to coefplot will have a very crowded y axis I just wanted to present a very rudimentary start at how you might interpret this. I'd try to visualize this in a more refined way for sure. And back to interpretation...You will need to work on that, but I think this is generally the right method. The best resource I can think of is chapters 5 and 6 of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Gelman and Hill. It's such a good resource so I'd really recommend you read the whole thing and try to master it if you're interested in this type of analysis going forward.
library(multilevel) # To get the data
library(MASS) # To get the polr modeling function
library(arm) # To get the tools, insight and expertise of Andrew Gelman and his team
# The data
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
# The model
m <- polr(factor(AF06) ~ AP34 + factor(GRP),weights = weight, data = bhr2000, Hess=TRUE, method = "logistic")
summary(m)
coefplot(m,cex.var=.6) # from the arm package
Check out the lfe package---it does econ style fixed effects and you can specify clustering.

No zeros predicted from zeroinfl object in R?

I created a zero inflated negative binomial model and want to investigate how many of the zeros were partitioned out to sampling or structural zeros. How do I implement this in R. The example code on the zeroinfl page is not clear to me.
data("bioChemists", package = "pscl")
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
table(round(predict(fm_zinb2, type="zero")))
> 0 1
> 891 24
table(round(bioChemists$art))
> 0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
> 275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
What is this telling me?
When I do the same for my data I get a read out that just has the sample size listed under the 1? Thanks
The details are in the paper by Zeileis (2008) available at https://www.jstatsoft.org/article/view/v027i08/v27i08.pdf
It's a little bit of work (a couple of years and your question was still unanswered) to gather together all the explanations of what the predict function does for each model in the pscl library, and it's buried (pp 19,23) in the mathematical expression of the likelihood function (equations 7, 8). I have interpreted your question to mean that you want/need to know how to use different types of predict:
What is the expected count? (type="response")
What is the (conditional) expected probability of an excess zero? (type="zero")
What is the (marginal) expected probability of any count? (type="prob")
And finally how many predicted zeros are excess (eg sampling) rather than regression based (ie structural)?
To read in the data that comes with the pscl package:
data("bioChemists", package = "pscl")
Then fit a zero-inflated negative binomial model:
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
If you wish to predict the expected values, then you use
predict(fm_zinb2, type="response")[29:31]
29 30 31
0.5213736 1.7774268 0.5136430
So under this model, the expected number of articles published in the last 3 years of a PhD is one half for biochemists 29 and 31 and nearly 2 for biochemist 30.
But I believe that you are after the probability of an excess zero (in the point mass at zero). This command does that and prints off the values for items in row 29 to 31 (yes I went fishing!):
predict(fm_zinb2, type="zero")[29:31]
It produces this output:
29 30 31
0.58120120 0.01182628 0.58761308
So the probability that the 29th item is an excess zero (which you refer to as a sampling zero, i.e. a non-structural zero and hence not explained by the covariates) is 58%, for the 30th is 1.1%, and for the 31st is 59%. So that's two biochemists who are predicted to have zero publications, and this is in excess of those that can be explained by the negative binomial regression on the various covariates.
And you have tabulated these predicted probabilities across the whole dataset
table(round(predict(fm_zinb2, type="zero")))
0 1
891 24
So your output tells you that only 24 biochemists were likely to be an excess zero, ie with a predicted probability of an excess zero that was over one-half (due to rounding).
It would perhaps be easier to interpret if you tabulated into bins of 10 points on the percentage scale
table(cut(predict(fm_zinb2, type="zero"), breaks=seq(from=0,to=1,by=0.1)))
to give
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
751 73 34 23 10 22
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
2 0 0 0
So you can see that 751 biochemists were unlikely to be an excess zero, but 22 biochemists have a chance of between 50-60% of being an excess zero, and only 2 have a higher chance (60-70%). No one was extremely likely to be an excess zero.
Graphically, this can be shown in a histogram
hist(predict(fm_zinb2, type="zero"), col="slateblue", breaks=seq(0,0.7,by=.02))
You tabulated the actual number of counts per biochemist (no rounding necessary, as these are counts):
table(bioChemists$art)
0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
Who is the special biochemist with 19 publications?
most_pubs <- max(bioChemists$art)
most_pubs
extreme_biochemist <- bioChemists$art==most_pubs
which(extreme_biochemist)
You can obtain the estimated probability that each biochemist has any number of pubs, exactly 0 and up to the maximum, here an unbelievable 19!
preds <- predict(fm_zinb2, type="prob")
preds[extreme_biochemist,]
and you can look at this for our one special biochemist, who had 19 publications (using base R plotting here, but ggplot is more beautiful)
expected <- predict(fm_zinb2, type="response")[extreme_biochemist]
# barplot returns the midpoints for counts 0 up to 19
midpoints<-barplot(preds[extreme_biochemist,],
xlab="Predicted #pubs", ylab="Relative chance among biochemists")
# add 1 because the first count is 0
abline(v=midpoints[19+1],col="red",lwd=3)
abline(v=midpoints[round(expected)+1],col="yellow",lwd=3)
and this shows that although we expect 4.73 publications for biochemist 915, under this model more likelihood is given to 2-3 pubs, nowhere near the actual 19 pubs (red line).
Getting back to the question, for biochemist 29,
the probability of an excess zero is
pzero <- predict(fm_zinb2, type="zero")
pzero[29]
29
0.5812012
The probability of a zero, overall (marginally) is
preds[29,1]
[1] 0.7320871
So the proportion of predicted probability of a zero that is excess versus structural (ie explained by the regression) is:
pzero[29]/preds[29,1]
29
0.7938962
Or the additional probability of a zero, beyond the chance of an excess zero is:
preds[29,1] - pzero[29]
29
0.1508859
The actual number of publications for biochemist 29 is
bioChemists$art[29]
[1] 0
So the reason that biochemist is predicted to have zero publications is very little explained by the regression (20%) and mostly not (ie excess, 80%).
And overall, we see that for most biochemists, this is not the case. Our biochemist 29 is unusual, since their chance of zero pubs is mostly excess, ie inexplicable by the regression. We can see this via:
hist(pzero/preds[,1], col="blue", xlab="Proportion of predicted probability of zero that is excess")
which gives you:

Resources