hurdle model prediction - count vs response - r

I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.
Here's a similar result for a toy model (code adapted from here):
library("pscl")
data("RecreationDemand", package = "AER")
## model
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin")
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")
Why is it that the counts are higher than the responses?
added comparison to glm.nb
Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??
library(MASS)
m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,])
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only

The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).
Using the countreg package from R-Forge you can fit the model you attempted to fit with glm.nb in your example. (Alternatively, VGAM or gamlss could also be used to fit the same model.)
library("countreg")
m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
## hurdle zerotrunc negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality 0.02482553 0.02483015 0.01671314
Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.
As for the predictions, one can distinguish three quantities:
The expectation from the untruncated count part, i.e., simply exp(x'b).
The conditional/truncated expectation from the count part, i.e., accounting for the zero trunctation: exp(x'b)/(1 - f(0)) where f(0) is the probability for 0 in that count part.
The overall expectation for the complete hurdle model, i.e., the probability for crossing the hurdle times the conditional expectation from 2.: exp(x'b)/(1 - f(0)) * (1 - g(0)) where g(0) is the probability for 0 in the zero hurdle part of the model.
See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl") for more details and formulas. predict(..., type = "count") computes item 1 from above where predict(..., type = "response") computes item 3 for a hurdle model and item 2 for a zerotrunc model.

Related

Why are the predictions from poisson lasso regression model in glmnet not integers?

I am conducting a lasso regression modeling predictors of a count outcome in glmnet.
I am wondering what to make of the predictions from this model.
Here is some toy data. It's not very good because I don't know how to simulate multivariate data but I'm mainly interested in whether I'm getting the syntax right.
set.seed(123)
df <- data.frame(count = rpois(500, lambda = 3),
pred1 = rnorm(500),
pred2 = rnorm(500),
pred3 = rnorm(500),
pred4 = rnorm(500),
pred5 = rnorm(500),
pred6 = rnorm(500),
pred7 = rnorm(500),
pred8 = rnorm(500),
pred9 = rnorm(500),
pred10 = rnorm(500))
Now run the model
x <- model.matrix(count ~ ., df)[,-1]
y <- df$count
cvg <- cv.glmnet(x,y,family = "poisson")
now when I generate predicted outcomes
yTest <- predict(cvg, newx = x, family = "poisson", type = "link")
This is the output
# 1 1.094604
# 2 1.094604
# 3 1.094604
# 4 1.094604
# 5 1.094604
# 6 1.094604
# ... ........
Now obviously the model predictions are all the same and all terrible (unsurprising given the absence of any association between the predictors and the outcome), but the thing I am wondering is why they are not integers (with my real data I have the same problem).
I have several questions.
So my questions are:
Am I specifying the correct arguments in the glmnet.predict() function? In the help for the predict function it states that specifying type = "link" gives "the linear predictors" for poisson models, whereas specifying type = "response" gives the "fitted mean" for poisson models (in the case of my dumb example it generates 500 values of 2.988).
Shouldn't the predicted outcomes match the form of the data itself, i.e. be integers?
If I am specifying the correct arguments in the predict() function, how do I use the non-integer predictions Do I round them to the nearest integer, or just leave them alone?
Shouldn't the predicted outcomes match the form of the data itself,
i.e. be integers?
When you use a regression model you are associating a (conditional) probability distribution, indexed by parameters (in the Poisson case, the lambda parameter, which represents the mean) to each predictor configuration. A prediction of the response minimizes some expected loss function conditional to the predictor values so it depends on what loss function you are using.
If you consider a 0-1 loss, then yes, the predicted values should be an integer: the mode of the distribution, its most probable value, which in the case of a Possion distribution is the floor of lambda if it is not an integer (https://en.wikipedia.org/wiki/Poisson_distribution).
If you consider a squared loss (y - y_prediction)^2 then your prediction is the conditional expectation (see https://en.wikipedia.org/wiki/Minimum_mean_square_error#Properties), which is not necessarily an integer, just like the result you are getting.
glmnet uses squared loss, but you can easily predict an integer value (the one that minimizes the 0-1 loss) by applying the floor() function to the predicted values output by glmnet.

IV regression computation

For my thesis I am doing an Instrumental Variables (IV) regression and I was wondering if I did it the right way. Couple of issues I have:
Comparing the linear model with the IV models, the sign of the effect changes (positive to negative or the other way round).
Using Two Stage Least Squares (2SLS) with ivreg (from the AER package) gives negative R² values, so I decided to manually compute the 2SLS estimates. These give the same estimates as the ivreg code but now with statistically significant results.
I have limited data and therefore I did not expect any significant results as I already did some non-parametric tests and the means of the different groups were not significantly different.
I am researching the effect of policies of organizations on a given budget.
The organization performs well if the budget residual is positive, so they have less costs than budgeted.
The variable is a percentage, either positive or negative.
There is non random selection into treatment as organizations can determine their own policy.
Furthermore, the policy factors are mostly dummy variables, 19 variables are binary and 2 are categorical and 1 is ratio.
My IV is any number between 0 and 1.
This is what I did:
1. I estimate a simple Ordinary Least Squares model to see what it would do (I know the results don't mean anything).
lm1 <- lm(budget ~ policy1, data=df)
lm2 <- lm(budget ~ policy2, data=df)
summ(lm1)
summ(lm2)
2. Then I performed an IV with the ivreg code, though the R² became negative which I thought was weird.
ivreg1 <- ivreg(budget ~ policy1| iv, data=df)
ivreg2 <- ivreg(budget ~ policy2 | iv, data=df)
library(stargazer)
stargazer(ivreg1, ivreg2, dep.var.labels=c("Budget"), covariate.labels = c("policy 1", "policy2") , align=TRUE, column.sep.width = "-15pt", font.size = "small", type="text")
3. So I tried to do the 2SLS in steps myself.
Instead of fitted.values(reg1) I also used predict(reg1). This gives the same output.
attach(df)
reg1<- lm(policy1~iv)
policy1.hat <- fitted.values(reg1)
reg2 <- lm(policy2~iv)
policy2.hat <- fitted.values(reg2)
ivreg3 <- lm(budget~policy1.hat)
ivreg4 <- lm(budget~policy2.hat)
stargazer(ivreg1, ivreg2, dep.var.labels=c("Budget"), covariate.labels = c("policy 1", "policy2"), align=TRUE, column.sep.width = "-15pt", font.size = "small", type="text")
detach(df)
With this step I got a positive adjusted R² but I noticed that the policy factors are now significant and that the sign compared to the lm model changes.
Question:
Am I computing the IV regression wrong?
Example data (not real numbers due to anonymity of data):
df <- data.frame(
budget = c(4,2.8,9.1,15.5,10.1,12.9,4.3,
-1.9,-4.9,-1.3,14.1,8.6,7.8,-5.8,3.8,7.2,5.2,-5.3,8.6,
3.5,-1.2,-15.7,1.6,6.9,12.6,10.4,4.5,-8.3,-15.3,
9.8,21.5),
iv = c(0.52,0.43,0.41,0.44,0.41,0.4,0.39,
0.43,0.38,0.37,0.34,0.42,0.4,0.36,0.35,0.41,0.39,
0.35,0.31,0.43,0.36,0.51,0.35,0.34,0.37,0.37,0.39,
0.46,0.44,0.36,0.37),
policy1 = c(1L,1L,1L,1L,1L,1L,0L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L),
policy2 = c(1L,1L,1L,1L,1L,1L,1L,0L,0L,1L,
0L,1L,0L,1L,1L,1L,1L,0L,1L,1L,1L,1L,1L,1L,1L,
0L,1L,1L,0L,1L,0L)
)

Fit binomial GLM on probabilities (i.e. using logistic regression for regression not classification)

I want to use a logistic regression to actually perform regression and not classification.
My response variable is numeric between 0 and 1 and not categorical. This response variable is not related to any kind of binomial process. In particular, there is no "success", no "number of trials", etc. It is simply a real variable taking values between 0 and 1 depending on circumstances.
Here is a minimal example to illustrate what I want to achieve
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10))
fit <- glm(formula = resp ~ a + b,
family = "binomial",
data = dummy_data)
This code gives a warning then fails because I am trying to fit the "wrong kind" of data:
In eval(family$initialize) : non-integer #successes in a binomial glm!
Yet I think there must be a way since the help of family says:
For the binomial and quasibinomial families the response can be
specified in one of three ways: [...] (2) As a numerical vector with
values between 0 and 1, interpreted as the proportion of successful
cases (with the total number of cases given by the weights).
Somehow the same code works using "quasibinomial" as the family which makes me think there may be a way to make it work with a binomial glm.
I understand the likelihood is derived with the assumption that $y_i$ is in ${0, 1}$ but, looking at the maths, it seems like the log-likelihood still makes sense with $y_i$ in $[0, 1]$. Am I wrong?
This is because you are using the binomial family and giving the wrong output. Since the family chosen is binomial, this means that the outcome has to be either 0 or 1, not the probability value.
This code works fine, because the response is either 0 or 1.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = sample(c(0,1),10,replace=T,prob=c(.5,.5)) )
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data)
If you want to model the probability directly you should include an additional column with the total number of cases. In this case the probability you want to model is interpreted as the success rate given the number of case in the weights column.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data, weights = w)
You will still get the warning message, but you can ignore it, given these conditions:
resp is the proportion of 1's in n trials.
for each value in resp, the corresponding value in w is the number of trials.
From the discussion at Warning: non-integer #successes in a binomial glm! (survey packages), I think we can solve it by another family function ?quasibinomial().
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit2 <- glm(formula = resp ~ a + b,
family = quasibinomial(),
data = dummy_data, weights = w)

rstanarm for Bayesian hierarchical modeling of binomial experiments

Suppose there are three binomial experiments conducted chronologically. For each experiment, I know the #of trials as well as the #of successes. To use the first two older experiments as prior for the third experiment, I want to "fit a Bayesian hierarchical model on the two older experiments and use the posterior form that as prior for the third experiment".
Given my available data (below), my question is: is my rstanarm code below capturing what I described above?
Study1_trial = 70
Study1_succs = 27
#==================
Study2_trial = 84
Study2_succs = 31
#==================
Study3_trial = 100
Study3_succs = 55
What I have tried in package rstanarm:
library("rstanarm")
data <- data.frame(n = c(70, 84, 100), y = c(27, 31, 55));
mod <- stan_glm(cbind(y, n - y) ~ 1, prior = NULL, data = data, family = binomial(link = 'logit'))
## can I use a beta(1.2, 1.2) as prior for the first experiment?
TL;DR: If you were directly predicting the probability of success, the model would be a Bernoulli likelihood with parameter theta (the probability of success) that could take on values between zero and one. You could use a Beta prior for theta in this case. But with a logistic regression model, you're actually modeling the log odds of success, which can take on any value from -Inf to Inf, so a prior with a normal distribution (or some other prior that can take on any real value within some range determined by the available prior information) would be more appropriate.
For a model where the only parameter is the intercept, the prior is the probability distribution for the log odds of success. Mathematically, the model is:
log(p/(1-p)) =  a
Where p is the probability of success and a, the parameter you're estimating, is the intercept, which can be any real number. If the odds of success are 1:1 (that is, p = 0.5) then a = 0. If the odds are greater than 1:1 then a is positive. If the odds are less than 1:1 then a is negative.
Since we want a prior for a, we need a probability distribution that can take on any real value. If we didn't know anything about the odds of success, we might use a very weakly informative prior like a normal distribution with, say, mean=0 and sd=10 (this is the rstanarm default), meaning that one standard deviation would encompass odds of success ranging from about 22000:1 to 1:22000! So this prior is essentially flat.
If we take your first two studies to construct the prior, we can use the probability density based on those studies and then transform it to the log odds scale:
# Possible outcomes (that is, the possible number of successes)
s = 0:(70+84)
# Probability density over all possible outcomes
dens = dbinom(s, 70+84, (27+31)/(70+84))
Assuming we'll use a normal distribution for the prior, we want the most likely probability of success (which will be the mean for the prior) and the standard deviation of the mean.
# Prior parameters
pp = s[which.max(dens)]/(70+84) # most likely probability
psd = sum(dens * (s/max(s) - pp)^2)^0.5 # standard deviation
# Convert prior to log odds scale
pp_logodds = log(pp/(1-pp))
psd_logodds = log(pp/(1-pp)) - log((pp-psd)/(1 - (pp-psd)))
c(pp_logodds, psd_logodds)
[1] -0.5039052 0.1702006
You could generate essentially the same prior by running stan_glm on the first two studies with the default (flat) prior:
prior = stan_glm(cbind(y, n-y) ~ 1,
data = data[1:2,],
family = binomial(link = 'logit'))
c(coef(prior), se(prior))
[1] -0.5090579 0.1664091
Now let's fit the model using data from Study 3 using the default prior and the prior we just generated. I've switched to a standard data frame, since stan_glm seems to fail when the data frame has only one row (as in data = data[3, ]).
# Default weakly informative prior
mod1 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
family = binomial(link = 'logit'))
# Prior based on studies 1 & 2
mod2 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
prior_intercept = normal(location=pp_logodds, scale=psd_logodds),
family = binomial(link = 'logit'))
For comparison, let's also generate a model with all three studies and the default flat prior. We would expect this model to give virtually the same results as mod2:
mod3 <- stan_glm(cbind(y, n - y) ~ 1,
data = data,
family = binomial(link = 'logit'))
Now let's compare the three models:
library(tidyverse)
list(`Study 3, Flat Prior`=mod1,
`Study 3, Prior from Studies 1 & 2`=mod2,
`All Studies, Flat Prior`=mod3) %>%
map_df(~data.frame(log_odds=coef(.x),
p_success=predict(.x, type="response")[1]),
.id="Model")
Model log_odds p_success
1 Study 3, Flat Prior 0.2008133 0.5500353
2 Study 3, Prior from Studies 1 & 2 -0.2115362 0.4473123
3 All Studies, Flat Prior -0.2206890 0.4450506
For Study 3 with the flat prior (row 1), the predicted probability of success is 0.55, as expected, since that's what the data says and the prior provides no additional information.
For Study 3 with a prior based on studies 1 and 2, the probability of success is 0.45. The lower probability of success is due to the lower probability of success in Studies 1 and 2 adding additional information. In fact, the probability of success from mod2 is exactly what you'd calculate directly from the data: with(data, sum(y)/sum(n)). mod3 puts all the information into the likelihood instead of splitting it between the prior and the likelihood, but is otherwise essentially the same as mod2.
Answer to (now deleted) comment: If all you know is the number of trials and successes and you think that a binomial probability is a reasonable model for how the data were generated, then it doesn't matter how you split up the data into "prior" and "likelihood" or whether you shuffle the order of the data. The resulting model fit will be the same.

Error when trying to run fixed effects logistic regression

not sure where can I get help, since this exact post was considered off-topic on StackExchange.
I want to run some regressions based on a balanced panel with electoral data from Brazil focusing on 2 time periods. I want to understand if after a change in legislation that prohibited firm donations to candidates, those individuals that depended most on these resources had a lower probability of getting elected.
I have already ran a regression like this on R:
model_continuous <- plm(percentage_of_votes ~ time +
treatment + time*treatment, data = dataset, model = 'fd')
On this model I have used a continuous variable (% of votes) as my dependent variable. My treatment units or those that in time = 0 had no campaign contributions coming from corporations.
Now I want to change my dependent variable so that it is a binary variable indicating if the candidate was elected on that year. All of my units were elected on time = 0. How can I estimate a logit or probit model using fixed effects? I have tried using the pglm package in R.
model_binary <- pglm(dummy_elected ~ time + treatment + time*treatment,
data = dataset,
effects = 'twoways',
model = 'within',
family = 'binomial',
start = NULL)
However, I got this error:
Error in maxRoutine(fn = logLik, grad = grad, hess = hess, start = start, :
argument "start" is missing, with no default
Why is that happening? What is wrong with my model? Is it conceptually correct?
I want the second regression to be as similar as possible to the first one.
I have read that clogit function from the survival package could do the job, but I dont know how to do it.
Edit:
this is what a sample dataset could look like:
dataset <- data.frame(individual = c(1,1,2,2,3,3,4,4,5,5),
time = c(0,1,0,1,0,1,0,1,0,1),
treatment = c(0,0,1,1,0,0,1,1,0,0),
corporate = c(0,0,0.1,0,0,0,0.5,0,0,0))
Based on the comments, I believe the logistic regression reduces to treatment and dummy_elected. Accordingly I have fabricated the following dataset:
dataset <- data.frame("treatment" = c(rep(1,1000),rep(0,1000)),
"dummy_elected" = c(rep(1, 700), rep(0, 300), rep(1, 500), rep(0, 500)))
I then ran the GLM model:
library(MASS)
model_binary <- glm(dummy_elected ~ treatment, family = binomial(), data = dataset)
summary(model_binary)
Note that the treatment coefficient is significant and the coefficients are given. The resulting probabilities are thus
Probability(dummy_elected) = 1 => 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Probability(dummy_elected) = 0 => 1 - 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Note that these probabilities are consistent with the frequencies I generated the data.
So for each row, take the max probability across the two equations above and that's the value for dummy_elected.

Resources