R How to quickly get decision boundary for logistic regression - r

We know how to plot decision boundaries for logistic regression and other classifier methods, however, I am not interested in a plot; but rather I want the exact value at which the binomial prediction is .50.
For example:
train=data.frame(1:20)
train$response=rep(1:0,10)
model=glm(response ~ poly(X1.20, 2), data=train, family=binomial)
train$X1.20[1]=10.5
predict(model, train[1,], type="response")
Leaves me with a decision boundary of 10.5 which I can find through trial and error with the predict() function, meaning a value of 10.5 for the independent variable gives a response of exactly .50. Is there an automated way to find what value will give a response of .50?

You should use the fact that a predicted value of zero from the logit model implies a response probability of 0.5. So you can just try to find a value of x that makes the predicted value as close to zero as possible. Here deviationFromZero() finds how far the predicted value from the model is from zero given any value of x.
df <- data.frame(x = 1:20, response = rep(1:0, 10))
model <- glm(response ~ poly(x, 2), data = df, family = binomial)
deviationFromZero <- function(y) abs(predict(model, data.frame(x = y)))
boundary <- optimize(f = deviationFromZero, interval = range(df$x))
boundary
$minimum
[1] 10.5
$objective
1
1.926772e-16

Related

GLMM with beta distribution and lots of zeros in y variable

I am trying to run a glmm with a beta distribution using the glmmTMB function (package glmmTMB). My response variable has a lot of 0 observations so I get this error when running the model
Error in eval(family$initialize) : y values must be 0 < y < 1
I have attached what my response variable looks like regular and also normalized (see image)
Zero values cannot occur in data that are truly Beta-distributed (the probability density of y==0 is either zero or infinite unless the first shape parameter is exactly 1.0). You can fit a zero-inflated Beta response by specifying ziformula. For example:
simulate data
set.seed(101)
y <- rbeta(1000, shape1 = 1, shape2 = 5)
y[sample(1000, replace= FALSE, size = 100)] <- 0
dd <- data.frame(y)
fit
library(glmmTMB)
glmmTMB(y ~ 1, ziformula = ~1, data = dd, family = beta_family)
This example doesn't have a random-effects component, but that doesn't change anything important.

Is there a difference between gamma hurdle (two-part) models and zero-inflated gamma models?

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.
library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)
### DATA
id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1),
sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)
df <- data.frame(cbind(id, disease, age, gender, cost, time))
# create binary variable for non-zero costs
df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))
### HURDLE MODEL (MY VERSION)
# gamma component
hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)),
data = subset(df, cost > 0),
family = Gamma(link = "log"))
model_parameters(hurdle_gamma, exponentiate = T)
# binomial component
hurdle_binomial <- glm(cost_binary ~ disease + gender + age + time,
data = df, family = "binomial")
model_parameters(hurdle_binomial, exponentiate = T)
# predicted probability of use
df$prob_use <- predict(hurdle_binomial, type = "response")
# predicted mean cost for people with any cost
df_bin <- subset(df, cost_binary == 1)
df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")
# combine data frames
df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")
# replace NA with 0
df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)
# calculate predicted cost for everyone
df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)
# mean predicted cost
mean(df2$cost_pred)
### glmmTMB with ziGamma
zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
family = ziGamma(link = "log"),
ziformula = ~ disease + gender + age + time,
data = df)
model_parameters(zigamma_model, exponentiate = T)
df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"
The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?
In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?
The glmmTMB package can do this:
glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)
ought to do it. Maybe something in VGAM as well?
To answer the questions about coefficients and standard errors:
the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295
6% for the intercept, up to 30% for the effect of age ...
the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).
In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

Why are the predictions from poisson lasso regression model in glmnet not integers?

I am conducting a lasso regression modeling predictors of a count outcome in glmnet.
I am wondering what to make of the predictions from this model.
Here is some toy data. It's not very good because I don't know how to simulate multivariate data but I'm mainly interested in whether I'm getting the syntax right.
set.seed(123)
df <- data.frame(count = rpois(500, lambda = 3),
pred1 = rnorm(500),
pred2 = rnorm(500),
pred3 = rnorm(500),
pred4 = rnorm(500),
pred5 = rnorm(500),
pred6 = rnorm(500),
pred7 = rnorm(500),
pred8 = rnorm(500),
pred9 = rnorm(500),
pred10 = rnorm(500))
Now run the model
x <- model.matrix(count ~ ., df)[,-1]
y <- df$count
cvg <- cv.glmnet(x,y,family = "poisson")
now when I generate predicted outcomes
yTest <- predict(cvg, newx = x, family = "poisson", type = "link")
This is the output
# 1 1.094604
# 2 1.094604
# 3 1.094604
# 4 1.094604
# 5 1.094604
# 6 1.094604
# ... ........
Now obviously the model predictions are all the same and all terrible (unsurprising given the absence of any association between the predictors and the outcome), but the thing I am wondering is why they are not integers (with my real data I have the same problem).
I have several questions.
So my questions are:
Am I specifying the correct arguments in the glmnet.predict() function? In the help for the predict function it states that specifying type = "link" gives "the linear predictors" for poisson models, whereas specifying type = "response" gives the "fitted mean" for poisson models (in the case of my dumb example it generates 500 values of 2.988).
Shouldn't the predicted outcomes match the form of the data itself, i.e. be integers?
If I am specifying the correct arguments in the predict() function, how do I use the non-integer predictions Do I round them to the nearest integer, or just leave them alone?
Shouldn't the predicted outcomes match the form of the data itself,
i.e. be integers?
When you use a regression model you are associating a (conditional) probability distribution, indexed by parameters (in the Poisson case, the lambda parameter, which represents the mean) to each predictor configuration. A prediction of the response minimizes some expected loss function conditional to the predictor values so it depends on what loss function you are using.
If you consider a 0-1 loss, then yes, the predicted values should be an integer: the mode of the distribution, its most probable value, which in the case of a Possion distribution is the floor of lambda if it is not an integer (https://en.wikipedia.org/wiki/Poisson_distribution).
If you consider a squared loss (y - y_prediction)^2 then your prediction is the conditional expectation (see https://en.wikipedia.org/wiki/Minimum_mean_square_error#Properties), which is not necessarily an integer, just like the result you are getting.
glmnet uses squared loss, but you can easily predict an integer value (the one that minimizes the 0-1 loss) by applying the floor() function to the predicted values output by glmnet.

How to check for overdispersion in a GAM with negative binomial distribution?

I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.

How to obtain prediction intervals for linear regression in R

This question probably stems from the fact that I don't fully understand what the predict() function is doing, but I'm wondering if there is a way to access the underlying prediction data so that I can get prediction intervals for a given unobserved value. Here's what I mean:
x <- rnorm(100,10)
y <- x+rnorm(100,5)
And making a linear model:
mod1 <- lm(y ~ x)
If I want the confidence intervals for the model estimates, I can do:
confint(mod1)
and get
> 2.5 % 97.5 %
(Intercept) -8.1864342 29.254714
x 0.7578651 1.132339
If I wanted to, I could plug these lower and upper bound estimates into a prediction equation to get a lower and upper confidence interval for some input of x.
What if I want to do the same, but with a prediction interval? Using
predict(mod1, interval = "prediction")
looks like it fits the model to the existing data with lower and upper bounds, but doesn't tell me which parameters those lower and upper bounds are based on so that I could use them for an unobserved value.
(I know I can technically put a value into the predict() command, but I just want the underlying parameters so that I don't necessarily have to do the prediction in R)
The predict function accepts a newdata argument that computes the interval for unobserved values. Here is an example
x <- rnorm(100, 10)
y <- x + rnorm(100, 5)
d <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = d)
d2 <- data.frame(x = c(0.3, 0.6, 0.2))
predict(mod, newdata = d2, interval = 'prediction')
I don't know what you mean by underlying parameters. The computation of prediction intervals involves a complex formula and you cannot reduce it to a few simple parameters.

Resources