mgcv: obtain predictive distribution of response given new data (negative binomial example) - r

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).

Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

Related

How to check for overdispersion in a GAM with negative binomial distribution?

I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.

bivariate Probit/logit R : how to find ALL coefficients and marginal effects with the "zeligverse" package

I am running a bivariate logit model in R with the zeligverse package.I want to calculate the impact of my independant variables on P(Y1=1), P(Y2=1), P(Y1=1,Y2=0), P(Y1=1,Y2=1), P(Y1=0,Y2=1), P(Y1=0,Y2=0), P(Y1=1|Y2=0) and all the other conditional probabilities (Y1 and Y2 are my dependant variables. They both equal 0 or 1). I also want all the marginal effects associated with these probabilities for each independant variable.
Do you know how to find those in this package (or in another package if it works better)?
Not sure this is what you are looking for (feel free to mark me down if not). Zelig packages do seem to be a right choice for your specific question.
library(Zelig)
## Let X_i be independent variable
## Assume you are working with a univariate target variable Y where Y \in {0, 1}
set.seed(123)
m <- 100
df <- data.frame(
Y = rbinom(m, 1, 0.5),
X1 = rbinom(m, 1, 0.95),
X2 = rbinom(m, 1, 0.95)
)
## Fit model once:
fit <- zelig(
Y ~ .,
model = "logit",
data = df,
cite = FALSE
)
summary(fit)
## Let's focus on the binomial predictor 2
x.out1 <- setx(fit, X2=1)
## Run estimation based on a posterior distribution:
postFit <- Zelig::sim(fit, x=x.out1)
summary(postFit)
# plot(postFit)

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

How to obtain prediction intervals for linear regression in R

This question probably stems from the fact that I don't fully understand what the predict() function is doing, but I'm wondering if there is a way to access the underlying prediction data so that I can get prediction intervals for a given unobserved value. Here's what I mean:
x <- rnorm(100,10)
y <- x+rnorm(100,5)
And making a linear model:
mod1 <- lm(y ~ x)
If I want the confidence intervals for the model estimates, I can do:
confint(mod1)
and get
> 2.5 % 97.5 %
(Intercept) -8.1864342 29.254714
x 0.7578651 1.132339
If I wanted to, I could plug these lower and upper bound estimates into a prediction equation to get a lower and upper confidence interval for some input of x.
What if I want to do the same, but with a prediction interval? Using
predict(mod1, interval = "prediction")
looks like it fits the model to the existing data with lower and upper bounds, but doesn't tell me which parameters those lower and upper bounds are based on so that I could use them for an unobserved value.
(I know I can technically put a value into the predict() command, but I just want the underlying parameters so that I don't necessarily have to do the prediction in R)
The predict function accepts a newdata argument that computes the interval for unobserved values. Here is an example
x <- rnorm(100, 10)
y <- x + rnorm(100, 5)
d <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = d)
d2 <- data.frame(x = c(0.3, 0.6, 0.2))
predict(mod, newdata = d2, interval = 'prediction')
I don't know what you mean by underlying parameters. The computation of prediction intervals involves a complex formula and you cannot reduce it to a few simple parameters.

calculate vector valued Hessian in R

I want to calculate a variance-covariance matrix of parameters. The parameters are obtained by a non-linear least squares fit.
library(minpack.lm)
library(numDeriv)
variables
t <- seq(0.1,20,0.3)
a <- 20
b <- 14
c <- 0.4
jitter <- rnorm(length(t),0,0.5)
Hobs <- a+b*exp(-c*t)+jitter
function def
Hhat <- function(parList, t) {parList$a + parList$b*exp(-parL
Hhatde <- function(par, t) {par[1] + par[2]*exp(-par[3]*t)}st$c*t)}
residFun <- function(par, t, observed) observed - Hhat(par,t)
initial conditions
parStart = list(a = 20, b = 10 ,c = 0.5)
nls.lm
library(minpack.lm)
out1 <- nls.lm(par = parStart, fn = residFun, observed = Hobs,
t = t, control = nls.lm.control(nprint=0))
I wish to calculate manually what is given back via vcov(out1)
I tried it with: but sigma and vcov(out1) which don't seem to be the same
J <- jacobian(Hhatde, c(19.9508523,14.6586555,0.4066367 ), method="Richardson",
method.args=list(),t=t)
sigma <- solve((t(J)%*%J))
vcov(out1)
now trying to do it with the hessian, I can't get it working for error message cf below
hessian
H <- hessian(Hhatde, x = c(19.9508523,14.6586555,0.4066367 ), method="complex", method.args=list(),t=t)
Error in hessian.default(Hhatde, x = c(19.9508523, 14.6586555, 0.4066367), :
Richardson method for hessian assumes a scalar valued function.
How do I do I get my hessian() to work.
I am not very strong on the math here, hence the trial and error approach.
vcov(out1) returns an estimate of the scaled variance-covariance matrix for the parameters in your model. The inverse of the cross product of the gradient, solve(crossprod(J)) returns an estimate of the unscaled variance-covariance matrix. The scaling factor is the estimated variance of the errors. So to calculate the scaled variance-covariance matrix (with some rounding error) using the gradient and the residuals from your model:
df = length(Hobs) - length(out1$par) # degrees freedom
se_var = sum(out1$fvec^2) / df # estimated error variance
var_cov = se_var * solve(crossprod(J)) # scaled variance-covariance
print(var_cov)
print(vcov(out1))
To brush up on non-linear regression and non-linear least squares, you might wish to check out Seber & Wild's Nonlinear regression, or Bates & Watts' Nonlinear regression analysis and its applications. John Fox also has a short online appendix that you may find helpful.

Resources