How to obtain prediction intervals for linear regression in R - r

This question probably stems from the fact that I don't fully understand what the predict() function is doing, but I'm wondering if there is a way to access the underlying prediction data so that I can get prediction intervals for a given unobserved value. Here's what I mean:
x <- rnorm(100,10)
y <- x+rnorm(100,5)
And making a linear model:
mod1 <- lm(y ~ x)
If I want the confidence intervals for the model estimates, I can do:
confint(mod1)
and get
> 2.5 % 97.5 %
(Intercept) -8.1864342 29.254714
x 0.7578651 1.132339
If I wanted to, I could plug these lower and upper bound estimates into a prediction equation to get a lower and upper confidence interval for some input of x.
What if I want to do the same, but with a prediction interval? Using
predict(mod1, interval = "prediction")
looks like it fits the model to the existing data with lower and upper bounds, but doesn't tell me which parameters those lower and upper bounds are based on so that I could use them for an unobserved value.
(I know I can technically put a value into the predict() command, but I just want the underlying parameters so that I don't necessarily have to do the prediction in R)

The predict function accepts a newdata argument that computes the interval for unobserved values. Here is an example
x <- rnorm(100, 10)
y <- x + rnorm(100, 5)
d <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = d)
d2 <- data.frame(x = c(0.3, 0.6, 0.2))
predict(mod, newdata = d2, interval = 'prediction')
I don't know what you mean by underlying parameters. The computation of prediction intervals involves a complex formula and you cannot reduce it to a few simple parameters.

Related

How to compute margin of error for prediction intervals of a multiple linear regression in R

I am working on a project where we use R to compute a multiple linear regression to come up with some estimates. I found out how to compute prediction intervals, like in this examplt
x <- rnorm(100, 10)
y <- x + rnorm(100, 5)
d <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = d)
d2 <- data.frame(x = c(0.3, 0.6, 0.2))
predict(mod, newdata = d2, interval = 'prediction')
Now I receive a prediction together with the lower and upper bounds of the prediction interval:
fit lwr upr
1 6.149834 3.630532 8.669137
2 6.425235 3.937989 8.912481
3 6.058034 3.527913 8.588155
However, I am wondering if there is a way to compute the likelihood of the new observation being in a given prediction interval (i.e. prediction +/- 1). In other words, I want to turn the computation “around”. Instead of asking “What are the upper and lower bounds of the 95% prediction interval”, I am asking “What is the likelihood of the new estimate being between a given upper and lower bound around the estimate?”.
To continue the example from above:
fit lwr upr likelihood
1 6.149834 5.149834 7.149834 ???
2 6.425235 5.425235 7.425235 ???
3 6.058034 5.058034 7.058034 ???
Does anyone have an idea how to compute this? Is there a predefined formula in R?
Thank you very much for your help!

my location-scale estimator function not working with polynomial mean

I'm building my own maximum likelihood estimator that estimates the parameters associated with the mean and standard deviation. On simulated data my function works when the true mean is a linear function and the standard deviation is constant. However, if the mean structure is polynomial my function cannot recover the true parameters. Can anybody point me to a solution?
I'm aware there are plenty of existing functions for estimating means and SDs. I'm not interested in them, I'm interested in why my function is not working.
Below is a reproducible example where my model does not recover the true standard deviation (true sd = 1.648, mysd = 4.184123)
*Edit: added library()
library(tidyverse)
my_poly_loglik <- function(pars, #parameters
outcome, #outcome variable
poly_mean){ #data frame of polynomials
#modelling the mean - adding intercept column
mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]
#modelling the standard deviation on exponential scale
sd_mdl = exp(pars[length(pars)])
#computing log likelihood
sum_log_likelihood <- sum(dnorm(outcome,
mean = mean_mdl,
sd = sd_mdl,
log = TRUE),
na.rm = TRUE)
#since optim() is minimizing we want the -log likelihood
return(-sum_log_likelihood)
}
#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2 #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean)
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)
#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2)
#Optimising my likelihood function
optim_res <- optim(pars,
fn = my_poly_loglik,
outcome = mydf$y,
poly_mean = polymean)
if (optim_res$convergence != 0) stop("optim_res value is not 0!")
#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
"truemean" = mean,
"z" = z)
#my mean (black) and true mean (red)
plot_df %>%
ggplot(aes(x = z, y = mymean)) +
geom_line() +
geom_line(aes(y = truemean), color = "red")
#Works!
#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd
this is not a complete solution but it might help others find the correct answer.
The code looks good overall and the issue emerges only with a high range of the z values. In fact, scaling them or generating data from a considerably lower range leads to the correct solution. Furthermore, checking the hessian shows that the covariance matrix of the estimates is not positive semidefinite and slightly reducing the range results in correlations of the mean parameters close to 1. (This is a bit puzzling since a normal linear model with the same parametrization does not suffer from the same issue -- I know it does not optimize the likelihood directly, but still a bit unintuitive to me).
So, a temporal solution might be rescaling the predictors / using an orthogonal parametrization? But that does not really explain core of the issue.

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

R How to quickly get decision boundary for logistic regression

We know how to plot decision boundaries for logistic regression and other classifier methods, however, I am not interested in a plot; but rather I want the exact value at which the binomial prediction is .50.
For example:
train=data.frame(1:20)
train$response=rep(1:0,10)
model=glm(response ~ poly(X1.20, 2), data=train, family=binomial)
train$X1.20[1]=10.5
predict(model, train[1,], type="response")
Leaves me with a decision boundary of 10.5 which I can find through trial and error with the predict() function, meaning a value of 10.5 for the independent variable gives a response of exactly .50. Is there an automated way to find what value will give a response of .50?
You should use the fact that a predicted value of zero from the logit model implies a response probability of 0.5. So you can just try to find a value of x that makes the predicted value as close to zero as possible. Here deviationFromZero() finds how far the predicted value from the model is from zero given any value of x.
df <- data.frame(x = 1:20, response = rep(1:0, 10))
model <- glm(response ~ poly(x, 2), data = df, family = binomial)
deviationFromZero <- function(y) abs(predict(model, data.frame(x = y)))
boundary <- optimize(f = deviationFromZero, interval = range(df$x))
boundary
$minimum
[1] 10.5
$objective
1
1.926772e-16

Resources