Predict.lm() in R - how to get nonconstant prediction bands around fitted values - r

So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.

Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.

Related

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

How to change the y-axis for a multivariate GAM model from smoothed to actual values?

I am using multivariate GAM models to learn more about fog trends in multiple regions. Fog is determined by visibility going below a certain threshold (< 400 meters). Our GAM model is used to determine the response of visibility to a range of meteorological variables.
However, my challenge right now is that I'd really like the y-axis to be the actual visibility observations rather than the centered smoothed. It is interesting to see how visibility is impacted by the covariates relative to the mean visibility in that location, but it's difficult to compare this for multiple locations where the mean visibility is different (and thus the 0 point in which visibility is enhanced or diminished has little comparable meaning).
In order to compare the results of multiple locations, I'm trying to make the y-axis actual visibility observations, and then I'll put a line at the visibility threshold we're interested in looking at (400 m)
to evaluate what the predictor variables values are like below that threshold (eg what temperatures are associated with visibility below 400 m).
I'm still a beginner when it comes to GAMs and R in general, but I've figured out a few helpful pieces so far.
Helpful things so far:
Attempt 1. how to extract gam fit for each variable in model
Extracting data used to make a smooth plot in mgcv
Attempt 2. how to use predict function to reconstruct a univariable model
http://zevross.com/blog/2014/09/15/recreate-the-gam-partial-regression-smooth-plots-from-r-package-mgcv-with-a-little-style/
Attempt 3. how to get some semblance of a y-axis that looks like visibility observations using "fitted" -- though I don't think this is
the correct approach since I'm not taking the intercept into account
http://gsp.humboldt.edu/OLM/R/05_03_GAM.html
simulated data
install.packages("mgcv") #for gam package
require(mgcv)
install.packages("pspline")
require(pspline)
#simulated GAM data for example
dataSet <- gamSim(eg=1,n=400,dist="normal",scale=2)
visibility <- dataSet[[1]]
temperature <- dataSet[[2]]
dewpoint <- dataSet[[3]]
windspeed <- dataSet[[4]]
#Univariable GAM model
gamobj <- gam(visibility ~ s(dewpoint))
plot(gamobj, scale=0, page=1, shade = TRUE, all.terms=TRUE, cex.axis=1.5, cex.lab=1.5, main="Univariable Model: Dew Point")
summary(gamobj)
AIC(gamobj)
abline(h=0)
Univariable Model of Dew Point
https://imgur.com/1uzP34F
ATTEMPT 2 -- predict function with univariable model, but didn't change y-axis
#dummy var that spans length of original covariate
maxDP <-max(dewpoint)
minDP <-min(dewpoint)
DPtrial.seq <-seq(minDP,maxDP,length=3071)
DPtrial.seq <-data.frame(dewpoint=DPtrial.seq)
#predict only the DP term
preds <- predict(gamobj, type="terms", newdata=DPtrial.seq, se.fit=TRUE)
#determine confidence intervals
DPplot <-DPtrial.seq$dewpoint
fit <-preds$fit
fit.up95 <-fit-1.96*preds$se.fit
fit.low95 <-fit+1.96*preds$se.fit
#plot
plot(DPplot, fit, lwd=3,
main="Reconstructed Dew Point Covariate Plot")
#plot confident intervals
polygon(c(DPplot, rev(DPplot)),
c(fit.low95,rev(fit.up95)), col="grey",
border=NA)
lines(DPplot, fit, lwd=2)
rug(dewpoint)
Reconstructed Dew Point Covariate Plot
https://imgur.com/VS8QEcp
ATTEMPT 3 -- changed y-axis using "fitted" but without taking intercept into account
plot(dewpoint,fitted(gamobj), main="Fitted Response of Y (Visibility) Plotted Against Dew Point")
abline(h=mean(visibility))
rug(dewpoint)
Fitted Response of Y Plotted Against Dew Point https://imgur.com/RO0q6Vw
Ultimately, I want a horizontal line where I can investigate the predictor variable relative to 400 meters, rather than just the mean of the response variable. This way, it will be comparable across multiple sites where the mean visibility is different. Most importantly, it needs to be for multiple covariates!
Gavin Simpson has explained the method in a couple of posts but unfortunately, I really don't understand how I would hold the mean of the other covariates constant as I use the predict function:
Changing the Y axis of default plot.gam graphs
Any deeper explanation into the method for doing this would be super helpful!!!
I'm not sure how helpful this will be as your Q is a little more open ended than we'd typically like on SO, but, here goes.
Firstly, I think it would help to think about modelling the response variable, which I assume is currently visibility. This is going to be a continuous variable, bounded at 0 (perhaps the data never reach zero?) which suggests modelling the data as conditionally distributed either
gamma (family = Gamma(link = 'log')) for visibility that never takes a value of zero.
Tweedie (family = tw()) for data that do have zeroes.
An alternative approach would be to model the occurrence of fog; if this is defined as an event <400m visibility then you could turn all your observations into 0/1 values for being a fog event or otherwise. Then you'd model the data as conditionally distributed Bernoulli, using family = binomial().
Having decided on a modelling approach, we need to model the response. This should be done using a multiple regression type of approach, with a GAM including multiple predictors. This way you get to estimate the effect of each potential predictor variable on the response while controlling for the effects of the other predictors. If you just do this using a single predictor at a time, say dewpoint, that variable could well "explain" variation in the data that might be due to another predictor, windspeed say, and you wouldn't know it.
Furthermore, there may well be interactions between predictors that you'll want to control for if they exist, which can only be done in
Then, to finally get to the crux of your problem, having fitted the multi-predictor model to "explain" visibility, you will need to predict from the model for sets of likely conditions. To look at how the visibility varies with dewpoint in a model where other predictor variables have effects, you need to fix the other variables at some reasonable values; one option is to set them to their mean (or modal value in the case of any factor predictor variables), or some other value indicative of typically values for that variable. You'll have to use your domain knowledge for this.
If you have interactions in the model, then you'll need to vary the two variables in the interaction, whilst holding all other variable fixed at some values.
Let's assume you don't have interactions and are interested in dewpoint but the model also includes windspeed. The mean windspeed for the values used to fit the model can be found from the cmX component of the fitted model. Of you could just calculate this from the observed windpseed values or set it to some known number you want to use. Denote the fitted by m, and the data frame with your data in it by df, then we can create new data to predict at over the range of dewpoint, whilst holding windspeed fixed.
mn.windspd <- m$cmX['windspeed']
## or
mn.windspd <- with(df, mean(windspeed))
## or set it some some value
mn.windspd <- 10 # say
Then you can do
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd))
Then you use this to predict from the fitted model:
pred <- predict(m, newdata = preddata, type = "link", se.fit = TRUE)
pred <- as.data.frame(pred)
Now we want to put these predictions back on to the response scale, and we want a confidence interval so we have to create that first before back transforming:
ilink <- family(m)$linkinv
pred <- transform(pred,
Fitted = ilink(fit),
Upper = ilink(fit + (2 * se.fit)),
Lower = ilink(fit - (2 * se.fit)),
dewpoint = preddata = dewpoint)
Now you can visualised the effect of dewpoint on the response whilst keeping windspeed fixed.
In your case, you will have to extend this to keeping temperature constant also, but that is done in the same way
mn.windspd <- m$cmX['windspeed']
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd,
temperature = mn.temp))
and then follow the steps above to do the prediction.
For one or two variables varying I have a function data_slice() in my gratia package which will do the above expand.grid() stuff for you so you don't have to specify the mean values of the other covariates:
preddata <- data_slice(m, 'dewpoint', n = 300)
technically this finds the value in the data closest to the median value (for the covariates not varying). If you want means, then do
fixdf <- data.frame(windspeed = mn.windspd, temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', data = fixdf, n = 300)
If you have an interaction, say between dewpoint and windspeed then you need to vary two variables. This is pretty easy again with expand.grid():
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 100),
windspeed = seq(min(windspeed),
max(windspeed),
length = 300),
temperature = mn.temp))
This will create a 100 x 100 grid of values of the covariates to predict at, whilst holding temperature constant.
For data_slice() you'd need to do:
fixdf <- data.frame(temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', 'windpseed',
data = fixdf, n = 300)
And extending this on to more covariates you want to vary, is also easy following this pattern with expand.grid(); I have yet to implement more than 2 variables varying in data_slice.

R absolute value of residuals with log transformation

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))
I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

ROCR cutoff value and accuracy plots

I have a continuous independent variable (let's say 'height') and a binary independent variable (let's say 'gets a job'). I want to see at what cutoff value for height best predicts one's ability to get a job. I also want to see how accurate this model is. I assumed a multinomial logistic model. I wanted a ROC curve so I used the ROCR package in R. This was my code:
mymodel <- multinom(job~height, data = dataset)
pred <- predict(mymodel,dataset,type = 'prob')
roc_pred <- prediction(pred,dataset$job)
roc <- performance(roc_pred,"tpr","fpr")
plot(roc,colorize=T)
Now, this is my question. When I colorize the plot, it gives me the range of cut-off values used to make the plot. I'm a little confused as to what the cutoff values actually are though. Are the cutoff values the heights? Or the probability that a certain data point (person) with a certain height is able to get a job? I have a feeling it's the latter, but I am interested in the former. If it is the latter, how do I obtain the cutoff value for the height??
I found a video that explains the cutoffs you see: https://www.youtube.com/watch?v=YdNhNfJ4Vl8
There are many different ways to estimate optimal cutoffs: Youden Index, Sensitivity + Specificity,Distance to Corner and many others (see this article)
I suggest you use a pROC library to do so
library(pROC)
roc <- roc(fit, obs, percent = TRUE)
roc.out <- coords(roc, "best", ret = c("threshold", "sens", "spec"), transpose = TRUE)
method "best" uses the Younden index (J- index) The maximum value of the Youden index is 1 (perfect test) and the minimum is 0 when the test has no diagnostic value. The minimum occurs when sensitivity=1−specificity, i.e., represented by the equal line (the diagonal) in the ROC diagram. The vertical distance between the equal line and the ROC curve is the J-index for that particular cutoff. The J-index is represented by the ROC-curve itself.

inverse of 'predict' function

Using predict() one can obtain the predicted value of the dependent variable (y) for a certain value of the independent variable (x) for a given model. Is there any function that predicts x for a given y?
For example:
kalythos <- data.frame(x = c(20,35,45,55,70),
n = rep(50,5), y = c(6,17,26,37,44))
kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
model <- glm(Ymat ~ x, family = binomial, data = kalythos)
If we want to know the predicted value of the model for x=50:
predict(model, data.frame(x=50), type = "response")
I want to know which x makes y=30, for example.
Saw the previous answer is deleted. In your case, given n=50 and the model is binomial, you would calculate x given y using:
f <- function (y,m) {
(logit(y/50) - coef(m)[["(Intercept)"]]) / coef(m)[["x"]]
}
> f(30,model)
[1] 48.59833
But when doing so, you better consult a statistician to show you how to calculate the inverse prediction interval. And please, take VitoshKa's considerations into account.
Came across this old thread but thought I would add some other info. Package MASS has function dose.p for logit/probit models. SE is via delta method.
> dose.p(model,p=.6)
Dose SE
p = 0.6: 48.59833 1.944772
Fitting the inverse model (x~y) would not makes sense here because, as #VitoshKa says, we assume x is fixed and y (the 0/1 response) is random. Besides, if the data weren’t grouped you’d have only 2 values of the explanatory variable: 0 and 1. But even though we assume x is fixed it still makes sense to calculate a confidence interval for the dose x for a given p, contrary to what #VitoshKa says. Just as we can reparameterize the model in terms of ED50, we can do so for ED60 or any other quantile. Parameters are fixed, but we still calculate CI's for them.
The chemcal package has an inverse.predict() function, which works for fits of the form y ~ x and y ~ x - 1
You just have to rearrange the regression equation, but as the comments above state this may prove tricky and not necessarily have a meaningful interpretation.
However, for the case you presented you can use:
(1/coef(model)[2])*(model$family$linkfun(30/50)-coef(model)[1])
Note I did the division by the x coefficient first to allow the name attribute to be correct.
For just a quick view (without intervals and considering additional issues) you could use the TkPredict function in the TeachingDemos package. It does not do this directly, but allows you to dynamically change the x value(s) and see what the predicted y-value is, so it would be fairly simple to move x until the desired Y is found (for given values of additional x's), this will also show possibly problems with multiple x's that would work for the same y.

Resources