`gam` package: extra shift spotted when sketching data on `plot.gam` - r

I try to fit a GAM using the gam package (I know mgcv is more flexible, but I need to use gam here). I now have the problem that the model looks good, but in comparison with the original data it seems to be offset along the y-axis by a constant value, for which I cannot figure out where this comes from.
This code reproduces the problem:
library(gam)
data(gam.data)
x <- gam.data$x
y <- gam.data$y
fit <- gam(y ~ s(x,6))
fit$coefficients
#(Intercept) s(x, 6)
# 1.921819 -2.318771
plot(fit, ylim = range(y))
points(x, y)
points(x, y -1.921819, col=2)
legend("topright", pch=1, col=1:2, legend=c("Original", "Minus intercept"))
Chambers, J. M. and Hastie, T. J. (1993) Statistical Models in S (Chapman & Hall) shows that there should not be an offset, and this is also intuitively correct (the smooth should describe the data).
I noticed something comparable in mgcv, which can be solved by providing the shift parameter with the intercept value of the model (because the smooth is seemingly centred). I thought the same could be true here, so I subtracted the intercept from the original data-points. However, the plot above shows this idea wrong. I don't know where the extra shift comes from. I hope someone here may be able to help me.
(R version. 3.3.1; gam version 1.12)

I think I should first explain various output in the fitted GAM model:
library(gam)
data(gam.data)
x <- gam.data$x
y <- gam.data$y
fit <-gam(y ~ s(x,6), model = FALSE)
## coefficients for parametric part
## this includes intercept and null space of spline
beta <- coef(fit)
## null space of spline smooth (a linear term, just `x`)
nullspace <- fit$smooth.frame[,1]
nullspace - x ## all 0
## smooth space that are penalized
## note, the backfitting procedure guarantees that this is centred
pensmooth <- fit$smooth[,1]
sum(pensmooth) ## centred
# [1] 5.89806e-17
## estimated smooth function (null space + penalized space)
smooth <- nullspace * beta[2] + pensmooth
## centred smooth function (this is what `plot.gam` is going to plot)
c0 <- mean(smooth)
censmooth <- smooth - c0
## additive predictors (this is just fitted values in Gaussian case)
addpred <- beta[1] + smooth
You can first verify that addpred is what fit$additive.predictors gives, and since we are fitting additive models with Gaussian response, this is also as same as fit$fitted.values.
What plot.gam does, is to plot censmooth:
plot.gam(fit, col = 4, ylim = c(-1.5,1.5))
points(x, censmooth, col = "gray")
Remember, there is
addpred = beta[0] + censmooth + c0
If you want to shift original data y to match this plot, you not only need to subtract intercept (beta[0]), but also c0 from y:
points(x, y - beta[1] - c0)

Related

How to extrapolate loess result? [duplicate]

I am struggling with "out-of-sample" prediction using loess. I get NA values for new x that are outside the original sample. Can I get these predictions?
x <- c(24,36,48,60,84,120,180)
y <- c(3.94,4.03,4.29,4.30,4.63,4.86,5.02)
lo <- loess(y~x)
x.all <- seq(3, 200, 3)
predict(object = lo, newdata = x.all)
I need to model full yield curve, i.e. interest rates for different maturities.
From the manual page of predict.loess:
When the fit was made using surface = "interpolate" (the default), predict.loess will not extrapolate – so points outside an axis-aligned hypercube enclosing the original data will have missing (NA) predictions and standard errors
If you change the surface parameter to "direct" you can extrapolate values.
For instance, this will work (on a side note: after plotting the prediction, my feeling is that you should increase the span parameter in the loess call a little bit):
lo <- loess(y~x, control=loess.control(surface="direct"))
predict(lo, newdata=x.all)
In addition to nico's answer: I would suggest to fit a gam (which uses penalized regression splines) instead. However, extrapolation is not advisable if you don't have a model based on science.
x <- c(24,36,48,60,84,120,180)
y <- c(3.94,4.03,4.29,4.30,4.63,4.86,5.02)
lo <- loess(y~x, control=loess.control(surface = "direct"))
plot(x.all <- seq(3,200,3),
predict(object = lo,newdata = x.all),
type="l", col="blue")
points(x, y)
library(mgcv)
fit <- gam(y ~ s(x, bs="cr", k=7, fx =FALSE), data = data.frame(x, y))
summary(fit)
lines(x.all, predict(fit, newdata = data.frame(x = x.all)), col="green")

Drawing a 3D decision boundary of logistic regression

I have fitted a logistic regression model that takes 3 variables into account. I would like to make a 3D plot of the datapoints and draw the decision boundary (which I suppose would be a plane here).
I found an online example that applies to the case (so that you can load the data directly)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
I was thinking of using the 3Dscatterplot package, but I am not sure what equation I should write to draw the boundary. Any ideas?
Many thanks,
The decision boundary will be a 3-d plane, which you could plot with any 3-d plotting package in R. I'll use persp by defining an x-y grid and then calculating the corresponding z value with the outer function:
# Use iris dataset for example logistic regression
data(iris)
iris$long <- as.numeric(iris$Sepal.Length > 6)
mod <- glm(long~Sepal.Width+Petal.Length+Petal.Width, data=iris, family="binomial")
# Plot 50% decision boundary; another cutoff can be achieved by changing the intercept term
x <- seq(2, 5, by=.1)
y <- seq(1, 7, by=.1)
z <- outer(x, y, function(x, y) (-coef(mod)[1] - coef(mod)[2]*x - coef(mod)[3]*y) /
coef(mod)[4])
persp(x, y, z, col="lightblue")

How to find "y" values of the already estimated monotone function of the non-monotone regression curve corresponding to the original "x" points?

The title sounds complicated but that is what I am looking for. Focus on the picture.
## data
x <- c(1.009648,1.017896,1.021773,1.043659,1.060277,1.074578,1.075495,1.097086,1.106268,1.110550,1.117795,1.143573,1.166305,1.177850,1.188795,1.198032,1.200526,1.223329,1.235814,1.239068,1.243189,1.260003,1.262732,1.266907,1.269932,1.284472,1.307483,1.323714,1.326705,1.328625,1.372419,1.398703,1.404474,1.414360,1.415909,1.418254,1.430865,1.431476,1.437642,1.438682,1.447056,1.456152,1.457934,1.457993,1.465968,1.478041,1.478076,1.485995,1.486357,1.490379,1.490719)
y <- c(0.5102649,0.0000000,0.6360097,0.0000000,0.8692671,0.0000000,1.0000000,0.0000000,0.4183691,0.8953987,0.3442624,0.0000000,0.7513169,0.0000000,0.0000000,0.0000000,0.0000000,0.1291901,0.4936121,0.7565551,1.0085108,0.0000000,0.0000000,0.1655482,0.0000000,0.1473168,0.0000000,0.0000000,0.0000000,0.1875293,0.4918018,0.0000000,0.0000000,0.8101771,0.6853480,0.0000000,0.0000000,0.0000000,0.0000000,0.4068802,1.1061434,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.6391678)
fit1 <- c(0.5102649100,0.5153380934,0.5177234836,0.5255544980,0.5307668662,0.5068087080,0.5071001179,0.4825657520,0.4832969250,0.4836378194,0.4842147729,0.5004039310,0.4987301366,0.4978800742,0.4978042478,0.4969807064,0.5086987191,0.4989497612,0.4936121200,0.4922210302,0.4904593166,0.4775197108,0.4757040857,0.4729265271,0.4709141776,0.4612406896,0.4459316517,0.4351338346,0.4331439717,0.4318664278,0.3235179189,0.2907908968,0.1665721429,0.1474035158,0.1443999345,0.1398517097,0.1153991839,0.1142140393,0.1022584672,0.1002410843,0.0840033244,0.0663669309,0.0629119398,0.0627979240,0.0473336492,0.0239237481,0.0238556876,0.0084990298,0.0077970954,0.0000000000,-0.0006598571)
fit2 <- c(-0.0006598571,0.0153328298,0.0228511733,0.0652889427,0.0975108758,0.1252414661,0.1270195143,0.1922510501,0.2965234797,0.3018551305,0.3108761043,0.3621749370,0.4184150225,0.4359301495,0.4432114081,0.4493565757,0.4510158144,0.4661865431,0.4744926045,0.4766574718,0.4796937554,0.4834718810,0.4836125426,0.4839450098,0.4841092849,0.4877317306,0.4930561638,0.4964939389,0.4970089201,0.4971376528,0.4990394601,0.5005881678,0.5023814257,0.5052125977,0.5056691690,0.5064254338,0.5115481820,0.5117259449,0.5146054557,0.5149729419,0.5184178197,0.5211542908,0.5216215426,0.5216426533,0.5239797875,0.5273573222,0.5273683002,0.5293994824,0.5295130266,0.5306236672,0.5307303109)
## picture
plot(x, y)
## red regression curve
points(x, fit1, col=2); lines(x, fit1, col=2)
## blue monotonic curve to the regression
points(min(x) + cumsum(c(0, rev(diff(x)))), rev(fit2), col="blue"); lines(min(x) + cumsum(c(0, rev(diff(x)))), rev(fit2), col="blue")
## "x" original point matches with the regression estimated point
## but not with the estimated (fit2=estimate) monotonic curve
abline(v=1.223329, lty=2, col="grey")
Focus on the dashed grey line. The idea is to get y value of the monotonic blue curve corresponding to x original value. The grey line should cross three points (the original one "black", the regression estimate "red", the adjusted regression estimate "blue"). Can we do this?
Methodology:
The object "fit2" is the output of the function rearrangement(). It is always monotonically increasing.
library(Rearrangement)
fit2 <- rearrangement(x=as.data.frame(x), y=fit1)
It sounds like you might be interested in approxfun
fn <- approxfun(x=min(x) + cumsum(c(0, rev(diff(x)))), y=rev(fit2))
fn(1.223329)
It's not very fancy, but it will do a basic linear interpolation between observed points for unobserved x values. The code above will estimate a y value for the value x=1.223329 using the existing points. You can then use the fn function to estimate other points as well.
I found a way to do this without approxfun. The rearrangement function returns a monotonically increasing result. Your fitted curve is decreasing, and you can do a simple trick to get what you want (also what you wanted in your earlier question).
## Apply rearrangement on minus fit1
fit3 <- rearrangement(x=as.data.frame(x), y = - fit1)
## Plot the minus rearranged result
plot(x, y)
lines(x, - fit3, col="green"); points(x, - fit3, col="green")
lines(x, fit1, col="green"); points(x, fit1, col="green")
So, the result is a monotonically decreasing curve, with x values equal to that of your data and your fit.
ind <- which(x == 1.223329)
-fit3[ind]
## 0.4857717
Hope it helps,
alex

How to compute prediction intervals for a circle fit in R

I wish to compute the prediction interval of the radius from a circle fit with the formula > r² = (x-h)²+(y-k)². r- radius of the circle, x,y, are gaussian coordinates, h,k, mark the center of the fitted circle.
# data
x <- c(1,2.2,1,2.5,1.5,0.5,1.7)
y <- c(1,1,3,2.5,4,1.7,0.8)
# using nls.lm from minpack.lm (minimising the sum of squared residuals)
library(minpack.lm)
residFun <- function(par,x,y) {
res <- sqrt((x-par$h)^2+(y-par$k)^2)-par$r
return(res)
}
parStart <- list("h" = 1.5, "k" = 2.5, "r" = 1.7)
out <- nls.lm(par = parStart, x = x, y = y, lower =NULL, upper = NULL, residFun)
The problem is, predict() doesn't work with nls.lm, hence I am trying to compute the circle fit using nlsLM. (I could compute it by hand, but have troubles creating my Designmatrix).`
So this is what I tried next:
dat = list("x" = x,"y" = y)
out1 <- nlsLM(y ~ sqrt(-(x-h)^2+r^2)+k, start = parStart )
which results in:
Error in stats:::nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Question 1a: How does nlsLM() work with circle fits? (advantage being that the generic predict() is available.
Question 1b: How do I get the prediction interval for my circle fit?
EXAMPLE from linear regression (this is what I want for the circle regression)
attach(faithful)
eruption.lm = lm(eruptions ~ waiting)
newdata = data.frame(waiting=seq(45,90, length = 272))
# confidence interval
conf <- predict(eruption.lm, newdata, interval="confidence")
# prediction interval
pred <- predict(eruption.lm, newdata, interval="predict")
# plot of the data [1], the regression line [1], confidence interval [2], and prediction interval [3]
plot(eruptions ~ waiting)
lines(conf[,1] ~ newdata$waiting, col = "black") # [1]
lines(conf[,2] ~ newdata$waiting, col = "red") # [2]
lines(conf[,3] ~ newdata$waiting, col = "red") # [2]
lines(pred[,2] ~ newdata$waiting, col = "blue") # [3]
lines(pred[,3] ~ newdata$waiting, col = "blue") # [3]
Kind regards
Summary of Edits:
Edit1: Rearranged formula in nlsLM, but parameter (h,k,r) results are now different in out and out1 ...
Edit2: Added 2 wikipedia links for clarification puprose on terminology used: (c.f. below)
confidence interval
prediction interval
Edit3: Some rephrasing of the question(s)
Edit4: Added a working example for linear regression
I am having a hard time figuring out what you want to do. Let me illustrate what the data looks like and something about the "prediction".
plot(x,y, xlim=range(x)*c(0, 1.5), ylim=range(y)*c(0, 1.5))
lines(out$par$h+c(-1,-1,1,1,-1)*out$par$r, # extremes of x-coord
out$par$k+c(-1,1,1,-1 ,-1)*out$par$r, # extremes of y-coord
col="red")
So what "prediction interval" are we speaking about? ( I do realize that you were thinking of a circle and if you just want to plot a circle on this background that's going to be pretty easy as well.)
lines(out$par$h+cos(seq(-pi,pi, by=0.1))*out$par$r, #center + r*cos(theta)
out$par$k+sin(seq(-pi,pi, by=0.1))*out$par$r, #center + r*sin(theta)
col="red")
I think that this question is not answerable in its current form. Any predict() function that is based on a linear model will require the predicted variable to be a linear function of the input design matrix. r^2 = (x-x0)^2 + (y-y0)^2 is not a linear function of the design matrix (which would be something like [x0 x y0 y], so I don't think you're going to be able to find a linear model fit that will give you confidence intervals. If someone more clever than I am has a way to do it, though, I'd be very interested in hearing about it.
The general way to approach these sorts of problems is to create a hierarchical nonlinear model, where your hyperparameters would be x0 and y0 (your h and k) with uniform distribution over your search space, and then the r^2 would be distributed ~N((x-x0)^2+(y-y0)^2, \sigma). You would then use MCMC sampling or similar to get your posterior confidence intervals.
Here's a solution to find h,k,r using base R's optim function. You essentially create a cost function that is a closure containing the data you wish to optimize over. I had to RSS value, else we would go to -Inf. There is a local optima problem, so you need to run this a few times...
# data
x <- c(1,2.2,1,2.5,1.5,0.5,1.7)
y <- c(1,1,3,2.5,4,1.7,0.8)
residFunArg <- function(xVector,yVector){
function(theta,xVec=xVector,yVec=yVector){
#print(xVec);print(h);print(r);print(k)
sum(sqrt((xVec-theta[1])^2+(yVec-theta[2])^2)-theta[3])^2
}
}
rFun = residFunArg(x,y);
o = optim(f=rFun,par=c(0,0,0))
h = o$par[1]
k = o$par[2]
r = o$par[3]
Run this command in the REPL to observe the local mins:
o=optim(f=tFun,par=runif(3),method="CG");o$par

Fitting logarithmic curve in R

If I have a set of points in R that are linear I can do the following to plot the points, fit a line to them, then display the line:
x=c(61,610,1037,2074,3050,4087,5002,6100,7015)
y=c(0.401244, 0.844381, 1.18922, 1.93864, 2.76673, 3.52449, 4.21855, 5.04368, 5.80071)
plot(x,y)
Estimate = lm(y ~ x)
abline(Estimate)
Now, if I have a set of points that looks like a logarithmic curve fit is more appropriate such as the following:
x=c(61,610,1037,2074,3050,4087,5002,6100,7015)
y=c(0.974206,1.16716,1.19879,1.28192,1.30739,1.32019,1.35494,1.36941,1.37505)
I know I can get the standard regression fit against the log of the x values with the following:
logEstimate = lm(y ~ log(x))
But then how do I transform that logEstimate back to normal scaling and plot the curve against my linear curve from earlier?
Hmmm, I'm not quite sure what you mean by "plot the curve against my linear curve from earlier".
d <- data.frame(x,y) ## need to use data in a data.frame for predict()
logEstimate <- lm(y~log(x),data=d)
Here are three ways to get predicted values:
(1) Use predict:
plot(x,y)
xvec <- seq(0,7000,length=101)
logpred <- predict(logEstimate,newdata=data.frame(x=xvec))
lines(xvec,logpred)
(2) Extract the numeric coefficient values:
coef(logEstimate)
## (Intercept) log(x)
## 0.6183839 0.0856404
curve(0.61838+0.08564*log(x),add=TRUE,col=2)
(3) Use with() magic (you need back-quotes around the parameter estimate names because they contain parentheses)
with(as.list(coef(logEstimate)),
curve(`(Intercept)`+`log(x)`*log(x),add=TRUE,col=4))
Maybe what you want is
est1 <- predict(lm(y~x,data=d),newdata=data.frame(x=xvec))
plot(est1,logpred)
... although I'm not sure why ...
I'm not exactly sure what you mean either... but I guessed a little different. I think you want to fit two models to those points, one linear, and one logged. Then, you want to plot the points, and the functional form of both models. Here is the code for that:
x=c(61,610,1037,2074,3050,4087,5002,6100,7015)
y=c(0.974206,1.16716,1.19879,1.28192,1.30739,1.32019,1.35494,1.36941,1.37505)
Estimate = lm(y ~ x)
logEstimate = lm(y ~ log(x))
plot(x,predict(Estimate),type='l',col='blue')
lines(x,predict(logEstimate),col='red')
points(x,y)
In response to your second question in the comment, linear regression does always return a linear combination of your predictors, but that doesn't necessarily mean that it is a straight line. Think about what your log transformation really means: If you fit,
y = log(x)
that is the same as fitting
exp(y) = x
Which means that as x increases linearly, then y will change exponentially, which is clearly not a 'straight line'. However, if you transformed your x-axis on the log scale, then the displayed line would be straight.

Resources