Differences between glm predictions and geom_smooth() using same glm - r

I am trying to reproduce someone else's work using a probit model. Unfortunately, I don't have much information about their methods, only their starting data and a plot of their model.
When I plot the data in ggplot and fit a line using geom_smooth(method = "glm", ...), I am able to reproduce the prior work. However, when I try to fit (what I think is) an identical model outside of ggplot using glm(), I get different predictions. I feel like I making some silly mistake, but I can't quite pin it down.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
df <- tibble(x = c(0.006, 0.014, 0.025, 0.05, 0.15, 0.3, 0.5),
y = c(0.4, 0.733, 0.875, 1, 1, 1, 1))
probit_model <- glm(y ~ x, data = df, family = quasibinomial(link = "probit"))
df <- df %>%
add_row(x = 0.001, y = NA) %>% # To underline that these models are different
mutate(y_pred = predict(probit_model, newdata = ., type = "response"))
df %>%
ggplot(aes(x, y)) +
geom_point(size = 4) +
geom_line(aes(y = y_pred), color = "red", lwd = 1) +
geom_smooth(formula = y ~ x, color = "blue",
method = "glm", fullrange = TRUE,
method.args = list(family = quasibinomial(link = "probit"))) +
scale_x_log10(limits = c(0.001, 1))
And here is the plot it produces. Note that the blue line and the red line describe different fits. I believe they should be the same (ignoring the piecewise nature of the red line), given they use the same model and data.
I've read quite a few threads in the process of troubleshooting, and many responses suggest that geom_smooth() is not a replacement for modelling. Broadly, I agree. That said, I am explicitly trying to figure out what geom_smooth() is doing here, and then reproduce it outside of ggplot.
My questions are:
Why are these two models different? How is geom_smooth() calling glm()? How can I call glm() myself to reproduce the model that geom_smooth() is using?

The models are actually the same. You can see this if you set, say, xlim(0, 0.1) and remove scale_x_log10. Then you'll see the fits coincide.
I think the behavior you're seeing is because scale_x_log10 performs the axis transformation before any statistical summaries (such as geom_smooth). So, when you run scale_x_log10, geom_smooth is effectively fitting the model y ~ log10(x), rather than y ~ x. If you use coord_trans(x="log10") instead of scale_x_log10, you'll also see that the models coincide, since coord_trans does the transformation after any statistical summaries.

Related

How to fit non-linear function to data in ggplot2 using maximum likelihood model in R?

The data set (x.test, y.test) is an exponential fit. I'm trying to fit a custom non-linear function and attached is the code. The regular points plot just fine but I'm unable to get the fit line to work. Any suggestions?
x.test <- runif(50,2,8)
y.test <- 0.5^(x.test)
df <- data.frame(x.test, y.test)
library(ggpmisc)
my.formula <- y ~ lambda/ (1 + aii*x)
ggplot(data = df, aes(x=x.test,y=y.test)) +
geom_point(shape=21, fill="white", color="red", size=3) +
stat_smooth(method="nls",formula = y.test ~ lambda/ (1 + aii*x.test), method.args=list(start=c(lambda=1000,aii=-816.39)),se=F,color="red") +
geom_smooth(method="lm", formula = my.formula , col = "red") + stat_poly_eq(formula = my.formula, aes(label = stringr::str_wrap(paste(..eq.label.., ..rr.label.., sep = "~~~"))), parse = TRUE, size = 2.5, col = "red") + stat_function(fun=function (x.test){
y.test ~ lambda/ (1 + aii*x.test)}, color = "blue")
A few things:
you need to use y and x as the variable names in the formula argument to geom_smooth, regardless of what the names are in your data set
you need better starting values (see below)
there's a GLM trick you can use to fit this model; doesn't always work (can be numerically unstable), but it doesn't need starting values and will work more often than nls()
I don't think lm() and stat_poly_eq() are going to work as expected (or maybe at all) with a nonlinear formula ...
simulate data
(same as your code but using set.seed() - probably not important here but good practice)
set.seed(101)
x.test <- runif(50,2,8)
y.test <- 0.5^(x.test)
df <- data.frame(x.test, y.test)
attempt nls fit with your starting values
It's usually a good idea to troubleshoot by fitting any smoothing terms outside of ggplot2, so you have fewer layers to dig through to find the problems:
nls(y.test ~ lambda/(1+ aii*x.test),
start = list(lambda=1000,aii=-816.39),
data = df)
Error in nls(y.test ~ lambda/(1 + aii * x.test), start = list(lambda = 1000, :
singular gradient
OK, still doesn't work. Let's use glm() to get better starting values: we use an inverse-link GLM:
1/y = b0 + b1*x
y = 1/(b0 + b1*x)
= (1/b0)/(1 + (b1/b0)*x)
So:
g1 <- glm(y.test ~ x.test, family = gaussian(link = "inverse"))
s0 <- with(as.list(coef(g1)), list(lambda = 1/`(Intercept)`, aii = x.test/`(Intercept)`))
This gives lambda = -0.09, aii = -0.638 (with a little bit more work we could probably also figure out how to eyeball these by looking at the starting point and scale of the curve).
ggplot(data = df, aes(x=x.test,y=y.test)) +
geom_point(shape=21, fill="white", color="red", size=3) +
stat_smooth(method="nls",
formula = y ~ lambda/ (1 + aii*x),
method.args=list(start=s0),
se=FALSE,color="red") +
stat_smooth(method = "glm",
formula = y ~ x,
method.args = list(gaussian(link = "inverse")),
color = "blue", linetype = 2)

Plotting output of GAM model

Edit: following interactions in the responses below, I believe there may be some issues with the plot() or plot.gam() functions when dealing with gam outputs. See responses below.
I am running a non parametric regression model <- gam(y ~ x, bs = "cs", data = data).
My data looks like what follows, where x is in logs. I have 273 observations
y x
[1,] 0.010234756 10.87952
[2,] 0.009165001 10.98407
[3,] 0.001330975 11.26850
[4,] 0.008000957 10.97803
[5,] 0.008579472 10.94924
[6,] 0.009746714 11.01823
I would like to plot the output of the model, basically the fitted curve. When I do
# graph
plot(model)
or
ggplot(data = data, mapping = aes(x = x y = y)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method="gam", formula= y~s(x, bs = "cs") )
I get the desired output graphs (apologies for the original labels):
[
However, the two plotted curves are not exactly the same and I did not manage to find the parameters to tweak to remove the differences. Therefore I would like to plot the curve manually.
Here it's my current attempt.
model <- gam(y~ s(x), bs = "cs", data = data)
names(model)
# summary(model)
model_fit <- as.data.frame(cbind(model$y, model$fitted.values,
model$linear.predictors, data$x,
model$residuals))
names(model_fit) <- c("y", "y_fit", "linear_pred", "x", "res")
### here the plotting
ggplot(model_fit) +
geom_point(aes(x = x, y = y_fit), size = 0.5, alpha = 0.5) +
geom_line(aes(x = x, y = y_fit))
However I get the following warning
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
and wrong output graph
I do not seem to be able to fix the last graph (it seems the error is in geom_point() ) and add the confidence intervals, nor to find where to tweak the first two to make them exactly the same.
The difference is likely due to you using different fitting algorithms. The default in gam() is (currently) method = "GCV.Cp" even through the recommended option is to use method = "REML". stat_smooth() uses method = "REML". GCV-based smoothness selection is known to undersmooth in some circumstances and this seems to be the case here with the REML solution being a much smoother curve.
If you change to method = "REML" in your gam() call, the differences should disappear.
That said, you really shouldn't be ripping things out of model objects like that - for a set off $residuals is not what you think it is - it's not useful in this context as those are the working residuals for PIRLS algorithm. Use the extractor functions like fitted(), residuals() etc.
The easiest way to plot your own version of that drawn by plot.gam() is to capture the object returned by plot.gam() and then use that object to draw what you need.
Via plot.gam()
df <- data_sim("eg1", seed = 2)
m <- gam(y ~ s(x2), data = df, method = "REML")
p_obj <- plot(m, residuals = TRUE)
p_obj <- p_obj[[1]] # just one smooth so select the first component
sm_df <- as.data.frame(p_obj[c("x", "se", "fit")])
data_df <- as.data.frame(p_obj[c("raw", "p.resid")])
## plot
ggplot(sm_df, aes(x = x, y = fit)) +
geom_rug(data = data_df, mapping = aes(x = raw, y = NULL),
sides = "b") +
geom_point(data = data_df, mapping = aes(x = raw, y = p.resid)) +
geom_ribbon(aes(ymin = fit - se, ymax = fit + se, y = NULL),
alpha = 0.3) +
geom_line() +
labs(x = p_obj$xlab, y = p_obj$ylab)
Which produces
Alternatively, you might look at my {gratia} package or the {mgcViz} package of Matteo Fasiolo as options that will do this all for you.
{gratia} example
For example with {gratia}
library('gratia')
draw(m, residuals = TRUE)
which produces
The solution provided by #Gavin Simpson here partially solves the issue, meaning that to make the two curves equal, one needs to add the method = "REML". The two curves then have the same slope.
However, for some reason, when plotting the output of a gam() model using either plot() or plot.gam(), the curve does not fit properly the original data as it should. The same happens by manually plotting the graph by extracting the elements from the object returned by plot.gam(). I am not sure why this happens. In my case, the fitted curve is shifted downwards, clearly "missing" the data points it is supposed to fit. Below the code and the corresponding output graph, the latter being the same you get in plot() or plot.gam() with the addition of the original data points to the plots.
plot(model_1)
# or plot.gam(model_1)
data.plot = as.data.frame(cbind(b[[1]]$x, b[[1]]$fit, b[[1]]$se))
ggplot(data=data.plot, mapping = aes(x= data.plot$V1, y= data.plot$V2)) +
geom_line(aes(x = V1, y = V2)) +
geom_line(aes(x= V1, y = V2 + V3 ), linetype="dashed") +
geom_line(aes(x= V1, y = V2 - V3 ), linetype ="dashed") +
geom_point(data= df_abs, aes(x= log(prd_l_1999), y=prd_gr), size = 0.5, alpha = 0.5)
Misplaced graphs
To note that the ggplot function makes the plot properly. Therefore, my ignorant guess is that this may be an issue with the plotting method.
Working solution
I am not able to prove that the issue is with the plotting functions, but it turns out that this is the same issue as in this question and the partial solution provided by the OP fixes the plotting while still using the gam() function. Below (his) code adapted to my case and the corresponding output graph. As you can see, the graph is plotted properly and the curve fits the data as it is supposed to do. I'd say this may corroborate my hypothesis even though I cannot prove it as I am not knowledgeable enough.
library(data.table)
model_1 <- gam(prd_gr ~ s(log(prd_l_1999)), bs = "cs", data = df_abs, method = "REML")
preds <- predict(model_1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))
ggplot()+
geom_line(data = my_data, aes(x=log(df_abs$prd_l_1999), y=mu), size=1, col="blue")+
geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=log(df_abs$prd_l_1999), y = mu), stat = "identity", col="green")

How to replicate a function 1000 times

So basically, I generated 2 random variables X and Y 1000 times and created a data frame Data=data.frame(x,y) in order to perform a smoothing by spline function. Now I want to recreate exactly that but for B= 1000 times and plot the smoothing functions (B=1,...,1000) to compare its variability
simulation= function(d){
X=runif(1000,0,10)
Y=rpois(1000,lambda=2*X+0.2*X*sin(X))
Data=matrix(data=c(X,Y),ncol=2)
smoothing_sim=lm(Y~ns(x=X,df=d),data=Data)
new_x2=seq(min(X),max(X),length.out=100)
adjusted_sim=predict(object=smoothing_sim,newdata=data.frame(X=new_x2))
return(data.frame(new_x2,smoothing_sim))
}
simulation2=replicate(n=1000,simulation)
I'm not sure wether my method is good or not. And I'm also not sure how to plot the functions following the simulation. Anyone care to comment? Thanks !
If you use ggplot, you can make the smooths right in geom_smooth. As ggplot demands long form, using list columns and tidyr::unnest is a useful substitute for replicate, though there are lots of ways to accomplish the data generation step.
library(tidyverse)
set.seed(47)
# A nice theme with a white background to help make low-opacity objects visible
theme_set(hrbrthemes::theme_ipsum_tw())
df <- tibble(replication = seq(100), # scaled down a little
x = map(replication, ~runif(1000, 0, 10)),
y = map(x, ~rpois(1000, lambda = 2*.x + 0.2*.x*sin(.x)))) %>%
unnest()
# base plot with aesthetics and points
point_plot <- ggplot(df, aes(x, y, group = replication)) +
geom_point(alpha = 0.01, stroke = 0)
point_plot +
geom_smooth(method = lm, formula = y ~ splines::ns(x), size = .1, se = FALSE)
Controlling the line's alpha can be really helpful for this sort of plot, but the alpha parameter in geom_smooth controls the opacity of the standard error ribbon. To set the alpha of the line, use geom_line with stat_smooth:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x),
color = 'blue', alpha = 0.03)
Currently, the smooth isn't doing much more than OLS here. To make it more flexible, set the degrees of freedom:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x, df = 5),
color = 'blue', alpha = 0.03)
Given the response is Poisson, it may be worth scaling up to Poisson regression with glm. The largest impact here is that when x is small, y doesn't dip all the way to 0:
point_plot +
stat_smooth(geom = 'line', method = glm, method.args = list(family = 'poisson'),
formula = y ~ splines::ns(x, df = 5), color = 'blue', alpha = 0.03)
Adjust further as you like.

ggplot GLM fitted curve without interaction

I want to add the fitted function from GLM on a ggplot. By default, it automatically create the plot with interaction. I am wondering, if I can plot the fitted function from the model without interaction. For example,
dta <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
dta <- within(dta, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
id <- factor(id)
})
plt <- ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point(size = 2) +
geom_smooth(method = "glm", , se = F,
method.args = list(family = "poisson"))
print(plt)
gives the plot with interaction,
However, I want the plot from the model,
`num_awards` = ß0 + ß1*`math` + ß2*`prog` + error
I tried to get this this way,
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
fun.gen <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd)
fun.acd <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd + mod$coef[3])
fun.voc <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd + mod$coef[4])
ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point() +
stat_function(fun = fun.gen, col = "red") +
stat_function(fun = fun.acd, col = "green") +
stat_function(fun = fun.voc, col = "blue") +
geom_smooth(method = "glm", se = F,
method.args = list(family = "poisson"), linetype = "dashed")
The output plot is
Is there any simple way in ggplot to do this efficiently?
Ben's idea of plotting predicted value of the response for specific model terms inspired me improving the type = "y.pc" option of the sjp.glm function. A new update is on GitHub, with version number 1.9.4-3.
Now you can plot predicted values for specific terms, one which is used along the x-axis, and a second one used as grouping factor:
sjp.glm(mod, type = "y.pc", vars = c("math", "prog"))
which gives you following plot:
The vars argument is needed in case your model has more than two terms, to specify the term for the x-axis-range and the term for the grouping.
You can also facet the groups:
sjp.glm(mod, type = "y.pc", vars = c("math", "prog"), show.ci = T, facet.grid = T)
There's no way that I know of to trick geom_smooth() into doing this, but you can do a little better than you've done. You still have to fit the model yourself and add the lines, but you can use the predict() method to generate the predictions and load them into a data frame with the same structure as the original data ...
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
## generate prediction frame
pframe <- with(dta,
expand.grid(math=seq(min(math),max(math),length=51),
prog=levels(prog)))
## add predicted values (on response scale) to prediction frame
pframe$num_awards <- predict(mod,newdata=pframe,type="response")
ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point() +
geom_smooth(method = "glm", se = FALSE,
method.args = list(family = "poisson"), linetype = "dashed")+
geom_line(data=pframe) ## use prediction data here
## (inherits aesthetics etc. from main ggplot call)
(the only difference here is that the way I've done it the predictions span the full horizontal range for all groups, as if you had specified fullrange=TRUE in geom_smooth()).
In principle it seems as though the sjPlot package should be able to handle this sort of thing, but it looks like the relevant bit of code for doing this plot type is hard-coded to assume a binomial GLM ... oh well.
I'm not sure, but you wrote "without interaction" - maybe you are looking for effect plots? (If not, excuse me that I'm assuming something completely wrong...)
You can, for instance, use the effects package for this.
dta <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
dta <- within(dta, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
id <- factor(id)
})
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
library(effects)
plot(allEffects(mod))
Another option would be the sjPlot package, as Ben suggested - however, the current version on CRAN only supports logistic regression models properly for effect plots. But in the current development version on GitHub I added support for various model families and link functions, so if you like, you can download that snapshot. The sjPlot package uses ggplot instead of lattice (which is used by the effects package, I think):
sjp.glm(mod, type = "eff", show.ci = T)
Or in non-faceted way:
sjp.glm(mod, type = "eff", facet.grid = F, show.ci = T)

A replacement for method = 'loess'

This is where I'm at so far:
I have a data frame df with two columns A and B (both containing real numbers) where b is dependent on a. I plot the columns against each other:
p = ggplot(df, aes(A, B)) + geom_point()
and see that the relationship is non-linear. Adding:
p = p + geom_smooth(method = 'loess', span = 1)
gives a 'good' line of best fit. Given a new value a of A I then use the following method to predict the value of B:
B.loess = loess(B ~ A, span = 1, data = df)
predict(B.loess, newdata = a)
So far, so good. However, I then realise I can't extrapolate using loess (presumably because it is non-parametric?!). The extrapolation seems fairly natural - the relationship looks something like a power type thing is going on e.g:
x = c(1:10)
y = 2^x
df = data.frame(A = x, B = y)
This is where I get unstuck. Firstly, what methods can I use to plot a line of best fit to this kind of ('power') data without using loess? Pathetic attempts such as:
p = ggplot(df, aes(A, B)) + geom_point() +
geom_smooth(method = 'lm', formula = log(y) ~ x)
give me errors. Also, assuming I am actually able to plot a line of best fit that I am happy with, I am having trouble using predict in a similar way I did when using loess. For examples sake, suppose I am happy with the line of best fit:
p = ggplot(df, aes(A, B)) + geom_point() +
geom_smooth(method = 'lm', formula = y ~ x)
then if I want to predict what value B would take if A was equal to 11 (theoretically 2^11), the following method does not work:
B.lm = lm(B ~ A)
predict(B.lm, newdata = 11)
Any help much appreciated. Cheers.
First , To answer your last question, you need to provide a data.frame with colnames are the predictors.
B.lm <- lm(B ~ A,data=df)
predict(B.lm, newdata = data.frame(A=11))
1
683.3333
As an alternative to loess you can try some higher polynomial regressions. Here I in this plot I compare poly~3 to loess using latticeExtra(easier to add the xspline interpolation) but in similar syntax to ggplot2.(layer).
xyplot(A ~ B,data=df,par.settings = ggplot2like(),
panel = function(x,y,...){
panel.xyplot(x,y,...)
grid.xspline(x,y,..., default.units = "native") ## xspline interpolation
})+
layer(panel.smoother(y ~ poly(x, 3), method = "lm"), style = 1)+ ## poly
layer(panel.smoother(y ~ x, span = 0.9),style=2) ### loeess
The default surface for loess.control is interpolate which, unsurprisingly doesn't allow extrapolations. The alternative, direct, allows you to extrapolate though a question remains as to whether this is meaningful.
predict(loess(hp~disp,mtcars),newdata=1000)
[1] NA
predict(loess(hp~disp,mtcars,control=loess.control(surface="direct")),newdata=1000)
[1] -785.0545

Resources