Plotting output of GAM model - r

Edit: following interactions in the responses below, I believe there may be some issues with the plot() or plot.gam() functions when dealing with gam outputs. See responses below.
I am running a non parametric regression model <- gam(y ~ x, bs = "cs", data = data).
My data looks like what follows, where x is in logs. I have 273 observations
y x
[1,] 0.010234756 10.87952
[2,] 0.009165001 10.98407
[3,] 0.001330975 11.26850
[4,] 0.008000957 10.97803
[5,] 0.008579472 10.94924
[6,] 0.009746714 11.01823
I would like to plot the output of the model, basically the fitted curve. When I do
# graph
plot(model)
or
ggplot(data = data, mapping = aes(x = x y = y)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method="gam", formula= y~s(x, bs = "cs") )
I get the desired output graphs (apologies for the original labels):
[
However, the two plotted curves are not exactly the same and I did not manage to find the parameters to tweak to remove the differences. Therefore I would like to plot the curve manually.
Here it's my current attempt.
model <- gam(y~ s(x), bs = "cs", data = data)
names(model)
# summary(model)
model_fit <- as.data.frame(cbind(model$y, model$fitted.values,
model$linear.predictors, data$x,
model$residuals))
names(model_fit) <- c("y", "y_fit", "linear_pred", "x", "res")
### here the plotting
ggplot(model_fit) +
geom_point(aes(x = x, y = y_fit), size = 0.5, alpha = 0.5) +
geom_line(aes(x = x, y = y_fit))
However I get the following warning
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
and wrong output graph
I do not seem to be able to fix the last graph (it seems the error is in geom_point() ) and add the confidence intervals, nor to find where to tweak the first two to make them exactly the same.

The difference is likely due to you using different fitting algorithms. The default in gam() is (currently) method = "GCV.Cp" even through the recommended option is to use method = "REML". stat_smooth() uses method = "REML". GCV-based smoothness selection is known to undersmooth in some circumstances and this seems to be the case here with the REML solution being a much smoother curve.
If you change to method = "REML" in your gam() call, the differences should disappear.
That said, you really shouldn't be ripping things out of model objects like that - for a set off $residuals is not what you think it is - it's not useful in this context as those are the working residuals for PIRLS algorithm. Use the extractor functions like fitted(), residuals() etc.
The easiest way to plot your own version of that drawn by plot.gam() is to capture the object returned by plot.gam() and then use that object to draw what you need.
Via plot.gam()
df <- data_sim("eg1", seed = 2)
m <- gam(y ~ s(x2), data = df, method = "REML")
p_obj <- plot(m, residuals = TRUE)
p_obj <- p_obj[[1]] # just one smooth so select the first component
sm_df <- as.data.frame(p_obj[c("x", "se", "fit")])
data_df <- as.data.frame(p_obj[c("raw", "p.resid")])
## plot
ggplot(sm_df, aes(x = x, y = fit)) +
geom_rug(data = data_df, mapping = aes(x = raw, y = NULL),
sides = "b") +
geom_point(data = data_df, mapping = aes(x = raw, y = p.resid)) +
geom_ribbon(aes(ymin = fit - se, ymax = fit + se, y = NULL),
alpha = 0.3) +
geom_line() +
labs(x = p_obj$xlab, y = p_obj$ylab)
Which produces
Alternatively, you might look at my {gratia} package or the {mgcViz} package of Matteo Fasiolo as options that will do this all for you.
{gratia} example
For example with {gratia}
library('gratia')
draw(m, residuals = TRUE)
which produces

The solution provided by #Gavin Simpson here partially solves the issue, meaning that to make the two curves equal, one needs to add the method = "REML". The two curves then have the same slope.
However, for some reason, when plotting the output of a gam() model using either plot() or plot.gam(), the curve does not fit properly the original data as it should. The same happens by manually plotting the graph by extracting the elements from the object returned by plot.gam(). I am not sure why this happens. In my case, the fitted curve is shifted downwards, clearly "missing" the data points it is supposed to fit. Below the code and the corresponding output graph, the latter being the same you get in plot() or plot.gam() with the addition of the original data points to the plots.
plot(model_1)
# or plot.gam(model_1)
data.plot = as.data.frame(cbind(b[[1]]$x, b[[1]]$fit, b[[1]]$se))
ggplot(data=data.plot, mapping = aes(x= data.plot$V1, y= data.plot$V2)) +
geom_line(aes(x = V1, y = V2)) +
geom_line(aes(x= V1, y = V2 + V3 ), linetype="dashed") +
geom_line(aes(x= V1, y = V2 - V3 ), linetype ="dashed") +
geom_point(data= df_abs, aes(x= log(prd_l_1999), y=prd_gr), size = 0.5, alpha = 0.5)
Misplaced graphs
To note that the ggplot function makes the plot properly. Therefore, my ignorant guess is that this may be an issue with the plotting method.
Working solution
I am not able to prove that the issue is with the plotting functions, but it turns out that this is the same issue as in this question and the partial solution provided by the OP fixes the plotting while still using the gam() function. Below (his) code adapted to my case and the corresponding output graph. As you can see, the graph is plotted properly and the curve fits the data as it is supposed to do. I'd say this may corroborate my hypothesis even though I cannot prove it as I am not knowledgeable enough.
library(data.table)
model_1 <- gam(prd_gr ~ s(log(prd_l_1999)), bs = "cs", data = df_abs, method = "REML")
preds <- predict(model_1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))
ggplot()+
geom_line(data = my_data, aes(x=log(df_abs$prd_l_1999), y=mu), size=1, col="blue")+
geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=log(df_abs$prd_l_1999), y = mu), stat = "identity", col="green")

Related

Plotting GAM in R: Setting custom x-axis limits?

Is there a way to set the x-axis limits when plotting the predicted fits for GAM models? More specifically, I'm fitting a smoother for each level of a factor using 'by = ', however, each factor level has a different range of values. Plotting the variable in ggplot results in an x-axis that automatically accommodates the different ranges of 'x'; however, after fitting a GAM (mgcv::gam()) the default behavior of plot.gam() appears to be predicting values across a shared x-axis limit.
The dummy data below has some continuous variable for 'x', but in my real data, 'x' is Time (year), and 'group' is sampling location. Because I did not collect data from each site across the same time range, I feel it is inappropriate to show a model fit in these empty years.
library(tidyverse)
library(mgcv)
library(gratia)
theme_set(theme_classic())
## simulate data with a grouping variable of three levels:
d = data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
## Look at data
ggplot(d, aes(x = x, y = y, colour = group))+
facet_wrap(~group)+
geom_point()+
geom_smooth()
Here is the raw data with loess smoother in ggplot:
## fit simple GAM with smoother for X
m1 = mgcv::gam(y ~ s(x, by = group), data = d)
## base R plot
par(mfrow = c(2,2), bty = 'l', las = 1, mai = c(.6,.6,.2,.1), mgp = c(2,.5,0))
plot(m1)
## Gavin's neat plotter
gratia::draw(m1)
Here is the predicted GAM fit that spans the same range (0,1) for all three groups:
Can I limit the prediction/plot to actual values of 'x'?
If you install the current development version (>= 0.6.0.9111) from GitHub, {gratia} will now do what you want, sort of. I added some functionality to smooth_estimates() that I had planned to add eventually but your post kicked it the top of the ToDo list and motivated me to add it now.
You can use smooth_estimates() to evaluate the smooths at the observed (or any user-supplied) data only and then a bit of ggplot() recreates most of the plot.
remotes::install_github("gavinsimpson/gratia")
library('mgcv')
library('gratia')
library('dplyr')
library('ggplot2')
d <- data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
m <- gam(y ~ group + s(x, by = group), data = d, method = 'REML')
sm <- smooth_estimates(m, data = d) %>%
add_confint()
ggplot(sm, aes(x = x, y = est, colour = group)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, colour = NULL, fill = group),
alpha = 0.2) +
geom_line() +
facet_wrap(~ group)

How do I plot a single numerical covariate using emmeans (or other package) from a model?

After variable selection I usually end up in a model with a numerical covariable (2nd or 3rd degree). What I want to do is to plot using emmeans package preferentially. Is there a way of doing it?
I can do it using predict:
m1 <- lm(mpg ~ poly(disp,2), data = mtcars)
df <- cbind(disp = mtcars$disp, predict.lm(m1, interval = "confidence"))
df <- as.data.frame(df)
ggplot(data = df, aes(x = disp, y = fit)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = disp, y = fit),alpha = 0.2)
I didn't figured out a way of doing it using emmip neither emtrends
For illustration purposes, how could I do it using mixed models via lme?
m1 <- lme(mpg ~ poly(disp,2), random = ~1|factor(am), data = mtcars)
I suspect that your issue is due to the fact that by default, covariates are reduced to their means in emmeans. You can use theat or cov.reduce arguments to specify a larger number of values. See the documentation for ref_grid and vignette(“basics”, “emmeans”), or the index of vignette topics.
Using sjPlot:
plot_model(m1, terms = "disp [all]", type = "pred")
gives the same graphic.
Using emmeans:
em1 <- ref_grid(m1, at = list(disp = seq(min(mtcars$disp), max(mtcars$disp), 1)))
emmip(em1, ~disp, CIs = T)
returns a graphic with a small difference in layout. An alternative is to add the result to an object and plot as the way that I want to:
d1 <- emmip(em1, ~disp, CIs = T, plotit = F)

How to replicate a function 1000 times

So basically, I generated 2 random variables X and Y 1000 times and created a data frame Data=data.frame(x,y) in order to perform a smoothing by spline function. Now I want to recreate exactly that but for B= 1000 times and plot the smoothing functions (B=1,...,1000) to compare its variability
simulation= function(d){
X=runif(1000,0,10)
Y=rpois(1000,lambda=2*X+0.2*X*sin(X))
Data=matrix(data=c(X,Y),ncol=2)
smoothing_sim=lm(Y~ns(x=X,df=d),data=Data)
new_x2=seq(min(X),max(X),length.out=100)
adjusted_sim=predict(object=smoothing_sim,newdata=data.frame(X=new_x2))
return(data.frame(new_x2,smoothing_sim))
}
simulation2=replicate(n=1000,simulation)
I'm not sure wether my method is good or not. And I'm also not sure how to plot the functions following the simulation. Anyone care to comment? Thanks !
If you use ggplot, you can make the smooths right in geom_smooth. As ggplot demands long form, using list columns and tidyr::unnest is a useful substitute for replicate, though there are lots of ways to accomplish the data generation step.
library(tidyverse)
set.seed(47)
# A nice theme with a white background to help make low-opacity objects visible
theme_set(hrbrthemes::theme_ipsum_tw())
df <- tibble(replication = seq(100), # scaled down a little
x = map(replication, ~runif(1000, 0, 10)),
y = map(x, ~rpois(1000, lambda = 2*.x + 0.2*.x*sin(.x)))) %>%
unnest()
# base plot with aesthetics and points
point_plot <- ggplot(df, aes(x, y, group = replication)) +
geom_point(alpha = 0.01, stroke = 0)
point_plot +
geom_smooth(method = lm, formula = y ~ splines::ns(x), size = .1, se = FALSE)
Controlling the line's alpha can be really helpful for this sort of plot, but the alpha parameter in geom_smooth controls the opacity of the standard error ribbon. To set the alpha of the line, use geom_line with stat_smooth:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x),
color = 'blue', alpha = 0.03)
Currently, the smooth isn't doing much more than OLS here. To make it more flexible, set the degrees of freedom:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x, df = 5),
color = 'blue', alpha = 0.03)
Given the response is Poisson, it may be worth scaling up to Poisson regression with glm. The largest impact here is that when x is small, y doesn't dip all the way to 0:
point_plot +
stat_smooth(geom = 'line', method = glm, method.args = list(family = 'poisson'),
formula = y ~ splines::ns(x, df = 5), color = 'blue', alpha = 0.03)
Adjust further as you like.

ggplot GLM fitted curve without interaction

I want to add the fitted function from GLM on a ggplot. By default, it automatically create the plot with interaction. I am wondering, if I can plot the fitted function from the model without interaction. For example,
dta <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
dta <- within(dta, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
id <- factor(id)
})
plt <- ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point(size = 2) +
geom_smooth(method = "glm", , se = F,
method.args = list(family = "poisson"))
print(plt)
gives the plot with interaction,
However, I want the plot from the model,
`num_awards` = ß0 + ß1*`math` + ß2*`prog` + error
I tried to get this this way,
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
fun.gen <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd)
fun.acd <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd + mod$coef[3])
fun.voc <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd + mod$coef[4])
ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point() +
stat_function(fun = fun.gen, col = "red") +
stat_function(fun = fun.acd, col = "green") +
stat_function(fun = fun.voc, col = "blue") +
geom_smooth(method = "glm", se = F,
method.args = list(family = "poisson"), linetype = "dashed")
The output plot is
Is there any simple way in ggplot to do this efficiently?
Ben's idea of plotting predicted value of the response for specific model terms inspired me improving the type = "y.pc" option of the sjp.glm function. A new update is on GitHub, with version number 1.9.4-3.
Now you can plot predicted values for specific terms, one which is used along the x-axis, and a second one used as grouping factor:
sjp.glm(mod, type = "y.pc", vars = c("math", "prog"))
which gives you following plot:
The vars argument is needed in case your model has more than two terms, to specify the term for the x-axis-range and the term for the grouping.
You can also facet the groups:
sjp.glm(mod, type = "y.pc", vars = c("math", "prog"), show.ci = T, facet.grid = T)
There's no way that I know of to trick geom_smooth() into doing this, but you can do a little better than you've done. You still have to fit the model yourself and add the lines, but you can use the predict() method to generate the predictions and load them into a data frame with the same structure as the original data ...
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
## generate prediction frame
pframe <- with(dta,
expand.grid(math=seq(min(math),max(math),length=51),
prog=levels(prog)))
## add predicted values (on response scale) to prediction frame
pframe$num_awards <- predict(mod,newdata=pframe,type="response")
ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point() +
geom_smooth(method = "glm", se = FALSE,
method.args = list(family = "poisson"), linetype = "dashed")+
geom_line(data=pframe) ## use prediction data here
## (inherits aesthetics etc. from main ggplot call)
(the only difference here is that the way I've done it the predictions span the full horizontal range for all groups, as if you had specified fullrange=TRUE in geom_smooth()).
In principle it seems as though the sjPlot package should be able to handle this sort of thing, but it looks like the relevant bit of code for doing this plot type is hard-coded to assume a binomial GLM ... oh well.
I'm not sure, but you wrote "without interaction" - maybe you are looking for effect plots? (If not, excuse me that I'm assuming something completely wrong...)
You can, for instance, use the effects package for this.
dta <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
dta <- within(dta, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
id <- factor(id)
})
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
library(effects)
plot(allEffects(mod))
Another option would be the sjPlot package, as Ben suggested - however, the current version on CRAN only supports logistic regression models properly for effect plots. But in the current development version on GitHub I added support for various model families and link functions, so if you like, you can download that snapshot. The sjPlot package uses ggplot instead of lattice (which is used by the effects package, I think):
sjp.glm(mod, type = "eff", show.ci = T)
Or in non-faceted way:
sjp.glm(mod, type = "eff", facet.grid = F, show.ci = T)

A replacement for method = 'loess'

This is where I'm at so far:
I have a data frame df with two columns A and B (both containing real numbers) where b is dependent on a. I plot the columns against each other:
p = ggplot(df, aes(A, B)) + geom_point()
and see that the relationship is non-linear. Adding:
p = p + geom_smooth(method = 'loess', span = 1)
gives a 'good' line of best fit. Given a new value a of A I then use the following method to predict the value of B:
B.loess = loess(B ~ A, span = 1, data = df)
predict(B.loess, newdata = a)
So far, so good. However, I then realise I can't extrapolate using loess (presumably because it is non-parametric?!). The extrapolation seems fairly natural - the relationship looks something like a power type thing is going on e.g:
x = c(1:10)
y = 2^x
df = data.frame(A = x, B = y)
This is where I get unstuck. Firstly, what methods can I use to plot a line of best fit to this kind of ('power') data without using loess? Pathetic attempts such as:
p = ggplot(df, aes(A, B)) + geom_point() +
geom_smooth(method = 'lm', formula = log(y) ~ x)
give me errors. Also, assuming I am actually able to plot a line of best fit that I am happy with, I am having trouble using predict in a similar way I did when using loess. For examples sake, suppose I am happy with the line of best fit:
p = ggplot(df, aes(A, B)) + geom_point() +
geom_smooth(method = 'lm', formula = y ~ x)
then if I want to predict what value B would take if A was equal to 11 (theoretically 2^11), the following method does not work:
B.lm = lm(B ~ A)
predict(B.lm, newdata = 11)
Any help much appreciated. Cheers.
First , To answer your last question, you need to provide a data.frame with colnames are the predictors.
B.lm <- lm(B ~ A,data=df)
predict(B.lm, newdata = data.frame(A=11))
1
683.3333
As an alternative to loess you can try some higher polynomial regressions. Here I in this plot I compare poly~3 to loess using latticeExtra(easier to add the xspline interpolation) but in similar syntax to ggplot2.(layer).
xyplot(A ~ B,data=df,par.settings = ggplot2like(),
panel = function(x,y,...){
panel.xyplot(x,y,...)
grid.xspline(x,y,..., default.units = "native") ## xspline interpolation
})+
layer(panel.smoother(y ~ poly(x, 3), method = "lm"), style = 1)+ ## poly
layer(panel.smoother(y ~ x, span = 0.9),style=2) ### loeess
The default surface for loess.control is interpolate which, unsurprisingly doesn't allow extrapolations. The alternative, direct, allows you to extrapolate though a question remains as to whether this is meaningful.
predict(loess(hp~disp,mtcars),newdata=1000)
[1] NA
predict(loess(hp~disp,mtcars,control=loess.control(surface="direct")),newdata=1000)
[1] -785.0545

Resources