Plotting GAM in R: Setting custom x-axis limits? - r

Is there a way to set the x-axis limits when plotting the predicted fits for GAM models? More specifically, I'm fitting a smoother for each level of a factor using 'by = ', however, each factor level has a different range of values. Plotting the variable in ggplot results in an x-axis that automatically accommodates the different ranges of 'x'; however, after fitting a GAM (mgcv::gam()) the default behavior of plot.gam() appears to be predicting values across a shared x-axis limit.
The dummy data below has some continuous variable for 'x', but in my real data, 'x' is Time (year), and 'group' is sampling location. Because I did not collect data from each site across the same time range, I feel it is inappropriate to show a model fit in these empty years.
library(tidyverse)
library(mgcv)
library(gratia)
theme_set(theme_classic())
## simulate data with a grouping variable of three levels:
d = data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
## Look at data
ggplot(d, aes(x = x, y = y, colour = group))+
facet_wrap(~group)+
geom_point()+
geom_smooth()
Here is the raw data with loess smoother in ggplot:
## fit simple GAM with smoother for X
m1 = mgcv::gam(y ~ s(x, by = group), data = d)
## base R plot
par(mfrow = c(2,2), bty = 'l', las = 1, mai = c(.6,.6,.2,.1), mgp = c(2,.5,0))
plot(m1)
## Gavin's neat plotter
gratia::draw(m1)
Here is the predicted GAM fit that spans the same range (0,1) for all three groups:
Can I limit the prediction/plot to actual values of 'x'?

If you install the current development version (>= 0.6.0.9111) from GitHub, {gratia} will now do what you want, sort of. I added some functionality to smooth_estimates() that I had planned to add eventually but your post kicked it the top of the ToDo list and motivated me to add it now.
You can use smooth_estimates() to evaluate the smooths at the observed (or any user-supplied) data only and then a bit of ggplot() recreates most of the plot.
remotes::install_github("gavinsimpson/gratia")
library('mgcv')
library('gratia')
library('dplyr')
library('ggplot2')
d <- data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
m <- gam(y ~ group + s(x, by = group), data = d, method = 'REML')
sm <- smooth_estimates(m, data = d) %>%
add_confint()
ggplot(sm, aes(x = x, y = est, colour = group)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, colour = NULL, fill = group),
alpha = 0.2) +
geom_line() +
facet_wrap(~ group)

Related

Regression line, prediction and confidence intervals in R

I would like to ask you, please, how to create from the table two statistical graphs:
regression line with prediction interval
regression line with confidence interval
U used this script but I don't know what to do next:
pred <- lm(dta$Number.of.species ~ dta$Latitude)
pred_interval <- predict(lm(dta$Number.of.species ~ dta$Latitude), level = .99, interval = "confidence")[,2]
conf_interval <- predict(pred, newdata=dta, interval="prediction")[,3]
par(mfrow=c(2,2))
plot(
dta$Latitude,
dta$Number.of.species,
pch = 1,
ylim = c(0, 180),
xlim = c(37, 40)
)
plot(
dta$Latitude,
dta$Number.of.species,
pch = 1,
ylim = c(0, 180),
xlim = c(37, 40)
)
abline(pred)
Thank you for your time.
If you are just learning R, I would make 2 recommendations.
First, I would suggest learning the ggplot2 package, rather than using the base R plotting system. It is generally much easier to build up complex plots with many parts using ggplot().
Second, there are several packages designed to make working with model results easier in R. The most prominent of these are broom and the easystats collection of packages (modelbased, performance, parameters, etc.). Between the two, I would recommend easystats.
I'll demonstrate how to build up the data frame for plotting the model manually and using modelbased.
Manually building data frame
library(ggplot2)
# fit the model
m <- lm(mpg ~ disp, data = mtcars)
# construct prediction and confidence intervals using predict()
m_ci <- predict(m, interval = "confidence") |>
as.data.frame() |>
setNames(c("fit", "ci_lo", "ci_hi"))
m_pi <- predict(m, interval = "prediction") |>
as.data.frame() |>
setNames(c("fit", "pi_lo", "pi_hi"))
#> Warning in predict.lm(m, interval = "prediction"): predictions on current data refer to _future_ responses
# merge the interval data frames with the data frame used in the model
m_data <-
merge(
merge(
model.frame(m), m_ci, by = "row.names"
),
m_pi
)
# make a plot using the merged model data frame
ggplot(m_data) + # use m_data in the plot
aes(x = disp) + # put the 'disp' variable on the x axis
geom_point(aes(y = mpg)) + # add points, put the 'mpg' variable on the y axis for these
geom_ribbon(aes(ymin = pi_lo, ymax = pi_hi), fill = "lightblue", alpha = .4) + # add a ribbon for the prediction interval, put the pi_lo/pi_hi values on the y axis for this, color it lightblue and make it semitransparent
geom_ribbon(aes(ymin = ci_lo, ymax = ci_hi), fill = "lightblue", alpha = .4) + # add a ribbon for the confidence interval, put the ci_lo/ci_hi values on the y axis for this, color it lightblue and make it semitransparent
geom_line(aes(y = fit)) + # add a line for the fitted values, put the 'fit' values on the y axis
theme_minimal() # use a white background for the plot
Using the modelbased package to streamline some of the above steps
library(modelbased)
# compute intervals, including fitted values and original model matrix
ci <- estimate_expectation(m) # model fitted values and confidence intervals (uncertainty intervals on the expected values/predicted means)
pi <- estimate_prediction(m) # model fitted values and prediction intervals (uncertainty intervals on the individual predictions)
plot(ci) + # this produces a ggplot with points, fitted line, and confidence ribbon
geom_ribbon(aes(x = disp, ymin = CI_low, ymax = CI_high), data = pi, alpha = .4) + # add a prediction ribbon
theme_minimal() # use a white background
Here is how to modify the color of the ribbon when working with modelbased:
plot(ci, ribbon = list(fill = "lightblue")) +
geom_ribbon(aes(x = disp, ymin = CI_low, ymax = CI_high), data = pi, fill = "lightblue", alpha = .4) +
theme_minimal()
Created on 2021-08-18 by the reprex package (v2.0.0)

`data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class ranger

I am working with R. Using a tutorial, I was able to create a statistical model and produce visual plots for some of the outputs:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a <- na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new <- a[1:3,]
#create a training set by removing first three rows
a <- a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <- data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, pred[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, pred[3,], type = "l", col = "blue")
Now, I am trying to convert the above plot into ggplot format (and add 95% confidence intervals):
ggplot(r_fit) + geom_line(aes(x = r_fit$unique.death.times, y = pred[1,], group = 1), color = red) + geom_ribbon(aes(ymin = 0.95 * pred[1,], ymax = - 0.95 * pred[1,]), fill = "red") + geom_line(aes(x = r_fit$unique.death.times, y = pred[2,], group = 1), color = blue) + geom_ribbon(aes(ymin = 0.95 * pred[2,], ymax = - 0.95 * pred[2,]), fill = "blue") + geom_line(aes(x = r_fit$unique.death.times, y = pred[3,], group = 1), color = green) + geom_ribbon(aes(ymin = 0.95 * pred[3,], ymax = - 0.95 * pred[3,]), fill = "green") + theme(axis.text.x = element_text(angle = 90)) + ggtitle("sample graph")
But this produces the following error:
Error: `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class ranger
Run `rlang::last_error()` to see where the error occurred.
What is the reason for this error? Can someone please show me how to fix this problem?
Thanks
As per the ggplot2 documentation, you need to provide a data.frame() or object that can be converted (coerced) to a data.frame(). In this case, if you want to reproduce the plot above in ggplot2, you will need to manually set up the data frame yourself.
Below is an example of how you could set up the data to display the plot in ggplot2.
Data Frame
First we create a data.frame() with the variables that we want to plot. The easiest way to do this is to just group them all in as separate columns. Note that I have used the as.numeric() function to first coerce the predicted values to a vector, because they were previously a data.table row, and if you don't convert them they are maintained as rows.
ggplot_data <- data.frame(unique.death.times = r_fit$unique.death.times,
pred1 = as.numeric(pred[1,]),
pred2 = as.numeric(pred[2,]),
pred3 = as.numeric(pred[3,]))
head(ggplot_data)
## unique.death.times pred1 pred2 pred3
## 1 5 0.9986676 1.0000000 0.9973369
## 2 11 0.9984678 1.0000000 0.9824642
## 3 12 0.9984678 0.9998182 0.9764154
## 4 13 0.9984678 0.9998182 0.9627118
## 5 15 0.9731656 0.9959416 0.9527424
## 6 26 0.9731656 0.9959416 0.9093876
Pivot the data
This format is still not ideal, because in order to plot the data and colour by the correct column (variable), we need to 'pivot' the data. We need to load the tidyr package for this.
library(tidyr)
ggplot_data <- ggplot_data %>%
pivot_longer(cols = !unique.death.times,
names_to = "category", values_to = "predicted.value")
Plotting
Now the data is in a form that makes it really easy to plot in ggplot2.
plot <- ggplot(ggplot_data, aes(x = unique.death.times, y = predicted.value, colour = category)) +
geom_line()
plot
If you really want to match the look of the base plot, you can add theme_classic():
plot + theme_classic()
Additional notes
Note that this doesn't include 95% confidence intervals, so they would have to be calculated separately. Be aware though, that a 95% confidence interval is not just 95% of the y value at a given x value. There are calculations that will give you the correct values of the confidence interval, including functions built into R.
For a quick view of a trend line with prediction intervals, you can use the geom_smooth() function in ggplot2, but in this case it adds a loess curve by default, and the intervals provided by that function.
plot + theme_classic() + geom_smooth()

How to replicate a function 1000 times

So basically, I generated 2 random variables X and Y 1000 times and created a data frame Data=data.frame(x,y) in order to perform a smoothing by spline function. Now I want to recreate exactly that but for B= 1000 times and plot the smoothing functions (B=1,...,1000) to compare its variability
simulation= function(d){
X=runif(1000,0,10)
Y=rpois(1000,lambda=2*X+0.2*X*sin(X))
Data=matrix(data=c(X,Y),ncol=2)
smoothing_sim=lm(Y~ns(x=X,df=d),data=Data)
new_x2=seq(min(X),max(X),length.out=100)
adjusted_sim=predict(object=smoothing_sim,newdata=data.frame(X=new_x2))
return(data.frame(new_x2,smoothing_sim))
}
simulation2=replicate(n=1000,simulation)
I'm not sure wether my method is good or not. And I'm also not sure how to plot the functions following the simulation. Anyone care to comment? Thanks !
If you use ggplot, you can make the smooths right in geom_smooth. As ggplot demands long form, using list columns and tidyr::unnest is a useful substitute for replicate, though there are lots of ways to accomplish the data generation step.
library(tidyverse)
set.seed(47)
# A nice theme with a white background to help make low-opacity objects visible
theme_set(hrbrthemes::theme_ipsum_tw())
df <- tibble(replication = seq(100), # scaled down a little
x = map(replication, ~runif(1000, 0, 10)),
y = map(x, ~rpois(1000, lambda = 2*.x + 0.2*.x*sin(.x)))) %>%
unnest()
# base plot with aesthetics and points
point_plot <- ggplot(df, aes(x, y, group = replication)) +
geom_point(alpha = 0.01, stroke = 0)
point_plot +
geom_smooth(method = lm, formula = y ~ splines::ns(x), size = .1, se = FALSE)
Controlling the line's alpha can be really helpful for this sort of plot, but the alpha parameter in geom_smooth controls the opacity of the standard error ribbon. To set the alpha of the line, use geom_line with stat_smooth:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x),
color = 'blue', alpha = 0.03)
Currently, the smooth isn't doing much more than OLS here. To make it more flexible, set the degrees of freedom:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x, df = 5),
color = 'blue', alpha = 0.03)
Given the response is Poisson, it may be worth scaling up to Poisson regression with glm. The largest impact here is that when x is small, y doesn't dip all the way to 0:
point_plot +
stat_smooth(geom = 'line', method = glm, method.args = list(family = 'poisson'),
formula = y ~ splines::ns(x, df = 5), color = 'blue', alpha = 0.03)
Adjust further as you like.

Graph GLM in ggplot2 where x variable is categorical

I need to graph the predicted probabilities of a logit regression in ggplot2. Essentially, I am trying to graph a glm by each treatment condition within the same graph. However, I am getting quite confused about how to do this seeing that my treat variable (i.e. the x I am interested in) is categorical.This means that when I try to graph the treatment effects using ggplot I just get a bunch of points at 0, 1, and 2 but no lines.
My question is... How could I graph the logit prediction lines in this case? Thanks in advance!
set.seed(96)
df <- data.frame(
vote = sample(0:1, 200, replace = T),
treat = sample(0:3, 200, replace = T))
glm_output <- glm(vote ~ as.factor(treat), data = df, family = binomial(link = "logit"))
predicted_vote <- predict(glm_output, newdata = df, type = "link", interval = "confidence", se = TRUE)
df <- cbind(df, data.frame(predicted_vote))
Since the explanatory variable treat is categorical, it will make more sense if you use boxplot instead like the following:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2)
If you want to see the predicted probabilities by glm across different values of some of other explanatory variables you may try this:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_wrap(~gender)
# create age groups
df$age_group <- cut(df$age, breaks=seq(0,100,20))
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_grid(age_group~gender)

Customize how the smooth confidence interval is computed

I use to plot the loess estimation of a bunch of points along with the confidence interval by means of the geom_smooth function.
Now I need to change the method by which the confidence bounds are computed (i.e. I need to change the shape of the blur band). Is there a way to do that in geom_smooth?
Or, how can I emulate it with ggplot2? How can I such a blur band?
If you need a to plot something that isn't one of the options in geom_smooth your best bet is to manually fit the model yourself.
You haven't said what method you need.
But here is an example of fitting the loess with family symmetric and computing the standard errors of that.
d <- data.frame(x = rnorm(100), y = rnorm(100))
# The original plot using the default loess method
p <- ggplot(d, aes(x, y)) + geom_smooth(method = 'loess', se = TRUE)
# Fit loess model with family = 'symmetric'
# Replace the next 2 lines with whatever different method you need
loess_smooth <- loess(d$x ~ d$y, family = 'symmetric')
# Predict the model over the range of data you are interested in.
loess_pred <- predict(loess_smooth,
newdata = seq(min(d$x), max(d$x), length.out = 1000),
se = TRUE)
loess.df <- data.frame(fit = loess_pred$fit,
x = seq(min(d$x), max(d$x), length.out = 1000),
upper = loess_pred$fit + loess_pred$se.fit,
lower = loess_pred$fit - loess_pred$se.fit)
# plot to compare
p +
geom_ribbon(data = loess.df, aes(x = x, y = fit, ymax = upper, ymin = lower), alpha = 0.6) +
geom_line(data = loess.df, aes(x = x, y = fit))

Resources