How to visualize spline regression with ggplot2? - r

I'm working with the Wage dataset in the ISLR library. My objective is to perform a spline regression with knots at 3 locations (see code below). I can do this regression. That part is fine.
My issue concerns the visualization of the regression curve. Using base R functions, I seem to get the correct curve. But I can't seem to get quite the right curve using the tidyverse. This is what is expected, and what I get with the base functions:
This is what ggplot spits out
It's noticeably different. R gives me the following message when running the ggplot functions:
geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")
What does this mean and how do I fix it?
library(tidyverse)
library(ISLR)
attach(Wage)
agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60), degree = 3), data = Wage) #Default is 3
plot(age, wage, col = 'grey', xlab = 'Age', ylab = 'Wages')
points(age.grid, predict(fit, newdata = list(age = age.grid)), col = 'darkgreen', lwd = 2, type = "l")
abline(v = c(25, 40, 60), lty = 2, col = 'darkgreen')
ggplot(data = Wage) +
geom_point(mapping = aes(x = age, y = wage), color = 'grey') +
geom_smooth(mapping = aes(x = age, y = fit$fitted.values), color = 'red')
I also tried
ggplot() +
geom_point(data = Wage, mapping = aes(x = age, y = wage), color = 'grey') +
geom_smooth(mapping = aes(x = age.grid, y = predict(fit, newdata = list(age = age.grid))), color = 'red')
but that looks very similar to the 2nd picture.
Thanks for any help!

splines::bs() and s(., type="bs") from mgcv do very different things; the latter is a penalized regression spline. I would try (untested!)
geom_smooth(method="lm",
formula= y ~ splines::bs(x, knots = c(25, 40, 60), degree = 3))

Related

Plot binomial GAM in ggplot

I'm trying to visualize a dataset that uses a binomial response variable (proportions). I'm using a gam to examine the trend, but having difficult getting it to plot with ggplot. How do I get the smooth added to the plot?
Example:
set.seed(42)
df <- data.frame(y1 = sample.int(100),
y2 = sample.int(100),
x = runif(100, 0, 100))
ggplot(data = df,
aes(y = y1/(y1+y2), x = x)) +
geom_point(shape = 1) +
geom_smooth(method = "gam",
method.args = list(family = binomial),
formula = cbind(y1, y2) ~ s(x))
Warning message:
Computation failed in `stat_smooth()`
Caused by error in `cbind()`:
! object 'y1' not found
The formula in geom_smooth has to be in terms of x and y, representing the variables on your x and y axes, so you can't pass in y1 and y2.
The way round this is that rather than attempting to use the cbind type left-hand side of your gam, you can expand the counts into 1s and 0s so that there is only a single y variable. Although this makes for a little extra pre-processing, it allows you to draw your points just as easily using stat = 'summary' inside geom_point and makes your geom_smooth very straightforward:
library(tidyverse)
set.seed(42)
df <- data.frame(y1 = sample.int(100),
y2 = sample.int(100),
x = runif(100, 0, 100))
df %>%
rowwise() %>%
summarize(y = rep(c(1, 0), times = c(y1, y2)), x = x) %>%
ggplot(aes(x, y)) +
geom_point(stat = 'summary', fun = mean, shape = 1) +
geom_smooth(method = "gam",
method.args = list(family = binomial),
formula = y ~ s(x)) +
theme_classic()
Created on 2023-01-20 with reprex v2.0.2

Adding regression line to plotted matrix

How do I add a regression line to this graph? I tried abline() but it seems to only work with dataframes and I'm working with a matrix.
https://i.stack.imgur.com/fAeyL.png
This is my code for the graph:
plot(Extended[,1], Extended[,14], xlim=c(1877, 2017), ylim=c(-12, 15), pch=19, col = 'blue')
This is in general easier with ggplot.
library(ggplot2)
Ex <- as.data.frame(Extended);
names(Ex) <- paste0('V', 1:ncol(Ex));
ggplot(Ex, aes(x = V1, y = V14)) +
geom_point(size = 4, col = 'blue') +
geom_smooth(method = 'lm') +
coord_cartesian(xlim = c(1877, 2017),
ylim = c(-12, 15))
For base method there are myriad ways of visualizing the effects. The effects package is a very general way of doing this. The way this works is by using the predictorEffect function and specifying which effect to plot.
library(effects)
data(mtcars)
model <- lm(mpg ~ hp, data = mtcars)
plot(predictorEffect('hp', model))
The effects package has some very general implementations that can do quite a lot of things. I'd recommend reading the vignettes to get an idea how the package works.
For a manual base plot version we could do something like:
mtcars[, c('fit', 'lwr', 'upr')] <- predict(model, interval = 'predict')
mtcars <- mtcars[order(mtcars$hp),]
plot(y = mtcars$mpg, x = mtcars$hp,
pch=19, col = 'blue')
lines(y = mtcars$fit, x = mtcars$hp, col = 'green')
lines(y = mtcars$upr, x = mtcars$hp, col = 'red')
lines(y = mtcars$lwr, x = mtcars$hp, col = 'red')

Weighting using predict function

I have used 'predict' find a fit line for a linear model(lm) I have created. Because the lm was built on only 2 data points and needs to have a positive slope, I have forced it to go thru the origin (0,0). I have also weighted the function by the number of observations underlying each data point.
Question 1: (SOLVED -see comment by #Gregor)
Why does the predicted line lie so much closer to my second data point (B) than my first data point (A), when B has fewer underlying observations? Did I code something wrong here when weighting the model?
Question 2:
Plotting GLM (link=logit) now, but how can still I force this through 0,0? I've tried adding formula = y~0+x in several places, none of which seem to work.
M <- data.frame("rate" = c(0.4643,0.2143), "conc" = c(300,6000), "nr_dead" = c(13,3), "nr_surv" = c(15,11), "region" = c("A","B"))
M$tot_obsv <- (M$nr_dead+M$nr_surv)
M_conc <- M$conc
M_rate <- M$rate
M_tot_obsv <- M$tot_obsv
#**linear model of data, force 0,0 intercept, weighted by nr. of observations of each data point.**
M_lm <- lm(data = M, rate~0+conc, weights = tot_obsv)
#**plot line using "predict" function**
x_conc <-c(600, 6700)
y_rate <- predict(M_lm, list(conc = x_conc), weights = tot_obsv, type = 'response')
plot(x = M$conc, y = M$rate, pch = 16, ylim = c(0, 0.5), xlim = c(0,7000), xlab = "conc", ylab = "death rate")
lines(x_conc, y_rate, col = "red", lwd = 2)
#**EDIT 1:**
M_glm <- glm(cbind(nr_dead, nr_surv) ~ (0+conc), data = M, family = "binomial")
#*plot using 'predict' function*
binomial_smooth <- function(formula = (y ~ 0+x),...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), formula = (y ~ 0+x), ...)
}
tibble(x_conc = c(seq(300, 7000, 1), M$conc), y_rate = predict.glm(M_glm, list(conc = x_conc), type = "response")) %>% left_join(M, by = c('x_conc' = 'conc')) %>%
ggplot(aes(x = x_conc, y = y_rate)) + xlab("concentration") + ylab("death rate") +
geom_point(aes(y = rate, size = tot_obsv)) + binomial_smooth(formula = (y ~ 0+x)) + theme_bw()

R - Manually plot calibration plot

From Clinical Prediction Models by Ewout W. Steyerberg we have the following:
A calibration plot has predictions on the x axis, and the outcome on
the y axis. A line of identity helps for orientation: Perfect
predictions should be on the 45° line. For linear regression, the
calibration plot results in a simple scatter plot. For binary
outcomes, the plot contains only 0 and 1 values for the y axis.
Probabilities are not observed directly. However, smoothing techniques
can be used to estimate the observed probabilities of the outcome ( p
( y = 1)) in relation to the predicted probabilities. The observed 0/1
outcomes are replaced by values between 0 and 1 by combining outcome
values of subjects with similar predicted probabilities, e.g. using
the loess algorithm.
I'm fitting a logistic regression model with a binary outcome. Below is an example code. The calibration curve is going to look weird because the sample is so small. I'm mostly wondering if the methodology is correct.
library(tidyverse)
tibble_ex <- tibble(
event = c(1, 0, 1, 0, 0, 1),
weight = c(100, 200, 110, 210, 220, 105)
)
model <- glm(event ~ weight, family = 'binomial', data = tibble_ex)
tibble_ex <- tibble_ex %>%
mutate(pred = predict(model, type = 'response'))
tibble_ex %>%
arrange(pred) %>%
ggplot(aes(x = pred, y = event)) +
stat_smooth(method = 'glm', method.args = list(family = binomial), se = F) +
geom_abline()
You are missing just the smoothing part if the plot. If you want to use glm to plot the curve then you have to use that with splines.
tibble_ex %>%
arrange(pred) %>%
ggplot(aes(x = pred, y = event)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
stat_smooth(method = "glm", formula = y ~ ns(x,1), size = 1) +
geom_abline()
However, I have noticed that Steyerberg and Harrell prefer the use of loess smoothing.
tibble_ex %>%
arrange(pred) %>%
ggplot(aes(x = pred, y = event)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
geom_smooth(aes(x = pred, y = event), color = "red", se = F, method = "loess") +
# you can use stat_smooth in place of geom_smooth
geom_abline()
I want to refer also to the rms package of Frank Harrell. There are many helpful functions to fit and validate models including calibration plots. The code below plots the calibration curve and provide other statistics.
library(rms)
val.prob(fitted(model),tibble_ex$event)

How to plot a linear and quadratic model on the same graph?

So I have 2 models for the data set that I am using:
> Bears1Fit1 <- lm(Weight ~ Neck.G)
>
> Bears2Fit2 <- lm(Weight ~ Neck.G + I(Neck.G)^2)
I want to plot these two models on the same scatterplot. I have this so far:
> plot(Neck.G, Weight, pch = c(1), main = "Black Bears Data: Weight Vs Neck Girth", xlab = "Neck Girth (inches) ", ylab = "Weight (pounds)")
> abline(Bears1Fit1)
However, I am unsure of how I should put the quadratic model on the same graph as well. I want to be able to have both lines on the same graph.
Here is an example with cars data set:
data(cars)
make models:
model_lm <- lm(speed ~ dist, data = cars)
model_lm2 <- lm(speed ~ dist + I(dist^2), data = cars)
make new data:
new.data <- data.frame(dist = seq(from = min(cars$dist),
to = max(cars$dist), length.out = 200))
predict:
pred_lm <- predict(model_lm, newdata = new.data)
pred_lm2 <- predict(model_lm2, newdata = new.data)
plot:
plot(speed ~ dist, data = cars)
lines(pred_lm ~ new.data$dist, col = "red")
lines(pred_lm2 ~ new.data$dist, col = "blue")
legend("topleft", c("linear", "quadratic"), col = c("red", "blue"), lty = 1)
with ggplot2
library(ggplot2)
put all data in one data frame and convert to long format using melt from reshape2
preds <- data.frame(new.data,
linear = pred_lm,
quadratic = pred_lm2)
preds <- reshape2::melt(preds,
id.vars = 1)
plot
ggplot(data = preds)+
geom_line(aes(x = dist, y = value, color = variable ))+
geom_point(data = cars, aes(x = dist, y = speed))+
theme_bw()
EDIT: another way using just ggplot2 using two geom_smooth layers, one with the default formula y ~ x (so it need not be specified) and one with a quadratic model formula = y ~ x + I(x^2). In order to get a legend we can specify color within the aes call naming the desired entry as we want it to show in the legend.
ggplot(cars,
aes(x = dist, y = speed)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = "linear"),
se = FALSE) +
geom_smooth(method = "lm",
formula = y ~ x + I(x^2),
aes(color = "quadratic"),
se = FALSE) +
theme_bw()

Resources