I am using the great plotting library bayesplot to visualize posterior probability intervals from models I am estimating with rstanarm. I want to graphically compare draws from different models by getting the posterior intervals for coefficients onto the same plot.
Imagine, for instance, that I have 1000 draws from the posterior for three parameters beta1, beta2, beta3 for two different models:
# load the plotting library
library(bayesplot)
#> This is bayesplot version 1.6.0
#> - Online documentation and vignettes at mc-stan.org/bayesplot
#> - bayesplot theme set to bayesplot::theme_default()
#> * Does _not_ affect other ggplot2 plots
#> * See ?bayesplot_theme_set for details on theme setting
library(ggplot2)
# generate fake posterior draws from model1
fdata <- matrix(rnorm(1000 * 3), ncol = 3)
colnames(fdata) <- c('beta1', 'beta2', 'beta3')
# fake posterior draws from model 2
fdata2 <- matrix(rnorm(1000 * 3, 1, 2), ncol = 3)
colnames(fdata2) <- c('beta1', 'beta2', 'beta3')
Bayesplot makes fantastic visualizations for individual model draws, and it is ggplot2 'under the hood' so I can customize as I please:
# a nice plot of 1
color_scheme_set("orange")
mcmc_intervals(fdata) + theme_minimal() + ggtitle("Model 1")
# a nice plot of 2
color_scheme_set("blue")
mcmc_intervals(fdata2) + ggtitle("Model 2")
But what I would like to achieve is to plot these two models together on the same plot, such that for each coefficient I have two intervals and can distinguish which interval is which by mapping color to the model. However I can't figure out how to do this. Some things that don't work:
# doesnt work
mcmc_intervals(fdata) + mcmc_intervals(fdata2)
#> Error: Don't know how to add mcmc_intervals(fdata2) to a plot
# appears to pool
mcmc_intervals(list(fdata, fdata2))
Any ideas on how I could do this? Or how to do it manually given the matrices of posterior draws?
Created on 2018-10-18 by the reprex package (v0.2.1)
Just so the answer is also posted here, I have expanded on the code at the link from #Manny T (https://github.com/stan-dev/bayesplot/issues/232)
# simulate having posteriors for two different models each with parameters beta[1],..., beta[4]
posterior_1 <- matrix(rnorm(4000), 1000, 4)
posterior_2 <- matrix(rnorm(4000), 1000, 4)
colnames(posterior_1) <- colnames(posterior_2) <- paste0("beta[", 1:4, "]")
# use bayesplot::mcmc_intervals_data() function to get intervals data in format easy to pass to ggplot
library(bayesplot)
combined <- rbind(mcmc_intervals_data(posterior_1), mcmc_intervals_data(posterior_2))
combined$model <- rep(c("Model 1", "Model 2"), each = ncol(posterior_1))
# make the plot using ggplot
library(ggplot2)
theme_set(bayesplot::theme_default())
pos <- position_nudge(y = ifelse(combined$model == "Model 2", 0, 0.1))
ggplot(combined, aes(x = m, y = parameter, color = model)) +
geom_linerange(aes(xmin = l, xmax = h), position = pos, size=2)+
geom_linerange(aes(xmin = ll, xmax = hh), position = pos)+
geom_point(position = pos, color="black")
If you are like me, you will want 80% and 90% credible intervals (instead of 50% being the inner ones) and might want the coordinates flipped, and let's add a dashed line at 0 (model estimates no change). You can do that like this:
# use bayesplot::mcmc_intervals_data() function to get intervals data in format easy to pass to ggplot
library(bayesplot)
combined <- rbind(mcmc_intervals_data(posterior_1,prob=0.8,prob_outer = 0.9), mcmc_intervals_data(posterior_2,prob=0.8,prob_outer = 0.9))
combined$model <- rep(c("Model 1", "Model 2"), each = ncol(posterior_1))
# make the plot using ggplot
library(ggplot2)
theme_set(bayesplot::theme_default())
pos <- position_nudge(y = ifelse(combined$model == "Model 2", 0, 0.1))
ggplot(combined, aes(x = m, y = parameter, color = model)) +
geom_linerange(aes(xmin = l, xmax = h), position = pos, size=2)+
geom_linerange(aes(xmin = ll, xmax = hh), position = pos)+
geom_point(position = pos, color="black")+
coord_flip()+
geom_vline(xintercept=0,linetype="dashed")
A few things to note on this last one. I added prob_outer = 0.9 even though that is the default, just to show how you might change the outer credible intervals. The dashed line is created with geom_vline and xintercept = here instead of geom_hline and yintercept = because of the coord_flip (everything is reversed). So if you don't flip axes, you will need to do the opposite.
I asked this question on the bayesplot page on GitHub and got a response (Issue #232).
I blew more time than I'd like to admit writing this, so might as well post it here. Here's a function that incorporates the suggestions from above that (for the moment) works for rstanarm and brms model objects.
compare_posteriors <- function(..., dodge_width = 0.5) {
dots <- rlang::dots_list(..., .named = TRUE)
draws <- lapply(dots, function(x) {
if (class(x)[1] == "stanreg") {
posterior::subset_draws(posterior::as_draws(x$stanfit),
variable = names(fixef(x))
)
} else if (class(x)[1] == "brmsfit") {
brm_draws <- posterior::subset_draws(posterior::as_draws(x$fit),
variable = paste0("b_", rownames(fixef(x)))
)
posterior::variables(brm_draws) <- stringr::str_split(posterior::variables(brm_draws), "_", simplify = T)[, 2]
posterior::rename_variables(brm_draws, `(Intercept)` = Intercept)
} else {
stop(paste0(class(x)[1], " objects not supported."))
}
})
intervals <- lapply(draws, bayesplot::mcmc_intervals_data)
combined <- dplyr::bind_rows(intervals, .id = "model")
ggplot(combined, aes(x = m, y = parameter, color = model, group = model)) +
geom_linerange(aes(xmin = l, xmax = h), size = 2, position = position_dodge(dodge_width)) +
geom_linerange(aes(xmin = ll, xmax = hh), position = position_dodge(dodge_width)) +
geom_point(color = "black", position = position_dodge(dodge_width)) +
geom_vline(xintercept = 0, linetype = "dashed")
}
Usage:
compare_posteriors(mod1, mod2, mod3)
Related
I would like to ask you, please, how to create from the table two statistical graphs:
regression line with prediction interval
regression line with confidence interval
U used this script but I don't know what to do next:
pred <- lm(dta$Number.of.species ~ dta$Latitude)
pred_interval <- predict(lm(dta$Number.of.species ~ dta$Latitude), level = .99, interval = "confidence")[,2]
conf_interval <- predict(pred, newdata=dta, interval="prediction")[,3]
par(mfrow=c(2,2))
plot(
dta$Latitude,
dta$Number.of.species,
pch = 1,
ylim = c(0, 180),
xlim = c(37, 40)
)
plot(
dta$Latitude,
dta$Number.of.species,
pch = 1,
ylim = c(0, 180),
xlim = c(37, 40)
)
abline(pred)
Thank you for your time.
If you are just learning R, I would make 2 recommendations.
First, I would suggest learning the ggplot2 package, rather than using the base R plotting system. It is generally much easier to build up complex plots with many parts using ggplot().
Second, there are several packages designed to make working with model results easier in R. The most prominent of these are broom and the easystats collection of packages (modelbased, performance, parameters, etc.). Between the two, I would recommend easystats.
I'll demonstrate how to build up the data frame for plotting the model manually and using modelbased.
Manually building data frame
library(ggplot2)
# fit the model
m <- lm(mpg ~ disp, data = mtcars)
# construct prediction and confidence intervals using predict()
m_ci <- predict(m, interval = "confidence") |>
as.data.frame() |>
setNames(c("fit", "ci_lo", "ci_hi"))
m_pi <- predict(m, interval = "prediction") |>
as.data.frame() |>
setNames(c("fit", "pi_lo", "pi_hi"))
#> Warning in predict.lm(m, interval = "prediction"): predictions on current data refer to _future_ responses
# merge the interval data frames with the data frame used in the model
m_data <-
merge(
merge(
model.frame(m), m_ci, by = "row.names"
),
m_pi
)
# make a plot using the merged model data frame
ggplot(m_data) + # use m_data in the plot
aes(x = disp) + # put the 'disp' variable on the x axis
geom_point(aes(y = mpg)) + # add points, put the 'mpg' variable on the y axis for these
geom_ribbon(aes(ymin = pi_lo, ymax = pi_hi), fill = "lightblue", alpha = .4) + # add a ribbon for the prediction interval, put the pi_lo/pi_hi values on the y axis for this, color it lightblue and make it semitransparent
geom_ribbon(aes(ymin = ci_lo, ymax = ci_hi), fill = "lightblue", alpha = .4) + # add a ribbon for the confidence interval, put the ci_lo/ci_hi values on the y axis for this, color it lightblue and make it semitransparent
geom_line(aes(y = fit)) + # add a line for the fitted values, put the 'fit' values on the y axis
theme_minimal() # use a white background for the plot
Using the modelbased package to streamline some of the above steps
library(modelbased)
# compute intervals, including fitted values and original model matrix
ci <- estimate_expectation(m) # model fitted values and confidence intervals (uncertainty intervals on the expected values/predicted means)
pi <- estimate_prediction(m) # model fitted values and prediction intervals (uncertainty intervals on the individual predictions)
plot(ci) + # this produces a ggplot with points, fitted line, and confidence ribbon
geom_ribbon(aes(x = disp, ymin = CI_low, ymax = CI_high), data = pi, alpha = .4) + # add a prediction ribbon
theme_minimal() # use a white background
Here is how to modify the color of the ribbon when working with modelbased:
plot(ci, ribbon = list(fill = "lightblue")) +
geom_ribbon(aes(x = disp, ymin = CI_low, ymax = CI_high), data = pi, fill = "lightblue", alpha = .4) +
theme_minimal()
Created on 2021-08-18 by the reprex package (v2.0.0)
I am running an analysis in R on the effect of canopy cover (OverheadCover) and the number of carcasses placed on the same location (CarcassNumber) on the proportion of carrion eaten by birds (ProportionBirdsScavenging). The interaction effect OverheadCover * CarcassNumber is significant and I would like visualise this using ggplot like explained here: https://sebastiansauer.github.io/vis_interaction_effects/. I won't be using method = "lm" like in the example, but method = glmmTMB::glmmTMB. I've added the extra arguments formula = and method.args = to make sure R computes the smooth correctly.
This is how it should look, but I prefer the graph to be made with ggplot because then all my graphs will be in the same style.
glmm_interaction <- glmmTMB(ProportionBirdsScavenging ~ OverheadCover * CarcassNumber + (1|Area), data = data_both, beta_family(link = "logit"), weights = pointWeight_scaled)
plot_model(glmm_interaction, type = "int", ci.lvl = 0.682) # conf. int. of 68.3% -> ± standard error
This is the code I'm trying to run, but I can't get it to work. It keeps giving me errors, like object 'pointWeight_scaled' not found. Anyone an idea what I'm doing wrong here?
qplot(x = OverheadCover, y = ProportionBirdsScavenging, color = CarcassNumber, data = data_both) +
geom_smooth(method = glmmTMB::glmmTMB,
formula = ProportionBirdsScavenging ~ OverheadCover * CarcassNumber,
method.args = list(data = data_both, beta_family(link = "logit"), weights = pointWeight_scaled))
I know that it might be easier to just individually run the models and plot them on the same graph. I've done that, and it works. However, my calculated standard errors are larger than the ones in the plot_model(), so I wanted to see how these standard errors look if R does all the work, hence my intention to plot it this way.
This is how it should look, but I prefer the graph to be made with ggplot
The plot returned by plot_model() is a ggplot-object, which you can modify as you like. You could also use the ggeffects-package, which returns the underlying data that can be used to create the plot. There are many examples in the vignettes, both on how to create own plots or how to modify plots returned by plot(), e.g. here or here.
Here is a toy example:
library(ggplot2)
library(ggeffects)
library(lme4)
#> Loading required package: Matrix
set.seed(123)
dat <- data.frame(
outcome = rbinom(n = 500, size = 1, prob = 0.25),
var_binom = as.factor(rbinom(n = 500, size = 1, prob = 0.3)),
var_cont = rnorm(n = 500, mean = 10, sd = 3),
group = sample(letters[1:4], size =500, replace = TRUE)
)
model <- glmer(
outcome ~ var_binom * poly(var_cont, 2) + (1 | group),
data = dat,
family = binomial(link = "logit")
)
predictions <- ggpredict(model, c("var_cont [all]", "var_binom"))
# plot-function from ggeffects
plot(predictions)
# self made ggplot
ggplot(
predictions,
aes(x = x, y = predicted, ymin = conf.low, ymax = conf.high, colour = group, fill = group)
) +
geom_line() +
geom_ribbon(alpha = .1, colour = NA) +
theme_minimal()
Created on 2020-02-06 by the reprex package (v0.3.0)
Notice that your graphic constructed from Problem 4 shows a quadratic
or curved relationship between log_wages against exp. The next
task is to plot three quadratic functions for each race level "black",
"white" and "other". To estimate the quadratic fit, you can use the
following function quad_fit:
```{r}
quad_fit <- function(data_sub) {
return(lm(log_wage~exp+I(exp^2),data=data_sub)$coefficients)
}
quad_fit(salary_data)
```
The above function computes the least squares quadratic fit and
returns coefficients a1, a2, a3, where
Y(hat) = a1 + a2x + a3x^2
where Y(hat) = log(wage) and x = exp
Use ggplot to accomplish this task or use base R graphics for
partial credit. Make sure to include a legend and appropriate labels.
My attempt
blackfit <- quad_fit(salary_data[salary_data$race == "black",])
whitefit <- quad_fit(salary_data[salary_data$race == "white",])
otherfit <- quad_fit(salary_data[salary_data$race == "other",])
yblack <- blackfit[1] + blackfit[2]*salary_data$exp + blackfit[3]*(salary_data$exp)^2
ywhite <- whitefit[1] + whitefit[2]*salary_data$exp + whitefit[3]*(salary_data$exp)^2
yother <- otherfit[1] + otherfit[2]*salary_data$exp + otherfit[3]*(salary_data$exp)^2
soloblack <- salary_data[salary_data$race == "black",]
solowhite <- salary_data[salary_data$race == "white",]
soloother <- salary_data[salary_data$race == "other",]
ggplot(data = soloblack) +
geom_point(aes(x = exp, y = log_wage)) +
stat_smooth(aes(y = log_wage, x = exp), formula = y ~ yblack)
This is only the first attempt for the data filtered with for race == "black".
I am not clear how the formula should look like because through the quad_fit function it seems it already does the calculations for you.
Consider plotting fitted values using output of quad_fit (as shown by #StefanK here) and use by to plot across all distinct values of race:
reg_plot <- function(sub) {
# PREDICTED DATA FOR LINE PLOT
q_fit <- quad_fit(sub)
predicted_df <- data.frame(wage_pred = predict(q_fit, sub), exp = sub$exp)
# ORIGINAL SCATTER PLOT WITH PREDICTED LINE
ggplot(data = sub) +
geom_point(aes(x = exp, y = log_wage, alpha = exp)) +
labs(x = "Job Experience", y = "Log of Wage",
title = paste("Wage and Job Experience Plot for",
sub$race[[1]], "in Salary Dataset")
geom_line(color='red', data = predicted_df, aes(x = exp, y = wage_pred))
}
# RUN GRAPHS FOR EACH RACE
by(salary_data, salary_data$race, reg_plot)
I have made a plot of a polynomial function: y = x^2 - 6*x + 9
with a series of several points in a sequence + minor standard error in y. I used these points to construct a spline model for that function from the raw data points, and then I calculated the derivative from the spline model with R's predict() function and then I added both of the spline curves to the plot.
By the way, the expected derivative function is this: dy / dx = 2*x - 6
The original function I colored blue and the 1st derivative function I colored red. I wish to add legends to these plots, but I'm finding that difficult since I did not assign any points to the plots, as I declared the data-frames within the geom_smooth() functions.
The code I'm using is this:
library(ggplot2)
# Plot the function: f(x) = x^2 - 6x + 9
# with a smooth spline:
# And then the deriviative of that function from predicted values of the
# smoothed spline: f ' (x) = 2*x - 6
# Get a large sequence of x-values:
x <- seq(from = -10, to = 10, by = 0.01)
# The y-values are a function of each x value.
y <- x^2 - 6*x + 9 + rnorm(length(x), 0, 0.5)
# Fit the curve to a model which is a smoothed spine.
model <- smooth.spline(x = x, y = y)
# Predict the 1st derivative of this smoothed spline.
f_x <- predict(model, x = seq(from = min(x), to = max(x), by = 1), deriv = 1)
# Plot the smoothed spline of the original function and the derivative with respect to x.
p <- ggplot() + theme_bw() + geom_smooth(data = data.frame(x,y), aes(x = x, y = y), method = "loess", col = "blue", se = TRUE) + geom_smooth(data = data.frame(f_x$x, f_x$y), aes(x = f_x$x, y = f_x$y), method = "loess", col = "red", se = TRUE)
# Set the bounds of the plot.
p <- p + scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-5, 10)) + scale_y_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-10, 10))
# Add some axis labels
p <- p + labs(x = "x-axis", y = "y-axis", title = "Original Function and predicted derivative function")
p <- p + scale_fill_manual(values = c("blue", "red"), labels = c("Original Function", "Derivative Function with respect to x"))
print(p)
I was hoping that I could add the legend with scale_fill_manual(), but my attempt does not add a legend to the plot. Essentially, the plot I get generally looks like this, minus the messy legend that I added in paint. I would like that legend, thank you.
I did this because I want to show to my chemistry instructor that I can accurately measure the heat capacity just from the points from differential scanning calorimetry data for which I believe the heat capacity is just the first derivative plot of heat flow vs Temperature differentiated with respect to temperature.
So I tried to make a plot showing the original function overlayed with the 1st derivative function with respect to x, showing that the plot of the first derivative made only from a spline curve fitted to raw data points reliably produces the expected line dy / dx = 2 * x - 6, which it does.
I just want to add that legend.
Creating a data frame with you data and use color within aesthetics is the most common way of doing this.
df <- rbind(
data.frame(data='f(x)', x=x, y=y),
data.frame(data='f`(x)', x=f_x$x, y=f_x$y))
p <- ggplot(df, aes(x,y, color=data)) + geom_smooth(method = 'loess')
p <- p + scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-5, 10)) + scale_y_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-10, 10))
p <- p + labs(x = "x-axis", y = "y-axis", title = "Original Function and predicted derivative function")
p <- p + scale_color_manual(name = "Functions", values = c("blue", "red"), labels = c("Original Function", "Derivative Function with respect to x"))
print(p)
I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")