I want to overlay parameter estimates of group intercept and slope from a Bayesian analysis onto a grouped ggplot scatter-plot of actual data. I can overlay the individual lines just fine but I would really like to get a single mean line for each of the groups as well.
Here is some toy data. Three groups with differing intercepts and slopes
# data
x <- rnorm(120, 0, 1)
y <- c(20 + 3*x[1:40] + rnorm(40,0.01), rnorm(40,0.01), 10 + -3*x[81:120] + rnorm(40,0.01))
group = factor(rep(letters[1:3], each = 40))
df <- data.frame(group,x,y)
# fake parameter estimates of intercept and slope
parsDF <- data.frame(int = c(rnorm(10,20,.5), rnorm(10,0,.5), rnorm(10,10,.5)),
slope = c(rnorm(10,3,.3), rnorm(10,0,.3), rnorm(10,-3,.3)),
group = rep(letters[1:3], each = 10))
Now for the plot
ggplot(df, aes(x,y, colour = group)) +
geom_abline(data = parsDF, aes(intercept = int, slope = slope), colour = "gray75") +
geom_point() +
facet_wrap(~group)
I thought maybe I could add a single mean intercept and slope line for each group via stat.summary-type methods, like so
ggplot(df, aes(x,y, colour = group)) +
geom_abline(data = parsDF, aes(intercept = int, slope = slope), colour = "gray75") +
geom_abline(data = parsDF, aes(intercept = int, slope = slope), stat = "summary", fun.y = "mean", colour = "black", linetype = "dotted") +
geom_point() +
facet_wrap(~group)
But it just ignores those arguments and re-plots the individual lines over the existing ones.
I realise I could just calculate the mean of the intercepts and slopes for each group and brute-force that into the graph somehow but I can't see how to do that without mucking up the faceting by group, other than by creating another dataframe for mean slopes and intercepts and passing that into the plot as well. And I don't want to simply use geom_smooth() because that will use the actual data not my parameter estimates.
Any help much appreciated
Related
How do I plot a log linear model in R?
Currently, I am doing this but am not sure if it's the right/efficient way:
data(food)
model1 <- lm(food_exp~log(income), data = food)
temp_var <- predict(model1, interval="confidence")
new_df <- cbind(food, temp_var)
head(new_df)
ggplot(new_df, aes(x = income, y = food_exp))+
geom_point() +
geom_smooth(aes(y=lwr), color = "red", linetype = "dashed")+
geom_smooth(aes(y=upr), color = "red", linetype = "dashed")+
geom_smooth(aes(y = fit), color = "blue")+
theme_economist()
you can use geom_smooth and putting your formula directly in. It should yield the same as your fit (which you can check by also plotting that)
ggplot(new_df, aes(x = Sepal.Width, y = Sepal.Length))+
geom_point() +
geom_point(aes(y=fit), color="red") + #your original fit
geom_smooth(method=lm, formula=y~log(x)) #ggplot fit
If you don't car about extracting the parameters and just want the plot, you can plot directly in ggplot2.
Some fake data for plotting:
library(tidyverse)
set.seed(454)
income <- VGAM::rpareto(n = 100, scale = 20, shape = 2)*1000
food_exp <- rnorm(100, income*.3+.1, 3)
food <- data.frame(income, food_exp)
Now within ggplot2, use the geom_smooth function and specify that you want a linear model. Additionally, you can directly transform the income in the aes argument:
ggplot(food, aes(x = log(income), y = food_exp))+
geom_point()+
geom_smooth(method = "lm")+
theme_bw()+
labs(
title = "Log Linear Model Food Expense as a Function of Log(income)",
x = "Log(Income)",
y = "Food Expenses"
)
This will work for confidence intervals, but adding prediction intervals, you'll need to do what you did earlier with fitting the model, generating the prediction intervals.
I have a data.frame with observed success/failure outcomes per two groups along with expected probabilities:
library(dplyr)
observed.probability.df <- data.frame(group = c("A","B"), p = c(0.4,0.6))
expected.probability.df <- data.frame(group = c("A","B"), p = qlogis(c(0.45,0.55)))
observed.data.df <- do.call(rbind,lapply(c("A","B"), function(g)
data.frame(group = g, value = c(rep(0,1000*dplyr::filter(observed.probability.df, group != g)$p),rep(1,1000*dplyr::filter(observed.probability.df, group == g)$p)))
)) %>% dplyr::left_join(expected.probability.df)
observed.probability.df$group <- factor(observed.probability.df$group, levels = c("A","B"))
observed.data.df$group <- factor(observed.data.df$group, levels = c("A","B"))
I'm fitting a logistic regression (binomial glm with a logit link function) to these data with the offset term:
fit <- glm(value ~ group + offset(p), data = observed.data.df, family = binomial(link = 'logit'))
Now, I'd like to plot these data as a bar graph using ggplot2's geom_bar, color-coded by group, and to add to that the trend line and shaded standard error area estimated in fit.
I'd use stat_smooth for that but I don't think it can handle the offset term in it's formula, so looks like I need to resort to assembling this figure in an alternative way.
To get the bars and the trend line I used:
slope.est <- function(x, ests) plogis(ests[1] + ests[2] * x)
library(ggplot2)
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
So the question is how to add to that the shaded standard error around the trend line?
Using stat_function I am able to shade the entire area from the upper bound of the standard error all the way down to the X-axis:
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
stat_function(fun = slope.est,args=list(ests=summary(fit)$coefficients[,1]+summary(fit)$coefficients[,2]),geom='area',fill="gray",alpha=0.25) +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
Which is close but not quite there.
Any idea how to subtract from the shaded area above the area that's below the lower bound of the standard error? Perhaps geom_ribbon is the way to go here, but I don't know how to combine it with the slope.est function
I have a question about ggplot2.
I want to connect data point with ols result via vertical line, like the code listed below.
Can I transfer ..y.., the value calculated by stat_smooth, to geom_linerange directly?
I tried stat_smooth(..., geom = "linerange", mapping(aes(ymin=pmin(myy, ..y..), ymax=pmax(myy,..y..)) but it is not the result I want.
library(ggplot2)
df <- data.frame(myx = 1:10,
myy = c(1:10) * 5 + 2 * rnorm(10, 0, 1))
lm.fit <- lm("myy~myx", data = df)
pred <- predict(lm.fit)
ggplot(df, aes(myx, myy)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_linerange(mapping = aes(ymin = pmin(myy, pred),
ymax = pmax(myy, pred)))
stat_smooth evaluates the values at n evenly spaced points, with n = 80 by default. These points may not coincide with the original x values in your data frame.
Since you are calculating predicted values anyway, it would probably be more straightforward to add that back to your data frame and plot all geom layers based on that as your data source, for example:
df$pred <- pred
ggplot(df, aes(myx, myy)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_linerange(aes(ymin = myy, ymax = pred))
I am trying to apply a color scale to a loess line based on a 3rd variable (Temperature). I've only been able to get the color to vary based on either the variable in the x or y axis.
set.seed(1938)
a2 <- data.frame(year = seq(0, 100, length.out = 1000),
values = cumsum(rnorm(1000)),
temperature = cumsum(rnorm(1000)))
library(ggplot2)
ggplot(a2, aes(x = year, y = values, color = values)) +
geom_line(size = 0.5) +
geom_smooth(aes(color = ..y..), size = 1.5, se = FALSE, method = 'loess') +
scale_colour_gradient2(low = "blue", mid = "yellow", high = "red",
midpoint = median(a2$values)) +
theme_bw()
This code produces the following plot, but I would like the loess line color to vary based on the temperature variable instead.
I tried using
color = loess(temperature ~ values, a2)
but I got an error of
"Error: Aesthetics must be either length 1 or the same as the data (1000): colour, x, y"
Thank you for any and all help! I appreciate it.
You can't do that when you calculate the loess with a geom_smooth since it only has access to:
..y.. which is the vector of y-values internally calculated by geom_smooth to create the regression curve"
Is it possible to apply color gradient to geom_smooth with ggplot in R?
To do this, you should calculate the loess curve manually with loess and then plot it with geom_line:
set.seed(1938)
a2 <- data.frame(year = seq(0,100,length.out=1000),
values = cumsum(rnorm(1000)),
temperature = cumsum(rnorm(1000)))
# Calculate loess curve and add values to data.frame
a2$predict <- predict(loess(values~year, data = a2))
ggplot(a2, aes(x = year, y = values)) +
geom_line(size = 0.5) +
geom_line(aes(y = predict, color = temperature), size = 2) +
scale_colour_gradient2(low = "blue", mid = "yellow" , high = "red",
midpoint=median(a2$values)) +
theme_bw()
The downside of this is that it won't fill in gaps in your data as nicely as geom_smooth
As per ex18q1 in "R for Data Science" I am trying to find the best model for the data:
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
I've applied linear model and am trying to plot the results on a graph using ggplot:
sim1a_mod <- lm(x ~ y, data = sim1a)
ggplot(sim1a, aes(x, y)) +
geom_point(size = 2, colour= "gray") +
geom_abline(intercept = coef(sim1a_mod)[[1]], slope = coef(sim1a_mod)[[2]], colour = "red")
coef(sim1a_mod)[[1]] prints -1.14403
coef(sim1a_mod)[[2]] prints 0.4384473
I create the plot with the data points, but the model is not showing. What am I doing wrong?
The nomenclature for typing formulas for model functions like lm(), glm(), lmer() etc. in R is always DV ~ IV1 + IV2 + ... + IVn where DV is your dependent variable and IVn is your list of independent variables. We typically chart the dependent variable on the y-axis and the independent variable on the x-axis, so in your case you'll need to change your sim1a_mod model to lm(y ~ x, data = sim1a).
In your original code, because you were running a different model, your line was being charted, but it was outside of your view. If you attempt to chart again with your original model with the following code you will then see your regression line:
ggplot(sim1a, aes(x, y)) +
geom_point(size = 2, colour= "gray") +
geom_abline(intercept = coef(sim1a_mod)[[1]], slope = coef(sim1a_mod)[[2]], colour = "red") +
scale_x_continuous(limits = c(-30, 30)) + scale_y_continuous(limits = c(-30, 30))