Transfer calculated value from stat_smooth to other geom like linerange - r

I have a question about ggplot2.
I want to connect data point with ols result via vertical line, like the code listed below.
Can I transfer ..y.., the value calculated by stat_smooth, to geom_linerange directly?
I tried stat_smooth(..., geom = "linerange", mapping(aes(ymin=pmin(myy, ..y..), ymax=pmax(myy,..y..)) but it is not the result I want.
library(ggplot2)
df <- data.frame(myx = 1:10,
myy = c(1:10) * 5 + 2 * rnorm(10, 0, 1))
lm.fit <- lm("myy~myx", data = df)
pred <- predict(lm.fit)
ggplot(df, aes(myx, myy)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_linerange(mapping = aes(ymin = pmin(myy, pred),
ymax = pmax(myy, pred)))

stat_smooth evaluates the values at n evenly spaced points, with n = 80 by default. These points may not coincide with the original x values in your data frame.
Since you are calculating predicted values anyway, it would probably be more straightforward to add that back to your data frame and plot all geom layers based on that as your data source, for example:
df$pred <- pred
ggplot(df, aes(myx, myy)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_linerange(aes(ymin = myy, ymax = pred))

Related

ggplot confidence interval not filling the whole dataset for my linear mixed model

my ggplot R-code works perfectly ok with my other datasets but I'm stumbled with why it's not working for one particular data set. See image below where the filled confidence interval stops at 0.10:
For reproducing the problem:
library(nlme)
library(ggeffects)
library(ggplot2)
SurfaceCoverage <- c(0.02,0.04,0.06,0.08,0.1,0.12,0.02,0.04,0.06,0.08,0.1,0.12)
SpecificSurfaceEnergy <- c(18.0052997,15.9636971,14.2951057,13.0263081,13.0816591,13.3825573,2.9267577,2.2889628,1.8909175,1.0083036,0.5683574,0.1681063)
sample <- c(1,1,1,1,1,1,2,2,2,2,2,2)
highW <- data.frame(sample,SurfaceCoverage,SpecificSurfaceEnergy)
highW$sample <- sub("^", "Wettable", highW$sample)
highW$RelativeHumidity <- "High relative humidity"; highW$group <- "Wettable"
highW$sR <- paste(highW$sample,highW$RelativeHumidity)
dfhighW <- data.frame(
"y"=c(highW$SpecificSurfaceEnergy),
"x"=c(highW$SurfaceCoverage),
"b"=c(highW$sample),
"sR"=c(highW$sR)
)
mixed.lme <- lme(y~log(x),random=~1|b,data=dfhighW)
pred.mmhighW <- ggpredict(mixed.lme, terms = c("x"))
(ggplot(pred.mmhighW) +
geom_line(aes(x = x, y = predicted)) + # slope
geom_ribbon(aes(x = x, ymin = predicted - std.error, ymax = predicted + std.error),
fill = "lightgrey", alpha = 0.5) + # error band
geom_point(data = dfhighW, # adding the raw data (scaled values)
aes(x = x, y = y, shape = b)) +
xlim(0.01,0.2) +
ylim(0,30) +
labs(title = "") +
ylab(bquote('Specific Surface Energy ' (mJ/m^2))) +
xlab(bquote('Surface Coverage ' (n/n[m]) )) +
theme_minimal()
)
Can someone advise me how to fix this? Thanks.
The last part of your ribbon has disappeared because you have excluded it from the plot. The lower edge of your ribbon is the following vector:
pred.mmhighW$predicted - pred.mmhighW$std.error
#> [1] 3.91264018 2.37386628 1.47061258 0.82834206 0.32935718 -0.07886245
Note the final value is a small negative number, but you have set the y axis limits with:
ylim(0, 30)
So anything negative will be cut off. If you change to
ylim(-2, 30)
You get
I don't know whether this is already answered previously, but coord_cartesian and scales::squish are two solutions to this problem.
coord_cartesian adjusts the viewport without adjusting the spacing of grid lines etc. (unlike xlim()/scale_*_continuous(limits = ...), which will "zoom")
scales::squish() is suboptimal if you are "squishing" lines and points, not just edgeless polygons (in the case of fill/polygons, squishing and clipping produce the same results)
gg0 <- (ggplot(pred.mmhighW)
+ geom_ribbon(aes(x = x, ymin = predicted - std.error,
ymax = predicted + std.error),
fill = "lightgrey", alpha = 0.5)
+ theme_minimal()
)
## set lower limit to 5 for a more obvious effect
gg0 + coord_cartesian(ylim = c(5, 30))
gg0 + scale_y_continuous(limits = c(5, 30),
## oob = "out of bounds" behaviour
oob = scales::squish)

Adding a regression trend line and a shaded standard error area to a ggplot for regression models that geom_smooth does not handle

I have a data.frame with observed success/failure outcomes per two groups along with expected probabilities:
library(dplyr)
observed.probability.df <- data.frame(group = c("A","B"), p = c(0.4,0.6))
expected.probability.df <- data.frame(group = c("A","B"), p = qlogis(c(0.45,0.55)))
observed.data.df <- do.call(rbind,lapply(c("A","B"), function(g)
data.frame(group = g, value = c(rep(0,1000*dplyr::filter(observed.probability.df, group != g)$p),rep(1,1000*dplyr::filter(observed.probability.df, group == g)$p)))
)) %>% dplyr::left_join(expected.probability.df)
observed.probability.df$group <- factor(observed.probability.df$group, levels = c("A","B"))
observed.data.df$group <- factor(observed.data.df$group, levels = c("A","B"))
I'm fitting a logistic regression (binomial glm with a logit link function) to these data with the offset term:
fit <- glm(value ~ group + offset(p), data = observed.data.df, family = binomial(link = 'logit'))
Now, I'd like to plot these data as a bar graph using ggplot2's geom_bar, color-coded by group, and to add to that the trend line and shaded standard error area estimated in fit.
I'd use stat_smooth for that but I don't think it can handle the offset term in it's formula, so looks like I need to resort to assembling this figure in an alternative way.
To get the bars and the trend line I used:
slope.est <- function(x, ests) plogis(ests[1] + ests[2] * x)
library(ggplot2)
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
So the question is how to add to that the shaded standard error around the trend line?
Using stat_function I am able to shade the entire area from the upper bound of the standard error all the way down to the X-axis:
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
stat_function(fun = slope.est,args=list(ests=summary(fit)$coefficients[,1]+summary(fit)$coefficients[,2]),geom='area',fill="gray",alpha=0.25) +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
Which is close but not quite there.
Any idea how to subtract from the shaded area above the area that's below the lower bound of the standard error? Perhaps geom_ribbon is the way to go here, but I don't know how to combine it with the slope.est function

Obtaining mean intercept and slopes from geom_abline() using summary methods

I want to overlay parameter estimates of group intercept and slope from a Bayesian analysis onto a grouped ggplot scatter-plot of actual data. I can overlay the individual lines just fine but I would really like to get a single mean line for each of the groups as well.
Here is some toy data. Three groups with differing intercepts and slopes
# data
x <- rnorm(120, 0, 1)
y <- c(20 + 3*x[1:40] + rnorm(40,0.01), rnorm(40,0.01), 10 + -3*x[81:120] + rnorm(40,0.01))
group = factor(rep(letters[1:3], each = 40))
df <- data.frame(group,x,y)
# fake parameter estimates of intercept and slope
parsDF <- data.frame(int = c(rnorm(10,20,.5), rnorm(10,0,.5), rnorm(10,10,.5)),
slope = c(rnorm(10,3,.3), rnorm(10,0,.3), rnorm(10,-3,.3)),
group = rep(letters[1:3], each = 10))
Now for the plot
ggplot(df, aes(x,y, colour = group)) +
geom_abline(data = parsDF, aes(intercept = int, slope = slope), colour = "gray75") +
geom_point() +
facet_wrap(~group)
I thought maybe I could add a single mean intercept and slope line for each group via stat.summary-type methods, like so
ggplot(df, aes(x,y, colour = group)) +
geom_abline(data = parsDF, aes(intercept = int, slope = slope), colour = "gray75") +
geom_abline(data = parsDF, aes(intercept = int, slope = slope), stat = "summary", fun.y = "mean", colour = "black", linetype = "dotted") +
geom_point() +
facet_wrap(~group)
But it just ignores those arguments and re-plots the individual lines over the existing ones.
I realise I could just calculate the mean of the intercepts and slopes for each group and brute-force that into the graph somehow but I can't see how to do that without mucking up the faceting by group, other than by creating another dataframe for mean slopes and intercepts and passing that into the plot as well. And I don't want to simply use geom_smooth() because that will use the actual data not my parameter estimates.
Any help much appreciated

How to combine stat_ecdf with geom_ribbon?

I am trying to draw an ECDF of some data with a "confidence interval" represented via a shaded region using ggplot2. I am having trouble combining geom_ribbon() with stat_ecdf() to achieve the effect I am after.
Consider the following example data:
set.seed(1)
dat <- data.frame(variable = rlnorm(100) + 2)
dat <- transform(dat, lower = variable - 2, upper = variable + 2)
> head(dat)
variable lower upper
1 2.534484 0.5344838 4.534484
2 3.201587 1.2015872 5.201587
3 2.433602 0.4336018 4.433602
4 6.929713 4.9297132 8.929713
5 3.390284 1.3902836 5.390284
6 2.440225 0.4402254 4.440225
I am able to produce an ECDF of variable using
library("ggplot2")
ggplot(dat, aes(x = variable)) +
geom_step(stat = "ecdf")
However I am unable to use lower and upper as the ymin and ymax aesthetics of geom_ribbon() to superimpose the confidence interval on the plot as another layer. I have tried:
ggplot(dat, aes(x = variable)) +
geom_ribbon(aes(ymin = lower, ymax = upper), stat = "ecdf") +
geom_step(stat = "ecdf")
but this raises the following error
Error: geom_ribbon requires the following missing aesthetics: ymin, ymax
Is there a way to coax geom_ribbon() into working with stat_ecdf() to produce a shaded confidence interval? Or, can anyone suggest an alternative means of adding a shaded polygon defined by lower and upper as a layer to the ECDF plot?
Try this (a bit of shot in the dark):
ggplot(dat, aes(x = variable)) +
geom_ribbon(aes(x = variable,ymin = ..y..-2,ymax = ..y..+2), stat = "ecdf",alpha=0.2) +
geom_step(stat = "ecdf")
Ok, so that's not the same thing as what you trying to do, but it should explain what's going on. The stat is returning a data frame with just the original x and the computed y, so I think that's all you have to work with. i.e. stat_ecdf only computes the cumulative distribution function for a single x at a time.
The only other thing I can think of is the obvious, calculating the lower and upper separately, something like this:
l <- ecdf(dat$lower)
u <- ecdf(dat$upper)
v <- ecdf(dat$variable)
dat$lower1 <- l(dat$variable)
dat$upper1 <- u(dat$variable)
dat$variable1 <- v(dat$variable)
ggplot(dat,aes(x = variable)) +
geom_step(aes(y = variable1)) +
geom_ribbon(aes(ymin = upper1,ymax = lower1),alpha = 0.2)
Not sure exactly how you want to reflect the CI, but ggplot_build() lets you get the generated data back from the plot, you can then overplot what you like.
This chart shows:
red = original ribbon
blue = takes the original CI vectors and applies to the ecdf curve
green = calculates the ecdf of upper and lower series and plots
g<-ggplot(dat, aes(x = variable)) +
geom_step(stat = "ecdf") +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha=0.5, fill="red")
inside<-ggplot_build(g)
matched<-merge(inside$data[[1]],data.frame(x=dat$variable,dat$lower,dat$upper),by=("x"))
g +
geom_ribbon(data=matched, aes(x = x,
ymin = y + dat.upper-x,
ymax = y - x + dat.lower),
alpha=0.5, fill="blue") +
geom_ribbon(data=matched, aes(x = x,
ymin = ecdf(dat.lower)(x),
ymax = ecdf(dat.upper)(x)),
alpha=0.5, fill="green")

Most succinct way to label/annotate extreme values with ggplot?

I'd like to annotate all y-values greater than a y-threshold using ggplot2.
When you plot(lm(y~x)), using the base package, the second graph that pops up automatically is Residuals vs Fitted, the third is qqplot, and the fourth is Scale-location. Each of these automatically label your extreme Y values by listing their corresponding X value as an adjacent annotation. I'm looking for something like this.
What's the best way to achieve this base-default behavior using ggplot2?
Updated scale_size_area() in place of scale_area()
You might be able to take something from this to suit your needs.
library(ggplot2)
#Some data
df <- data.frame(x = round(runif(100), 2), y = round(runif(100), 2))
m1 <- lm(y ~ x, data = df)
df.fortified = fortify(m1)
names(df.fortified) # Names for the variables containing residuals and derived qquantities
# Select extreme values
df.fortified$extreme = ifelse(abs(df.fortified$`.stdresid`) > 1.5, 1, 0)
# Based on examples on page 173 in Wickham's ggplot2 book
plot = ggplot(data = df.fortified, aes(x = x, y = .stdresid)) +
geom_point() +
geom_text(data = df.fortified[df.fortified$extreme == 1, ],
aes(label = x, x = x, y = .stdresid), size = 3, hjust = -.3)
plot
plot1 = ggplot(data = df.fortified, aes(x = .fitted, y = .resid)) +
geom_point() + geom_smooth(se = F)
plot2 = ggplot(data = df.fortified, aes(x = .fitted, y = .resid, size = .cooksd)) +
geom_point() + scale_size_area("Cook's distance") + geom_smooth(se = FALSE, show_guide = FALSE)
library(gridExtra)
grid.arrange(plot1, plot2)

Resources