Adding line plot with boxplot - r

Sample data
set.seed(123)
par(mfrow = c(1,2))
dat <- data.frame(years = rep(1980:2014, each = 8), x = sample(1000:2000, 35*8 ,replace = T))
boxplot(dat$x ~ dat$year, ylim = c(500, 4000))
I have another dataset that has a single value for some selected years
ref.dat <- data.frame(years = c(1991:1995, 2001:2008), x = sample(1000:2000, 13, replace = T))
plot(ref.dat$years, ref.dat$x, type = "b")
How can I add the line plot on top of the boxplot

With ggplot2 you could do this:
ggplot(dat, aes(x = years, y = x)) +
geom_boxplot(data = dat, aes(group = years)) +
geom_line(data = ref.dat, colour = "red") +
geom_point(data = ref.dat, colour = "red", shape = 1) +
coord_cartesian(ylim = c(500, 4000)) +
theme_bw()

The trick here is to figure out the x-axis on the boxplot. You have 35 boxes and they are plotted at the x-coordinates 1, 2, 3, ..., 35 - i.e. year - 1979. With that, you can add the line with lines as usual.
set.seed(123)
dat <- data.frame(years = rep(1980:2014, each = 8),
x = sample(1000:2000, 35*8 ,replace = T))
boxplot(dat$x ~ dat$year, ylim = c(500, 2500))
ref.dat <- data.frame(years = c(1991:1995, 2001:2008),
x = sample(1000:2000, 13, replace = T))
lines(ref.dat$years-1979, ref.dat$x, type = "b", pch=20)
The points were a bit hard to see, so I changed the point style 20. Also, I used a smaller range on the y-axis to leave less blank space.

Related

Heatmap using geom_tile in a loop/function then save the output figures

I would like to make heatmaps using the following data:
dt <- data.frame(
h = rep(LETTERS[1:7], 7),
j = c(rep("A", 7), rep("B", 7), rep("C", 7), rep("D", 7), rep("E", 7), rep("F", 7), rep("G", 7)),
Red = runif(7, 0, 1),
Yellow = runif(7, 0, 1),
Green = runif(7, 0, 1),
Blue = runif(7, 0, 1),
Black = runif(7, 0, 1)
)
For each of the heatmaps, the x and y axes stay as the first 2 columns of df. The values that fill in each heatmap will be each of the remaining columns, e.g., Red, Yellow, ...
I borrowed this example to produce the following code:
loop = function(df, x_var, y_var, f_var) {
ggplot(df, aes(x = .data[[x_var]], y = .data[[y_var]], fill = .data[[f_var]])) +
geom_tile(color = "black") +
scale_fill_gradient(low = "white", high = "blue") +
geom_text(aes(label = .data[[f_var]]), color = "black", size = 4) +
coord_fixed() +
theme_minimal() +
labs(x = "",
y = "",
fill = "R", # Want the legend title to be each of the column names that are looped
title = .data[[f_var]])
ggsave(a, file = paste0("heatmap_", f_var,".png"), device = png, width = 15, height = 15, units = "cm")
}
plot_list <- colnames(dt)[-1] %>%
map( ~ loop(df = dt,
x_var = colnames(dt)[1],
y_var = colnames(dt)[2],
f_var = .x))
# view all plots individually (not shown)
plot_list
Problems I encountered when ran this chunk of code:
Error: Discrete value supplied to continuous scale
Step ggsave didn't work. I would like to save each plot by the names of the changing columns.
There are some minor issues with your code. You get the first error as you included the second column of your dataset (which is a categorical, i.e. discrete variable) in the loop. Second, title = .data[[f_var]] will not work. Simply use title = f_var to add the variable name as the title. Finally, you are trying to save an object called a which however is not defined in your code, i.e. you have to assign your plot to a variable a and to return the plot I added a return(a):
set.seed(123)
library(ggplot2)
library(purrr)
loop = function(df, x_var, y_var, f_var) {
a <- ggplot(df, aes(x = .data[[x_var]], y = .data[[y_var]], fill = .data[[f_var]])) +
geom_tile(color = "black") +
scale_fill_gradient(low = "white", high = "blue") +
geom_text(aes(label = .data[[f_var]]), color = "black", size = 4) +
coord_fixed() +
theme_minimal() +
labs(x = "",
y = "",
fill = "R", # Want the legend title to be each of the column names that are looped
title = f_var)
ggsave(a, file = paste0("heatmap_", f_var,".png"), device = png, width = 15, height = 15, units = "cm")
return(a)
}
plot_list <- colnames(dt)[-c(1, 2)] %>%
map( ~ loop(df = dt,
x_var = colnames(dt)[1],
y_var = colnames(dt)[2],
f_var = .x))
# view all plots individually (not shown)
plot_list[c(1, 5)]
#> [[1]]
#>
#> [[2]]

Combined scatter and line ggplot with proper legend

I try to find a clear approach for combined scatter and line plots with ggplot2 that have an appropriate legend. The following works, in principle, but with warnings:
library("ggplot2")
library("dplyr")
## 2 data sets, one for the lines, one for the points
tbl <- tibble(
f = rep(letters[1:2], each = 10),
x = rep(1:10, 2),
y = c(1e-4 * exp(1:10), log(1:10))
)
obs <- tibble(
f = rep("c", 5),
x = seq(2, 10, 2),
y = log(seq(2, 10, 2)) + rnorm(5, sd = 0.1)
)
rbind(tbl, obs) %>%
ggplot(aes(x, y, color = f, linetype = f)) +
geom_line(show.legend = TRUE) +
geom_point(show.legend = TRUE, aes(shape = f), size = 3) +
scale_linetype_manual(values=c("solid", "solid", "blank")) +
scale_shape_manual(values=c(NA, NA, 16))
but I would like to get rid of warnings and to write something like:
scale_shape_manual(values=c("none", "none", "circle"))
Is there already a "none" or "empty" shape code? Several past answers have been suggested on SO, but I wonder if there is a recent canonical way.

ggplot color lines with consistent scale adding new color for each new line

I'm using ggplot to plot some data:
## sample data
dat = data.frame(group = rep(letters[1:5], 10),
idx = rep(1:length(letters[1:5]), each = 10))
dat$value = cumsum(cumsum(sample(c(-1, 1), nrow(dat), TRUE)))
ggplot(dat) +
geom_path(aes(x = idx, y = value, color = group, group = group)) +
viridis::scale_color_viridis(option = 'magma', discrete = T)
## add more groups
dat = data.frame(group = rep(letters[1:10], 10),
idx = rep(1:length(letters[1:10]), each = 10))
dat$value = cumsum(cumsum(sample(c(-1, 1), nrow(dat), TRUE)))
## replot
ggplot(dat) +
geom_path(aes(x = idx, y = value, color = group, group = group)) +
viridis::scale_color_viridis(option = 'magma', discrete = T)
My issue with this is that the max and min colors are the same for both plots. And it is adjusting the colors in between.
Is there anyway to use this color scale (or similar) but always have the second color be the same? ie so that the first five colors would be the same for both graphs?

Producing a "fuzzy" RD plot with ggplot2

My question is similar to this but the answers there will not work for me. Basically, I'm trying to produce a regression discontinuity plot with a "fuzzy" design that uses all the data for the treatment and control groups, but only plots the regression line within the "range" of the treatment and control groups.
Below, I've simulated some data and produced the fuzzy RD plot with base graphics. I'm hoping to replicate this plot with ggplot2. Note that the most important part of this is that the light blue regression line is fit using all the blue points, while the peach colored regression line is fit using all the red points, despite only being plotted over the ranges in which individuals were intended to receive treatment. That's the part I'm having a hard time replicating in ggplot.
I'd like to move to ggplot because I'd like to use faceting to produce this same plot across various units in which participants were nested. In the code below, I show a non-example using geom_smooth. When there's no fuzziness within a group, it works fine, but otherwise it fails. If I could get geom_smooth to be limited to only specific ranges, I think I'd be set. Any and all help is appreciated.
Simulate data
library(MASS)
mu <- c(0, 0)
sigma <- matrix(c(1, 0.7, 0.7, 1), ncol = 2)
set.seed(100)
d <- as.data.frame(mvrnorm(1e3, mu, sigma))
# Create treatment variable
d$treat <- ifelse(d$V1 <= 0, 1, 0)
# Introduce fuzziness
d$treat[d$treat == 1][sample(100)] <- 0
d$treat[d$treat == 0][sample(100)] <- 1
# Treatment effect
d$V2[d$treat == 1] <- d$V2[d$treat == 1] + 0.5
# Add grouping factor
d$group <- gl(9, 1e3/9)
Produce regression discontinuity plot with base
library(RColorBrewer)
pal <- brewer.pal(5, "RdBu")
color <- d$treat
color[color == 0] <- pal[1]
color[color == 1] <- pal[5]
plot(V2 ~ V1,
data = d,
col = color,
bty = "n")
abline(v = 0, col = "gray", lwd = 3, lty = 2)
# Fit model
m <- lm(V2 ~ V1 + treat, data = d)
# predicted achievement for treatment group
pred_treat <- predict(m,
newdata = data.frame(V1 = seq(-3, 0, 0.1),
treat = 1))
# predicted achievement for control group
pred_no_treat <- predict(m,
newdata = data.frame(V1 = seq(0, 4, 0.1),
treat = 0))
# Add predicted achievement lines
lines(seq(-3, 0, 0.1), pred_treat, col = pal[4], lwd = 3)
lines(seq(0, 4, 0.1), pred_no_treat, col = pal[2], lwd = 3)
# Add legend
legend("bottomright",
legend = c("Treatment", "Control"),
lty = 1,
lwd = 2,
col = c(pal[4], pal[2]),
box.lwd = 0)
non-example with ggplot
d$treat <- factor(d$treat, labels = c("Control", "Treatment"))
library(ggplot2)
ggplot(d, aes(V1, V2, group = treat)) +
geom_point(aes(color = treat)) +
geom_smooth(method = "lm", aes(color = treat)) +
facet_wrap(~group)
Notice the regression lines extending past the treatment range for groups 1 and 2.
There's probably a more graceful way to make the lines with geom_smooth, but it can be hacked together with geom_segment. Munge the data.frames outside of the plotting call if you like.
ggplot(d, aes(x = V1, y = V2, color = factor(treat, labels = c('Control', 'Treatment')))) +
geom_point(shape = 21) +
scale_color_brewer(NULL, type = 'qual', palette = 6) +
geom_vline(aes(xintercept = 0), color = 'grey', size = 1, linetype = 'dashed') +
geom_segment(data = data.frame(t(predict(m, data.frame(V1 = c(-3, 0), treat = 1)))),
aes(x = -3, xend = 0, y = X1, yend = X2), color = pal[4], size = 1) +
geom_segment(data = data.frame(t(predict(m, data.frame(V1 = c(0, 4), treat = 0)))),
aes(x = 0, xend = 4, y = X1, yend = X2), color = pal[2], size = 1)
Another option is geom_path:
df <- data.frame(V1 = c(-3, 0, 0, 4), treat = c(1, 1, 0, 0))
df <- cbind(df, V2 = predict(m, df))
ggplot(d, aes(x = V1, y = V2, color = factor(treat, labels = c('Control', 'Treatment')))) +
geom_point(shape = 21) +
geom_vline(aes(xintercept = 0), color = 'grey', size = 1, linetype = 'dashed') +
scale_color_brewer(NULL, type = 'qual', palette = 6) +
geom_path(data = df, size = 1)
For the edit with facets, if I understand what you want correctly, you can calculate a model for each group with lapply and predict for each group. Here I'm recombine with dplyr::bind_rows instead of do.call(rbind, ...) for the .id parameter to insert the group number from the list element name, though there are other ways to do the same thing.
df <- data.frame(V1 = c(-3, 0, 0, 4), treat = c('Treatment', 'Treatment', 'Control', 'Control'))
m_list <- lapply(split(d, d$group), function(x){lm(V2 ~ V1 + treat, data = x)})
df <- dplyr::bind_rows(lapply(m_list, function(x){cbind(df, V2 = predict(x, df))}), .id = 'group')
ggplot(d, aes(x = V1, y = V2, color = treat)) +
geom_point(shape = 21) +
geom_vline(aes(xintercept = 0), color = 'grey', size = 1, linetype = 'dashed') +
geom_path(data = df, size = 1) +
scale_color_brewer(NULL, type = 'qual', palette = 6) +
facet_wrap(~group)

ggplot2 guide/legend on shape

I have a plot from the following script.
require(ggplot2)
df.shape <- data.frame(
AX = runif(10),
AY = runif(10),
BX = runif(10, 2, 3),
BY = runif(10, 2, 3)
)
p <- ggplot(df.shape)
p <- p + geom_point(aes(x = AX, y = AY, shape = 15)) +
geom_point(aes(x = BX, y = BY, shape = 19)) +
scale_shape_identity() +
guides(shape = guide_legend(override.aes = list(shape = 15, shape = 19)) )
print(p)
This doesn't produce a legend, describing which shape is "A" and which shape is "B". Note that the squares and circles may be close to one another, so I can't generally define the variable based on location. How do I display a "shape" legend?
I would reshape my data in the long format using reshape:
dt <- reshape(df.shape ,direction='long', varying=list(c(1, 3), c(2, 4)),
,v.names = c('X','Y'), times = c('A','B'))
Then I plot it simply like this
ggplot(dt) +
geom_point(aes(x = X, y = Y, shape = time),size=5) +
scale_shape_manual(values=c(15,19))

Resources