Loop violin plots of controls with clinical case data points overlaid (ggplot) - r

New to posting on here. Apologies if I miss including something needed to solve my situation.
I have a matched case-control design where three 'younger' clinical cases have been age-matched to a 'younger' control group, and three 'older' cases have been matched to an 'older' control group. I am plotting the control group distribution in a violin plot and overlaying the corresponding matched cases as data points.
I have a lot of variables and I would like to loop through them to minimise error and increase efficiency. I have had a go at writing the code for the loop but I am not sure what to do with the fact that I have two types of plots (violin and point) and two data frames (controls and cases) involved.
Here is the code I have for the plots:
#fake data
cases <- data.frame(
id = factor(1:6),
strange_stories_ToM_mean = sample(6:8, 6, replace = TRUE),
age = factor(c(rep("young", 3), rep("old", 3)))
)
controls <- data.frame(
id = 7:23,
strange_stories_ToM_mean = sample(c(6,6,7,7,7,7,7,7,7,8,8,8,9,9,9,9,9), 17),
age = c(rep("young", 9), rep("old", 8))
)
#plots
ggplot(data = controls, aes(strange_stories_ToM_mean, age)) +
geom_violin(
trim = FALSE,
alpha = 0.2,
draw_quantiles = c(0.25, 0.5, 0.75),
fill = "gray90"
) +
geom_point(
data = cases,
aes(colour = id, shape = id), # map color/shape to individual cases
size = 5,
show.legend = FALSE
) +
scale_shape_manual(values=c(16, 17, 15, 16, 17, 15)) +
scale_colour_manual(values=c("deeppink1","indianred3", "blueviolet", "springgreen3", "chartreuse2", "darkgreen")) +
scale_size_manual(values=c(5, 4, 5, 5, 4, 5)) +
theme_classic()
ggsave("strange_stories_ToM_mean.svg", width = 8, height = 8, units = "cm")
I looked at using 'for' and created a list to loop through (what I have is below) but I came unstuck at where the list should be incorporated when two data frames are being used and two plots...could lapply be best?
variables <- list() # Create empty listfor(i in ncol(FTD_data)) { # Using for-loop to add all columns tolist variables[[i]] <- FTD_data[ , i]}
names(variables) <- colnames(FTD_data) #rename list elements with variable names from df
for (i in variables)
{CODE TO PLOT INSERT HERE}

One approach to achieve your desired result would be to put your plotting code inside a function which takes one argument, the name of the column to plot. The only change I made to your plotting code is to replace the hardcoded strange_stories_ToM_mean by .data[[col]] to tell ggplot I want to plot the data column whose name is stored in col.
Also, instead of using a for loop I would recommend to use lapply when using ggplot2:
library(ggplot2)
plot_fun <- function(col) {
ggplot(data = controls, aes(.data[[col]], age)) +
geom_violin(
trim = FALSE,
alpha = 0.2,
draw_quantiles = c(0.25, 0.5, 0.75),
fill = "gray90"
) +
geom_point(
data = cases,
aes(colour = id, shape = id),
size = 5,
show.legend = FALSE
) +
scale_shape_manual(values=c(16, 17, 15, 16, 17, 15)) +
scale_colour_manual(values=c("deeppink1","indianred3", "blueviolet", "springgreen3", "chartreuse2", "darkgreen")) +
scale_size_manual(values=c(5, 4, 5, 5, 4, 5)) +
theme_classic()
}
cols_to_plot <- names(controls)[!names(controls) %in% c("id", "age")]
names(cols_to_plot) <- cols_to_plot
p <- lapply(cols_to_plot, plot_fun)
lapply(cols_to_plot, function(x) ggsave(paste0(x, ".svg"), plot = p[[x]], width = 8, height = 8, units = "cm"))
#> $strange_stories_ToM_mean
#> [1] "strange_stories_ToM_mean.svg"
#>
#> $strange_stories_ToM_median
#> [1] "strange_stories_ToM_median.svg"
p
#> $strange_stories_ToM_mean
#>
#> $strange_stories_ToM_median
DATA
set.seed(123)
cases <- data.frame(
id = factor(1:6),
strange_stories_ToM_mean = sample(6:8, 6, replace = TRUE),
strange_stories_ToM_median = sample(6:8, 6, replace = TRUE),
age = factor(c(rep("young", 3), rep("old", 3)))
)
controls <- data.frame(
id = 7:23,
strange_stories_ToM_mean = sample(c(6,6,7,7,7,7,7,7,7,8,8,8,9,9,9,9,9), 17),
strange_stories_ToM_median = sample(c(6,6,7,7,7,7,7,7,7,8,8,8,9,9,9,9,9), 17),
age = c(rep("young", 9), rep("old", 8))
)

Related

automatically adding brackets or braces to ggplot and plotly objects

I want to automate adding brackets/braces to a ggplot object and then convert it to plotly using ggplotly.
library(tidyverse)
library(plotly)
#devtools::install_github("NicolasH2/ggbrace")
library(ggbrace)
set.seed(10)
mydata <- data.frame(xx = c(1:10), yy = sample(0:10, 10))
my_bracket_data <- data.frame(start = c(1, 4, 6),
end = c(3, 5, 9),
info = c("first", "second", "third"),
y_bottom = rep(11, 3),
y_top = rep(12, 3))
I can do this by using geom_brace but it involves manually typing it out for each brace rather than using the data that is already stored in a data.frame:
plot_geom_brace <- ggplot(data = mydata, aes(xx, yy)) +
geom_line(size = 1.5) +
geom_brace(aes(x = c(1, 3), y = c(11, 12), label = "first"), inherit.data = F, labelsize = 5) +
geom_brace(aes(x = c(4, 5), y = c(11, 12), label = "second"), inherit.data = F, labelsize = 5) +
geom_brace(aes(x = c(6, 9), y = c(11, 12), label = "third"), inherit.data = F, labelsize = 5)
plot_geom_brace
ggplotly(plot_geom_brace)
Is there a way that I can do this without repeatedly writing geom_brace layers for each brace (and instead access the data from my_bracket_data directly)?
As an aside this can be automated using geom_bracket but this is not supported by plotly yet.
library(ggpubr)
plot_geom_bracket <- ggplot(data = mydata, aes(xx, yy)) +
geom_line(size = 1.5) +
geom_bracket(xmin = my_bracket_data$start,
xmax = my_bracket_data$end,
y.position = rep(11, 3),
label = my_bracket_data$info,
inherit.aes = FALSE)
plot_geom_bracket
ggplotly(plot_geom_bracket)
# Warning message:
# In geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]) :
# geom_GeomBracket() has yet to be implemented in plotly.
# If you'd like to see this geom implemented,
# Please open an issue with your example code at
# https://github.com/ropensci/plotly/issues
Any suggestions?
Thanks
An option could be by creating two dataframes of your bracket data. One dataframe for the geom_braces by converting the data to a longer format with pivot_longer to create three geom braces through the aes. To get the labels you can create a small summarise table with the x and y positions per group of info. Here is some reproducible code:
library(tidyverse)
library(plotly)
#devtools::install_github("NicolasH2/ggbrace")
library(ggbrace)
set.seed(10)
mydata <- data.frame(xx = c(1:10), yy = sample(0:10, 10))
my_bracket_data <- data.frame(start = c(1, 4, 6),
end = c(3, 5, 9),
info = c("first", "second", "third"),
y_bottom = rep(11, 3),
y_top = rep(12, 3))
# Data for geom_brace
my_bracket_data_long <- my_bracket_data %>%
pivot_longer(cols = c(start, end), values_to = "x_value", names_to = "x_names") %>%
pivot_longer(cols = c(y_bottom, y_top), values_to = "y_value", names_to = "y_names")
# data for labels braces
my_bracket_data_labels <- my_bracket_data_long %>%
group_by(info) %>%
summarise(x_pos = mean(x_value),
y_pos = mean(y_value) + 1)
# plot
plot_geom_brace <- ggplot(data = mydata, aes(xx, yy)) +
geom_line(size = 1.5) +
geom_brace(data = my_bracket_data_long, aes(x = x_value, y = y_value, group = info)) +
geom_text(data = my_bracket_data_labels, aes(x = x_pos, y = y_pos, group = info, label = info))
ggplotly(plot_geom_brace)
Created on 2023-01-07 with reprex v2.0.2
special ggplot2 libraries like ggpubr usually don't play along well with conversion to plotly objects.
If you dont want to type out each geom_bracet call you could loop over the rows of the dataframe, create the geom statement using paste and pass it to the existing plot object using eval in the following line:
m<- data.frame(s = c(1, 4, 6), ## = my_bracket_data
e = c(3, 5, 9),
i = c("first", "second", "third"),
y_b = rep(11, 3),
y_t = rep(12, 3))
p<- ggplot(data = mydata, aes(xx, yy)) + geom_line(size = 1.5)
for (i in 1:NROW(my_bracket_data)) {
input = paste('geom_brace(aes(x =c(',m[i,]$s,',',m[i,]$e'),c(',m[i,]$y_b,',',
m[i,]$y_t,'),label=',m[i,]$i,'), inherit.data = F, labelsize = 5)',sep='')
p = p + eval(parse(text=input))
}
p
However this is more of a hacky solution, but that's what R tends to become if you incorporate/mix different styles like for example apply functions with tidyr syntax (or in this case ggplot, which could be seen as an ancestor of tidyr) and more programming style approaches (for, while , func...[yes you can programm in R]) and also want to let it automatically converse the whole thing to a Javascript thing (aka plotly) . .. its a beautiful mess .

Sparse Functional Data Plot

I'm wondering how to reproduce the following figure using R.
The data used in the figure are sparse functional data of bone mineral density. Basically each participant's bone mineral level is observed a few times during the experiment. But the observation times and number of observations for each participant are different.
The figure is from article 'Principal component models for sparse functional data'.
You can find it here Principal component models for sparse functional data or Principal component models for sparse functional data
You could reproduce the figure with made-up data like this:
library(ggplot2)
# Create sample data
set.seed(8) # Makes data reproducible
ages <- runif(40, 8, 24)
df <- do.call(rbind, lapply(seq_along(ages), function(x) {
age <- ages[x] + cumsum(runif(sample(2:5, 1), 1, 2))
y <- (tanh((age - 10)/pi - pi/2) + 2.5)/3
y <- y + rnorm(1, 0, 0.1)
y <- y + cumsum(rnorm(length(y), 0, 0.02))
data.frame(ID = x, age = age, BMD = y)
}))
# Draw plot
ggplot(df, aes(x = age, y = BMD)) +
geom_path(aes(group = ID), color = 'gray70', na.rm = TRUE) +
geom_point(color = 'gray70', na.rm = TRUE) +
geom_smooth(color = 'black', se = FALSE, formula =y ~ s(x, bs = "cs"),
method = 'gam', na.rm = TRUE) +
theme_classic(base_size = 16) +
scale_x_continuous(limits = c(8, 28)) +
labs(y = 'Spinal Bone Density', x = 'Age') +
theme(panel.border = element_rect(fill = NA))
Without knowing your own data structure however, it's difficult to say how applicable you will find this to your own use case.
You can do this in ggplot2 as long as you have data in long format and with a grouping variable such as id in my example:
dat <- tibble::tribble(
~id, ~age, ~bone_dens,
1, 10, 0.6,
1, 15, 0.8,
1, 19, 1.12,
2, 11, 0.7,
2, 18, 1.1,
3, 16, 1.1,
3, 18, 1.2,
3, 25, 1.0)
You first plot the dots with geom_point(), then you add the lines that join dots with the same id with geom_line():
dat |>
ggplot(aes(x = age, y = bone_dens)) +
geom_point() +
geom_line(aes(group = id))
Output will look like this - you'll be able to customise it like any other ggplot.

Colouring nodes using graph and tidygraph in R?

I recently asked this question about how to colour nodes by variable. And the code works great. However, I'm back trying to colour the terminal nodes separately. For example, if I create some data, then turn them into tidygraph objects and plot them using ggraph then I get something like this:
library(tidygraph)
library(ggraph)
library(gridExtra)
pal = colorspace::sequential_hcl(palette = "Purples 3", n = 100)
# create some data for the tbl_graph
nodes <- data.frame(name = c("x4", NA, NA),
label = c("x4", 5, 2),
value = c(10, 5, 2))
nodes1 <- data.frame(name = c("x4", "x2", NA, NA, "x1", NA, NA),
label = c("x4", "x2", 2, 1, "x1", 13, 7),
value = c(10, 8, 2, 1, 10, 13, 7))
edges <- data.frame(from = c(1,1), to = c(2,3))
edges1 <- data.frame(from = c(1, 2, 2, 1, 5, 5),
to = c(2, 3, 4, 5, 6, 7))
# create the tbl_graphs
tg <- tbl_graph(nodes = nodes, edges = edges)
tg_1 <- tbl_graph(nodes = nodes1, edges = edges1)
# put into list
myList <- list(tg, tg_1)
# set colours for variables
nodenames <- unique(na.omit(unlist(lapply(myList, .%>%activate(nodes) %>% pull(name) ))))
nodecolors <- setNames(scales::hue_pal(c(0,360)+15, 100, 64, 0, 1)(length(nodenames)), nodenames)
nodecolors
# plot function
plotFun <- function(List, colors=NULL){
plot <- ggraph(List, "partition") +
geom_node_tile(aes(fill = name), size = 0.25) +
geom_node_label(aes(label = label, color = name)) +
scale_y_reverse() +
theme_void() +
theme(legend.position = "none")
if (!is.null(colors)) {
plot <- plot + scale_fill_manual(values=colors) +
scale_fill_manual(values=colors, na.value= 'grey40')
}
plot
}
# create grid of plots
allPlots <- lapply(myList, plotFun, colors=nodecolors)
n <- length(allPlots)
nRow <- floor(sqrt(n))
do.call("grid.arrange", c(allPlots, nrow = nRow))
As you can see the named nodes are all coloured correctly, but the terminal nodes are coloured grey. I am trying to colour the terminal nodes by the corresponding value in the value column of the data. I have tried altering the scale_fill_manual function, but I cant seem to get it to work..
Any suggestions as to how I could do this?
If I understand correctly, you want to apply a different colour mapping to
the terminal nodes, mapping value to colour rather than name, and using
a different colour scale altogether. ggplot2 doesn’t support that directly,
but you can use e.g. ggnewscale to apply a different scale for the rest
of the plot.
I simplified your example a bit to focus on the new scale application:
library(tidygraph)
library(ggraph)
nodes <- data.frame(
name = c("x4", "x2", NA, NA, "x1", NA, NA),
label = c("x4", "x2", 2, 1, "x1", 13, 7),
value = c(10, 8, 2, 1, 10, 13, 7)
)
edges <- data.frame(
from = c(1, 2, 2, 1, 5, 5),
to = c(2, 3, 4, 5, 6, 7)
)
tg <- tbl_graph(nodes = nodes, edges = edges)
ggraph(tg, "partition") +
geom_node_tile(aes(fill = name)) +
geom_node_label(aes(label = label, color = name)) +
# Apply different colour/fill scales to terminal nodes
ggnewscale::new_scale_fill() +
ggnewscale::new_scale_color() +
geom_node_tile(
data = . %>% filter(is.na(name)),
aes(fill = value)
) +
geom_node_label(
data = . %>% filter(is.na(name)),
aes(label = label, color = value)
)

Keep ggplot secondary axis scale fixed

I'm making a ggplot with a secondary axis using the sec_axis() function but am having trouble retaining the correct scale.
Below is a reproducible example
# load package
library(ggplot2)
# produce dummy data
data = data.frame(week = 1:5,
count = c(45, 67, 21, 34, 50),
rate = c(3, 6, 2, 5, 3))
# calculate scale (and save as an object called 'scale')
scale = max(data$count)/10
# produce ggplot
p = ggplot(data, aes(x = week)) +
geom_bar(aes(y = count), stat = "identity") +
geom_point(aes(y = rate*scale)) +
scale_y_continuous(sec.axis = sec_axis(~./scale, name = "% positive",
breaks = seq(0, 10, 2)))
# look at ggplot - all looks good
p
# change the value of the scale object
scale = 2
# look at ggplot - you can see the scale has now change
p
In reality I am producing a series of ggplot's within a loop and within each iteration of the loop the 'scale' object changes
Question
How do I ensure the scale of my secondary y-axis remains fixed? (even if the value of the 'scale' object changes)
EDIT
I wanted to keep the example as simple as possible (see example above) but on request I'll add an example which includes a loop
# load package
library(ggplot2)
# produce dummy data
data = data.frame(group = c(rep("A", 5), rep("B", 5)),
week = rep(1:5, 2),
count = c(45, 67, 21, 34, 50,
120, 200, 167, 148, 111),
rate = c(3, 6, 2, 5, 3,
15, 17, 20, 11, 12))
# define the groups i want to loop over
group = c("A", "B")
# initalize an empty list (to store figures)
fig_list = list()
for(i in seq_along(group)){
# subset the data
data.sub = data[data$group == group[i], ]
# calculate scale (and save as an object called 'scale')
scale = max(data.sub$count)/20
# produce the plot
p = ggplot(data.sub, aes(x = week)) +
geom_bar(aes(y = count), stat = "identity") +
geom_point(aes(y = rate*scale), size = 4, colour = "dark red") +
scale_y_continuous(sec.axis = sec_axis(~./scale, name = "% positive",
breaks = seq(0, 20, 5))) +
ggtitle(paste("Plot", group[i]))
# print the plot
print(p)
# store the plot in a list
fig_list[[group[i]]] = p
}
I get the following figures when printing within the loop (everything looks good)
However... if I call the figure for group A from the list I created you can see the secondary y-axis scale is now incorrect (it has used the scale created for group B)
fig_list[["A"]]
Thanks for your edit, this makes things clearer. Your problem stems from the way R evaluates objects. The plot in your fig_list is not an image, but an outline on how the plot should be generated. It is only generated when you call print (by typing fig_list["A"]and hitting enter). Since the value for scale changes throughout the loop, if you evaluate the plot later, it will be incorrect, since it will use the last iteration of scale.
An easy solution is to wrap your code for plotting in a function and use lapply:
make_plot <- function(df) {
scale = max(df$count)/20
ggplot(df, aes(x = week)) +
geom_bar(aes(y = count), stat = "identity") +
geom_point(aes(y = rate*scale), size = 4, colour = "dark red") +
scale_y_continuous(sec.axis = sec_axis(~./scale, name = "% positive",
breaks = seq(0, 20, 5))) +
ggtitle(paste("Plot", unique(df$group)))
}
grouped_data <- split(data, data$group)
fig_list <- lapply(grouped_data, make_plot)
Now when you call the first plot, it is evaluated correctly.
fig_list["A"]
#> $A
This still works when you happen to have an object scale with a bogus value in your environment, since R looks up scale within the function call, and not in the global environment.
Created on 2018-09-02 by the reprex
package (v0.2.0).

Using ggplot in R to create a line graph for two different groups

I'm trying to create a line graph depicting different trajectories over time for two groups/conditions. I have two groups for which the data 'eat' was collected at five time points (1,2,3,4,5).
I'd like the lines to connect the mean point for each group at each of five time points, so I'd have two points at Time 1, two points at Time 2, and so on.
Here's a reproducible example:
#Example data
library(tidyverse)
library(ggplot2)
eat <- sample(1:7, size = 30, replace = TRUE)
df <- data.frame(id = rep(c(1, 2, 3, 4, 5, 6), each = 5),
Condition = rep(c(0, 1), each = 15),
time = c(1, 2, 3, 4, 5),
eat = eat
)
df$time <- as.factor(df$time)
df$Condition <- as.factor(df$Condition)
#Create the plot.
library(ggplot2)
ggplot(df, aes(x = time, y = eat, fill = Condition)) + geom_line() +
geom_point(size = 4, shape = 21) +
stat_summary(fun.y = mean, colour = "red", geom = "line")
The problem is, I need my lines to go horizontally (ie to show two different colored lines moving across the x-axis). But this code just connects the dots vertically:
If I don't convert Time to a factor, but only convert Condition to a factor, I get a mess of lines. The same thing happens in my actual data, as well.
I'd like it to look like this aesthetically, with the transparent error envelopes wrapping each line. However, I don't want it to be curvy, I want the lines to be straight, connecting the means at each point.
Here's the lines running in straight segments through the means of each time, with the range set to be the standard deviation of the points at the time. One stat.summary makes the mean line with the colour aesthetic, the other makes the area using the inherited fill aesthetic. ggplot2::mean_se is a convenient function that takes a vector and returns a data frame with the mean and +/- some number of standard errors. This is the right format for thefun.data argument to stat_summary, which passes these values to the geom specified. Here, geom_ribbon accepts ymin and ymax values to plot a ribbon across the graph.
library(tidyverse)
set.seed(12345)
eat <- sample(1:7, size = 30, replace = T)
df <- data.frame(
Condition = rep(c(0, 1), each = 15),
time = c(1, 2, 3, 4, 5),
eat = eat
)
df$Condition <- as.factor(df$Condition)
ggplot(df, aes(x = time, y = eat, fill = Condition)) +
geom_point(size = 4, shape = 21, colour = "black") +
stat_summary(geom = "ribbon", fun.data = mean_se, alpha = 0.2) +
stat_summary(
mapping = aes(colour = Condition),
geom = "line",
fun.y = mean,
show.legend = FALSE
)
Created on 2018-07-09 by the reprex package (v0.2.0).
Here's my best guess at what you want:
# keep time as numeric
df$time = as.numeric(as.character(df$time))
ggplot(df, aes(x = time, y = eat, group = Condition)) +
geom_smooth(
aes(fill = Condition, linetype = Condition),
method = "lm",
level = 0.65,
color = "black",
size = 0.3
) +
geom_point(aes(color = Condition))
Setting the level = 0.65 is about +/- 1 standard deviation on the linear model fit.
I think this code will get you most of the way there
library(tidyverse)
eat <- sample(1:7, size = 30, replace = TRUE)
tibble(id = rep(c(1, 2, 3, 4, 5, 6), each = 5),
Condition = factor(rep(c(0, 1), each = 15)),
time = factor(rep(c(1, 2, 3, 4, 5), 6)),
eat = eat) %>%
ggplot(aes(x = time, y = eat, fill = Condition, group = Condition)) +
geom_point(size = 4, shape = 21) +
geom_smooth()
geom_smooth is what you were looking for, I think. This creates a linear model out of the points, and as long as your x value is a factor, it should use the mean and connect the points that way.

Resources