Lines connecting jittered points - dodging by multiple groups - r

I try to connect jittered points between measurements from two different methods (measure) on an x-axis. These measurements are linked to one another by the probands (a), that can be separated into two main groups, patients (pat) and controls (ctr),
My df is like that:
set.seed(1)
df <- data.frame(a = rep(paste0("id", "_", 1:20), each = 2),
value = sample(1:10, 40, rep = TRUE),
measure = rep(c("a", "b"), 20), group = rep(c("pat", "ctr"), each = 2,10))
I tried
library(ggplot2)
ggplot(df,aes(measure, value, fill = group)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0.1,
dodge.width = 0.75), shape = 1) +
geom_line(aes(group = a), position = position_dodge(0.75))
Created on 2020-01-13 by the reprex package (v0.3.0)
I used the fill aesthetic in order to separate the jittered dots from both groups (pat and ctr). I realised that when I put the group = a aesthetics into the ggplot main call, then it doesn't separate as nicely, but seems to link better to the points.
My question: Is there a way to better connect the lines to the (jittered) points, but keeping the separation of the two main groups, ctr and pat?
Thanks a lot.

The big issue you are having is that you are dodging the points by only group but the lines are being dodged by a, as well.
To keep your lines with the axes as is, one option is to manually dodge your data. This takes advantage of factors being integers under the hood, moving one level of group to the right and the other to the left.
df = transform(df, dmeasure = ifelse(group == "ctr",
as.numeric(measure) - .25,
as.numeric(measure) + .25 ) )
You can then make a plot with measure as the x axis but then use the "dodged" variable as the x axis variable in geom_point and geom_line.
ggplot(df, aes(x = measure, y = value) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
If you also want jittering, that can also be added manually to both you x and y variables.
df = transform(df, dmeasure = ifelse(group == "ctr",
jitter(as.numeric(measure) - .25, .1),
jitter(as.numeric(measure) + .25, .1) ),
jvalue = jitter(value, amount = .1) )
ggplot(df, aes(x = measure, y = jvalue) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )

This turned out to be an astonishingly common question and I'd like to add an answer/comment to myself with a suggestion of a - what I now think - much, much better visualisation:
The scatter plot.
I originally intended to show paired data and visually guide the eye between the two comparisons. The problem with this visualisation is evident: Every subject is visualised twice. This leads to a quite crowded graphic. Also, the two dimensions of the data (measurement before, and after) are forced into one dimension (y), and the connection by ID is awkwardly forced onto your x axis.
Plot 1: The scatter plot naturally represents the ID by only showing one point per subject, but showing both dimensions more naturally on x and y. The only step needed is to make your data wider (yes, this is also sometimes necessary, ggplot not always requires long data).
The box plot
Plot 2: As rightly pointed out by user AllanCameron, another option would be to plot the difference of the paired values directly, for example as a boxplot. This is a nice visualisation of the appropriate paired t-test where the mean of the differences is tested against 0. It will require the same data shaping to "wide format". I personally like to show the actual values as well (if there are not too many).
library(tidyr)
library(dplyr)
library(ggplot2)
## first reshape the data wider (one column for each measurement)
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
## now use the new columns for your scatter plot
ggplot() +
geom_point(aes(time_a, time_b, color = group)) +
## you can add a line of equality to make it even more intuitive
geom_abline(intercept = 0, slope = 1, lty = 2, linewidth = .2) +
coord_equal()
Box plot to show differences of paired values
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
ggplot(aes(x = "", y = time_a - time_b)) +
geom_boxplot() +
# optional, if you want to show the actual values
geom_point(position = position_jitter(width = .1))

Related

Horizontal Group Bar Chart - How to scale to 100% and how to specify the order of the layers

So I have the following code which produces:
The issue here is twofold:
The group bar chart automatically places the highest value on the top (i.e. for avenue 4 CTP is on top), whereas I would always want FTP to be shown first then CTP to be shown after (so always blue bar then red bar)
I need all of the values to scale to 100 or 100% for their respective group (so for CTP avenue 4 would have a huge bar graph but the other avenues should be extremely tiny)
I am new to 'R'/Stack overflow so sorry if anything is wrong/you need more but any help is greatly appreciated.
library(ggplot2)
library(tidyverse)
library(magrittr)
# function to specify decimals
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
# sample data
avenues <- c("Avenue1", "Avenue2", "Avenue3", "Avenue4")
flytip_amount <- c(1000, 2000, 1500, 250)
collection_amount <- c(5, 15, 10, 2000)
# create data frame from the sample data
df <- data.frame(avenues, flytip_amount, collection_amount)
# got it working - now to test
df3 <- df
SumFA <- sum(df3$flytip_amount)
df3$FTP <- (df3$flytip_amount/SumFA)*100
df3$FTP <- specify_decimal(df3$FTP, 1)
SumCA <- sum(df3$collection_amount)
df3$CTP <- (df3$collection_amount/SumCA)*100
df3$CTP <- specify_decimal(df3$CTP, 1)
# Now we have percentages remove whole values
df2 <- df3[,c(1,4,5)]
df2 <- df2 %>% pivot_longer(-avenues)
FTGraphPos <- df2$name
ggplot(df2, aes(x = avenues, fill = as.factor(name), y = value)) +
geom_col(position = "dodge", width = 0.75) + coord_flip() +
labs(title = "Flytipping & Collection %", x = "ward_name", y = "Percentageperward") +
geom_text(aes(x= avenues, label = value), vjust = -0.1, position = "identity", size = 5)
I have tried the above and I have looked at lots of tutorials but nothing is exactly precise to what I need of ensuring the group bar charts puts the layers in the same order despite amount and scaling to 100/100%
As Camille notes, to handle ordering of the categories in a plot, you need to set them as factors, and then use functions from the forcats package to handle the order. Here I am using fct_relevel() (note that it will automatically convert character variables to factors).
Your numeric values are in fact set to character, so they need to be set to numeric for the chart to make sense.
To cover point #2, I'm using group_by() to calculate percentages within each name.
I have also fixed the labels so that they are properly dodged along with the bar chart. Also, note that you don't need to call ggplot2 or magrittr if you are calling tidyverse - those packages come along with it already.
df_plot <- df2 |>
mutate(name = fct_relevel(name, "CTP"),
value = as.numeric(value)) |>
group_by(name) |>
mutate(perc = value / sum(value)) |>
ungroup()
ggplot(df_plot, aes(x = value, y = avenues, fill = name)) +
geom_col(position = "dodge", width = 0.75) +
geom_text(aes(label = value), position = position_dodge(width = 0.75), size = 5) +
labs(title = "Flytipping & Collection %", x = "Percentageperward", y = "ward_name") +
guides(fill = guide_legend(reverse = TRUE))

How to add percentage values to strata in alluvial plot with ggalluvial?

I'm looking for the most convenient way for adding rounded percentage labels to strata of an alluvial plot.
There are 50 cases in the following example. Independently of stages 1 or 2, each case belongs to one group of A, B or C. I'd like to display the relative group affiliation during each stage.
library(ggplot2)
library(ggalluvial)
df <- data.frame('id' = rep(1:50,2),
'stage' = c(rep(1,50), rep(2,50)),
'group' = sample(c('A','B','C'), 100, replace = TRUE))
ggplot(df,
aes(x = stage, stratum = group, alluvium = id, fill = group)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5)
Is there a way to add rounded percentage labels (including "%") to the strata (bar segments) without calculating a percentage column in the initial data frame? If I'm not completely mistaken, geom_text doesn't work the same way here as in geom_bar().
The standard ggplot2 solution to this question is to use "calculated aesthetics". These are aesthetic specifications that come not from the data passed to ggplot() but from the output of the statistical transformation (the stat_*()), which is used to render the graphical elements (the geom_*()). The columns of this output (which are rarely seen by the user) are called "computed variables". The documentation on this topic is limited and a bit out of date, using stat() instead of after_stat() to call them. Since ggalluvial did not support computed variables, the answer from #bencekd was correct at the time.
As of today, v0.12.0 is on CRAN with support and documentation for computed variables. In particular, three computed variables are available that correspond to variables with the same names used by stat_bin() or stat_count(): n, count (a weighted version of n), and prop (a within-axis proportion calculated from count). It looks like you'd want to use prop, as illustrated below:
library(ggplot2)
library(scales)
library(ggalluvial)
df <- data.frame('id' = rep(1:50,2),
'stage' = c(rep(1,50), rep(2,50)),
'group' = sample(c('A','B','C'), 100, replace = TRUE))
ggplot(df,
aes(x = stage, stratum = group, alluvium = id, fill = group)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum",
aes(label = percent(after_stat(prop), accuracy = .1)))
Created on 2020-07-14 by the reprex package (v0.3.0)
Unfortunately I don't think you can do it without calculating the percentage column in the initial data frame yet. But that can be done easily and also gives more flexibility with the labeling:
library(ggplot2)
library(ggalluvial)
df <- data.frame('id' = rep(1:50,2),
'stage' = c(rep(1,50), rep(2,50)),
'group' = sample(c('A','B','C'), 100, replace = TRUE))
# the list needs to be reversed, as stratums are displayed reversed in the alluvial by default
stratum_list <- df %>%
group_by(stage, group) %>%
summarize(s = n()) %>%
group_by(stage) %>%
mutate(s = percent(s/sum(s), accuracy=0.1)) %>%
arrange(stage, -as.numeric(group)) %>%
.$s
ggplot(df,
aes(x = stage, stratum = group, alluvium = id, fill = group)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", label=stratum_list)
UPDATE [13/04/2020]
Added stratum_list reversion as Yonghao suggested

Showing number of values outside axis range in boxplot (using ggplot2 in R)

Sometimes you want to limit the axis range of a plot to a region of interest so that certain features (e.g. location of the median & quartiles) are emphasized. Nevertheless, it may be of interest to make it clear how many/what proportion of values lie outside the (truncated) axis range.
I am trying to show this when using ggplot2 in R and am wondering whether there is some buildt-in way of doing this in ggplot2 (or alternatively some sensible solution some of you may have used). I am not actually particularly wedded to any particular way of displaying this (e.g. jittered points with a different symbol at the edge of the plot, a little bar outside that depending on how full it is shows the proportion outside the range, some kind of other display that somehow conveys the information).
Below is some example code that creates some mock data and the kind of plot I have in mind (shown below the code), but without any clear indication exactly how much data is outside the y-axis range.
library(ggplot2)
set.seed(seed=123)
group <- rep(c(0,1),each=500)
y <- rcauchy(1000, group, 10)
mockdata <- data.frame(group,y)
ggplot(mockdata, aes(factor(group),y)) + geom_boxplot(aes(fill = factor(group))) + coord_cartesian(ylim = c(-40,40))
You may compute these values in advance and display them via e.g. geom_text:
library(dplyr)
upper_lim <- 40
lower_lim <- -40
mockdata$upper_cut <- mockdata$y > upper_lim
mockdata$lower_cut <- mockdata$y < lower_lim
mockdata$group <- as.factor(mockdata$group)
mockpts <- mockdata %>%
group_by(group) %>%
summarise(upper_count = sum(upper_cut),
lower_count = sum(lower_cut))
ggplot(mockdata, aes(group, y)) +
geom_boxplot(aes(fill = group)) +
coord_cartesian(ylim = c(lower_lim, upper_lim)) +
geom_text(y = lower_lim, data = mockpts,
aes(label = lower_count, x = group), hjust = 1.5) +
geom_text(y = upper_lim, data = mockpts,
aes(label = upper_count, x = group), hjust = 1.5)

ggplot2: how to add sample numbers to density plot?

I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?
The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)

How to plot the mean of a single factor in a barplot with

I'm having trouble to create a figure with ggplot2.
In this plot, I'm using geom_bar to plot three factors. I mean, for each "time" and "dose" I'm plotting two bars (two genotypes).
To be more specific, this is what I mean:
This is my code till now (Actually I changed some settings, but I'm presenting just what is need for):
ggplot(data=data, aes(x=interaction(dose,time), y=b, fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")
Question: I intend to add the mean of each time using points and that these points are just in the middle of the bars of a certain time. How can I proceed?
I tried to add these points using geom_dotplot and geom_point but I did not succeed.
library(dplyr)
time_data = data %>% group_by(time) %>% summarize(mean(b))
data <- inner_join(data,time_data,by = "time")
this gives you data with the means attached. Now make the plot
ggplot(data=data, aes(x=interaction(dose,time), y=b,fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")+
geom_text(aes(b),vjust = 0)
You might need to fiddle around with the argument hjust and vjust in the geom_text statement. Maybe the aes one too, I didn't run the program so I don't know.
It generally helps if you can give a reproducible example. Here, I made some of my own data.
sampleData <-
data.frame(
dose = 1:3
, time = rep(1:3, each = 3)
, genotype = rep(c("AA","aa"), each = 9)
, b = rnorm(18, 20, 5)
)
You need to calculate the means somewhere, and I chose to do that on the fly. Note that, instead of using points, I used a line to show that the mean is for all of those values. I also sorted somewhat differently, and used facet_wrap to cluster things together. Points would be a fair bit harder to place, particularly when using position_dodge, but you could likely modify this code to accomplish that.
ggplot(
sampleData
, aes(x = dose
, y = b
, fill = genotype)
) +
geom_bar(position = "dodge", stat = "identity") +
geom_hline(data =
sampleData %>%
group_by(time) %>%
summarise(meanB = mean(b)
, dose = NA, genotype = NA)
, aes(yintercept = meanB)
, col = "black"
) +
facet_wrap(~time)

Resources