I try to establish R as data visualisation tool in my company. A typical graph type used in my department are waterfall charts (https://en.wikipedia.org/wiki/Waterfall_chart).
In R, there are some packages and hints for ggplot to generate a waterfall chart (https://learnr.wordpress.com/2010/05/10/ggplot2-waterfall-charts/), which I used already.
Unfortunately, a common feature for the used waterfall charts are annotations with arrows to indicate the percentage changes within the steps.
See an example below:
Or here in this video (https://www.youtube.com/watch?v=WMHf7uFR6Rk)
The software used to produce such kind of plots is think cell (https://www.think-cell.com/), which is an add-on to Excel and Powerpoint.
The problem I have is that I don't know how to start to tackle the topic. My first thoughts are going in this direction:
Use geom_segment for generating the arrows and boxes
Use ggplot's annotate funktion to place the text at the arrows or in the boxes
Calculate the positions automatically based on the data provided to the waterfall chart.
May I ask you, if you have additional thoughts/ideas to implement such graphs in ggplot?
Best Regards Markus
Here's an example of the approach I would take.
Step 1. Pick which elements should be added, and add them one at a time.
Let's say we're starting with this simple chart:
df <- data.frame(x = c(2007, 2008, 2009),
y = c(100, 120, 140))
ggplot(df, aes(x, y, label = y)) +
geom_col() +
geom_text(vjust = -0.5)
First of all, we need some extra vertical space:
ggplot(df, aes(x, y, label = y)) +
geom_col() +
geom_text(vjust = -0.5) +
scale_y_continuous(expand = expand_scale(add = c(10, 50))) # Add 50 y padding
Now, I incrementally add layers until it looks like I want:
# Semi-manual proof of concept
ggplot(df, aes(x, y, label = y)) +
geom_col() +
geom_text(vjust = -0.5) +
scale_y_continuous(expand = expand_scale(add = c(10, 50))) + # Add 50 y padding
# Line with arrow
geom_segment(aes(x = df$x[3], y = df$y[3] + 50,
xend = df$x[3], yend = df$y[3] + 50),
arrow = arrow(length = unit(0.02, "npc"), type = "closed")) +
# Background box
geom_tile(aes(x = mean(c(df$x[3], df$x[3])),
y = mean(c(df$y[3], df$y[3])) + 50, width = 1, height = 40),
fill = "white", color = "black", size = 0.5) +
# Text
geom_text(aes(x = mean(c(df$x[3], df$x[3])),
y = mean(c(df$y[3], df$y[3])) + 50,
label = paste0("CAGR\n",
df$x[3], "-", df$x[3], "\n",
scales::percent((df$y[3] / df$y[3]) ^ (1/(df$x[3]-df$x[3])) - 1))))
Step 2. Make it into a function
Now I move the CAGR-related layers into a function, replacing most of the constants with function parameters.
add_CAGR <- function(df, first_val_pos, second_val_pos,
y_offset, box_width = 1, box_height) {
list(
# Line with arrow
geom_segment(aes(x = df$x[first_val_pos],
xend = df$x[second_val_pos],
y = df$y[first_val_pos] + y_offset,
yend = df$y[second_val_pos] + y_offset),
arrow = arrow(length = unit(0.02, "npc"), type = "closed")),
# Background box
geom_tile(aes(x = mean(c(df$x[first_val_pos], df$x[second_val_pos])),
y = mean(c(df$y[first_val_pos], df$y[second_val_pos])) + y_offset,
width = box_width, height = box_height),
fill = "white", color = "black", size = 0.5),
# Text
geom_text(aes(x = mean(c(df$x[first_val_pos], df$x[second_val_pos])),
y = mean(c(df$y[first_val_pos], df$y[second_val_pos])) + y_offset,
label = paste0("CAGR\n",
df$x[first_val_pos], "-", df$x[second_val_pos], "\n",
scales::percent((df$y[second_val_pos] / df$y[1]) ^
(1/(df$x[second_val_pos]-df$x[first_val_pos])) - 1))),
lineheight = 0.8)
)
}
Step 3: Use in plot
ggplot(df, aes(x, y, label = y)) +
geom_col() +
geom_text(vjust = -0.5) +
scale_y_continuous(expand = expand_scale(add = c(0, 50))) + # Add 50 y padding
add_CAGR(df, first_val_pos = 1, second_val_pos = 3,
y_offset = 50,
box_width = 0.7, box_height = 40)
Or the same thing just between the first two bars:
ggplot(df, aes(x, y, label = y)) +
geom_col() +
geom_text(vjust = -0.5) +
scale_y_continuous(expand = expand_scale(add = c(0, 50))) + # Add 50 y padding
add_CAGR(df, first_val_pos = 1, second_val_pos = 2,
y_offset = 50,
box_width = 0.7, box_height = 40)
I have created boxplots using ggplot2 with this code.
plotgraph <- function(x, y, colour, min, max)
{
plot1 <- ggplot(dims, aes(x = x, y = y, fill = Region)) +
geom_boxplot()
#plot1 <- plot1 + scale_x_discrete(name = "Blog Type")
plot1 <- plot1 + labs(color='Region') + geom_hline(yintercept = 0, alpha = 0.4)
plot1 <- plot1 + scale_y_continuous(breaks=c(seq(min,max,5)), limits = c(min, max))
plot1 <- plot1 + labs(x="Blog Type", y="Dimension Score") + scale_fill_grey(start = 0.3, end = 0.7) + theme_grey()
plot1 <- plot1 + theme(legend.justification = c(1, 1), legend.position = c(1, 1))
return(plot1)
}
plot1 <- plotgraph (Blog, Dim1, Region, -30, 25)
A part of data I use is reproduced here.
Blog,Region,Dim1,Dim2,Dim3,Dim4
BlogsInd.,PK,-4.75,13.47,8.47,-1.29
BlogsInd.,PK,-5.69,6.08,1.51,-1.65
BlogsInd.,PK,-0.27,6.09,0.03,1.65
BlogsInd.,PK,-2.76,7.35,5.62,3.13
BlogsInd.,PK,-8.24,12.75,3.71,3.78
BlogsInd.,PK,-12.51,9.95,2.01,0.21
BlogsInd.,PK,-1.28,7.46,7.56,2.16
BlogsInd.,PK,0.95,13.63,3.01,3.35
BlogsNews,PK,-5.96,12.3,6.5,1.49
BlogsNews,PK,-8.81,7.47,4.76,1.98
BlogsNews,PK,-8.46,8.24,-1.07,5.09
BlogsNews,PK,-6.15,0.9,-3.09,4.94
BlogsNews,PK,-13.98,10.6,4.75,1.26
BlogsNews,PK,-16.43,14.49,4.08,9.91
BlogsNews,PK,-4.09,9.88,-2.79,5.58
BlogsNews,PK,-11.06,16.21,4.27,8.66
BlogsNews,PK,-9.04,6.63,-0.18,5.95
BlogsNews,PK,-8.56,7.7,0.71,4.69
BlogsNews,PK,-8.13,7.26,-1.13,0.26
BlogsNews,PK,-14.46,-1.34,-1.17,14.57
BlogsNews,PK,-4.21,2.18,3.79,1.26
BlogsNews,PK,-4.96,-2.99,3.39,2.47
BlogsNews,PK,-5.48,0.65,5.31,6.08
BlogsNews,PK,-4.53,-2.95,-7.79,-0.81
BlogsNews,PK,6.31,-9.89,-5.78,-5.13
BlogsTech,PK,-11.16,8.72,-5.53,8.86
BlogsTech,PK,-1.27,5.56,-3.92,-2.72
BlogsTech,PK,-11.49,0.26,-1.48,7.09
BlogsTech,PK,-0.9,-1.2,-2.03,-7.02
BlogsTech,PK,-12.27,-0.07,5.04,8.8
BlogsTech,PK,6.85,1.27,-11.95,-10.79
BlogsTech,PK,-5.21,-0.89,-6,-2.4
BlogsTech,PK,-1.06,-4.8,-8.62,-2.42
BlogsTech,PK,-2.6,-4.58,-2.07,-3.25
BlogsTech,PK,-0.95,2,-2.2,-3.46
BlogsTech,PK,-0.82,7.94,-4.95,-5.63
BlogsTech,PK,-7.65,-5.59,-3.28,-0.54
BlogsTech,PK,0.64,-1.65,-2.36,-2.68
BlogsTech,PK,-2.25,-3,-3.92,-4.87
BlogsTech,PK,-1.58,-1.42,-0.38,-5.15
Columns,PK,-5.73,3.26,0.81,-0.55
Columns,PK,0.37,-0.37,-0.28,-1.56
Columns,PK,-5.46,-4.28,2.61,1.29
Columns,PK,-3.48,2.38,12.87,3.73
Columns,PK,0.88,-2.24,-1.74,3.65
Columns,PK,-2.11,4.51,8.95,2.47
Columns,PK,-10.13,10.73,9.47,-0.47
Columns,PK,-2.08,1.04,0.11,0.6
Columns,PK,-4.33,5.65,2,-0.77
Columns,PK,1.09,-0.24,-0.92,-0.17
Columns,PK,-4.23,-4.01,-2.32,6.26
Columns,PK,-1.46,-1.53,9.83,5.73
Columns,PK,9.37,-1.32,1.27,-4.12
Columns,PK,5.84,-2.42,-5.21,1.07
Columns,PK,8.21,-9.36,-5.87,-3.21
Columns,PK,7.34,-7.3,-2.94,-5.86
Columns,PK,1.83,-2.77,1.47,-4.02
BlogsInd.,PK,14.39,-0.55,-5.42,-4.7
BlogsInd.,US,22.02,-1.39,2.5,-3.12
BlogsInd.,US,4.83,-3.58,5.34,9.22
BlogsInd.,US,-3.24,2.83,-5.3,-2.07
BlogsInd.,US,-5.69,15.17,-14.27,-1.62
BlogsInd.,US,-22.92,4.1,5.79,-3.88
BlogsNews,US,0.41,-2.03,-6.5,2.81
BlogsNews,US,-4.42,8.49,-8.04,2.04
BlogsNews,US,-10.72,-4.3,3.75,11.74
BlogsNews,US,-11.29,2.01,0.67,8.9
BlogsNews,US,-2.89,0.08,-1.59,7.06
BlogsNews,US,-7.59,8.51,3.02,12.33
BlogsNews,US,-7.45,23.51,2.79,0.48
BlogsNews,US,-12.49,15.79,-9.86,18.29
BlogsTech,US,-11.59,6.38,11.79,-7.28
BlogsTech,US,-4.6,4.12,7.46,3.36
BlogsTech,US,-22.83,2.54,10.7,5.09
BlogsTech,US,-4.83,3.37,-8.12,-0.9
BlogsTech,US,-14.76,29.21,6.23,9.33
Columns,US,-15.93,12.85,19.47,-0.88
Columns,US,-2.78,-1.52,8.16,0.24
Columns,US,-16.39,13.08,11.07,7.56
Even though I have tried to add detailed scale on y-axis, it is hard for me to pinpoint exact median score for each boxplot. So I need to print median value within each boxplot. There was another answer available (for faceted boxplot) which does not work for me as the printed values are not within the boxes but jammed together in the middle. It will be great to be able to print them within (middle and above the median line of) boxplots.
Thanks for your help.
Edit: I make a grouped graph as below.
Add
library(dplyr)
dims=dims%>%
group_by(Blog,Region)%>%
mutate(med=median(Dim1))
plotgraph <- function(x, y, colour, min, max)
{
plot1 <- ggplot(dims, aes(x = x, y = y, fill = Region)) +
geom_boxplot()+
labs(color='Region') +
geom_hline(yintercept = 0, alpha = 0.4)+
scale_y_continuous(breaks=c(seq(min,max,5)), limits = c(min, max))+
labs(x="Blog Type", y="Dimension Score") + scale_fill_grey(start = 0.3, end = 0.7) +
theme_grey()+
theme(legend.justification = c(1, 1), legend.position = c(1, 1))+
geom_text(aes(y = med,x=x, label = round(med,2)),position=position_dodge(width = 0.8),size = 3, vjust = -0.5,colour="blue")
return(plot1)
}
plot1 <- plotgraph (Blog, Dim1, Region, -30, 25)
Which gives (the text colour can be tweaked to something less tacky):
Note: You should consider using non-standard evaluation in your function rather than having it require the use of attach()
Edit:
One liner, not as clean I wanted it to be since I ran into problems with dplyr not properly aggregating the data even though it says the grouping was performed.
This function assume the dataframe is always called dims
library(ggplot2)
library(reshape2)
plotgraph <- function(x, y, colour, min, max)
{
plot1 <- ggplot(dims, aes_string(x = x, y = y, fill = colour)) +
geom_boxplot()+
labs(color=colour) +
geom_hline(yintercept = 0, alpha = 0.4)+
scale_y_continuous(breaks=c(seq(min,max,5)), limits = c(min, max))+
labs(x="Blog Type", y="Dimension Score") +
scale_fill_grey(start = 0.3, end = 0.7) +
theme_grey()+
theme(legend.justification = c(1, 1), legend.position = c(1, 1))+
geom_text(data= melt(with(dims, tapply(eval(parse(text=y)),list(eval(parse(text=x)),eval(parse(text=colour))), median)),varnames=c("Blog","Region"),value.name="med"),
aes_string(y = "med",x=x, label = "med"),position=position_dodge(width = 0.8),size = 3, vjust = -0.5,colour="blue")
return(plot1)
}
plot1 <- plotgraph ("Blog", "Dim1", "Region", -30, 25)
Assuming that Blog is your dataframe, the following should work:
min <- -30
max <- 25
meds <- aggregate(Dim1~Region, Blog, median)
plot1 <- ggplot(Blog, aes(x = Region, y = Dim1, fill = Region)) +
geom_boxplot()
plot1 <- plot1 + labs(color='Region') + geom_hline(yintercept = 0, alpha = 0.4)
plot1 <- plot1 + scale_y_continuous(breaks=c(seq(min,max,5)), limits = c(min, max))
plot1 <- plot1 + labs(x="Blog Type", y="Dimension Score") + scale_fill_grey(start = 0.3, end = 0.7) + theme_grey()
plot1 + theme(legend.justification = c(1, 1), legend.position = c(1, 1)) +
geom_text(data = meds, aes(y = Dim1, label = round(Dim1,2)),size = 5, vjust = -0.5, color='white')