How to improve speed of ggplot bar chart when plotting >1000 points? - r

I'm producing a bar chart for 1200 observations using ggplot2. Each of these observations has an error bar. There's also an average shown (using geom_line) for these observations overall.
I'm finding the running time is very slow (2 seconds) in comparison to less observations (e.g. if 500 or were used <1 second). Also, all observations must be a seperate bar.
I realise it doesn't sound like much time, but this time adds up overall for what I need to do - producing over 100 of these plots and knitting them to rmd file.
Below is a piece of code I've created to replicate the issue - this is using ggplot2 inbuilt diamonds dataset.
diamonds1 <- as.data.frame(mutate(diamonds, upper = x + 1.2, lower = x - 0.4))
diamonds2 <- diamonds1 %>%
group_by(cut) %>%
summarize(Mean = mean(x, na.rm=TRUE))
ChosenColorClarity <- "VVS28451"
diamonds3 <- left_join(diamonds1 ,diamonds2, by = c("cut" = "cut") ) %>%
filter(cut == "Very Good") %>%
mutate(ID = paste0(clarity,row_number() )) %>%
mutate(CutType = case_when(ID==ChosenColorClarity ~ ID,
color == "F" & ID != ChosenColorClarity ~ " Same Color",
TRUE ~ " Other Color"),
CutLabel = ifelse(ID == ChosenColorClarity, "Your Cut", ""))
diamonds4 <- diamonds3[order(-xtfrm(diamonds3$CutLabel)),]
diamonds4 <- diamonds4[1:1255,]
diamonds4$Xval <- as.numeric(reorder(diamonds4$ID, diamonds4$x))
DiamondCutChart = diamonds4 %>%
ggplot(aes(x = Xval,
y = x)) +
geom_bar(aes(fill=CutType), stat = "identity", width = 1) +
geom_errorbar(aes(ymin = lower, ymax = upper)) +
geom_text(aes(label = CutLabel),
position = position_stack(vjust = 0.5),
size = 2.7, angle = 90, fontface = "bold") +
geom_line(aes(y = diamonds4$Mean), group = 1, linetype=2, colour = "#0000ff") +
scale_fill_manual(values = c("#32572C", "#41B1B1", "#db03fc")) +
annotate("text", x = 1, y = diamonds4$Mean, hjust =0, vjust = -0.5,
size = 3.2, colour = "#0000ff",
label=paste0("Mean ",diamonds4$Mean)) +
theme_classic()+
theme(axis.title.x=element_blank(),
axis.title.y=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position = "top") +
labs(fill = "")
StartTime = Sys.time()
DiamondCutChart
EndTime = Sys.time()
EndTime - StartTime
When running this, it takes around 2 seconds. I need this to be less than 1 second to be able to produce multiple plots and rmarkdown outputs in less overall time.
How can I reduce the time it takes to plot the graph from the piece of code?
Any help or pointing in the right direction is greatly appreciated.

I'm assuming for now that you're aiming for raw speed, and visualization that depicts the desired data content. I'm not sure you need geom_bar() if only one bar is a different color. If your real world scenario has 7 different colors mixed randomly among the 1255 bars... this workaround won't work for you. :) Hopefully this will be helpful! :)
The geom_ribbon() is much faster to render than geom_bar(). With 1255 positions I didn't fiddle with its options, but I understand it has step functions to make it appear like bars when zoomed in. Ymmv.
It is so much faster, I decided to use it twice: once to render "bars" and once to render "error bars". In order for geom_ribbon() to work (for me) I created a numeric column for the x-axis values Xval, see below.
The geom_text() step is really only printing one label, and subsetting the data during this step saves a lot of rendering time. You can adjust as needed.
Same with the annotate() step, it's actually printing and re-printing the same label 1255 times, takes a lot of time. Obviously you don't need that. :)
Each of the three steps above saves about 0.6 to 0.7 seconds. Maybe you can mix and match with other geoms as needed.
The final result (on my system) was 0.2 seconds.
diamonds4$Xval <- as.numeric(reorder(diamonds4$ID, diamonds4$x))
DiamondCutChartNew <- diamonds4 %>%
ggplot(aes(x = Xval, y = x)) +
geom_ribbon(aes(ymin = 0, ymax = x), fill="#32572C") +
geom_col(data = subset(diamonds4, nchar(CutLabel) > 0),
aes(x = Xval, y = x),
fill = "#41B1B1") +
geom_ribbon(data = diamonds4,
aes(ymin = lower, ymax = upper), fill="#FF000077") +
geom_line(aes(y = x)) +
geom_text(data = subset(diamonds4, nchar(CutLabel) > 0),
aes(label = CutLabel),
position = position_stack(vjust = 0.5),
size = 2.7, angle = 90, fontface = "bold") +
geom_line(aes(x = Xval, y = Mean), group = 1, linetype = 2, colour = "#0000ff") +
annotate("text", x = 1, y = head(diamonds4$Mean, 1), hjust = 0, vjust = -0.5,
size = 3.2, colour = "#0000ff",
label=paste0("Mean ", head(diamonds4$Mean, 1))) +
theme_classic() +
theme(axis.title.x=element_blank(),
axis.title.y=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position = "top") +
labs(fill = "")
{StartTime = Sys.time()
print(DiamondCutChartNew)
EndTime = Sys.time()
EndTime - StartTime}
Original result (for me):
Time difference of 2.05 secs
The new result:
Time difference of 0.229 secs

Pasting the ProfVis run for this question:
https://rstudio.github.io/profvis/
install.packages("profvis")
library(profvis)
profvis(expr = {
DiamondCutChart <- diamonds4 %>%
ggplot(aes(x = reorder(ID, x),
y = x)) +
geom_bar(aes(fill=CutType), stat = "identity", width = 1) +
geom_errorbar(aes(ymin = lower, ymax = upper)) +
geom_text(aes(label = CutLabel),
position = position_stack(vjust = 0.5),
size = 2.7, angle = 90, fontface = "bold") +
geom_line(aes(y = Mean), group = 1, linetype=2, colour = "#0000ff") +
scale_fill_manual(values = c("#32572C", "#41B1B1")) +
annotate("text", x = 1, y = diamonds4$Mean, hjust =0, vjust = -0.5,
size = 3.2, colour = "#0000ff",
label=paste0("Mean ",diamonds4$Mean)) +
theme_classic()+
theme(axis.title.x=element_blank(),
axis.title.y=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position = "top") +
labs(fill = "")
print(DiamondCutChart)
},
interval = 0.005
)

Related

Indicating a range for the Y axis in a bar chart

For my data the average normally lies between 8,000 and 10,000 and I want to indicate this range on my bar chart below,
I want to show to red lines from y=10,000 and y=8,000 and potentially shade the area in between them, if possible.
Bar chart attachted
Monthly_accidents2 %>%
ggplot(aes(x=Month,y=Traffic_Accidents))+
geom_bar(stat ="identity",fill = "#97B3C6")+
geom_text(aes(label = Traffic_Accidents), vjust = 0.5, colour = "white")+
ylim(0,12000)+
#coord_flip()+
theme_dark()+
labs(x=NULL,
y="Number of traffic accidents",
title = " Traffic Accidents throughout the year")
Thanks for any possible help in advance.
I tried creating a data set and adding the two lines but it didn't work.
For adding the two lines, 'geom_hline' makes it very straightforward. And for the shaded area, you can use 'geom_rect', though I'm guessing your 'Month' variable is factor so it requires a little bit of faffing to convert it to numeric first then adjust so the shaded area covers all of your bars :)
You didn't provide any sample data so I haven't run this but it should work.
Monthly_accidents2 %>%
ggplot(aes(x=Month,y=Traffic_Accidents))+
geom_bar(stat ="identity",fill = "#97B3C6")+
geom_text(aes(label = Traffic_Accidents), vjust = 0.5, colour = "white")+
ylim(0,12000)+
geom_hline(yintercept = c(8000, 10000), colour = 'red')+
geom_rect(aes(xmin = min(as.integer(Monthly_accidents2$Month)) - 0.5,
xmax = max(as.integer(Monthly_accidents2$Month)) + 0.5,
ymin = 8000, ymax = 10000), alpha = 0.2, fill = 'darkred')+
#coord_flip()+
theme_dark()+
labs(x=NULL,
y="Number of traffic accidents",
title = " Traffic Accidents throughout the year")
One option to achieve your desired result would be to use geom_hline to add some horizontal lines and annotate to add a shaded rectangle:
Using some fake example data:
Monthly_accidents2 <- data.frame(
Month = factor(month.abb, month.abb),
Traffic_Accidents = 1000 * seq_len(12)
)
library(ggplot2)
base <- ggplot(Monthly_accidents2, aes(x = Month, y = Traffic_Accidents)) +
geom_col(fill = "#97B3C6") +
geom_text(aes(label = Traffic_Accidents), vjust = 0.5, colour = "white") +
ylim(0, 12000) +
theme_dark() +
labs(
x = NULL,
y = "Number of traffic accidents",
title = "Traffic Accidents throughout the year"
) +
theme(plot.title = element_text(hjust = .5))
base +
geom_hline(yintercept = c(8000, 10000), color = "red") +
annotate(geom = "rect", ymin = 8000, ymax = 10000, xmin = -Inf, xmax = Inf, fill = "red", alpha = .2)
add the following to your plot:
+
geom_hline(aes(yintercept = c(8000, 10000), color = "red"))
For the lines
edit:
See stefan's more complete answer.

ggplot2 with side by side and proportional fill

I have data that looks like this:
My goal is to have a barplot grid as follows: Each plot will be specific to 1 race_ethnicity group. The x-axis in each plot will be the different age_bin groups. For each age_bin, there will be two bars: 1 for men, and 1 for women. For each bar, I want it to be filled with the proportion of Likely/(Unlikely + Likely). Preferably, each bar would have a height of 1 and a line cut through it so Likely% of that bar is one color with a label. This is what I currently have:
I am running into issues with 1) using a predefined proportion as the fill, and 2) having two different "fills" (one for biological sex, one for the predefined proportion.
Thanks to anyone who can help with this. My code is currently the following:
ggplot(data=who_votes_data, aes(x=age_bin,y=1, fill=gender)) +
geom_bar(stat='identity',aes(fill = gender), position = position_dodge2()) +
facet_wrap(~race_ethnicity, nrow = 2, scales = "free") +
geom_text(aes(label=paste0(sprintf("%1.1f", prop*100),"%"), y=prop),
colour="white") +
labs(x = expression("Age Group"), y= ("Prortion of Likely Voters"),
title = "Proportion of Likely Voters Across Age Groups, Race/Ethnicity, and Sex",
caption="Figure 1") + theme(plot.caption = element_text(hjust = 0.5, vjust = -0.5, size = 18))
https://docs.google.com/spreadsheets/d/1a7433iwXNSwcuXDJOvqsxNDN6oaYULVlyw22E41JROU/edit?usp=sharing
Updated Code:
library(tidyverse)
library(ggplot2)
df<- read.csv("samplevotes.csv")
df %>%
group_by(race_ethnicity, age_bin, gender) %>%
summarise(Likely = sum(Likely),
Unlikely = sum(Unlikely),
proportion = Likely/(Likely+Unlikely)) %>% ungroup() %>%
ggplot(aes(x = age_bin, y = proportion, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~race_ethnicity, nrow = 2) +
geom_text(aes(label=paste0(sprintf("%1.1f", proportion*100),"%"), y=proportion), position = position_dodge(width = 1), colour="Black", size = 2.2) +
labs(x = expression("Age Group"), y= ("Proportion of Likely Voters"), title = "Proportion of Likely Voters Across Age Groups, Race/Ethnicity, and Sex", caption="Figure 1") +
theme(plot.caption = element_text(hjust = 0.5, vjust = -0.5, size = 18))
Here is the code I would use. I did make some changes based on the way the data was combined.
df %>%
group_by(race_ethnicity, age_bin, gender) %>%
summarise(Likely = sum(Likely),
Unlikely = sum(Unlikely),
proportion = Likely/(Likely+Unlikely)) %>% ungroup() %>%
ggplot(aes(x = age_bin, y = proportion, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~race_ethnicity, nrow = 2) +
geom_text(aes(label=paste0(sprintf("%1.1f", proportion*100),"%"), y=proportion), position = position_dodge(width = 1), colour="Black", size = 2.2) +
labs(x = expression("Age Group"), y= ("Proportion of Likely Voters"), title = "Proportion of Likely Voters Across Age Groups, Race/Ethnicity, and Sex", caption="Figure 1") +
theme(plot.caption = element_text(hjust = 0.5, vjust = -0.5, size = 18))
Here is what it looks like

R Windrose percent label on figure

I am using the windrose function posted here: Wind rose with ggplot (R)?
I need to have the percents on the figure showing on the individual lines (rather than on the left side), but so far I have not been able to figure out how. (see figure below for depiction of goal)
Here is the code that makes the figure:
p.windrose <- ggplot(data = data,
aes(x = dir.binned,y = (..count..)/sum(..count..),
fill = spd.binned)) +
geom_bar()+
scale_y_continuous(breaks = ybreaks.prct,labels=percent)+
ylab("")+
scale_x_discrete(drop = FALSE,
labels = waiver()) +
xlab("")+
coord_polar(start = -((dirres/2)/360) * 2*pi) +
scale_fill_manual(name = "Wind Speed (m/s)",
values = spd.colors,
drop = FALSE)+
theme_bw(base_size = 12, base_family = "Helvetica")
I marked up the figure I have so far with what I am trying to do! It'd be neat if the labels either auto-picked the location with the least wind in that direction, or if it had a tag for the placement so that it could be changed.
I tried using geom_text, but I get an error saying that "aesthetics must be valid data columns".
Thanks for your help!
One of the things you could do is to make an extra data.frame that you use for the labels. Since the data isn't available from your question, I'll illustrate with mock data below:
library(ggplot2)
# Mock data
df <- data.frame(
x = 1:360,
y = runif(360, 0, 0.20)
)
labels <- data.frame(
x = 90,
y = scales::extended_breaks()(range(df$y))
)
ggplot(data = df,
aes(x = as.factor(x), y = y)) +
geom_point() +
geom_text(data = labels,
aes(label = scales::percent(y, 1))) +
scale_x_discrete(breaks = seq(0, 1, length.out = 9) * 360) +
coord_polar() +
theme(axis.ticks.y = element_blank(), # Disables default y-axis
axis.text.y = element_blank())
#teunbrand answer got me very close! I wanted to add the code I used to get everything just right in case anyone in the future has a similar problem.
# Create the labels:
x_location <- pi # x location of the labels
# Get the percentage
T_data <- data %>%
dplyr::group_by(dir.binned) %>%
dplyr::summarise(count= n()) %>%
dplyr::mutate(y = count/sum(count))
labels <- data.frame(x = x_location,
y = scales::extended_breaks()(range(T_data$y)))
# Create figure
p.windrose <- ggplot() +
geom_bar(data = data,
aes(x = dir.binned, y = (..count..)/sum(..count..),
fill = spd.binned))+
geom_text(data = labels,
aes(x=x, y=y, label = scales::percent(y, 1))) +
scale_y_continuous(breaks = waiver(),labels=NULL)+
scale_x_discrete(drop = FALSE,
labels = waiver()) +
ylab("")+xlab("")+
coord_polar(start = -((dirres/2)/360) * 2*pi) +
scale_fill_manual(name = "Wind Speed (m/s)",
values = spd.colors,
drop = FALSE)+
theme_bw(base_size = 12, base_family = "Helvetica") +
theme(axis.ticks.y = element_blank(), # Disables default y-axis
axis.text.y = element_blank())

How to change the legend of a ggplot

When I use ggplot and try to change the legend name from "value" to "Work schedules" doesn't change. As well the scale 0 - Did not worked; 1-Did worked. Do you know what could be wrong with my code:
plot <- ggplot(df3, aes(x = time, y = index, fill = value)) +
geom_raster() +
facet_grid(~ day) +
theme(panel.spacing = unit(1, "mm"),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x="Hours", y ="Identification Number") +
scale_x_continuous(breaks = c(9,17), name= "Time") +
scale_y_continuous()
plot + annotate("rect", fill = "red", alpha = 0.5, xmin = c(9), xmax = c(17), ymin = -Inf, ymax = Inf) +
ylab ("Identification number") +
theme_bw()
#Jordo82 has the right answer for naming the legend. As far as changing the scale from continuous to discrete, you should take a look at your variable "value" and see the range of the (for lack of a better word) values. If the variable type is a double, you may need to use dplyr::mutate() to create ranges. If the values are indeed discrete, try dplyr::mutate(value = as.factor(values))
df3 <- df3 %>% dplyr::mutate(value = ifelse(value < 2, "Not Worked", "Worked"))

How to separately label and scale double y-axis in ggplot2?

I have a test dataset like this:
df_test <- data.frame(
proj_manager = c('Emma','Emma','Emma','Emma','Emma','Alice','Alice'),
proj_ID = c(1, 2, 3, 4, 5, 6, 7),
stage = c('B','B','B','A','C','A','C'),
value = c(15,15,20,20,20,70,5)
)
Preparation for viz:
input <- select(df_test, proj_manager, proj_ID, stage, value) %>%
filter(proj_manager=='Emma') %>%
do({
proj_value_by_manager = sum(distinct(., proj_ID, value)$value);
mutate(., proj_value_by_manager = proj_value_by_manager)
}) %>%
group_by(stage) %>%
do({
sum_value_byStage = sum(distinct(.,proj_ID,value)$value);
mutate(.,sum_value_byStage= sum_value_byStage)
}) %>%
mutate(count_proj = length(unique(proj_ID)))
commapos <- function(x, ...) {
format(abs(x), big.mark = ",", trim = TRUE,
scientific = FALSE, ...) }
Visualization:
ggplot (input, aes(x=stage, y = count_proj)) +
geom_bar(stat = 'identity')+
geom_bar(aes(y=-proj_value_by_manager),
stat = "identity", fill = "Blue") +
scale_y_continuous(labels = commapos)+
coord_flip() +
ylab('') +
geom_text(aes(label= sum_value_byStage), hjust = 5) +
geom_text(aes(label= count_proj), hjust = -1) +
labs(title = "Emma: 4 projects| $90M Values \n \n Commitment|Projects") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype =1)
My questions are:
Why is the y-values not showing up right? e.g. C is labeled 20, but nearing hitting 100 on the scale.
How to adjust the position of labels so that it sits on the top of its bar?
How to re-scale the y axis so that both the very short bar of 'count of project' and long bar of 'Project value' can be well displayed?
Thank you all for the help!
I think your issues are coming from the fact that:
(1) Your dataset has duplicated values. This causes geom_bar to add all of them together. For example there are 3 obs for B where proj_value_by_manager = 90 which is why the blue bar extends to 270 for that group (they all get added).
(2) in your second geom_bar you use y = -proj_value_by_manager but in the geom_text to label this you use sum_value_byStage. That's why the blue bar for A is extending to 90 (since proj_value_by_manager is 90) but the label reads 20.
To get you what I believe the chart you want is you could do:
#Q1: No dupe dataset so it doesnt erroneous add columns
input2 <- input[!duplicated(input[,-c(2,4)]),]
ggplot (input2, aes(x=stage, y = count_proj)) +
geom_bar(stat = 'identity')+
geom_bar(aes(y=-sum_value_byStage), #Q1: changed so this y-value matches your label
stat = "identity", fill = "Blue") +
scale_y_continuous(labels = commapos)+
coord_flip() +
ylab('') +
geom_text(aes(label= sum_value_byStage, y = -sum_value_byStage), hjust = 1) + #Q2: Added in y-value for label and hjust so it will be on top
geom_text(aes(label= count_proj), hjust = -1) +
labs(title = "Emma: 4 projects| $90M Values \n \n Commitment|Projects") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype =1)
For your last question, there is no good way to display both of these. One option would be to rescale the small data and still label it with a 1 or 3. However, I didn't do this because once you scale down the blue bars the other bars look OK to me.

Resources