I have a data frame that simulates the NFL season with 2 columns: team and rank. I am trying to use ggridges to make a distribution plot of the frequency of each team at each rank from 1-10. I can get the plot working, but I'd like to display the count of each team/rank in each bin. I have been unsuccessful so far.
ggplot(results,
aes(x=rank, y=team, group = team)) +
geom_density_ridges2(aes(fill=team), stat='binline', binwidth=1, scale = 0.9, draw_baseline=T) +
scale_x_continuous(limits = c(0,11), breaks = seq(1,10,1)) +
theme_ridges() +
theme(legend.position = "none") +
scale_fill_manual(values = c("#4F2E84", "#FB4F14", "#7C1415", "#A71930", "#00143F", "#0C264C", "#192E6C", "#136677", "#203731"), name = NULL)
Which creates this plot:
I tried adding in this line to get the count added to each bin, but it did not work.
geom_text(stat='bin', aes(y = team + 0.95*stat(count/max(count)),
label = ifelse(stat(count) > 0, stat(count), ""))) +
Not the exact dataset but this should be enough to at least run the original plot:
results = data.frame(team = rep(c('Jets', 'Giants', 'Washington', 'Falcons', 'Bengals', 'Jaguars', 'Texans', 'Cowboys', 'Vikings'), 1000), rank = sample(1:20,9000,replace = T))
How about calculating the count for each bin, joining to the original data and using the new variable n as the label?
library(dplyr) # for count, left_join
results %>%
count(team, rank) %>%
left_join(results) %>%
ggplot(aes(rank, team, group = team)) +
geom_density_ridges2(aes(fill = team),
stat = 'binline',
binwidth = 1,
scale = 0.9,
draw_baseline = TRUE) +
scale_x_continuous(limits = c(0, 11),
breaks = seq(1, 10, 1)) +
theme_ridges() +
theme(legend.position = "none") +
scale_fill_manual(values = c("#4F2E84", "#FB4F14", "#7C1415", "#A71930", "#00143F",
"#0C264C", "#192E6C", "#136677", "#203731"), name = NULL) +
geom_text(aes(label = n),
color = "white",
nudge_y = 0.2)
Result:
Neilfws' answer is great, but I've always found geom_ridgelines difficult to work with in circumstances like this so I usually recreate them with geom_rect:
library(dplyr)
results %>%
count(team, rank) %>%
filter(rank<=10) %>%
mutate(team=factor(team)) %>%
ggplot() +
geom_rect(aes(xmin=rank-0.5, xmax=rank+0.5, ymin=team, fill=team,
ymax=as.numeric(team)+n*0.75/max(n))) +
geom_text(aes(x=rank, y=as.numeric(team)-0.1, label=n)) +
theme_ridges() +
theme(legend.position = "none") +
scale_fill_manual(values = c("#4F2E84", "#FB4F14", "#7C1415", "#A71930",
"#00143F", "#0C264C", "#192E6C", "#136677",
"#203731"), name = NULL) +
ylab("team")
I especially like the level of fine control I get from geom_rect rather than ridgelines. But you do lose out on the nice bounding line drawn around each ridgeline, so if that's important then go with the other answer.
Related
In order to highlight the moving average in my ggplot visualization, I want to give it a different color (in this case grey or black for both MA lines). When it comes to to a graph representing two time series, I struggle to find the best solution. Maybe I need to take a different approach.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidyquant))
V = 365
data <- data.frame (var1 = c(rnorm(V)),
var2 = c(rnorm(V)+12),
date = c(dates <- ymd("2013-01-01")+ days(0:364))
)
data_melted <- reshape2::melt(data, id.var='date')
data_melted %>%
ggplot() +
geom_line(mapping = aes(x= date, y=value, col=variable)) +
scale_color_manual(values=c("#CC6666", "steelblue")) +
geom_ma(ma_fun = SMA, n = 30, mapping = aes(x= date, y=value, col=variable)) +
theme(axis.text.x = element_text(angle = 50, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month")
I think you can get what you want by not mapping variable to color in aes() for the MA part. I instead include group = variable to tell ggplot2 that the two MA's should be separate series, but no difference in their color based on that.
data_melted %>%
ggplot() +
geom_line(mapping = aes(x= date, y=value, col=variable)) +
scale_color_manual(values=c("#CC6666", "steelblue")) +
tidyquant::geom_ma(ma_fun = SMA, n = 30, mapping = aes(x= date, y=value, group = variable), color = "black") +
theme(axis.text.x = element_text(angle = 50, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month")
If you want different colors, the natural way to do this in ggplot would be to give the moving averages their own values to be mapped to color.
...
scale_color_manual(values=c("#CC6666", "#996666", "steelblue", "slateblue")) +
tidyquant::geom_ma(ma_fun = SMA, n = 30, mapping = aes(x= date, y=value, col=paste(variable, "MA"))) +
...
I would consider looking at the tsibble library for time series data.
library(tsibble)
data_melted <-as_tsibble(data_melted, key = 'variable', index = 'date')
data_melted <- data_melted %>%
mutate(
`5-MA` = slider::slide_dbl(value, mean,
.before = 2, .after = 2, .complete = TRUE)
)
data_melted %>%
filter(variable == "var1") %>%
autoplot(value) +
geom_line(aes(y = `5-MA`), colour = "#D55E00") +
labs(y = "y",
title = "title") +
guides(colour = guide_legend(title = "series"))
I have data that looks like this:
My goal is to have a barplot grid as follows: Each plot will be specific to 1 race_ethnicity group. The x-axis in each plot will be the different age_bin groups. For each age_bin, there will be two bars: 1 for men, and 1 for women. For each bar, I want it to be filled with the proportion of Likely/(Unlikely + Likely). Preferably, each bar would have a height of 1 and a line cut through it so Likely% of that bar is one color with a label. This is what I currently have:
I am running into issues with 1) using a predefined proportion as the fill, and 2) having two different "fills" (one for biological sex, one for the predefined proportion.
Thanks to anyone who can help with this. My code is currently the following:
ggplot(data=who_votes_data, aes(x=age_bin,y=1, fill=gender)) +
geom_bar(stat='identity',aes(fill = gender), position = position_dodge2()) +
facet_wrap(~race_ethnicity, nrow = 2, scales = "free") +
geom_text(aes(label=paste0(sprintf("%1.1f", prop*100),"%"), y=prop),
colour="white") +
labs(x = expression("Age Group"), y= ("Prortion of Likely Voters"),
title = "Proportion of Likely Voters Across Age Groups, Race/Ethnicity, and Sex",
caption="Figure 1") + theme(plot.caption = element_text(hjust = 0.5, vjust = -0.5, size = 18))
https://docs.google.com/spreadsheets/d/1a7433iwXNSwcuXDJOvqsxNDN6oaYULVlyw22E41JROU/edit?usp=sharing
Updated Code:
library(tidyverse)
library(ggplot2)
df<- read.csv("samplevotes.csv")
df %>%
group_by(race_ethnicity, age_bin, gender) %>%
summarise(Likely = sum(Likely),
Unlikely = sum(Unlikely),
proportion = Likely/(Likely+Unlikely)) %>% ungroup() %>%
ggplot(aes(x = age_bin, y = proportion, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~race_ethnicity, nrow = 2) +
geom_text(aes(label=paste0(sprintf("%1.1f", proportion*100),"%"), y=proportion), position = position_dodge(width = 1), colour="Black", size = 2.2) +
labs(x = expression("Age Group"), y= ("Proportion of Likely Voters"), title = "Proportion of Likely Voters Across Age Groups, Race/Ethnicity, and Sex", caption="Figure 1") +
theme(plot.caption = element_text(hjust = 0.5, vjust = -0.5, size = 18))
Here is the code I would use. I did make some changes based on the way the data was combined.
df %>%
group_by(race_ethnicity, age_bin, gender) %>%
summarise(Likely = sum(Likely),
Unlikely = sum(Unlikely),
proportion = Likely/(Likely+Unlikely)) %>% ungroup() %>%
ggplot(aes(x = age_bin, y = proportion, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~race_ethnicity, nrow = 2) +
geom_text(aes(label=paste0(sprintf("%1.1f", proportion*100),"%"), y=proportion), position = position_dodge(width = 1), colour="Black", size = 2.2) +
labs(x = expression("Age Group"), y= ("Proportion of Likely Voters"), title = "Proportion of Likely Voters Across Age Groups, Race/Ethnicity, and Sex", caption="Figure 1") +
theme(plot.caption = element_text(hjust = 0.5, vjust = -0.5, size = 18))
Here is what it looks like
I would like modify an existing sankey plot using ggplot2 and ggalluvial to make it more appealing
my example is from https://corybrunson.github.io/ggalluvial/articles/ggalluvial.html
library(ggplot2)
library(ggalluvial)
data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = freq,
fill = response, label = response)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", size = 3) +
theme(legend.position = "none") +
ggtitle("vaccination survey responses at three points in time")
Created on 2020-10-01 by the reprex package (v0.3.0)
Now, I would like to change this plot that it looks similar to a plot from https://sciolisticramblings.wordpress.com/2018/11/23/sankey-charts-the-new-pie-chart/, i.e. 1. change absolute to relative values (percentage) 2. add percentage labels and 3. apply partial fill (e.g. "missing" and "never")
My approach:
I think I could change the axis to percentage with something like: scale_y_continuous(label = scales::percent_format(scale = 100))
However, I am not sure about step 2. and 3.
This could be achieved like so:
Changing to percentages could be achieved by adding a new column to your df with the percentage shares by survey, which can then be mapped on y instead of freq.
To get nice percentage labels you can make use of scale_y_continuous(label = scales::percent_format())
For the partial filling you can map e.g. response %in% c("Missing", "Never") on fill (which gives TRUE for "Missing" and "Never") and set the fill colors via scale_fill_manual
The percentages of each stratum can be added to the label via label = paste0(..stratum.., "\n", scales::percent(..count.., accuracy = .1)) in geom_text where I make use of the variables ..stratum.. and ..count.. computed by stat_stratum.
library(ggplot2)
library(ggalluvial)
library(dplyr)
data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
vaccinations <- vaccinations %>%
group_by(survey) %>%
mutate(pct = freq / sum(freq))
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = pct,
fill = response %in% c("Missing", "Never"),
label = response)) +
scale_x_discrete(expand = c(.1, .1)) +
scale_y_continuous(label = scales::percent_format()) +
scale_fill_manual(values = c(`TRUE` = "cadetblue1", `FALSE` = "grey50")) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(aes(label = paste0(..stratum.., "\n", scales::percent(..count.., accuracy = .1))), stat = "stratum", size = 3) +
theme(legend.position = "none") +
ggtitle("vaccination survey responses at three points in time")
Here is a dataframe
DF <- data.frame(SchoolYear = c("2015-2016", "2016-2017"),
Value = sample(c('Agree', 'Disagree', 'Strongly agree', 'Strongly disagree'), 50, replace = TRUE))
I have created this graph.
ggplot(DF, aes(x = Value, fill = SchoolYear)) +
geom_bar(position = 'dodge', aes(y = (..count..)/sum(..count..))) +
geom_text(aes(y = ((..count..)/sum(..count..)), label = scales::percent((..count..)/sum(..count..))),
stat = "count", vjust = -0.25, size = 2, position = position_dodge(width = 0.9)) +
scale_y_continuous(labels = percent) +
ylab("Percent") + xlab("Response") +
theme(axis.text.x = element_text(angle = 75, hjust = 1))
Is there a way to make the data for each school year add up to 100%, but not have the data stacked, in the graph?
I know this question is similar to this question Create stacked barplot where each stack is scaled to sum to 100%, but I don't want the graph to be stacked. I can't figure out how to apply the solution in my question to this situation. Also I would prefer not to summarize the data before graphing, as I have to make this graph many times using different data each time and would prefer not to have to summarize the data each time.
I'm not sure how to create the plot that you want without transforming the data. But if you want to re-use the same code for multiple datasets, you can write a function to transform your data and generate the plot at the same time:
plot.fun <- function (original.data) {
newDF <- reshape2::melt(apply(table(original.data), 1, prop.table))
Plot <- ggplot(newDF, aes(x=Value, y=value)) +
geom_bar(aes(fill=SchoolYear), stat="identity", position="dodge") +
geom_text(aes(group=SchoolYear, label=scales::percent(value)), stat="identity", vjust=-0.25, size=2, position=position_dodge(width=0.85)) +
scale_y_continuous(labels=scales::percent) +
ylab("Percent") + xlab("Response") +
theme(axis.text.x = element_text(angle = 75, hjust = 1))
return (Plot)
}
plot.fun(DF)
Big Disclaimer: I would highly recommend you summarize your data before hand and not try to do these calculations within ggplot. That is not what ggplot is meant to do. Furthermore, it not only complicates your code unnecessarily, but can easily introduce bugs/unintended results.
Given that, it appears that what you want is doable (without summarizing first). A very hacky way to get what you want by doing the calculations within ggplot would be:
#Store factor values
fac <- unique(DF$SchoolYear)
ggplot(DF, aes(x = Value, fill = SchoolYear)) +
geom_bar(position = 'dodge', aes(y = (..count..)/stats::ave(..count.., get("fac", globalenv()), FUN = sum))) +
geom_text(aes(y = (..count..)/stats::ave(..count.., get("fac", globalenv()), FUN = sum), label = scales::percent((..count..)/stats::ave(..count.., get("fac", globalenv()), FUN = sum))),
stat = "count", vjust = -0.25, size = 2, position = position_dodge(width = 0.9)) +
scale_y_continuous(labels = percent) +
ylab("Percent") + xlab("Response") +
theme(axis.text.x = element_text(angle = 75, hjust = 1))
This takes the ..count.. variable and divides it by the sum within it's respective group using stats::ave. Note this can be messed up extremely easily.
Finally, we check to see the plot is in fact giving us what we want.
#Check to see we have the correct values
d2 <- DF
d2 <- setDT(d2)[, .(count = .N), by = .(SchoolYear, Value)][, percent := count/sum(count), by = SchoolYear]
I have a test dataset like this:
df_test <- data.frame(
proj_manager = c('Emma','Emma','Emma','Emma','Emma','Alice','Alice'),
proj_ID = c(1, 2, 3, 4, 5, 6, 7),
stage = c('B','B','B','A','C','A','C'),
value = c(15,15,20,20,20,70,5)
)
Preparation for viz:
input <- select(df_test, proj_manager, proj_ID, stage, value) %>%
filter(proj_manager=='Emma') %>%
do({
proj_value_by_manager = sum(distinct(., proj_ID, value)$value);
mutate(., proj_value_by_manager = proj_value_by_manager)
}) %>%
group_by(stage) %>%
do({
sum_value_byStage = sum(distinct(.,proj_ID,value)$value);
mutate(.,sum_value_byStage= sum_value_byStage)
}) %>%
mutate(count_proj = length(unique(proj_ID)))
commapos <- function(x, ...) {
format(abs(x), big.mark = ",", trim = TRUE,
scientific = FALSE, ...) }
Visualization:
ggplot (input, aes(x=stage, y = count_proj)) +
geom_bar(stat = 'identity')+
geom_bar(aes(y=-proj_value_by_manager),
stat = "identity", fill = "Blue") +
scale_y_continuous(labels = commapos)+
coord_flip() +
ylab('') +
geom_text(aes(label= sum_value_byStage), hjust = 5) +
geom_text(aes(label= count_proj), hjust = -1) +
labs(title = "Emma: 4 projects| $90M Values \n \n Commitment|Projects") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype =1)
My questions are:
Why is the y-values not showing up right? e.g. C is labeled 20, but nearing hitting 100 on the scale.
How to adjust the position of labels so that it sits on the top of its bar?
How to re-scale the y axis so that both the very short bar of 'count of project' and long bar of 'Project value' can be well displayed?
Thank you all for the help!
I think your issues are coming from the fact that:
(1) Your dataset has duplicated values. This causes geom_bar to add all of them together. For example there are 3 obs for B where proj_value_by_manager = 90 which is why the blue bar extends to 270 for that group (they all get added).
(2) in your second geom_bar you use y = -proj_value_by_manager but in the geom_text to label this you use sum_value_byStage. That's why the blue bar for A is extending to 90 (since proj_value_by_manager is 90) but the label reads 20.
To get you what I believe the chart you want is you could do:
#Q1: No dupe dataset so it doesnt erroneous add columns
input2 <- input[!duplicated(input[,-c(2,4)]),]
ggplot (input2, aes(x=stage, y = count_proj)) +
geom_bar(stat = 'identity')+
geom_bar(aes(y=-sum_value_byStage), #Q1: changed so this y-value matches your label
stat = "identity", fill = "Blue") +
scale_y_continuous(labels = commapos)+
coord_flip() +
ylab('') +
geom_text(aes(label= sum_value_byStage, y = -sum_value_byStage), hjust = 1) + #Q2: Added in y-value for label and hjust so it will be on top
geom_text(aes(label= count_proj), hjust = -1) +
labs(title = "Emma: 4 projects| $90M Values \n \n Commitment|Projects") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype =1)
For your last question, there is no good way to display both of these. One option would be to rescale the small data and still label it with a 1 or 3. However, I didn't do this because once you scale down the blue bars the other bars look OK to me.