I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)
Related
I simulated a dataset, did some data manipulation (in a very clumsy way) and made the following plot.
Simulate Data:
# Step 1 : Simulate Data
set.seed(123)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
Weight = rnorm(5000, 90, 10)
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
my_data = data.frame(Weight, Hospital_Visits, Disease)
my_data$hospital_ntile <- cut(my_data$Hospital_Visits, breaks = c(0, 5, 10, Inf), labels = c("Less than 5", "5 to 10", "More than 10"), include.lowest = TRUE)
Data Manipulation:
# Step 2: Data Manipulation:
my_data$weight_ntile <- cut(my_data$Weight, breaks = seq(min(my_data$Weight), max(my_data$Weight), by = (max(my_data$Weight) - min(my_data$Weight)) / 10), include.lowest = TRUE)
# Create a dataset for rows where hospital_ntile = 'Less than 5'
df1 <- subset(my_data, hospital_ntile == "Less than 5")
# Create a dataset for rows where hospital_ntile = '5 to 10'
df2 <- subset(my_data, hospital_ntile == "5 to 10")
# Create a dataset for rows where hospital_ntile = 'More than 10'
df3 <- subset(my_data, hospital_ntile == "More than 10")
avg_disease_rate_df1 <- tapply(df1$Disease == "Yes", df1$weight_ntile, mean)
avg_disease_rate_df2 <- tapply(df2$Disease == "Yes", df2$weight_ntile, mean)
avg_disease_rate_df3 <- tapply(df3$Disease == "Yes", df3$weight_ntile, mean)
avg_disease_rate_df1[is.na(avg_disease_rate_df1)] <- 0
avg_disease_rate_df2[is.na(avg_disease_rate_df2)] <- 0
avg_disease_rate_df3[is.na(avg_disease_rate_df3)] <- 0
#transform into dataset
names = names(avg_disease_rate_df1)
rate_1 = as.numeric(avg_disease_rate_df1)
rate_2 = as.numeric(avg_disease_rate_df2)
rate_3 = as.numeric(avg_disease_rate_df3)
# stack data
d1 = data.frame(class = "Less than 5", names = names, rate = rate_1)
d2 = data.frame(class = "5 to 10", names = names, rate = rate_2)
d3 = data.frame(class = "More than 10", names = names, rate = rate_3)
plot_data = rbind(d1, d2, d3)
Make Plot:
library(ggplot2)
ggplot(plot_data, aes(x=names, y=rate, group = class, color=class)) + geom_point() + geom_line() + theme_bw()
For some reason, the ordering on the x-axis is not in order - right now it appears randomly, I would like to make it from smallest to largest.
I consulted some references which showed how to change this manually - but is there some option within ggplot2 that allows for this ordering to be automatically corrected?
Thanks!
EDIT - you can either do this at the 1) plotting step or at the 2) data manipulation step.
1) - Option for Doing it at the Plotting Step
I think the easiest way to do this is to turn your x-axis variable into a factor and order it.
Right now it's a character
str(plot_data)
'data.frame': 30 obs. of 3 variables:
$ class: chr "Less than 5" "Less than 5" "Less than 5" "Less than 5" ...
$ names: chr "[52.6,59.9]" "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" ...
$ rate : num 0.6 0.1 0.339 0.399 0.438 ...
So you can make it into a factor and then check the levels:
plot_data$names <- as.factor(plot_data$names)
levels(plot_data$names)
which will show them in a somewhat random order:
[1] "(104,111]" "(111,118]" "(118,125]" "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" "(81.8,89.1]" "(89.1,96.3]" "(96.3,104]"
[10] "[52.6,59.9]"
You can re-level them then with the forcats library (there are other options but I like this one):
plot_data$names <- fct_relevel(plot_data$names,
c("[52.6,59.9]", "(59.9,67.2]", "(67.2,74.5]",
"(74.5,81.8]", "(81.8,89.1]", "(89.1,96.3]",
"(96.3,104]", "(104,111]", "(111,118]" ))
And then your plot will look like this:
ggplot(plot_data, aes(x=names, y=rate, group = class, color=class)) +
geom_point() +
geom_line() +
theme_bw()
2) - Option for Doing it at the Data Manipulation Step
When you make weight_ntile you use cut() to which you can ask for an ordered result (ordered_result = TRUE) which will get you your ordered factor. But then the rest of the data manipulation will get rid of that ordered factor if left as is. Instead, you could do it all in a pipe series with dplyr and tidyr. Here's a way to do that:
my_data$weight_ntile <- cut(my_data$Weight,
breaks = seq(min(my_data$Weight),
max(my_data$Weight),
by = (max(my_data$Weight) - min(my_data$Weight)) / 10),
include.lowest = TRUE,
ordered_result = TRUE)
plot_data <- my_data %>%
# group by your variables of interest
dplyr::group_by(hospital_ntile, weight_ntile, Disease) %>%
# find number of obs for eacvh group
dplyr::summarize(
count = n()
) %>%
# move these from long to wide for easier calculation
tidyr::pivot_wider(
names_from = "Disease", values_from = "count"
) %>%
dplyr::rowwise() %>%
# make the rate calculation, na.rm = TRUE is important for the edge case of
# only one observation (gives the 1.00 rate for 52.6 and 5 to 10)
dplyr::summarize(
rate = Yes / sum(Yes, No, na.rm = TRUE)
) %>%
# need to ungrouop to use complete()
ungroup() %>%
# use complete to fill in what would be NAs, then replace with zeros
# instead of NA's (that's the fill argument), this replaces your
# `avg_disease_rate_df1[is.na(avg_disease_rate_df1)] <- 0` etc step
tidyr::complete(., hospital_ntile, weight_ntile, fill = list(rate = 0)) %>%
# rename to what you use for the plotting step
dplyr::rename(
class = hospital_ntile, names = weight_ntile
)
And then the plot can be made the same way:
ggplot(plot_data, aes(x=names, y=rate, group = class, color=class)) +
geom_point() +
geom_line() +
theme_bw()
Here's a shorter workflow that keeps the ordering (by using ordered = TRUE in your cut steps, and relying on the simpler dplyr::group_by grouped aggregation:
my_data$hospital_ntile <- cut(my_data$Hospital_Visits, breaks = c(0, 5, 10, Inf),
labels = c("Less than 5", "5 to 10", "More than 10"),
ordered_result = TRUE, include.lowest = TRUE)
my_data$weight_ntile <- cut(my_data$Weight,
breaks = seq(min(my_data$Weight),
max(my_data$Weight),
by = (max(my_data$Weight) - min(my_data$Weight)) / 10),
include.lowest = TRUE, ordered_result = TRUE)
my_data |>
dplyr::group_by(hospital_ntile, weight_ntile) |>
dplyr::summarize(avg_disease_rate = mean(Disease == "Yes"), .groups = "drop") |>
tidyr::complete(hospital_ntile, weight_ntile, fill = list(avg_disease_rate = 0)) |>
ggplot(aes(weight_ntile, avg_disease_rate, color = hospital_ntile, group = hospital_ntile)) +
geom_line()
Starting from your plot_data, we can extract the start and end into separate columns, and then make the names be an ordered factor in order of the parsed numeric value of start.
The reason your original plot isn't showing up as you expect is because the ranges are being interpreted as text (ie character data) and character data is always sorted alphabetically in R.
plot_data |>
tidyr::separate(names, c("start", "end"), sep = ",", remove = FALSE) |>
dplyr::mutate(names = forcats::fct_reorder(names, readr::parse_number(start))) |>
ggplot(aes(x=names, y=rate, group = class, color=class)) +
geom_point() + geom_line() + theme_bw()
You can also simply bin your data using round (Plot 1). Or create custom bins with plyr::round_any.
However, I think what you're effectively asking for is a moving average which is commonly used in time series. And here lies the crux of the problem: By visualising averages over ranges of your weight (your x) you are suggesting a dimension of your data that simply doesn't exist (a relation/dependency within your weights).
Plot 2 shows an option of showing the moving average of your aggregated rates based on your rounded weight bins (using tidyquant::geom_ma). Although it "averages averages", I think it is a cleaner way to address your approach.
Plot 3 is a suggestion for a different visualisation altogether. It does not require any data manipulation and will therefore be the most accurate representation of your data. The story is getting a different twist though.
library(tidyverse)
df_rate <- my_data %>%
## count yes/no disease for rounded weights
mutate(weight = round(Weight)) %>%
count(hospital_ntile, weight, Disease) %>%
## calculate rate per weight, this does not take into account number of visits!
group_by(weight) %>%
mutate(rate = sum(Disease == "Yes")/ n())
## plot the rate directly
ggplot(df_rate, aes(x = weight, y = rate, color = hospital_ntile)) +
geom_line()
## arguably better to use a moving average
ggplot(df_rate, aes(x = weight, y = rate, color = hospital_ntile)) +
tidyquant::geom_ma(n = 10, lty = 1)
# however, with the above visualisation you're suggesting a relation that does not exist.
## What you (possibly) actually want to show, is how the distribution is of weight in people with disease and people without. (?)
## Better then do a scatter plot, or show the distribution in a violin plot
ggplot(my_data, aes(x = hospital_ntile, y = Weight)) +
ggforce::geom_sina(alpha = .5, stroke = 0, size = 1) +
labs(x = "Hospital visits") +
facet_grid(~Disease)
Please help!
I have case data I need to prepare for a report soon and just cannot get the graphs to display properly.
From a dataset with CollectionDate as the "record" of cases (i.e. multiple rows with the same date means more cases that day), I want to display Number of positive cases/total (positive + negative) cases for that day as a percent on the y-axis, with collection dates along the x-axis. Then I want to break down by region. Goal is to look like this but in terms of daily positives/# of tests rather than just positives vs negatives. I also want to add a horizontal line on every graph at 20%.
I have tried manipulating it before, in and after ggplot:
ggplot(df_final, aes(x =CollectionDate, fill = TestResult)) +
geom_bar(aes(y=..prop..)) +
scale_y_continuous(labels=percent_format())
Which is, again, close. But the percents are wrong because they are just taking the proportion of that day against counts of all days instead of per day.
Then I tried using tally()in the following command to try and count per region and aggregate:
df_final %>%
group_by(CollectionDate, Region, as.factor(TestResult)) %>%
filter(TestResult == "Positive") %>%
tally()
and I still cannot get the graphs right.
Suggestions?
A quick look at my data:
head(df_final)
Well, I have to say that I am not 100% sure that I got what you want, but anyway, this can be helpful.
The data: Since you are new here, I have to let you know that using a simple and reproducible version of your data will make it easier to the rest of us to answer. To do this you can simulate a data frame o any other objec, or use dput function on it.
library(ggplot2)
library(dplyr)
data <- data.frame(
# date
CollectionDate = sample(
seq(as.Date("2020-01-01"), by = "day", length.out = 15),
size = 120, replace = TRUE),
# result
TestResult = sample(c("Positive", "Negative"), size = 120, replace = TRUE),
# region
Region = sample(c("Region 1", "Region2"), size = 120, replace = TRUE)
)
With this data, you can do ass follow to get the plots you want.
# General plot, positive cases proportion
data %>%
count(CollectionDate, TestResult, name = "cases") %>%
group_by(CollectionDate) %>%
summarise(positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
geom_hline(yintercept = 0.2)
# positive proportion by day within region
data %>%
count(CollectionDate, TestResult, Region, name = "cases") %>%
group_by(CollectionDate, Region) %>%
summarise(
positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)
) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
# horizontal line at 20%
geom_hline(yintercept = 0.2) +
facet_wrap(~Region)
I can get you halfway there (refer to the comments in the code for clarifications). This code is for the counts per day per region (plotted separately for each region). I think you can tweak things further to calculate the counts per day per county too; and whole state should be a cakewalk. I wish you good luck with your report.
rm(list = ls())
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(tidyr) #Needed for the spread() function
#Dummy data
set.seed(1984)
sdate <- as.Date('2000-03-09')
edate <- as.Date('2000-05-18')
dateslist <- as.Date(sample(as.numeric(sdate): as.numeric(edate), 10000, replace = TRUE), origin = '1970-01-01')
df_final <- data.frame(Region = rep_len(1:9, 10000),
CollectionDate = dateslist,
TestResult = sample(c("Positive", "Negative"), 10000, replace = TRUE))
#First tally the positve and negative cases
#by Region, CollectionDate, TestResult in that order
df_final %<>%
group_by(Region, CollectionDate, TestResult) %>%
tally()
#Then
#First spread the counts (in n)
#That is, create separate columns for Negative and Positive cases
#for each Region-CollectionDate combination
#Then calculate their proportions (as shown)
#Now you have Negative and Positive
#percentages by CollectionDate by Region
df_final %<>%
spread(key = TestResult, value = n) %>%
mutate(Negative = Negative/(Negative + Positive),
Positive = Positive/(Negative + Positive))
#Plotting this now
#Since the percentages are available already
#Use geom_col() instead of geom_bar()
df_final %>% ggplot() +
geom_col(aes(x = CollectionDate, y = Positive, fill = "Positive"),
position = "identity", alpha = 0.4) +
geom_col(aes(x = CollectionDate, y = Negative, fill = "Negative"),
position = "identity", alpha = 0.4) +
facet_wrap(~ Region, nrow = 3, ncol = 3)
This yields:
I am using ggforce to create a plot like this. .
My goal is to facet this type of plot.
For background on how the chart was made, check out update 3 on this question. The only modification that I have made was adding a geom_segment between the x axis and the Y value positions.
The reason why I believe faceting this graph is either difficult, or even impossible, is because continuous value x coordinates are used to determine where the geom_arc_bar is positioned in space.
My only idea for getting this to work has been supplying each "characteristic" that I want to facet with a set of x coordinates (1,2,3). Initially, as I will demonstrate in my code, I worked with set of highly curated data. Ideally, I would like to scale this to a dataset with many variables.
In the example graph that I have provided, the Y value is from table8, filtered for rows with "DFT". The area of the half-circles is proportional to the values of DDFS and FDFS from table9. Ideally, I would like to be able to create a function allowing for the easy creation of these graphs, with perhaps 3 parameters, the data for the y value, and for both half circles.
Here is my data.
Here is the code that I have written thus far.
For making a single plot
#Filter desired Age and Measurement
table9 %>%
filter(Age == "6-11" & Measurement != 'DFS' ) %>%
select( SurveyYear, Total , Measurement ) %>%
arrange(SurveyYear) %>%
dplyr::rename(Percent = Total) -> table9
#Do the same for table 8.
table8 %>%
filter(Age == "6-11" & Measurement != "DS" & Measurement != "FS") %>%
select(SurveyYear, Total) %>%
dplyr::rename(Y = Total)-> table8
table8 <- table8 %>%
bind_rows(table8) %>%
arrange(Y) %>%
add_column(start = rep(c(-pi/2, pi/2), 3), x = c(1,1,2,2,3,3))
table8_9 <- bind_cols(table8,table9) %>%
select(-SurveyYear1)
#Create the plot
ggplot(table8_9) + geom_segment( aes(x=x, xend=x, y=0, yend=Y), size = 0.5, linetype="solid") +
geom_arc_bar(aes(x0 = x, y0 = Y, r0 = 0, r = sqrt((Percent*2)/pi)/20,
start = start, end = start + pi, fill = Measurement),
color = "black") + guides(fill = guide_legend(title = "Type", reverse = T)) +
guides(fill = guide_legend(title = "Measurement", reverse = F)) +
xlab("Survey Year") + ylab("Mean dfs") + coord_fixed() + theme_pubr() +
scale_y_continuous(expand = c(0, 0), limits = c(0, 5.5)) +
scale_x_continuous(breaks = 1:3, labels = paste0(c("1988-1994", "1999-2004", "2011-2014"))) +
scale_fill_discrete(labels = c("ds/dfs", "fs/dfs")) -> lolliPlot
lolliPlot
Attempt at many plots
#Filter for "DFS"
table8 <- table8 %>%
filter(Measurement=="DFS")
#Duplicate DF vertically, and add column specifying the start point for the arcs.
table8 <- table8 %>%
bind_rows(table8) %>%
add_column(start = rep(c(-pi/2, pi/2), length(.$SurveyYear)/2), x = rep(x = c(1,2,3),length(.$SurveyYear)/3)) %>%
arrange(Age, x)
#Bind two tables today, removing all of the characteristic columns from table 8.
table8_9 <- bind_cols(table8,table9) %>%
select(-Age1, -SurveyYear1, -Measurement) %>%
gather(key = Variable, value = Y, -x,-start,-Age, -SurveyYear, -Measurement1, -Total1, -Male1, -Female1, -'White, non-Hispanic1', -'Black, non-hispanic1', -'Mexican American1', -'Less than 100% FPG1', -'100-199% FPG1', -'Greater than 200% FPG1')
This is where I get stuck. I can't figure out a way to format the data so that I can facet the graph. If anybody has any ideas or advice, I would greatly appreciate it.
I'm trying to generate a stacked line/area graph utilizing the ggplot and geom_area functions. I have my data loaded into R correctly from what I can tell. Every time I generate the plot, the graph is empty (even though the axis looks correct except for the months being organized in alpha).
I've tried utilizing the data.frame function to define my variables but was unable to generate my plot. I've also looked around Stack Overflow and other websites, but no one seems to have the issue of no errors but still an empty plot.
Here's my data set:
Here's the code I'm using currently:
ggplot(OHV, aes(x=Month)) +
geom_area(aes(y=A+B+Unknown, fill="A")) +
geom_area(aes(y=B, fill="B")) +
geom_area(aes(y=Unknown, fill="Unknown"))
Here's the output at the end:
I have zero error messages, simply just no data being plotted on my graph.
Your dates are being interpreted as a factor. You must transform them.
ibrary(tidyverse)
set.seed(1)
df <- data.frame(Month = seq(lubridate::ymd('2018-01-01'),
lubridate::ymd('2018-12-01'), by = '1 month'),
Unknow = sample(17, replace = T, size = 12),
V1 = floor(runif(12, min = 35, max = 127)),
V2 = floor(runif(12, min = 75, max = 275)))
df <- df %>%
dplyr::mutate(Month = format(Month, '%b')) %>%
tidyr::gather(key = "Variable", value = "Value", -Month)
ggplot2::ggplot(df) +
geom_area(aes(x = Month, y = Value, fill = Variable),
position = 'stack')
Note that I used tidyr::gather to be able to stack the areas in an easier way.
Now assuming your year of analysis is 2018, you need to transform the date of your data frame to something continuous, in the interpretation of r.
df2 <- df %>%
dplyr::mutate(Month = paste0("2018-", Month, "-01"),
Month = lubridate::parse_date_time(Month,"y-b-d"),
Month = as.Date(Month))
library(scales)
ggplot2::ggplot(df2) +
geom_area(aes(x = Month, y = Value, fill = Variable),
position = 'stack') +
scale_x_date(labels = scales::date_format("%b"))
I have a data set similar to the one below where I have a lot of data for certain groups and then only single observations for other groups. I would like my single observations to show up as points but the other groups with multiple observations to show up as lines (no points). My code is below:
EDIT: I'm attempting to find a way to do this without using multiple datasets in the geom_* calls because of the issues it causes with the legend. There was an answer that has since been deleted that was able to handle the legend but didn't get rid of the points on the lines. I would potentially like a single legend with points only showing up if they are a single observation.
library(tidyverse)
dat <- tibble(x = runif(10, 0, 5),
y = runif(10, 0, 20),
group = c(rep("Group1", 4),
rep("Group2", 4),
"Single Point 1",
"Single Point 2")
)
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point() +
geom_line()
Created on 2019-04-02 by the reprex package (v0.2.1)
Only plot the data with 1 point in geom_point() and the data with >1 point in geom_line(). These can be precalculated in mutate().
dat = dat %>%
group_by(group) %>%
mutate(n = n() )
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) )
Having the legend match this is trickier. This is the sort of thing that that override.aes argument in guide_legend() can be useful for.
In your case I would separately calculate the number of observations in each group first, since that is what the line vs point is based on.
sumdat = dat %>%
group_by(group) %>%
summarise(n = n() )
The result is in the same order as the factor levels in the legend, which is why this works.
Now we need to remove lines and keep points whenever the group has only a single observation. 0 stands for a blank line and NA stands for now shape. I use an ifelse() statement for linetype and shape for override.aes, based on the number of observations per group.
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) ) +
guides(color = guide_legend(override.aes = list(linetype = ifelse(sumdat$n == 1, 0, 1),
shape = ifelse(sumdat$n == 1, 19, NA) ) ) )