Multiple time series with ggplot2 - r

I need to make some plots for work and I've been learning to use ggplot2, but I can't quite figure out how to get it to work with the dataset I'm using. I can't post my actual data here, but can give a brief example of what it is like. I have two main dataframes; one contains quarterly total revenue for a variety of companies and the other contains quarterly revenue for various segments within each company. For example:
Quarter, CompA, CompB, CompC...
2011.0, 1, 2, 3...
2011.25, 2, 3, 4...
2011.5, 3, 4, 5...
2011.75, 4, 5, 6...
2012.0, 5, 6, 7...
and
Quarter, CompA_Footwear, CompA_Apparel, CompB_Wholesale...
2011.0, 1, 2, 3...
2011.25, 2, 3, 4...
2011.5, 3, 4, 5...
2011.75, 4, 5, 6...
2012.0, 5, 6, 7...
The script I've been building loops through each company in the first table and uses select() to grab all of the columns in the second table, so for the purposes of this question, forget about the other companies and assume that the first table is just CompA and the second table is all of the different CompA segments.
What I'm trying to do is for each segment, create a line plot that has both the total company revenue and the segment revenue charted over time. Something like this is what it would look like. Ideally, I'd like to be able to use a facet_wrap() or something to be able to make all the different graphs for each segment at once, but that's not absolutely necessary. To clarify, each individual graph should only have two lines: the overall company and one specific segment.
I'm fine with having to restructure my data in any way necessary. Does anyone know how I can get this to work?

I think the below should work. Note that you need to move data around a fair bit.
# Load packages
library(dplyr)
library(ggplot2)
library(reshape2)
library(tidyr)
Make a reproducible data set:
# Create companies
# Could pull this from column names in your data
companies <- paste0("Comp",LETTERS[1:4])
set.seed(12345)
sepData <-
lapply(companies, function(thisComp){
nDiv <- sample(3:6,1)
temp <-
sapply(1:nDiv,function(idx){
round(rnorm(24, rnorm(1,100,25), 6))
}) %>%
as.data.frame() %>%
setNames(paste(thisComp,sample(letters,nDiv), sep = "_"))
}) %>%
bind_cols()
sepData$Quarter <-
rep(2010:2015
, each = 4) +
(0:3)/4
meltedSep <-
melt(sepData, id.vars = "Quarter"
, value.name = "Revenue") %>%
separate(variable
, c("Company","Division")
, sep = "_") %>%
mutate(Division = factor(Division
, levels = c(sort(unique(Division))
, "Total")))
fullCompany <-
meltedSep %>%
group_by(Company, Quarter) %>%
summarise(Revenue = sum(Revenue)) %>%
mutate(Division = factor("Total"
, levels = levels(meltedSep$Division)))
The plot you say you want is here. Note that you need to set Divison = NULL to prevent the total from showing up in its own facet:
theme_set(theme_minimal())
catch <- lapply(companies, function(thisCompany){
tempPlot <-
meltedSep %>%
filter(Company == thisCompany) %>%
ggplot(aes(y = Revenue
, x = Quarter)) +
geom_line(aes(col = "Division")) +
facet_wrap(~Division) +
geom_line(aes(col = "Total")
, fullCompany %>%
filter(Company == thisCompany) %>%
mutate(Division = NULL)
) +
ggtitle(thisCompany) +
scale_color_manual(values = c(Division = "darkblue"
, Total = "green3"))
print(tempPlot)
})
Example of the output:
Note, however, that that looks sort of terrible. The difference between the "Total" and any one division is always going to be huge. Instead, you may want to just plot all the divisions on one plot:
allData <-
bind_rows(meltedSep, fullCompany)
catch <- lapply(companies, function(thisCompany){
tempPlot <-
allData %>%
filter(Company == thisCompany) %>%
ggplot(aes(y = Revenue
, x = Quarter
, col = Division)) +
geom_line() +
ggtitle(thisCompany)
# I would add manual colors here, assigned so that, e.g. "Clothes" is always the same
print(tempPlot)
})
Example:
The difference between Total and each is still large, but at least you can compare the divisions.
If it were mine to make though, I would probably make two plots. One with each division from each company (faceted) and one with the totals:
meltedSep %>%
ggplot(aes(y = Revenue
, x = Quarter
, col = Division)) +
geom_line() +
facet_wrap(~Company)
fullCompany %>%
ggplot(aes(y = Revenue
, x = Quarter
, col = Company)) +
geom_line()

There are two other ways I can think to do it using facet_wrap() that are a little more bare-bones:
using annotate() in ggplot2 (simple approach)
doubling your data frames for each company (still relatively simple, just more prone to errors)
Either way, let's recreate your two data frames so that we can reproduce your example:
First create the "total company revenue" data frame:
Quarter <- seq(2011, 2012, by = .25)
CompA <- as.integer(runif(5, 5, 15))
CompB <- as.integer(runif(5, 6, 16))
CompC <- as.integer(runif(5, 7, 17))
df1 <- data.frame(Quarter, CompA, CompB, CompC)
Next, the "segment revenue" data frame of Company A:
CompA_Footwear <- as.integer(runif(5, 0, 5))
CompA_Apparel <- as.integer(runif(5,1 , 6))
CompA_Wholesale <- as.integer(runif(5, 2, 7))
df2 <- data.frame(Quarter, CompA_Footwear, CompA_Apparel, CompA_Wholesale)
Now we will re-arrage your data to be something more recognizable for ggplot2 using melt() from reshape2
require(reshape2)
melt.df1 <- melt(df1, id = "Quarter")
melt.df2 <- melt(df2, id = "Quarter")
df <- rbind(melt.df1, melt.df2)
We are mostly ready to graph now. For sake of example, I'll only focus on "Company A"
Using annotate()
Subset the data so that it only contains "segment revenue" for Company A
CompA.df2 <- df[grep("CompA_", df$variable),]
This assumes all your segment revenue is coded starting with "CompA_*". You will have to subset according to your data.
Now plot:
require(ggplot2)
ggplot(data = CompA.df2, aes(x = Quarter, y = value,
group = variable, colour = variable)) +
geom_line() +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_wrap(~variable) + # Facets by segment
# Next, adds the total revenue data as an annotation
annotate(geom = "line", x = Quarter, y = df1$CompA) +
annotate(geom = "point", x = Quarter, y = df1$CompA)
Basically, we are just annotating the graph with a line and points from our original "total company revenue" data frame for Company A. The major downside to this is the lack of a legend.
The second approach will produce a legend for all values
Duplicating your data
The way facet_wrap() works, we need to define the same facet variables for each of the intended plotted lines on each facet. So we are going to replicate our total revenue for each "segment revenue" level, and group each of these pairs together.
Using the same data frames as above, we are going to separate out the Total Company A Revenue and the Segment Revenue of Company A
CompA.df1 <- df[which(df$variable == "CompA"),] # Total Company A Revenue
CompA.df2 <- droplevels(df[grep("CompA_", df$variable),]) # Segment Revenue of Company A
Now repeat the total revenue data frame for Company A based on how many levels we have for the "Segment Revenue"
rep.CompA.df1 <- CompA.df1[rep(seq_len(nrow(CompA.df1)), nlevels(CompA.df2$variable)), ]
This might be prone to errors if you have NA's or NaN's
Now merge the repeated data frame, and add a facet variable (facet.var here) to pair these together.
CompA.df3 <- rbind(rep.CompA.df1, CompA.df2)
CompA.df3$facet.var <- rep(CompA.df2$variable,2)
Now you are ready to graph. You can still define group = variable, but this time we will set facet_wrap() to our newly created facet.var
require(ggplot2)
ggplot(data = CompA.df3, aes(x = Quarter, y = value,
group = variable, colour = variable)) +
geom_line() +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_wrap(~facet.var)
As you can see, we now have our "Total Revenue" added to the legend:
That plot's a real beaut

Related

Plotting multiple line graphs in R

I am trying to create a plot in R that shows post-surgical outcomes over time. Each row in the dataframe has up to 8 measurements at different time points post-surgery (with some missing values), and for each row, I want to create a line graph that shows the change in the measurement over time. Here is an example dataframe:
dat <- data.frame(Preop=c(-2,0.5,-0.25,1.5), PO_1M=c(-1.5,0.2,-0.1,1.0), PO_6M=c(-1.2,0.1,-0.05,0.5), PO_1Y=c(-1.0,0.05,0,0.25))
dat
I have tried the following code to rearrange the data and get a plot with points over time, but I want to change this so that each row is maintained and can create a line graph.
library(tidyverse)
dat2<-dat %>% tidyr::pivot_longer(cols=Preop:PO_1Y)
dat2$nummonths<-ifelse(dat2$name=='Preop',0,
ifelse(dat2$name=='PO_1M',1,
ifelse(dat2$name=='PO_6M',6,
ifelse(dat2$name=='PO_1Y',12,NA))))
ggplot(dat2, aes(nummonths,value))+geom_point()
I want the graph to look something like this:
Currently, I have the points plotted, but I do not know how to connect these points to create a line graph. Thanks so much for any help!
You need an id of sorts in the data in order to group by it and plot it accordingly. Here's a dplyr suggestion to get you started:
library(dplyr)
# library(tidyr) # pivot_longer
library(ggplot2)
dat %>%
mutate(id = factor(row_number())) %>%
tidyr::pivot_longer(cols=Preop:PO_1Y) %>%
mutate(NumMonths = case_when(name == "Preop" ~ 0, name == "PO_1M" ~ 1, name == "PO_6M" ~ 6, name == "PO_1Y" ~ 12, TRUE ~ NA_real_)) %>%
ggplot(aes(NumMonths, value)) + geom_path(aes(group = id, color = id))
An alternative (in place of case_when) is to define a lookup table of Months that maps names to number of months, and then you can easily use this to add some context to your plot:
Months <- tibble(
name = c("Preop", "PO_1M", "PO_6M", "PO_1Y"),
NumMonths = c(0, 1, 6, 12)
)
dat %>%
mutate(id = factor(row_number())) %>%
tidyr::pivot_longer(cols=Preop:PO_1Y) %>%
left_join(., Months, by = "name") %>%
ggplot(aes(NumMonths, value)) +
geom_text(aes(y = -Inf, label = name), data = Months, hjust = 0, vjust = 0, angle = 90) +
geom_path(aes(group = id, color = id)) +
geom_point(aes(group = id, color = id))
While it can arguably be improved aesthetically, I think the structure of it should be clear enough.

How to plot % positive cases (y-axis) by collection date (x-axis) and by other factors (R)?

Please help!
I have case data I need to prepare for a report soon and just cannot get the graphs to display properly.
From a dataset with CollectionDate as the "record" of cases (i.e. multiple rows with the same date means more cases that day), I want to display Number of positive cases/total (positive + negative) cases for that day as a percent on the y-axis, with collection dates along the x-axis. Then I want to break down by region. Goal is to look like this but in terms of daily positives/# of tests rather than just positives vs negatives. I also want to add a horizontal line on every graph at 20%.
I have tried manipulating it before, in and after ggplot:
ggplot(df_final, aes(x =CollectionDate, fill = TestResult)) +
geom_bar(aes(y=..prop..)) +
scale_y_continuous(labels=percent_format())
Which is, again, close. But the percents are wrong because they are just taking the proportion of that day against counts of all days instead of per day.
Then I tried using tally()in the following command to try and count per region and aggregate:
df_final %>%
group_by(CollectionDate, Region, as.factor(TestResult)) %>%
filter(TestResult == "Positive") %>%
tally()
and I still cannot get the graphs right.
Suggestions?
A quick look at my data:
head(df_final)
Well, I have to say that I am not 100% sure that I got what you want, but anyway, this can be helpful.
The data: Since you are new here, I have to let you know that using a simple and reproducible version of your data will make it easier to the rest of us to answer. To do this you can simulate a data frame o any other objec, or use dput function on it.
library(ggplot2)
library(dplyr)
data <- data.frame(
# date
CollectionDate = sample(
seq(as.Date("2020-01-01"), by = "day", length.out = 15),
size = 120, replace = TRUE),
# result
TestResult = sample(c("Positive", "Negative"), size = 120, replace = TRUE),
# region
Region = sample(c("Region 1", "Region2"), size = 120, replace = TRUE)
)
With this data, you can do ass follow to get the plots you want.
# General plot, positive cases proportion
data %>%
count(CollectionDate, TestResult, name = "cases") %>%
group_by(CollectionDate) %>%
summarise(positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
geom_hline(yintercept = 0.2)
# positive proportion by day within region
data %>%
count(CollectionDate, TestResult, Region, name = "cases") %>%
group_by(CollectionDate, Region) %>%
summarise(
positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)
) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
# horizontal line at 20%
geom_hline(yintercept = 0.2) +
facet_wrap(~Region)
I can get you halfway there (refer to the comments in the code for clarifications). This code is for the counts per day per region (plotted separately for each region). I think you can tweak things further to calculate the counts per day per county too; and whole state should be a cakewalk. I wish you good luck with your report.
rm(list = ls())
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(tidyr) #Needed for the spread() function
#Dummy data
set.seed(1984)
sdate <- as.Date('2000-03-09')
edate <- as.Date('2000-05-18')
dateslist <- as.Date(sample(as.numeric(sdate): as.numeric(edate), 10000, replace = TRUE), origin = '1970-01-01')
df_final <- data.frame(Region = rep_len(1:9, 10000),
CollectionDate = dateslist,
TestResult = sample(c("Positive", "Negative"), 10000, replace = TRUE))
#First tally the positve and negative cases
#by Region, CollectionDate, TestResult in that order
df_final %<>%
group_by(Region, CollectionDate, TestResult) %>%
tally()
#Then
#First spread the counts (in n)
#That is, create separate columns for Negative and Positive cases
#for each Region-CollectionDate combination
#Then calculate their proportions (as shown)
#Now you have Negative and Positive
#percentages by CollectionDate by Region
df_final %<>%
spread(key = TestResult, value = n) %>%
mutate(Negative = Negative/(Negative + Positive),
Positive = Positive/(Negative + Positive))
#Plotting this now
#Since the percentages are available already
#Use geom_col() instead of geom_bar()
df_final %>% ggplot() +
geom_col(aes(x = CollectionDate, y = Positive, fill = "Positive"),
position = "identity", alpha = 0.4) +
geom_col(aes(x = CollectionDate, y = Negative, fill = "Negative"),
position = "identity", alpha = 0.4) +
facet_wrap(~ Region, nrow = 3, ncol = 3)
This yields:

ggplot2 - Two color series in area chart

I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)

Group observations in ggplot2 with long form data

I have data in long form that looks like this:
id <- rep(seq(1:16), each = 3)
trial <- rep(seq(1:3), times = 16)
repeatedMeasure <- round(rnorm(48, mean = 3, sd = 2))
measuredOnce <- rep(10:14, times = c(9,6,6,12,15))
con1 <- rep(c('hi', 'lo'), each = 6, times = 4)
con2 <- rep(c('up', 'down'), each = 3, times = 8)
dat <- as.data.frame(cbind(id, trial, con1, con2, repeatedMeasure, measuredOnce))
dat$measuredOnce <- as.character(dat$measuredOnce)
dat$measuredOnce <- as.numeric(dat$measuredOnce)
Participants complete multiple trials. There is a unique measurement for each trial in the 'repeatedMeasures' variable. However, they are only measured once for the variable titled 'measuredOnce'. I want to produce a bar plot of the measuredOnce variable - something like this:
ggplot(data = dat) +
aes(x = measuredOnce) +
geom_bar() +
facet_wrap(~con1*con2)
However, I want to specify that the measurements for measuredOnce are grouped by id, so that the number of observations (and hence the height of the bar) is divided by three.
I know I could produce what I want by using spread() or taking every third row, but would like to work with the same (long) data frame.
Edit: plot using code above with group = id and fill = id added to aesthetics.
Edit 2: What I am looking for is something that looks like the plot produced by this code
dat %>%
spread(key = trial, value = repeatedMeasure) %>%
ggplot() +
aes(x = measuredOnce) +
geom_bar() +
facet_wrap(~con1*con2)
but without creating a new data frame using spread().

ggplot2: creating different graph panels per id

I am trying to create a time series plot for each individual (ID) I have in my dataset.
Example data:
ID <- rep(c(2:5), each = 9, times = 4)
Attitude <- rep(c('A1', 'A2','A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9'), 16)
Answer <- rep(1:5, length.out = 144)
time <- as.character(rep(c(0, 1, 3, 4), each = 9, times = 4))
first_answer <- rep(1:5, length.out = 144)
df <- data.frame(ID, Attitude, Answer, time, first_answer)
df$time <- as.character(df$time)
The function code I am currently using:
library(dplyr)
spaghetti_plot <- function(input, MV, item_level){
MV <- enquo(MV)
titles <- enquo(item_level)
input %>%
filter(!!(MV) == item_level) %>%
mutate(first_answer = first_answer) %>%
ggplot(.,aes( x = time, y = jitter(Answer), group = ID)) +
geom_line(aes(colour = first_answer)) +
labs(title = titles ,x = 'Time', y = 'Answer', colour = 'Answer given at time 0')
}
This gives me a graph where I have a line for each individual, i.e. one plot for all individuals (equal to number of ID). Instead of this, I would like to have 1 plot with # panels = ID. For example, if I have data of 10 individuals, I would like to have a graph with 10 panels.
I tried using facet_wrap and facet_panel to get the job done, but I haven't found a proper solution yet.
EDIT using facet_wrap(~ID) gives
The result that I am after would look something like this:
Which was originally made in SAS.
EDIT2 Solution is in the comments.
The data from your reproducible example are a bit weird because you have only one value per ID, but I believe this is the code you are looking for:
library(ggplot2)
ggplot(df,aes(x = time, y = Answer)) +
geom_line()+
facet_grid(. ~ ID)
If you have too many facets the data may not show up, try to increase the size of the plot window or export the image directly with ggsave. If you find the right parameters for ggsave all the plots should be visible on the saved image.

Resources