I have two issues in this plot
1- I have this dataset and would like to have a line plot with multiple lines each represent total number of injury cause during 3 months. So, the x-axis should be separated into blucks of 3 months and data represented in the figure should reflect frequency of each injury cause in three months.
x-axis should have year and month (date_labels="%Y %b")
2- Road traffic accidents are more frequent than other causes of injuries; thus, I would like to have two y axis scales (one on the left as usual and one on the right). the one on the right will have a scale that fits the number of "traffic accidents" (from 1 to 150 by 10) and the other one (from 1 to 80 by 10) will indicate number of other causes of injuries.
traffic accidents should be linked to the right scale and the other causes should be linked to the left scale.
Date Injury.Cause n
1 2019-03-25 Falls from height 1
2 2019-03-25 Falls on level ground 3
3 2019-03-25 Road traffic accidents 3
4 2019-03-26 Road traffic accidents 5
5 2019-03-27 Falls on level ground 3
6 2019-03-27 Road traffic accidents 3
7 2019-03-28 Falls on level ground 2
8 2019-03-28 Road traffic accidents 3
9 2019-03-29 Falls on level ground 4
10 2019-03-29 Road traffic accidents 9
11 2019-03-30 Falls from height 2
12 2019-03-30 Falls on level ground 2
13 2019-03-30 Road traffic accidents 7
14 2019-03-31 Falls on level ground 1
15 2019-03-31 Road traffic accidents 1
16 2019-04-01 Falls on level ground 3
17 2019-04-02 Assaults related injuries 1
18 2019-04-02 Falls on level ground 1
19 2019-04-02 Road traffic accidents 2
20 2019-04-03 Falls from height 2
I tried this code
ggplot(df) + aes(Date, n, color = Injury.Cause) + geom_line()+
scale_x_date(date_breaks = "3 months", breaks = "3 months" , date_labels="%Y %b",limits = c(min(df$Date), max(df$Date)))
and I got a really crowded figure because it used number of cases/day (below)
thank you in advance
Ram
Building on the other answer, here's a complete response including the other y-axis. Note that the first (long) step involves creating test data, so in future it would be helpful if you can provide the input data. This code should work with your data, but I can't guarantee it.
Set Up Test Data
This creates a dataframe called df with random numbers for daily accident rates. The daily rate for Traffic Accidents is about five times higher.
# load packages
library(lubridate)
library(dplyr)
library(magrittr)
library(ggplot2)
# creating dummy data
# set up low-value categories
categories <- c("Falls on level ground", "Falls from height", "Assaults Related Injuries")
# set up all the dates
dates <- seq.Date(from = lubridate::ymd("2019-01-01"),
to = lubridate::ymd("2021-06-30"),
by = "day")
# create a data frame with injury stats for each day from 2019-01-01 to 2021-06-30
# Road traffic accidents are much higher, ranging from 40-50 per day
df <- tibble(Date = rep(dates, times = length(categories)),
Injury.Cause = rep(categories, times = length(dates)),
n = sample(x = 0:10,
size = length(categories) * length(dates),
replace = TRUE)) %>%
bind_rows(tibble(Date = dates,
Injury.Cause = "Road Traffic Accidents",
n = sample(10:20,
size = length(dates),
replace = TRUE)))
Creating Quarterly Sums
This code rounds each date down to beginning of the quarter and then creates quarterly sums. One advantage of this approach over the other answer is that these can then be plotted using a date x-axis.
# now group data by quarter by rounding dates down to the nearest quarter,
# then summing results.
df_rounded <- df %>%
mutate(quarter_floor = lubridate::floor_date(Date, unit = "quarter")) %>%
group_by(quarter_floor, Injury.Cause) %>%
summarise(n_quarter = sum(n))
# make a single plot showing all values on one axis
ggplot() +
geom_line(data = df_rounded,
mapping = aes(x = quarter_floor,
y = n_quarter,
color = Injury.Cause))
Plotting With a Second Y-Axis
Putting two y-axes on the same plot is confusing and isn't often recommended, but here's how to do it. This solution creates two separate data frames, one for the left-hand axis and one for the right-hand axis.
The trick is that the y-values in the second data frame are scaled to normalize their range them against the values in the left-hand column, and then the second axis is scaled to show the original range.
# set up a plot with two seaprately scaled axes.
# set up the first dataframe with the low-valued entries
df_for_plot_1 <- df %>%
mutate(quarter_floor = lubridate::floor_date(Date, unit = "quarter")) %>%
group_by(quarter_floor, Injury.Cause) %>%
summarise(n_quarter = sum(n)) %>%
filter(Injury.Cause != "Road Traffic Accidents")
# set up a second dataframe with only the high-value accidents.
# create scaled values so that the largest road accident sum equals the largest
# of the other accident sums.
df_for_plot_2 <- df %>%
mutate(quarter_floor = lubridate::floor_date(Date, unit = "quarter")) %>%
group_by(quarter_floor, Injury.Cause) %>%
summarise(n_quarter = sum(n)) %>%
ungroup() %>%
filter(Injury.Cause == "Road Traffic Accidents") %>%
mutate(n_quarter_scaled = n_quarter / (max(n_quarter) / max(df_for_plot_1$n_quarter)) )
# now plot both dataframes on the same plot, with a second y-axis that maps
# the scaped road-accident values back to the original values
ggplot() +
geom_line(data = df_for_plot_1,
mapping = aes(x = quarter_floor,
y = n_quarter,
color = Injury.Cause))+
geom_line(data = df_for_plot_2,
mapping = aes(x = quarter_floor,
y = n_quarter_scaled,
color = Injury.Cause))+
scale_x_date() +
scale_y_continuous(sec.axis = sec_axis(trans = ~ . *(max(df_for_plot_2$n_quarter) / max(df_for_plot_1$n_quarter))))
I did not understood how you want to show data for each month and also summarise for each 3 months, so I did both but separated.
Summarise (3 months)
In the package lubridate we have the function quarter, that divide the year into fourths
df %>%
group_by(quarter = quarter(Date)) %>%
summarise(total = sum(n,na.rm = TRUE))
# A tibble: 2 x 2
quarter total
<int> <int>
1 1 49
2 2 9
Two y axis
To create an axis with a diffent scale is very hard, and not recommend in ggplot2, but you can create a secondary axis using a function with the main axis scales
df %>%
ggplot() +
aes(Date, n, color = Injury.Cause, group = Injury.Cause) +
scale_y_continuous(
breaks = seq(0,150,10),
limits = c(0,150),
sec.axis = sec_axis(trans = ~.,breaks = seq(0,80,10))
)
Related
I have a dataset that has around 2000 rows.
Each row is a hospital encounter for ICU Admissions. This is data collected over 5 years
The variables of interest are: Encounter Number, Diagnosis Category, Admit Date, Discharge Date
What I want to do is try and plot the ICU occupancy for each day over these 5 years.
Example:
Encounter Number : 786786
Diagnosis Category : Tuberculosis
Admit Date : 2022-01-20
Discharge Date : 2022-01-30
Therefore this patient stayed in the ICU for 10 days starting from 01.20 to 01.30.
There will be other encounters for another diagnosis -
Encounter Number : 786786
Diagnosis Category : Cancer
Admit Date : 2022-01-21
Discharge Date : 2022-01-28
End goal is to plot the ICU occupancy for EACH date starting from the EARLIEST Admit Date and the LATEST Discharge Date (x - axis) by Diagnosis Category.
For each date on the x-axis for the 5 year time period, there will be a bar for the diagnosis category.
How can I go about doing this?
Thanks (:
I have encountered this problem myself many times before. The algorithm to count the occupancy is essentially just creating a vector of the days you want to plot, then for each day, counting how many people were admitted before that day and discharged after that day.
We need some realistic data. Given that you have 2000 admissions over the 5 years, and given mean ICU length of stay is typically 3.5 days with a gamma or lognormal type distribution, we can create some reasonable simulated data like this:
# Make data reproducible
set.seed(1)
df <- data.frame(Admit_date = sample(seq(as.POSIXct("2015-01-01"),
as.POSIXct("2020-01-01"), "day"),
2000, TRUE),
Diagnosis_category = sample(c("Respiratory",
"Infective",
"Post-op",
"Trauma"), 2000, TRUE),
Encounter_number = 56789123 + 1:2000)
df$Discharge_date <- df$Admit_date + 86400 * rgamma(2000, sh = 2, scale = 1.75)
df$Discharge_date <- as.Date(df$Discharge_date)
df$Admit_date <- as.Date(df$Admit_date)
df <- df[order(df$Admit_date), c(3, 1, 4, 2)]
rownames(df) <- NULL
head(df)
#> Encounter_number Admit_date Discharge_date Diagnosis_category
#> 1 56790418 2015-01-01 2015-01-02 Post-op
#> 2 56789614 2015-01-05 2015-01-10 Post-op
#> 3 56790100 2015-01-05 2015-01-12 Post-op
#> 4 56790644 2015-01-07 2015-01-07 Trauma
#> 5 56789943 2015-01-08 2015-01-09 Respiratory
#> 6 56790066 2015-01-08 2015-01-13 Trauma
Assuming this is similar to your own data, we can now count the occupancy for each day like this:
library(tidyverse)
# Create vector of all dates you wish to plot
days <- seq(as.Date("2015-01-01"), as.Date("2020-01-01"), "day")
plot_df <- df %>%
group_by(Diagnosis_category) %>%
summarize(date = days, count = sapply(days, function(x) {
sum(Admit_date <= x & Discharge_date >= x)
}))
Now we are ready to plot. In my example, we only have 4 diagnostic categories, and trying to plot over 1600 columns on a single panel is already challenging. If you try to put all your diagnostic categories over 5 years in a single panel, you will get a total mess. This is made worse by the fact that you will only ever have a handful of patients in each diagnostic category (other than during Covid peaks), so the plot will only have a few discrete steps in it. I think it would be best to use facets in this case:
ggplot(plot_df, aes(date, count, fill = Diagnosis_category,
color = Diagnosis_category)) +
geom_col() +
facet_wrap(.~Diagnosis_category) +
theme_minimal(base_size = 16) +
theme(legend.position = "none")
Unless there is a specific point you wish your data to make with this kind of plot (like massive occupancy spikes during Covid surges), you might want to think of a different summary measure. You could try grouping plot_df by diagnostic category and month, then calculating average monthly occupancy.
My data consists of 25 sectors on a time series, I want to plot for each sector the number of workers (series 1) and the average pay (series 2) in a line graph, with the secondary y axis for the average pay and the primary y axis for the number of workers, and than arrange the graphs on a grid.
example data:
period
avg_wage
number_of_workers
sector
1990
2000
5000
construction
1991
2020
4970
construction
1992
2050
5050
construction
1990
1000
120
IT
1991
1100
400
IT
1992
1080
500
IT
1990
10000
900
hospital staff
1991
10200
980
hospital staff
1992
10400
1200
hospital staff
I tried to use facet_wrap() for the grid and scale_y_continuous(sec.axis...) as follows:
#fake sample data for reference
dfa=data.frame(order=seq(1,100),workers=rnorm(1000,7),pay=rnorm(1000,3000,500),type="a") #1st sector
dfb=data.frame(order=seq(1,100),workers=rnorm(1000,25),pay=rnorm(1000,1000,500),type="b") #2nd sector
dfc=data.frame(order=seq(1,100),workers=rnorm(1000,400),pay=rnorm(1000,5000,500),type="c") #3rd sector
df=rbind(dfa,dfb,dfc)
colnames(df)=c(
"order", #shared x axis/time value
"workers", #time series 1 (y values for left side y axis)
"pay", #time series 2 (y values for left side y axis)
"type" #diffrent graphs to put on the grid
)
ggploting the data:
df=df %>% group_by(l=type) %>% mutate(coeff=max(pay)/max(workers)) %>% ungroup() #creating a coefficient to scale the secondry axis
plot=ggplot(data=df,aes(x=order))+
geom_line(aes(y=workers),linetype="dashed",color="red")+
geom_line(aes(y=pay/coeff)) +
scale_y_continuous(sec.axis=sec_axis(~.*coeff2,name="wage"))+
facet_wrap(~type,scale="free")
But unfortunately this doesn't work since you cant use data in the function sec_axis() (this example doesn't even run).
another approach I tried is using a for loop and grid.arrange():
plots=list()
for (i in (unique(df$type)))
{
singlesector=df[df$type==i,]
axiscoeff=df$coeff[1]
plot=ggplot(data=singlesector,aes(x=order))+
geom_line(aes(y=workers),linetype="dashed",color="red")+
geom_line(aes(y=pay/coeff)) + labs(title=i)+
scale_y_continuous(sec.axis=sec_axis(~.*axiscoeff,name="wage"))
plots[[i]]=plot
}
grid.arrange(grobs=plots)
But this also doesn't work because ggplot doesn't save the various values of the variable axiscoeff so it applies the first value to all of the graphs.
see result (the axis on the right are messed up and don't conform to the red line's data):
Is there any way to do what I want to do?
I thought maybe saving directly all of the plots as png separately and than joining them in some other way but it just seems like an extreme solution which would take too much time figuring out.
As far as I get it, the issue is the way you (re)scale your data, i.e. using max(pay) / max(workers) you rescale your data such that the maximum value of pay is mapped on the maximum value of workers which however does not take account of the different range or the spread of the variables.
Instead you could use scales::rescale to rescale your data such that the range of pay is mapped on the range of workers.
Besides that I took a different approach to glue the plots together which makes use of patchwork. To this end I have put the plotting code in a function, split the data by type, use lapply to loop over the splitted data and finally glue the plots together using patchwork::wrap_plots.
Note: As your example data included multiple values per order/type I slightly changed it to get rid of the zig-zag lines.
library(dplyr)
library(ggplot2)
library(patchwork)
library(scales)
df %>%
split(.$type) %>%
lapply(function(df) {
range_pay <- range(df$pay)
range_workers <- range(df$workers)
ggplot(data = df, aes(x = order)) +
geom_line(aes(y = workers), linetype = "dashed", color = "red") +
geom_line(aes(y = rescale(pay, range_workers, range_pay))) +
scale_y_continuous(sec.axis = sec_axis(~ rescale(.x, range_pay, range_workers), name = "wage")) +
facet_wrap(~type)
}) %>%
wrap_plots(ncol = 1)
DATA
set.seed(123)
dfa <- data.frame(order = 1:100, workers = rnorm(100, 7), pay = rnorm(100, 3000, 500), type = "a") # 1st sector
dfb <- data.frame(order = 1:100, workers = rnorm(100, 25), pay = rnorm(100, 1000, 500), type = "b") # 2nd sector
dfc <- data.frame(order = 1:100, workers = rnorm(100, 400), pay = rnorm(100, 5000, 500), type = "c") # 3rd sector
df <- rbind(dfa, dfb, dfc)
names(df) <- c("order", "workers", "pay", "type")
I want to create a time series plot showing how two variables have changed overtime and colour them to their appropriate region?
I have 2 regions, England and Wales and for each I have calculated the total_tax and the total_income.
I want to plot these on a ggplot over the years, using the years variable.
How would I do this and colour the regions separately?
I have the year variable which I will put on the x axis, then I want to plot both incometax and taxpaid on the graph but show how they have both changed over time?
How would I add a 3rd axis to get the plot how these two variables have changed overtime?
I have tried this code but it has not worked the way I wanted it to do.
ggplot(tax_data, filter %>% aes(x=date)) +
geom_line(aes(y=incometax, color=region)) +
geom_line(aes(y=taxpaid, color=region))+
ggplot is at the beginning a bit hard to grasp - I guess you're trying to achieve something like the following:
Assuming your data is in a format with a column for each date, incometax and taxpaid - I'm creating here an example:
library(tidyverse)
dataset <- tibble(date = seq(from = as.Date("2015-01-01"), to = as.Date("2019-12-31"), by = "month"),
incometax = rnorm(60, 100, 10),
taxpaid = rnorm(60, 60, 5))
Now, for plotting a line for each incometax and taxpaid we need to shape or "tidy" the data (see here for details):
dataset <- dataset %>% pivot_longer(cols = c(incometax, taxpaid))
Now you have three columns like this - we've turned the former column names into the variable name:
# A tibble: 6 x 3
date name value
<date> <chr> <dbl>
1 2015-01-01 incometax 106.
2 2015-01-01 taxpaid 56.9
3 2015-02-01 incometax 112.
4 2015-02-01 taxpaid 65.0
5 2015-03-01 incometax 95.8
6 2015-03-01 taxpaid 64.6
this has now the right format for ggplot and you can map the name to the colour of the lines:
ggplot(dataset, aes(x = date, y = value, colour = name)) + geom_line()
I have the following dataset
head(Data)
Fecha PriceStats
1 01-2002 45.2071
2 02-2002 46.6268
3 03-2002 48.4712
4 04-2002 53.5067
5 05-2002 55.6527
6 06-2002 57.6684
ThereĀ“s a total of 176 observations.
Every row corresponds to a different month.
I would like to create a graph with the 12 months of the year in the x-axis and that every year of the dataset (containing 12 months each) corresponds to a series in the graph so I can plot all the different years overlapping (in these case would be 15 series).
Do I have to set levels on the dataset or ggplot can do that directly?
This should do it:
library(ggplot2)
library(lubridate)
Data <- data.frame(date = seq(ymd('2014/01/01'), ymd('2016/12/01'), 30),
n = sample(1:50, 36))
Data$month <- month(Data$date)
Data$year <- year(Data$date)
ggplot(Data, aes(x = month, y = n, group = year)) +
geom_line(aes(colour = as.factor(year)))
Let's say I have data consisting of the time I leave the house and the number of minutes it takes me to get to work. I'll have some repeated values:
08:00, 20
08:04, 25
08:30, 40
08:20, 23
08:04, 22
And some numbers will repeat (like 08:04). What I want to do is a run a scatter plot that is correctly scaled at the x-axis but allows these multiple values per entry so that I could view the trend.
Is a time-series even what I want to be using? I've been able to plot a time series graph that has one value per time, and I've gotten multiple values plotted but without the time-series scaling. Can anyone suggest a good approach? Preference for ggplot2 but I'll take standard R plotting if it's easier.
First lets prepare some more data
set.seed(123)
df <- data.frame(Time = paste0("08:", sample(35:55, 40, replace = TRUE)),
Length = sample(20:50, 40, replace = TRUE),
stringsAsFactors = FALSE)
df <- df[order(df$Time), ]
df$Attempt <- unlist(sapply(rle(df$Time)$lengths, function(i) 1:i))
df$Time <- as.POSIXct(df$Time, format = "%H:%M") # Fixing y axis
head(df)
Time Length Attempt
6 08:35 24 1
18 08:35 43 2
35 08:35 34 3
15 08:37 37 1
30 08:38 33 1
38 08:39 38 1
As I understand, you want to preserve the order of observations of the same leaving house time. At first I ignored that and got a scatter plot like this:
ggplot(data = df, aes(x = Length, y = Time)) +
geom_point(aes(size = Length, colour = Length)) +
geom_path(aes(group = Time, colour = Length), alpha = I(1/3)) +
scale_size(range = c(2, 7)) + theme(legend.position = 'none')
but considering three dimensions (Time, Length and Attempt) scatter plot no longer can show us all the information. I hope I understood you correctly and this is what you are looking for:
ggplot(data = df, aes(y = Time, x = Attempt)) + geom_tile(aes(fill = Length))