I want to create a time series plot showing how two variables have changed overtime and colour them to their appropriate region?
I have 2 regions, England and Wales and for each I have calculated the total_tax and the total_income.
I want to plot these on a ggplot over the years, using the years variable.
How would I do this and colour the regions separately?
I have the year variable which I will put on the x axis, then I want to plot both incometax and taxpaid on the graph but show how they have both changed over time?
How would I add a 3rd axis to get the plot how these two variables have changed overtime?
I have tried this code but it has not worked the way I wanted it to do.
ggplot(tax_data, filter %>% aes(x=date)) +
geom_line(aes(y=incometax, color=region)) +
geom_line(aes(y=taxpaid, color=region))+
ggplot is at the beginning a bit hard to grasp - I guess you're trying to achieve something like the following:
Assuming your data is in a format with a column for each date, incometax and taxpaid - I'm creating here an example:
library(tidyverse)
dataset <- tibble(date = seq(from = as.Date("2015-01-01"), to = as.Date("2019-12-31"), by = "month"),
incometax = rnorm(60, 100, 10),
taxpaid = rnorm(60, 60, 5))
Now, for plotting a line for each incometax and taxpaid we need to shape or "tidy" the data (see here for details):
dataset <- dataset %>% pivot_longer(cols = c(incometax, taxpaid))
Now you have three columns like this - we've turned the former column names into the variable name:
# A tibble: 6 x 3
date name value
<date> <chr> <dbl>
1 2015-01-01 incometax 106.
2 2015-01-01 taxpaid 56.9
3 2015-02-01 incometax 112.
4 2015-02-01 taxpaid 65.0
5 2015-03-01 incometax 95.8
6 2015-03-01 taxpaid 64.6
this has now the right format for ggplot and you can map the name to the colour of the lines:
ggplot(dataset, aes(x = date, y = value, colour = name)) + geom_line()
Related
I have two issues in this plot
1- I have this dataset and would like to have a line plot with multiple lines each represent total number of injury cause during 3 months. So, the x-axis should be separated into blucks of 3 months and data represented in the figure should reflect frequency of each injury cause in three months.
x-axis should have year and month (date_labels="%Y %b")
2- Road traffic accidents are more frequent than other causes of injuries; thus, I would like to have two y axis scales (one on the left as usual and one on the right). the one on the right will have a scale that fits the number of "traffic accidents" (from 1 to 150 by 10) and the other one (from 1 to 80 by 10) will indicate number of other causes of injuries.
traffic accidents should be linked to the right scale and the other causes should be linked to the left scale.
Date Injury.Cause n
1 2019-03-25 Falls from height 1
2 2019-03-25 Falls on level ground 3
3 2019-03-25 Road traffic accidents 3
4 2019-03-26 Road traffic accidents 5
5 2019-03-27 Falls on level ground 3
6 2019-03-27 Road traffic accidents 3
7 2019-03-28 Falls on level ground 2
8 2019-03-28 Road traffic accidents 3
9 2019-03-29 Falls on level ground 4
10 2019-03-29 Road traffic accidents 9
11 2019-03-30 Falls from height 2
12 2019-03-30 Falls on level ground 2
13 2019-03-30 Road traffic accidents 7
14 2019-03-31 Falls on level ground 1
15 2019-03-31 Road traffic accidents 1
16 2019-04-01 Falls on level ground 3
17 2019-04-02 Assaults related injuries 1
18 2019-04-02 Falls on level ground 1
19 2019-04-02 Road traffic accidents 2
20 2019-04-03 Falls from height 2
I tried this code
ggplot(df) + aes(Date, n, color = Injury.Cause) + geom_line()+
scale_x_date(date_breaks = "3 months", breaks = "3 months" , date_labels="%Y %b",limits = c(min(df$Date), max(df$Date)))
and I got a really crowded figure because it used number of cases/day (below)
thank you in advance
Ram
Building on the other answer, here's a complete response including the other y-axis. Note that the first (long) step involves creating test data, so in future it would be helpful if you can provide the input data. This code should work with your data, but I can't guarantee it.
Set Up Test Data
This creates a dataframe called df with random numbers for daily accident rates. The daily rate for Traffic Accidents is about five times higher.
# load packages
library(lubridate)
library(dplyr)
library(magrittr)
library(ggplot2)
# creating dummy data
# set up low-value categories
categories <- c("Falls on level ground", "Falls from height", "Assaults Related Injuries")
# set up all the dates
dates <- seq.Date(from = lubridate::ymd("2019-01-01"),
to = lubridate::ymd("2021-06-30"),
by = "day")
# create a data frame with injury stats for each day from 2019-01-01 to 2021-06-30
# Road traffic accidents are much higher, ranging from 40-50 per day
df <- tibble(Date = rep(dates, times = length(categories)),
Injury.Cause = rep(categories, times = length(dates)),
n = sample(x = 0:10,
size = length(categories) * length(dates),
replace = TRUE)) %>%
bind_rows(tibble(Date = dates,
Injury.Cause = "Road Traffic Accidents",
n = sample(10:20,
size = length(dates),
replace = TRUE)))
Creating Quarterly Sums
This code rounds each date down to beginning of the quarter and then creates quarterly sums. One advantage of this approach over the other answer is that these can then be plotted using a date x-axis.
# now group data by quarter by rounding dates down to the nearest quarter,
# then summing results.
df_rounded <- df %>%
mutate(quarter_floor = lubridate::floor_date(Date, unit = "quarter")) %>%
group_by(quarter_floor, Injury.Cause) %>%
summarise(n_quarter = sum(n))
# make a single plot showing all values on one axis
ggplot() +
geom_line(data = df_rounded,
mapping = aes(x = quarter_floor,
y = n_quarter,
color = Injury.Cause))
Plotting With a Second Y-Axis
Putting two y-axes on the same plot is confusing and isn't often recommended, but here's how to do it. This solution creates two separate data frames, one for the left-hand axis and one for the right-hand axis.
The trick is that the y-values in the second data frame are scaled to normalize their range them against the values in the left-hand column, and then the second axis is scaled to show the original range.
# set up a plot with two seaprately scaled axes.
# set up the first dataframe with the low-valued entries
df_for_plot_1 <- df %>%
mutate(quarter_floor = lubridate::floor_date(Date, unit = "quarter")) %>%
group_by(quarter_floor, Injury.Cause) %>%
summarise(n_quarter = sum(n)) %>%
filter(Injury.Cause != "Road Traffic Accidents")
# set up a second dataframe with only the high-value accidents.
# create scaled values so that the largest road accident sum equals the largest
# of the other accident sums.
df_for_plot_2 <- df %>%
mutate(quarter_floor = lubridate::floor_date(Date, unit = "quarter")) %>%
group_by(quarter_floor, Injury.Cause) %>%
summarise(n_quarter = sum(n)) %>%
ungroup() %>%
filter(Injury.Cause == "Road Traffic Accidents") %>%
mutate(n_quarter_scaled = n_quarter / (max(n_quarter) / max(df_for_plot_1$n_quarter)) )
# now plot both dataframes on the same plot, with a second y-axis that maps
# the scaped road-accident values back to the original values
ggplot() +
geom_line(data = df_for_plot_1,
mapping = aes(x = quarter_floor,
y = n_quarter,
color = Injury.Cause))+
geom_line(data = df_for_plot_2,
mapping = aes(x = quarter_floor,
y = n_quarter_scaled,
color = Injury.Cause))+
scale_x_date() +
scale_y_continuous(sec.axis = sec_axis(trans = ~ . *(max(df_for_plot_2$n_quarter) / max(df_for_plot_1$n_quarter))))
I did not understood how you want to show data for each month and also summarise for each 3 months, so I did both but separated.
Summarise (3 months)
In the package lubridate we have the function quarter, that divide the year into fourths
df %>%
group_by(quarter = quarter(Date)) %>%
summarise(total = sum(n,na.rm = TRUE))
# A tibble: 2 x 2
quarter total
<int> <int>
1 1 49
2 2 9
Two y axis
To create an axis with a diffent scale is very hard, and not recommend in ggplot2, but you can create a secondary axis using a function with the main axis scales
df %>%
ggplot() +
aes(Date, n, color = Injury.Cause, group = Injury.Cause) +
scale_y_continuous(
breaks = seq(0,150,10),
limits = c(0,150),
sec.axis = sec_axis(trans = ~.,breaks = seq(0,80,10))
)
I have a dataframe as below (very simple structure) and I want to draw a column chart to show the amount for each date. The issue is that the date has duplicate entries (e.g., 2020-01-15).
# A tibble: 5 x 2
date amount
<date> <dbl>
1 2020-01-02 4000
2 2020-01-06 2568.
3 2020-01-15 2615.
4 2020-01-15 2565
5 2020-01-16 2640
When I try doing the following it somehow groups the similar dates together and draws a stacked column chart which is NOT what I want.
df %>%
ggplot(aes(x= factor(date), y=amount)) +
geom_col()
scale_x_discrete( labels = df$date ) #this creates discrete x-axis labels but the values are still stacked. So it just messes things up.
There's no issue if i'm using geom_line() but I want to see a bar for each date. Any idea how to do this?
Try:
df %>%
ggplot(aes(date, amount)) +
geom_col(position = position_dodge2()) +
scale_x_date(breaks = unique(df$date))
Result:
I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))
I have a dataframe that only lists months from October to April. When I plot this data on a line graph, it includes the unused months as well. I would only like to show the months that are listed in the data so there is no unused space on the plot.
The code I am using for the plot
gplot(df,aes(GAME_DATE,DEF_RATING)) +
geom_line(aes(y=rollmean(df$DEF_RATING,9, na.pad = TRUE))) +
geom_line(aes(y=rollmean(df$OFF_RATING,9,na.pad = TRUE)),color='steelblue')
Sample data
GAME_DATE OFF_RATING DEF_RATING
<dttm> <dbl> <dbl>
1 2017-04-12 106.1 113.1
2 2017-04-10 107.1 100.8
3 2017-04-08 104.4 105.1
4 2017-04-06 116.1 105.9
5 2017-04-04 116.9 116.0
ggplot2 doesn't allow broken axes because such axes can be misleading. However, if you still want to proceed with this, you can simulate a broken axis with faceting. To do this, create a grouping variable to mark each "island" where data is present with a unique group code and then facet by those group codes.
Also, the data should be converted to long format before plotting, so that you can get two separate colored lines with a single call to geom_line. Mapping a column to color inside aes will also automatically generate a legend.
Here's an example with fake data:
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(x=1950:2000,
y1=c(cumsum(rnorm(30)), rep(NA,10), cumsum(rnorm(11))),
y2=c(cumsum(rnorm(30)), rep(NA,10), cumsum(rnorm(11))))
dat %>%
# Convert to long format
gather(key, value, y1:y2) %>%
# Add the grouping variable
group_by(key) %>%
mutate(group=c(0, cumsum(diff(as.integer(is.na(value)))!=0))) %>%
# Remove missing values
filter(!is.na(value)) %>%
ggplot(aes(x, value, colour=key)) +
geom_line() +
scale_x_continuous(breaks=seq(1950,2000,10), expand=c(0,0.1)) +
facet_grid(. ~ group, scales="free_x", space="free_x") +
theme(strip.background=element_blank(),
strip.text=element_blank())
You can try to delimiting x-axis with "scale_x_date()" for your present dates like this:
gplot(df,aes(GAME_DATE,DEF_RATING)) +
geom_line(aes(y=rollmean(df$DEF_RATING,9, na.pad = TRUE))) +
geom_line(aes(y=rollmean(df$OFF_RATING,9,na.pad = TRUE)),color='steelblue') +
scale_x_date(date_labels="%b",breaks=seq(min(df$GAME_DATE),max(df$GAME_DATE), "1 month"))
head(bktst.plotdata)
date method product type actuals forecast residual Percent_error month
1 2012-12-31 bauwd CUSTM NET 194727.51 -8192.00 -202919.51 -104.21 Dec12
2 2013-01-31 bauwd CUSTM NET 470416.27 1272.01 -469144.26 -99.73 Jan13
3 2013-02-28 bauwd CUSTM NET 190943.57 -1892.45 -192836.02 -100.99 Feb13
4 2013-03-31 bauwd CUSTM NET -42908.91 2560.05 45468.96 -105.97 Mar13
5 2013-04-30 bauwd CUSTM NET -102401.68 358807.48 461209.16 -450.39 Apr13
6 2013-05-31 bauwd CUSTM NET -134869.73 337325.33 472195.06 -350.11 May13
I have been trying to plot my back test result using ggplot2. Given above a sample dataset. I have dates ranging from Dec2012 to Jul2013. 3 levels in 'method', 5 levels in 'product' and 2 levels in 'type'
I tried this code, trouble is that R is not reading x-axis correct, on the X-axis I am getting 'Jan, feb, mar, apr, may,jun, jul, aug', instead I expect R to plot Dec-to-Jul
month.plot1 <- ggplot(data=bktst.plotdata, aes(x= date, y=Percent_error, colour=method))
facet4 <- facet_grid(product~type,scales="free_y")
title3 <- ggtitle("Percent Error - Month-over-Month")
xaxis2 <- xlab("Date")
yaxis3 <- ylab("Error (%)")
month.plot1+geom_line(stat="identity", size=1, position="identity")+facet4+title3+xaxis2+yaxis3
# Tried changing the code to this still not getting the X-axis right
month.plot1 <- ggplot(data=bktst.plotdata, aes(x= format(date,'%b%y'), y=Percent_error, colour=method))
month.plot1+geom_line(stat="identity", size=1, position="identity")+facet4+title3+xaxis2+yaxis3
Well, it looks like you are plotting the last day of each month, so it actually makes sense to me that December 31 is plotted very very close to January. If you look at the plotted points (with geom_point) you can see that each point is just to the left of the closest month axis.
It sounds like you want to plot years and months instead of actual dates. There are a variety of ways you might do this, but one thing you could is to change the day part of the date to the first of the month instead of the last of the month. Here I show how you could do this using some functions from package lubridate along with paste (I have assumed your variable date is already a Date object).
require(lubridate)
bktst.plotdata$date2 = as.Date(with(bktst.plotdata,
paste(year(date), month(date), "01", sep = "-")))
Then the plot axes start at December. You can change the format of the x axis if you load the scales package.
require(scales)
ggplot(data=bktst.plotdata, aes(x = date2, y=Percent_error, colour=method)) +
facet_grid(product~type,scales="free_y") +
ggtitle("Percent Error - Month-over-Month") +
xlab("Date") + ylab("Error (%)") +
geom_line() +
scale_x_date(labels=date_format(format = "%m-%Y"))