How do I group 90th percentiles by two columns - r

I'm working with a large dataframe 7191 obs. of 19 variables. The columns are Month, Day, Year, and Site1 through Site16. Where Month is either June, July, August, September, or October.
Here is the beginning of my data, which I believe has only numerals in the columns site1-site16. Currently, I'm double checking to make sure.
dput(head(No_PS_for_Calculations ))
structure(list(Month = c("June", "June", "June", "June", "June",
"June"), Day = c(1, 2, 3, 4, 5, 6), Year = c(1970, 1970, 1970,
1970, 1970, 1970), Site1 = c("11.531", "12.298", "12.732", "12.619",
"12.5", "13.201"), Site2 = c("11.185", "11.439", "12.17", "12.432",
"12.337", "12.492"), Site3 = c("11.147", "11.496", "11.645",
"12.208", "12.644", "12.971"), Site4 = c("11.393", "11.707",
"11.961", "12.135", "12.809", "13.041"), Site5 = c("11.797",
"11.925", "12.34", "12.525", "13.01", "13.548"), Site6 = c("11.853",
"11.974", "12.16", "12.481", "12.459", "12.838"), Site7 = c("12.319",
"12.46", "12.476", "12.729", "13.026", "13.032"), Site8 = c("12.557",
"12.643", "12.789", "12.975", "13.202", "13.339"), Site9 = c("12.774",
"13.337", "13.896", "13.897", "13.819", "14.054"), Site10 = c("12.819",
"13.202", "13.783", "14.298", "14.284", "14.309"), Site11 = c("13.151",
"13.556", "13.833", "14.08", "14.244", "14.841"), Site12 = c("13.61",
"13.57", "14.111", "14.073", "14.331", "14.849"), Site13 = c("13.802",
"13.872", "14.244", "14.249", "14.255", "14.818"), Site14 = c("14.138",
"14.275", "14.332", "14.522", "14.244", "14.927"), Site15 = c("14.138",
"14.616", "14.766", "14.697", "14.61", "14.694"), Site16 = c("14.208",
"14.627", "14.928", "14.829", "14.69", "14.762")), row.names = 151:156, class = "data.frame")
For my analysis I am interested in finding the 90th percentile for each month in each year. For example for 1970, I need the 90th percentile for June, July, August, September, and October. I've tried a few different ways but keep getting stuck in the same spot so I thought I'd ask for help.
result <- No_PS_for_Calculations %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
data.frame(result)
Which results in the following error:
Error: Problem with `summarise()` input `..1`.
i `..1 = across(Site1:Site16, quantile, probs = 0.9, .names = "percent90_{col}")`.
x non-numeric argument to binary operator
i The error occurred in group 1: Year = 1970, Month = "August".
I've been able to find the percentile grouped by month but now need to include year for further analysis.
What is the best way to get the 90th Percentiles presented by year and then month?
Thanks for the help!

It seems likely that you have something non-numeric in a column between Site1 and Site16. Some fake data:
set.seed(42)
No_PS_for_Calculations <- data.frame(Year = rep(2020:2021, each = 3), Month = rep(c("Aug","Sep","Oct"), times = 2), Site1 = runif(6), Quux = sprintf("%0.03f", runif(6)), Site16 = runif(6))
No_PS_for_Calculations
# Year Month Site1 Quux Site16
# 1 2020 Aug 0.9148060 0.737 0.9346722
# 2 2020 Sep 0.9370754 0.135 0.2554288
# 3 2020 Oct 0.2861395 0.657 0.4622928
# 4 2021 Aug 0.8304476 0.705 0.9400145
# 5 2021 Sep 0.6417455 0.458 0.9782264
# 6 2021 Oct 0.5190959 0.719 0.1174874
No_PS_for_Calculations %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
+ > Error: Problem with `summarise()` input `..1`.
# x non-numeric argument to binary operator
# i Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
# i The error occurred in group 1: Year = 2020, Month = "Aug".
If the non-numeric data ("Quux" column here) is not meant to be summarized, then you can select the columns you need to avoid any confusion:
No_PS_for_Calculations %>%
select(Year, Month, starts_with("Site")) %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
# # A tibble: 6 x 4
# # Groups: Year [2]
# Year Month percent90_Site1 percent90_Site16
# <int> <chr> <dbl> <dbl>
# 1 2020 Aug 0.915 0.737
# 2 2020 Oct 0.286 0.657
# 3 2020 Sep 0.937 0.135
# 4 2021 Aug 0.830 0.705
# 5 2021 Oct 0.519 0.719
# 6 2021 Sep 0.642 0.458
Another cause might be if a legitimate Site column is non-numeric, in which case you need to determine if you can easily convert to numeric. For instance, if "Quux" here is instead named "Site2"
names(No_PS_for_Calculations)[4] <- "Site2"
then we can try to convert it inline:
No_PS_for_Calculations %>%
mutate(Site2 = as.numeric(Site2)) %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
# # A tibble: 6 x 5
# # Groups: Year [2]
# Year Month percent90_Site1 percent90_Site2 percent90_Site16
# <int> <chr> <dbl> <dbl> <dbl>
# 1 2020 Aug 0.915 0.737 0.935
# 2 2020 Oct 0.286 0.657 0.462
# 3 2020 Sep 0.937 0.135 0.255
# 4 2021 Aug 0.830 0.705 0.940
# 5 2021 Oct 0.519 0.719 0.117
# 6 2021 Sep 0.642 0.458 0.978
Of course, if there are non-number characters in there, you will get NAs, which is easily fixed given filters, cleaners, or similar.

Related

Reordering rows alphabetically with specific exception(s) in R

This is my first Stack Overflow question so bear with me please. I'm trying to create dataframes that are ordered alphabetically based on a "Variable" field, with exceptions made for rows of particular values (e.g. "Avg. Temp" at the top of the dataframe and "Intercept" at the bottom of the dataframe). The starting dataframe might look like this, for example:
Variable Model 1 Estimate
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Age 0.033***
Avg Temp. -0.001***
Intercept -3.772***
Sex -0.073***
Year=2008 0.084***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
And I want to reorder it as such:
Variable Model 1 Estimate
Avg Temp. -0.001***
Age 0.033***
Sex -0.073***
Year=2008 0.084***
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
Intercept -3.772***
Appreciate any help on this.
You can use the fct_relevel() function from {forcats}. Its first call put Avg Temp., Age and Sex at the beginning (after = 0 by default). The second call will put Intercept at the end (n() refers to the numbers of line in the data frame).
library(tidyverse)
df <-
tribble(~Variable, ~Model,
"Year=2009", 0.026,
"Year=2010", -0.04,
"Year=2011", -0.135,
"Age", 0.033,
"Avg Temp.", -0.001,
"Intercept", -3.772,
"Sex", -0.073,
"Year=2008", 0.084,
"Year=2012", -0.237,
"Year=2013", -0.326,
"Year=2014", -0.431,
"Year=2015", -0.589)
df %>%
mutate(Variable = as.factor(Variable),
Variable = fct_relevel(Variable, "Avg Temp.", "Age", "Sex"),
Variable = fct_relevel(Variable, "Intercept", after = n())) %>%
arrange(Variable)
# A tibble: 12 × 2
Variable Model
<fct> <dbl>
1 Avg Temp. -0.001
2 Age 0.033
3 Sex -0.073
4 Year=2008 0.084
5 Year=2009 0.026
6 Year=2010 -0.04
7 Year=2011 -0.135
8 Year=2012 -0.237
9 Year=2013 -0.326
10 Year=2014 -0.431
11 Year=2015 -0.589
12 Intercept -3.77
Another option, in case the dataframes contain a variety of different variable names besides year and intercept, is something like this:
library(tidyverse)
# Sample data
df <- tribble(
~variable, ~model_1_estimate,
"Year=2009", "0.026",
"Year=2010", "-0.04",
"Year=2011", "-0.135***",
"Age", "0.033***",
"Avg Temp.", "-0.001***",
"Intercept", "-3.772***",
"Sex", "-0.073***",
"Year=2008", "0.084***",
"Year=2012", "-0.237***",
"Year=2013", "-0.326***",
"Year=2014", "-0.431***",
"Year=2015", "-0.589***"
)
# Possible solution
df |>
separate(variable, c("term", "year"), sep = "=") |>
mutate(intercept = if_else(term == "Intercept", 1, 0)) |>
arrange(intercept, term, year) |>
select(-intercept)
#> # A tibble: 12 × 3
#> term year model_1_estimate
#> <chr> <chr> <chr>
#> 1 Age <NA> 0.033***
#> 2 Avg Temp. <NA> -0.001***
#> 3 Sex <NA> -0.073***
#> 4 Year 2008 0.084***
#> 5 Year 2009 0.026
#> 6 Year 2010 -0.04
#> 7 Year 2011 -0.135***
#> 8 Year 2012 -0.237***
#> 9 Year 2013 -0.326***
#> 10 Year 2014 -0.431***
#> 11 Year 2015 -0.589***
#> 12 Intercept <NA> -3.772***
Created on 2022-06-28 by the reprex package (v2.0.1)

Filtering twice with multiple variables and counting rows

I have this data frame that is grouped by id_station, id_parameter, "zona", and its date.
id_station id_parameter zona year month day mediaDiaria sdDiaria Count
1 AJM CO SO 2019 1 1 0.281 0.181 21
2 AJM CO SO 2019 1 2 0.367 0.230 24
3 AJM CO SO 2019 1 3 0.371 0.160 24
4 AJM CO SO 2019 1 4 0.312 0.185 24
5 AJM CO SO 2019 1 5 0.296 0.168 24
6 AJM CO SO 2019 1 6 0.225 0.142 24
7 AJM CO SO 2019 1 7 0.281 0.0873 21
8 AJM CO SO 2019 1 8 0.388 0.236 24
9 AJM CO SO 2019 1 9 0.421 0.265 24
10 AJM CO SO 2019 1 10 0.225 0.103 24
What I want to do is to filter March 1st, 2019 to February 29, 2020. I would treat this as "Year 1." Afterwards, I want to count the number of rows in Count, in Year 1 and per id_station, to eliminate all rows from stations that have less than 275 rows (days) with Count > 18.
I have tried the following with filter:
Year1in2019CO <- datosCO %>%
filter(year == 2019, month %in% c(3:12)) %>%
group_by(id_station, id_parameter, zona, year, month, day) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE), sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
Year1in2020CO <- datosCO %>%
filter(year == 2020, month %in% c(1:2)) %>%
group_by(id_station, id_parameter, zona, year, month, day) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE), sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
Year1CO <- bind_rows(Year1in2019CO, Year1in2020CO)
It does the job. But is there a way to do this while only creating one data frame, instead of 3?
And I have tried the following for the counting rows part:
YEAR1dfCO_2 <- Year1CO %>%
group_by(id_station) %>%
summarise(dws = sum(Count > 18))
And while it does give me what I need, I do not know how to eliminate all data from stations with less than 275 rows in Count (being > 18) in Year 1 from the original dataset (Year1CO).
Can you please help me?
This might work. First, filter year 1 rows using a constructed date then remove the stations based on the condition you described.
library(tidyverse)
yr1 <- datosCO %>%
mutate(d = as.Date(paste(year, month, day, sep = "-"))) %>%
filter(between(d, as.Date("2019-03-01"), as.Date("2020-02-29"))) %>%
group_by(id_station, id_parameter, zona, d) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE),
sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
yr1 %>%
group_by(id_station) %>%
filter(sum(Count > 18) < 275) %>%
ungroup()

How to perform multiple regressions grouped by 2 factors and create a file containing N and R-squared?

I am running into some problems again and hope that someone can help me. I am doing research on the effect of ELI on ROS for firms and if the pandemic has an effect on this. For this research, my supervisor for my thesis has asked me to do a regression analysis per year grouped by industries (NAICS) and I am at a loss as to how to do this. I have firms in 46 different industries (NAICS) and 11 years of firm data per firm (2010-2020). Now I would like to run a regression ROS ~ ELI + ELI*Pandemic, for all industries for each year and then capture the resulting N (number of firms per industry) and R-squared in one file. The image below is an example of what I am trying to achieve:
I hope that someone can help me because I am at an absolute loss and I can't seem to find a similar question/answer on SO.
Here is the dput(head()) as an example. NAICS is the industry.
df <- structure(list(NAICS = c(315, 315, 315, 315, 315, 315),
Year = c(2010, 2011, 2012, 2013, 2014, 2015),
Firm = c("A", "A", "A", "A", "A", "A"),
ROS = c(0.17, 0.19, 0.29, 0.3, 0.29, 0.25),
ELI = c(0.856264428748774, 0.723379402777553, 0.958341156943977, 0.680567730897854, 0.790480861209701, 0.827279134948296),
Pandemic = c(0, 0, 0, 0, 0, 0)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
Update02
I have made the necessary modifications on my solution after I received the original data set and I don't there will be any other problems.
library(dplyr)
library(tidyr)
library(broom)
library(purrr)
df %>%
group_by(NAICS, Year) %>%
add_count(name = "N") %>%
nest(data = !c(NAICS, Year, N)) %>%
mutate(models = map(data, ~ lm(ROS ~ ELI + ELI * Pandemic, data = .)),
glance = map(models, ~ glance(.x)),
tidied = map(models, ~ tidy(.x))) %>%
unnest(glance) %>%
select(NAICS:N, r.squared, tidied) %>%
unnest(tidied)
# A tibble: 2,024 x 9
# Groups: NAICS, Year [506]
NAICS Year N r.squared term estimate std.error statistic p.value
<dbl> <dbl> <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 315 2010 12 0.122 (Intercept) 0.0959 0.0123 7.83 0.0000143
2 315 2010 12 0.122 ELI 0.0189 0.0160 1.18 0.266
3 315 2010 12 0.122 Pandemic NA NA NA NA
4 315 2010 12 0.122 ELI:Pandemic NA NA NA NA
5 315 2011 12 0.129 (Intercept) 0.0999 0.0115 8.70 0.00000559
6 315 2011 12 0.129 ELI 0.0161 0.0132 1.22 0.251
7 315 2011 12 0.129 Pandemic NA NA NA NA
8 315 2011 12 0.129 ELI:Pandemic NA NA NA NA
9 315 2012 13 0.594 (Intercept) -0.486 0.606 -0.802 0.439
10 315 2012 13 0.594 ELI 2.11 0.526 4.01 0.00205
# ... with 2,014 more rows

Weekly average of daily panel data in R

I have a large panel dataset of roughly 4million daily observations (Overview of my Dataset).
The variable symbol depicts the 952 different stocks contained in the data set and the other variables are some stock-related daily measures. I want to calculate the weekly averages of the variables rv, rskew, rkurt and rsj for each of the of the 952 stocks included in symbol.
I tried to group the dataset with group_by(symbol), but then I did not manage to aggregate the daily observations in the right way.
I am not very experienced with R and would highly appreciate some help here.
This is simple with the lubridate and dplyr packages:
library(dplyr)
library(lubridate)
set.seed(123)
df <- data.frame(date = seq.Date(ymd('2020-07-01'),ymd('2020-07-31'),by='day'),
sybol = 'a',
x = runif(31),
y = runif(31),
z = runif(31)
)
df <- df %>%
mutate(year = year(date),
week = week(date),
) %>%
group_by(year, week, symbol) %>%
summarise(x = mean(x),
y = mean(y),
z = mean(z)
)
> df
# A tibble: 5 x 6
# Groups: year, week [5]
year week symbol x y z
<dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 2020 27 a 0.555 0.552 0.620
2 2020 28 a 0.652 0.292 0.461
3 2020 29 a 0.495 0.350 0.398
4 2020 30 a 0.690 0.495 0.609
5 2020 31 a 0.466 0.378 0.376

generate seasonal plot, but with fiscal year start/end dates

Hello! Is there a way to index a chart to start and end at specific points
(which may be out of numeric order)?
I have data that begins October 1st, and ends September 31st the following year. The series repeats through multiple years past, and i want to build a daily seasonality chart. The challenge is the X axis is not from low to high, it runs 10-11-12-1-2-3-4-5-6-7-8-9.
Question 1:
Can you order the index by month 10-11-12-1-2-3-4-5-6-7-8-9?
while, being compatible with %m-%d formatting, as the real problem is in
daily format, but for the sake of brevity, I am only using months.
the result should look something like this...sorry i had to use excel...
Question 2:
Can we remove the connected chart lines, or will the solution to 1, naturally fix
question 2? examples in the attempts below.
Question 3:
Can the final formatting of the solution allow to take a moving average, or other
mutations of the initial data? The table in attempt #2 would allow to take the average of each month by year. Since July 17 is 6 and July 18 is 12, we would plot a 9 in the chart, ect for the entire plot.
Question 4:
Is there and XTS equivalent to solve this problem?
THANK YOU, THANK YOU, THANK YOU!
library(ggplot2)
library(plotly)
library(tidyr)
library(reshape2)
Date <- seq(as.Date("2016-10-1"), as.Date("2018-09-01"), by="month")
values <- c(2,3,4,3,4,5,6,4,5,6,7,8,9,10,8,9,10,11,12,13,11,12,13,14)
YearEnd <-c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,
2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018)
df <- data.frame(Date,values,YearEnd)
## PLOT THE TIMESERIES
plot_ly(df, x = ~Date, y = ~values, type = "scatter", mode = "lines")
## PLOT THE DATA BY MONTH: attempt 1
df$Month <- format(df$Date, format="%m")
df2 <- df %>%
select(values, Month, YearEnd)
plot_ly(df2, x = ~Month, y = ~values, type = "scatter", mode = "lines",
connectgaps = FALSE)
## Plot starts on the 10th month, which is good, but the index is
## in standard order, not 10-11-12-1-2-3-4-5-6-7-8-9
## It also still connects the gaps, bad.
## CREATE A PIVOTTABLE: attempt 2
table <- spread(df2,YearEnd, values)
df3 <- melt(table , id.vars = 'Month', variable.name = 'series')
plot_ly(df3, x = ~Month, y = ~values, type = "scatter", mode = "lines",
connectgaps = FALSE)
## now the data are in the right order, but the index is still wrong
## I also do not understand how plotly is ordering it correctly, as 2
## is not the starting point in January.
You just need to set the desired levels for the Month inside factor
library(magrittr)
library(tidyverse)
library(lubridate)
library(plotly)
Date <- seq(as.Date("2016-10-1"), as.Date("2018-09-01"), by = "month")
values <- c(2, 3, 4, 3, 4, 5, 6, 4, 5, 6, 7, 8, 9, 10, 8, 9, 10, 11, 12, 13, 11, 12, 13, 14)
YearEnd <- c(
2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018
)
df <- data.frame(Date, values, YearEnd)
# to fiscal year order
df %<>%
mutate(
Month = month(Date),
YearEnd = factor(YearEnd)) %>%
mutate(Month = factor(Month,
levels = c(10:12, 1:9),
labels = c(month.abb[10:12], month.abb[1:9])))
df
#> Date values YearEnd Month
#> 1 2016-10-01 2 2017 Oct
#> 2 2016-11-01 3 2017 Nov
#> 3 2016-12-01 4 2017 Dec
#> 4 2017-01-01 3 2017 Jan
#> 5 2017-02-01 4 2017 Feb
#> 6 2017-03-01 5 2017 Mar
#> 7 2017-04-01 6 2017 Apr
#> 8 2017-05-01 4 2017 May
#> 9 2017-06-01 5 2017 Jun
#> 10 2017-07-01 6 2017 Jul
#> 11 2017-08-01 7 2017 Aug
#> 12 2017-09-01 8 2017 Sep
...
p1 <- ggplot(df, aes(
x = Month, y = values,
color = YearEnd,
group = YearEnd)) +
geom_line() +
theme_classic(base_size = 12)
ggplotly(p1)
Edit: to plot by Julian day, we use a similar method to the 3rd one from this answer
# Generate random data
set.seed(2018)
date = seq(from = as.Date("2016-10-01"), to = as.Date("2018-09-30"),
by = "days")
values = c(rnorm(length(date)/2, 8, 1.5), rnorm(length(date)/2, 16, 2))
dat <- data.frame(date, values)
df <- dat %>%
tbl_df() %>%
mutate(jday = factor(yday(date)),
Month = month(date),
Year = year(date),
# only create label for the 1st day of the month
myLabel = case_when(day(date) == 1L ~ format(date, "%b-%d"),
TRUE ~ NA_character_)) %>%
# create fiscal year column
mutate(fcyear = case_when(Month > 9 ~ as.factor(Year + 1),
TRUE ~ as.factor(Year))) %>%
mutate(Month = factor(Month,
levels = c(10:12, 1:9),
labels = c(month.abb[10:12], month.abb[1:9])))
df
#> # A tibble: 730 x 7
#> date values jday Month Year myLabel fcyear
#> <date> <dbl> <fct> <fct> <dbl> <chr> <fct>
#> 1 2016-10-01 7.37 275 Oct 2016 Oct-01 2017
#> 2 2016-10-02 5.68 276 Oct 2016 <NA> 2017
#> 3 2016-10-03 7.90 277 Oct 2016 <NA> 2017
#> 4 2016-10-04 8.41 278 Oct 2016 <NA> 2017
#> 5 2016-10-05 10.6 279 Oct 2016 <NA> 2017
#> 6 2016-10-06 7.60 280 Oct 2016 <NA> 2017
#> 7 2016-10-07 11.1 281 Oct 2016 <NA> 2017
#> 8 2016-10-08 9.30 282 Oct 2016 <NA> 2017
#> 9 2016-10-09 7.08 283 Oct 2016 <NA> 2017
#> 10 2016-10-10 8.96 284 Oct 2016 <NA> 2017
#> # ... with 720 more rows
# Create a row number for plotting to make sure ggplot plot in
# the exact order of a fiscal year
df1 <- df %>%
group_by(fcyear) %>%
mutate(order = row_number()) %>%
ungroup()
df1
#> # A tibble: 730 x 8
#> date values jday Month Year myLabel fcyear order
#> <date> <dbl> <fct> <fct> <dbl> <chr> <fct> <int>
#> 1 2016-10-01 7.37 275 Oct 2016 Oct-01 2017 1
#> 2 2016-10-02 5.68 276 Oct 2016 <NA> 2017 2
#> 3 2016-10-03 7.90 277 Oct 2016 <NA> 2017 3
#> 4 2016-10-04 8.41 278 Oct 2016 <NA> 2017 4
#> 5 2016-10-05 10.6 279 Oct 2016 <NA> 2017 5
#> 6 2016-10-06 7.60 280 Oct 2016 <NA> 2017 6
#> 7 2016-10-07 11.1 281 Oct 2016 <NA> 2017 7
#> 8 2016-10-08 9.30 282 Oct 2016 <NA> 2017 8
#> 9 2016-10-09 7.08 283 Oct 2016 <NA> 2017 9
#> 10 2016-10-10 8.96 284 Oct 2016 <NA> 2017 10
#> # ... with 720 more rows
# plot with `order` as x-axis
p2 <- ggplot(df1,
aes(x = order, y = values,
color = fcyear,
group = fcyear)) +
geom_line() +
theme_classic(base_size = 12) +
xlab(NULL)
# now replace `order` label with `myLabel` created above
x_break <- df1$order[!is.na(df1$myLabel)][1:12]
x_label <- df1$myLabel[x_break]
x_label
#> [1] "Oct-01" "Nov-01" "Dec-01" "Jan-01" "Feb-01" "Mar-01" "Apr-01"
#> [8] "May-01" "Jun-01" "Jul-01" "Aug-01" "Sep-01"
p3 <- p2 +
scale_x_continuous(
breaks = x_break,
labels = x_label) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer("Fiscal Year", palette = "Dark2") +
xlab(NULL)
p3
ggplotly(p3)
Created on 2018-09-09 by the reprex package (v0.2.0.9000).
Consider this an appendix to Tung's excellent answer. Here I've made it obvious how to alter the code for different start and end months of financial (or production) years which varies by country (and industry), with the Parameter EndMonth. I've also added an annual average, which seems like a pretty obvious thing to want as well (though outside the OP's request).
library(tidyverse)
library(lubridate)
## Generate random data
set.seed(2018)
date = seq(from = as.Date("2016-06-01"), to = as.Date("2016-06-01")+729,
by = "days") # about 2 years, but even number of days
values = c(rnorm(length(date)/2, 8, 1.5), rnorm(length(date)/2, 16, 2))
dat <- data.frame(date, values)
EndMonth <- 5 #i.e. if last month of financial year is May, use 5 for 5th month of calendar year
df <- dat %>%
tbl_df() %>%
mutate(jday = factor(yday(date)),
Month = month(date),
Year = year(date),
# only create label for the 1st day of the month
myLabel = case_when(day(date) == 1L ~ format(date, "%b%e"),
TRUE ~ NA_character_)) %>%
# create fiscal year column
mutate(fcyear = case_when(Month > EndMonth ~ as.factor(Year + 1),
TRUE ~ as.factor(Year))) %>%
mutate(Month = factor(Month,
levels = c((EndMonth+1):12, 1:(EndMonth)),
labels = c(month.abb[(EndMonth+1):12], month.abb[1:EndMonth])))
df
#make 2 (or n) year average
df_mean <- df %>%
group_by(jday) %>%
mutate(values = mean(values, na.rm=TRUE)) %>%
filter(fcyear %in% c("2017")) %>% #note hard code for first fcyear in dataset
mutate(fcyear = "Average")
#Add average to data frame
df <- bind_rows(df, df_mean)
# Create a row number for plotting to make sure ggplot plot in
# the exact order of a fiscal year
df1 <- df %>%
group_by(fcyear) %>%
mutate(order = row_number()) %>%
ungroup()
df1
# plot with `order` as x-axis
p2 <- ggplot(df1,
aes(x = order, y = values,
color = fcyear,
group = fcyear)) +
geom_line() +
theme_classic(base_size = 12) +
xlab(NULL)
p2
# now replace `order` label with `myLabel` created above
x_break <- df1$order[!is.na(df1$myLabel)][1:12]
x_label <- df1$myLabel[x_break]
x_label
p3 <- p2 +
scale_x_continuous(
breaks = x_break,
labels = x_label) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer("Fiscal Year", palette = "Dark2") +
xlab(NULL)
p3

Resources