Calculating which quarters occur in a timeframe - r

My dataset has monthly reporting which needs to be summed to return both the quarterly value and the 12 month rolling rate. I have successfully created a column specifying which quarter each row is from by using df$Quarter <- quarter(df$Month, fiscal_start = 4, with_year = T), this returns as 2022.1etc which I then use as part of my group_by to sum all values in that quarter. I however now need to create a row for each area which returns the 4 quarter sum based upon when I update the dataset, which will be done quarterly.
If this were my data I would want it to end up something like the second table
|Area|Quarter|Measure_1|
|----|-------|---------|
|Area_a|2022.1|5|
|Area_a|2021.4|1|
|Area_a|2021.3|2|
|Area_a|2021.2|6|
|Area_b|2022.1|9|
|Area_b|2021.4|7|
|Area_b|2021.3|2|
|Area_b|2021.2|1|
It doesn't need to be exactly like this but this is the rough idea of what I want to happen
Area
Quarter
Measure_1
Timeframe
Area_a
2022.1
5
Quarterly
Area_a
2021.4
1
Quarterly
Area_a
2021.3
2
Quarterly
Area_a
2021.2
6
Quarterly
Area_a
2022.1
14
12 month rolling
Area_b
2022.1
9
Quarterly
Area_b
2021.4
7
Quarterly
Area_b
2021.3
2
Quarterly
Area_b
2021.2
1
Quarterly
Area_b
2022.1
19
12 month rolling

The following code produces the required results from your sample data. If your real data covers multiple years for each Area, then you would have to calculate a year variable and then include it along with Area in the group_by().
want <- df %>%
group_by(Area) %>%
# use summarise to calculate totals
# Quarter variable will be used to sort the output, new_Quarter will ensure
# that the total row has the maximum Quarter value for that area
summarise(new_Quarter=max(Quarter), Quarter=min(Quarter)-0.05, Measure_1=sum(Measure_1)) %>%
bind_rows(df) %>% # combine the totals with the original data
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Quarter=if_else(is.na(new_Quarter),Quarter,new_Quarter), # assign maximum Quarter to total row
Timeframe=if_else(is.na(new_Quarter),'Quarterly','12 month rolling')) %>% # add label
select(-new_Quarter) # remove temporary variable

Related

How can I calculate the number of nights PER MONTH between two dates in R even if they are across two months?

I have hotel booking data and there's an arrival and a departure date. I have successfully counted the days in between using difftime but I would now like to know the number of dates per month. If both arrival and departure date are within one month (like arrival on September 1st and departure on September 10th) that's not a problem of course but what do I do with bookings that are across months like arrival on September 25th and departure on October 4th or even years? In this case I would like to calculate how many days fall in September and how many days fall in October.
The overall goal is to calculate booked days per month / year.
Since you included no sample data (may I suggest you do so in next questions), I made it up to replicate what you want:
library(lubridate)
library(tidyverse)
#creating sample data
bookings <- tibble(
pax = c("Jane", "John"),
arrival = as.Date(c("2020-12-20", "2021-01-25")),
departure = as.Date(c("2021-01-04", "2021-02-02"))
)
#creating a column with all booked dates to group_by and summarize
bookings <- bookings %>%
rowwise() %>%
mutate(booked_dates = list(seq(arrival, departure, by="days"))) %>% # this creates a column of tiny dataframes with the occupied dates by pax
unnest(cols = booked_dates) %>% # this flattens the list-column into a regular one
mutate( # extracting the year and month
year = year(booked_dates),
month = month(booked_dates, label = TRUE)
) %>%
group_by(year, month) %>% # grouping and summarizing
summarise(n_days = n())
Then you have the desired output:
bookings
# A tibble: 3 × 3
# Groups: year [2]
year month n_days
<dbl> <ord> <int>
1 2020 Dec 12
2 2021 Jan 11
3 2021 Feb 2

Question for calculating the mean date only with month and day

I have the following dataset, and I would like to have the average date (Month and day) for each (phenology) pheno and station across years. It seems I can directly use the mean function to calculate the mean for the date format objects. However, if I convert the month day to date, with function as.Date, then the year is added, and the average date is not independent of years. How can I directly calculate the mean date only based on Month and day?
You cannot compute a "mean month + day" independet of the year, since not every year has the same number of days. So you need to choose a fixed year for your computations.
Then you can:
Create "dummy" date objects which have the correct month and day, but the previously select year.
Compute the mean of those dummies
Extract month and day from the result (remove the year)
You can use the yday function from the lubridate package to convert each date into the day of the year for that year then average the day of the year for each Pheno. The conversion of the day of the year to a month and day depends upon whether your want the date in a leap year or non leap year. I report both dates.
The code looks like:
library(tidyverse)
library(lubridate)
#
# calculate average day of year
#
average_doy <- df %>% mutate(day_of_year = yday(as.Date(paste(Year,Month,Day, sep="-")))) %>%
group_by(Pheno) %>%
summarize(avg_doy = round(mean(day_of_year,0)))
# set base years
non_leap_year <- 2003
leap_year <- 2004
#
# convert day of year to average day using base years
#
averages <- average_doy %>%
mutate(avg_non_leap_year_mon_day = paste(avg_doy, non_leap_year, sep="_") %>%
as.Date(format = "%j_%Y") %>%
str_remove(paste0(non_leap_year,"-")),
avg_leap_year_mon_day = paste(avg_doy, leap_year, sep="_") %>%
as.Date(format = "%j_%Y") %>%
str_remove(paste0(leap_year,"-") ))
Using the first seven rows of your data, this gives
# A tibble: 3 x 4
Pheno avg_doy avg_non_leap_year_mon_day avg_leap_year_mon_day
<chr> <dbl> <chr> <chr>
1 Dormant 348 12-14 12-13
2 Tillering 343 12-09 12-08
3 Turning green 48 02-17 02-17

Calculate average and std same day last 3 weeks in R [duplicate]

This question already has an answer here:
use rollapply and zoo to calculate rolling average of a column of variables
(1 answer)
Closed 2 years ago.
I have a data frame like below (sample data). I want to add two columns for each day to show average and std sales of same day in the last 3 weeks. What I mean by this is the same 3 previous days (last 3 Tuesdays, last 3 Wednesdays, etc.)
df <- data.frame(
stringsAsFactors = FALSE,
date = c("3/28/2019","3/27/2019",
"3/26/2019","3/25/2019","3/24/2019","3/23/2019",
"3/22/2019","3/21/2019","3/20/2019","3/19/2019","3/18/2019",
"3/17/2019","3/16/2019","3/15/2019","3/14/2019",
"3/13/2019","3/12/2020","3/11/2020","3/10/2020","3/9/2021",
"3/8/2021","3/7/2021","3/6/2022","3/5/2022",
"3/4/2022","3/3/2023"),
weekday = c(4L,3L,2L,1L,7L,6L,5L,4L,
3L,2L,1L,7L,6L,5L,4L,3L,2L,1L,7L,6L,5L,4L,
3L,2L,1L,7L),
store_id = c(344L,344L,344L,344L,344L,
344L,344L,344L,344L,344L,344L,344L,344L,344L,344L,
344L,344L,344L,344L,344L,344L,344L,344L,344L,
344L,344L),
store_sales = c(1312005L,1369065L,1354185L,
1339183L,973780L,1112763L,1378349L,1331890L,1357713L,
1366399L,1303573L,936919L,1099826L,1406752L,
1318841L,1321099L,1387767L,1281097L,873449L,1003667L,
1387767L,1281097L,873449L,1003667L,1331636L,1303804L)
)
For example for 3/28/2019 take average sales of (3/21/2019 & 3/14/2019 & 3/7/2021) , like this
date weekday store_id store_sales avg_sameday3
3/28/2019 4 344 1312005 1310609
We can group by weekday and store_id and calculate rolling mean for last 3 entries using zoo::rollapplyr.
library(dplyr)
df %>%
arrange(weekday) %>%
group_by(store_id, weekday) %>%
mutate(store_sales_avg = zoo::rollapplyr(store_sales, 4,
function(x) mean(x[-1]), partial = TRUE))
Note that I have used window size as 4 and removed the first entry from mean calculation so that it does not consider the current value while taking mean. With partial = TRUE it takes mean even when last values are less than 4.

Tidy a column in a data frame by using mean function [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 3 years ago.
I have the following data of Unemployement per Year and quarter, but in my data frame is up to 2018, but I will use only 2 years for exemple.
Year Unemployement
1997Q3 1914
1997Q4 1697
1998Q1 1702
1998Q2 1645
1998Q3 1742
1998Q4 1605
What code can I use in order to tidy the Year column and to have the following data, and mainly to obtain the unemployment number by calculating the mean of each data per year: 1997 and 1998 (+ for other years that I have in my data frame). In the final version, I would like to have only one data of Unemployment per year, which theoretically shoud be the average of all Quaters
Year Unemployement
1997 1805.50
1998 1673.50
Thank you!
##Data entry
library(tidyverse)
df<- tribble(
~Year,~Quarter,~Unemployement,
1997,"Q3",1914,
1997,"Q4",1697,
1998,"Q1",1702,
1998,"Q2",1645,
1998,"Q3",1742,
1998,"Q4",1605
)
##Solution
df%>%
group_by(Year)%>%
summarise(mean_year = mean(Unemployement))
# A tibble: 2 x 2
Year mean_year
<dbl> <dbl>
1 1997 1806.
2 1998 1674.
## 2nd Version (first separate the Year-column)
df%>%
separate(Year, c("Year", "Quarter"))%>%
group_by(Year)%>%
summarise(mean_year = mean(Unemployement))

Longest consecutive period above threshold using rle and for loop

I have four years of streamflow data for one month and I'm trying to figure out how to extract the longest consecutive period at or above a certain threshold for each of the four years. In the example below, the threshold is 4. I want to try to accomplish this using a for loop or possibly one of the apply functions, but I'm not sure how to go about it.
Here's my example dataframe:
year <- c(rep(2009,31), rep(2010, 31), rep(2011, 31), rep(2012, 31))
day<-c(rep(seq(1:31),4))
discharge <- c(4,4,4,5,6,5,4,8,4,5,3,8,8,8,8,8,8,8,1,2,2,8,8,8,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,10,3,3,3,3,3,3,1,1,3,8,8,8,8,8,8,8,8,8,1,2,2,8,8,3,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,3,3,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,3)
df<-data.frame(cbind(year, day, discharge))
df$threshold<-ifelse(discharge>=4,1,0)
In this example, the threshold column is coded as 1 if the discharge is at or above the threshold and 0 if not. I'm able to partially get my desired output for one year (2009 in the example below), with the following code:
rl2009<-with(subset(df,year==2009),rle(threshold))
cs2009 <- cumsum(rl2009$lengths)
index2009<-cbind(cs2009[rl2009$values == 1] - rl2009$length[rl2009$values == 1] + 1,
cs2009[rl2009$values == 1])
df2009<-data.frame(index2009)
df2009 #ouput all periods when flow is above threshold
df2009$X3<-df2009$X2-df2009$X1+1
max2009<-df2009[which.max(df2009$X3),]
max2009 #output the first and longest period when flow is above threshold
For 2009, there are three time periods when the discharge equals or exceeds 4, but the period from day 1 to day 10 is chosen because it is the first of the longest period above the threshold. X1 represents the start of the time period, X2 the end of the time period, and X3 the number of days in the time period. If there is more than one period with the same number of days, I want to select the first of such periods.
My desired output for all four years is below:
year X1 X2 X3
2009 1 10 10
2010 9 31 23
2011 10 18 9
2012 12 30 19
The actual data includes many more years and many streams, so it's not feasible to do this for each year individually. If anyone has any thoughts on how to achieve this, it'd be greatly appreciated. Thanks.
Simply, generalize your process with a defined function such as threshold_find and pass dataframes subsetted for each year into it which can be handled with by.
As the object-oriented wrapper to tapply, by slices a dataframe by one or more factors (i.e., year) and returns a list of whatever object the defined function outputs, here being the max dataframe. At end, do.call() row binds all dataframes in by list into one dataframe.
threshold_find <- function(df) {
rl <- with(df, rle(threshold))
cs <- cumsum(rl$lengths)
index <- cbind(cs[rl$values == 1] - rl$length[rl$values == 1] + 1,
cs[rl$values == 1])
df <- data.frame(index)
df$X3 <- df$X2 - df$X1+1
max <- df[which.max(df$X3),]
max
}
finaldf <- do.call(rbind, by(df, df$year, FUN=threshold_find))
finaldf
# X1 X2 X3
# 2009 1 10 10
# 2010 9 31 23
# 2011 10 18 9
# 2012 12 30 19

Resources