Aggregating daily data to weekly, ending today - r

I'm currently building some charts of covid-related data....my script goes out and downloads most recent data and goes from there. I wind up with dataframes that look like
head(NMdata)
Date state positiveIncrease totalTestResultsIncrease
1 2020-05-19 NM 158 4367
2 2020-05-18 NM 81 4669
3 2020-05-17 NM 195 4126
4 2020-05-16 NM 159 4857
5 2020-05-15 NM 139 4590
6 2020-05-14 NM 152 4722
I've been aggregating to weekly data using the tq_transmute function from tidyquant.
NMweeklyPos <- NMdata %>% tq_transmute(select = positiveIncrease, mutate_fun = apply.weekly, FUN=sum)
This works, but it aggregates on week of the year, with weeks starting on Sunday.
head(NMweeklyPos)
Date positiveIncrease
<dttm> <int>
1 2020-03-08 00:00:00 0
2 2020-03-15 00:00:00 13
3 2020-03-22 00:00:00 44
4 2020-03-29 00:00:00 180
5 2020-04-05 00:00:00 306
6 2020-04-12 00:00:00 631
So for instance if I ran it today (which happens to be a Wednesday) my last entry is a partial week with Monday, Tuesday, Wednesday.
tail(NMweeklyPos)
Date positiveIncrease
<dttm> <int>
1 2020-04-19 00:00:00 624
2 2020-04-26 00:00:00 862
3 2020-05-03 00:00:00 1072
4 2020-05-10 00:00:00 1046
5 2020-05-17 00:00:00 1079
6 2020-05-19 00:00:00 239
For purposes of my chart this winds up being a small value, and so I have been discarding the partial weeks at the end, but that means I'm throwing out the most recent data.
I would prefer the throw out a partial week from the start of the dataset and have the aggregation automatically use weeks that end on whatever day the script is being run. So if I ran it today (Wednesday) it would aggregate on weeks ending Wednesday so that I had the most current data included...I could drop the partial week from the beginning of the data. But tomorrow it would choose weeks ending Thursday, etc. And I don't want to have to hardcode the week end day and change it each time.
How can I go about achieving that?

Using lubridate, the below code will find what day of the week it is and define that day as the floor for each week.
Hope this helps!
library(lubridate)
library(dplyr)
end = as.Date("2020-04-14")
data = data.frame(
date = seq.Date(as.Date("2020-01-01"), end, by = "day"),
val = 1
)
# get the day of the week
weekday = wday(end)
# using the floor_date function we can use todays date to determine what day of the week will be our floor
data%>%
mutate(week = floor_date(date, "week", week_start = weekday))%>%
group_by(week)%>%
summarise(total = sum(val))

Related

How to group by a time window in R?

I want to find the highest average of departure delay in time windows of length 1 week in flights dataset of nycflights13 package.
I've used
seq(min(flights:time_hour), max(flights:time_hour), by = "week")
to find the dates with the difference of one week. But I don't know how to group by these dates to find the average departure delay of each period. How can I do this using tidyverse package?
Thank you for your help in advance.
We can use {lubridate} to round each date to the nearest week. Two wrinkles to think about:
To count weeks beginning with Jan 1, you'll need to specify the week_start arg. Otherwise lubridate will count from the previous Sunday, which in this case is 12/30/2012.
You also need to deal with incomplete weeks. In this case, the last week of the year only contains one day. I chose to drop weeks with < 7 days for this demo.
library(tidyverse)
library(lubridate)
library(nycflights13)
data(flights)
# what weekday was the first of the year?
weekdays(min(flights$time_hour))
#> [1] "Tuesday"
# Tuesday = day #2 so we'll pass `2` to `week_start`
flights %>%
group_by(week = floor_date(time_hour, unit = "week", week_start = 2)) %>%
filter(n_distinct(day) == 7) %>% # drop incomplete weeks
summarize(dep_delay_avg = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(dep_delay_avg))
#> # A tibble: 52 x 2
#> week dep_delay_avg
#> <dttm> <dbl>
#> 1 2013-06-25 00:00:00 40.6 # week of June 25 had longest delays
#> 2 2013-07-09 00:00:00 24.4
#> 3 2013-12-17 00:00:00 24.0
#> 4 2013-07-23 00:00:00 21.8
#> 5 2013-03-05 00:00:00 21.7
#> 6 2013-04-16 00:00:00 21.6
#> 7 2013-07-16 00:00:00 20.4
#> 8 2013-07-02 00:00:00 20.1
#> 9 2013-12-03 00:00:00 19.9
#> 10 2013-05-21 00:00:00 19.2
#> # ... with 42 more rows
Created on 2022-03-06 by the reprex package (v2.0.1)
Edit: as requested by OP, here is a solution using only core {tidyverse} packages, without {lubridate}:
library(tidyverse)
library(nycflights13)
data(flights)
flights %>%
group_by(week = (as.POSIXlt(time_hour)$yday) %/% 7) %>%
filter(n_distinct(day) == 7) %>%
summarize(
week = as.Date(min(time_hour)),
dep_delay_avg = mean(dep_delay, na.rm = TRUE)
) %>%
arrange(desc(dep_delay_avg))

Selecting the data frame row with the earliest time value for a set period

I have a df in r with numerous records with the below format, with 'arrival_time' values for a 12 hour period'.
id
arrival_time
wait_time_value
1
2020-02-20 12:02:00
10
2
2020-02-20 12:04:00
5
99900
2020-02-20 23:47:00
8
10000
2020-02-20 23:59:00
21
I would like to create a new df that has a row for each 15 minute slot of the arrival time period and the wait_time_value of the record with the earliest arrival time in that slot. So, in the above example, the first and last row of the new df would look like:
id
period_start
wait_time_value
1
2020-02-20 12:00:00
10
48
2020-02-20 23:45:00
8
I have used the below code to achieve this for the mean average wait time for all records in each 15 minute range, but i'm not sure how to select the value for the earliest record?
df$period_start <- align.time(df$arrival_time- 899, n = 60*15)
avgwait_df <- aggregate(wait_time_value ~ period_start, df, mean)
Use DataFrame.resample with GroupBy.first, remove only NaNs and convert to DataFrame:
df['arrival_time'] = pd.to_datetime(df['arrival_time'])
df = (df.resample('15Min', on='arrival_time')['wait_time_value']
.first()
.dropna()
.reset_index(name='wait_time_value'))
print (df)
arrival_time wait_time_value
0 2020-02-20 12:00:00 10.0
1 2020-02-20 23:45:00 8.0
Using dplyr:
df %>%
group_by(period_start) %>%
summarise(wait_time = min(wait_time_value))

How to find third Sunday date for all months between a date range in R and their respective values

For a particular date range, for example between 2020-01-29 and 2021-05-02, I want to find out dates for every 3rd Sunday of every month along with their associated value in a data.frame.
Additionally, if there is any 5th Monday in any month then I want to obtain its date and corresponding value in a separate data.frame.
Please note that it needs to be between a date range from those given in the data.frame.
## for creating data frame in R wrt dates and values
dates_seq<-(seq(as.Date("2019/12/28"), by = "day", length.out = 1000))
dates_seq<-as.data.frame(dates_seq)
values<-seq(1:1000)
df<-as.data.frame(cbind(dates_seq,values))
To summarize, I want to find the third Sunday date for every month and it's corresponding value and the fifth Monday for every month if there is any along with it's value.
Here's a base R approach :
# Get date between 2020-01-29 and 2021-05-0
temp <- subset(df, dates_seq >= as.Date('2020-01-29') &
dates_seq <= as.Date('2021-05-02'))
#Add weekday
temp$week_day <- weekdays(temp$dates_seq)
#Add week number for each month
temp$week_number <- ave(temp$week_day, temp$week_day,
format(temp$dates_seq, "%Y-%m"), FUN = seq_along)
#Subset 3rd Sunday and 5th Monday
subset(temp, week_number == 3 & week_day == 'Sunday' |
week_number == 5 & week_day == 'Monday')
# dates_seq values week_day week_number
#51 2020-02-16 51 Sunday 3
#79 2020-03-15 79 Sunday 3
#94 2020-03-30 94 Monday 5
#114 2020-04-19 114 Sunday 3
#142 2020-05-17 142 Sunday 3
#177 2020-06-21 177 Sunday 3
#185 2020-06-29 185 Monday 5
#205 2020-07-19 205 Sunday 3
#233 2020-08-16 233 Sunday 3
#248 2020-08-31 248 Monday 5
#268 2020-09-20 268 Sunday 3
#296 2020-10-18 296 Sunday 3
#324 2020-11-15 324 Sunday 3
#339 2020-11-30 339 Monday 5
#359 2020-12-20 359 Sunday 3
#387 2021-01-17 387 Sunday 3
#422 2021-02-21 422 Sunday 3
#450 2021-03-21 450 Sunday 3
#458 2021-03-29 458 Monday 5
#478 2021-04-18 478 Sunday 3
As in lubridate Sundays are the 1st day of the week, this code will give you a data frame containing all third Sundays:
df <- df %>%
mutate(dates_seq = as.Date(dates_seq)) %>%
mutate(year = year(dates_seq),
month = month(dates_seq),
day = wday(dates_seq)) %>%
filter(day == 1) %>%
group_by(year, month) %>%
slice(3)
You could do a match with the original data frame to find the row index.

How to subset data by specific hours of interest?

I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867
One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
library(lubridate)
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
library(lubridate)
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
....
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
library(lubridate)
library(dplyr)
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?

How to divide monthly totals by the seasonal monthly ratio in R

I am trying de-seasonalize my data by dividing my monthly totals by the average seasonality ratio per that month. I have two data frames. avgseasonality that has 12 rows of the average seasonality ratio per month. The problem is since the seasonality ratio is the ratio of each month averaged only has 12 rows and the ordertotal data frame has 147 rows.
deseasonlize <- transform(avgseasonalityratio, deseasonlizedtotal =
df1$OrderTotal / avgseasonality$seasonalityratio)
This runs but it does not pair the months appropriately. It uses the first ratio of april and runs it on the first ordertotal of december.
> avgseasonality
Month seasonalityratio
1 April 1.0132557
2 August 1.0054602
3 December 0.8316988
4 February 0.9813396
5 January 0.8357475
6 July 1.1181648
7 June 1.0439899
8 March 1.1772450
9 May 1.0430667
10 November 0.9841149
11 October 0.9595041
12 September 0.8312318
> df1
# A tibble: 157 x 3
DateEntLabel OrderTotal `d$Month`
<dttm> <dbl> <chr>
1 2005-12-01 00:00:00 512758. December
2 2006-01-01 00:00:00 227449. January
3 2006-02-01 00:00:00 155652. February
4 2006-03-01 00:00:00 172923. March
5 2006-04-01 00:00:00 183854. April
6 2006-05-01 00:00:00 239689. May
7 2006-06-01 00:00:00 237638. June
8 2006-07-01 00:00:00 538688. July
9 2006-08-01 00:00:00 197673. August
10 2006-09-01 00:00:00 144534. September
# ... with 147 more rows
I need the ordertotal and ratio of each month respectively. The calculations would for each month respectively be such as (december) 512758/0.8316988 = 616518.864762 The output for the calculations would be in their new column that corresponds with the month and ordertotal. Please any help is greatly appreciated!
Easiest way would be to merge() your data first, then do the operation. You can use R base merge() function, though I will show here using the tidyverse left_join() function. I see that one of your columns has a strange name d$Month, renameing this to Month will simplify the merge!
Reproducible example:
library(tidyverse)
df_1 <- data.frame(Month = c("Jan", "Feb"), seasonalityratio = c(1,2))
df_2 <- data.frame(Month = rep(c("Jan", "Feb"),each=2), OrderTotal = 1:4)
df_1 %>%
left_join(df_2, by = "Month") %>%
mutate(eseasonlizedtotal = OrderTotal / seasonalityratio)
#> Month seasonalityratio OrderTotal eseasonlizedtotal
#> 1 Jan 1 1 1.0
#> 2 Jan 1 2 2.0
#> 3 Feb 2 3 1.5
#> 4 Feb 2 4 2.0
Created on 2019-01-30 by the reprex package (v0.2.1)

Resources