Aggregate daily data into weeks - r

I have data resembling the following structure, where the when variable denotes the day of measurement:
## Generate data.
set.seed(1986)
n <- 1000
y <- rnorm(n)
when <- as.POSIXct(strftime(seq(as.POSIXct("2021-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
as.POSIXct("2022-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
length.out = n), format = "%Y-%m-%d"))
dta <- data.frame(y, when)
head(dta)
#> y when
#> 1 -0.04625141 2021-11-01
#> 2 0.28000082 2021-11-01
#> 3 0.25317063 2021-11-01
#> 4 -0.96411077 2021-11-02
#> 5 0.49222664 2021-11-02
#> 6 -0.69874551 2021-11-02
I need to compute averages of y over time. For instance, the following computes daily averages:
## Compute daily averages of y.
library(dplyr)
daily_avg <- dta %>%
group_by(when) %>%
summarise(daily_mean = mean(y)) %>%
ungroup()
daily_avg
#> # A tibble: 366 × 2
#> when daily_mean
#> <dttm> <dbl>
#> 1 2021-11-01 00:00:00 0.162
#> 2 2021-11-02 00:00:00 -0.390
#> 3 2021-11-03 00:00:00 -0.485
#> 4 2021-11-04 00:00:00 -0.152
#> 5 2021-11-05 00:00:00 0.425
#> 6 2021-11-06 00:00:00 0.726
#> 7 2021-11-07 00:00:00 0.855
#> 8 2021-11-08 00:00:00 0.0608
#> 9 2021-11-09 00:00:00 -0.995
#> 10 2021-11-10 00:00:00 0.395
#> # … with 356 more rows
I am having a hard time computing weekly averages. Here is what I have tried so far:
## Fail - compute weekly averages of y.
library(lubridate)
dta$week <- week(dta$when) # This is wrong.
dta[165: 171, ]
#> y when week
#> 165 0.9758333 2021-12-30 52
#> 166 -0.8630091 2021-12-31 53
#> 167 0.3054031 2021-12-31 53
#> 168 1.2814421 2022-01-01 1
#> 169 0.1025440 2022-01-01 1
#> 170 1.3665411 2022-01-01 1
#> 171 -0.5373058 2022-01-02 1
Using the week function from the lubridate package ignores the fact that my data spawn across years. So, if I were to use a code similar to the one I used for the daily averages, I would aggregate observations belonging to different years (but to the same week number). How can I solve this?

You can use %V (from ?strptime) for weeks, combining it with the year.
dta %>%
group_by(week = format(when, format = "%Y-%V")) %>%
summarize(daily_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 2
# week daily_mean
# <chr> <dbl>
# 1 2021-44 0.179
# 2 2021-45 0.0477
# 3 2021-46 0.0340
# 4 2021-47 0.356
# 5 2021-48 0.0544
# 6 2021-49 -0.0948
# 7 2021-50 -0.0419
# 8 2021-51 0.209
# 9 2021-52 0.251
# 10 2022-01 -0.197
# # ... with 44 more rows
There are different variants of "week", depending on your preference.
%V
Week of the year as decimal number (01–53) as defined in ISO 8601.
If the week (starting on Monday) containing 1 January has four or more
days in the new year, then it is considered week 1. Otherwise, it is
the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.)
%W
Week of the year as decimal number (00–53) using Monday as the first
day of week (and typically with the first Monday of the year as day 1
of week 1). The UK convention.

You can extract year and week from the dates and group by both:
dta %>%
mutate(year = year(when),
week = week(when)) %>%
group_by(year, week) %>%
summarise(y_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 3
# # Groups: year, week [54]
# year week y_mean
# <dbl> <dbl> <dbl>
# 1 2021 44 -0.222
# 2 2021 45 0.234
# 3 2021 46 0.0953
# 4 2021 47 0.206
# 5 2021 48 0.192
# 6 2021 49 -0.0831
# 7 2021 50 0.0282
# 8 2021 51 0.196
# 9 2021 52 0.132
# 10 2021 53 -0.279
# # ... with 44 more rows

Related

How to separate daily data into weekly or monthly data in R

I have daily discharge data from a local stream near me. I am trying to sum and take the average of the daily data into weekly or monthly chunks so I can plot discharge_m3d(discharge) and Qs_sum(depletion) by weekly and monthly timeframes. Does anyone know how I can do this? I attached a figure of how my data frame looks.
People often use floor_date() from lubridate for these purposes. You can floor to a unit of month or week and then group by the resulting date column. Then you can use summarize() to compute the monthly or weekly sums/averages. From there you can use your plotting library of choice to visualize the result (like ggplot2, not shown).
This works even if you have more than one year of data (i.e. where the month or week number might repeat).
library(dplyr)
library(lubridate)
set.seed(123)
df <- tibble(
date = seq(
from = as.Date("2014-03-01"),
to = as.Date("2016-12-31"),
by = 1
),
Qs_sum = runif(length(date)),
discharge_m3d = runif(length(date))
)
df
#> # A tibble: 1,037 × 3
#> date Qs_sum discharge_m3d
#> <date> <dbl> <dbl>
#> 1 2014-03-01 0.288 0.560
#> 2 2014-03-02 0.788 0.427
#> 3 2014-03-03 0.409 0.448
#> 4 2014-03-04 0.883 0.833
#> 5 2014-03-05 0.940 0.720
#> 6 2014-03-06 0.0456 0.457
#> 7 2014-03-07 0.528 0.521
#> 8 2014-03-08 0.892 0.242
#> 9 2014-03-09 0.551 0.0759
#> 10 2014-03-10 0.457 0.391
#> # … with 1,027 more rows
df %>%
mutate(date = floor_date(date, unit = "month")) %>%
group_by(date) %>%
summarise(
n = n(),
qs_total = sum(Qs_sum),
qs_average = mean(Qs_sum),
discharge_total = sum(discharge_m3d),
discharge_average = mean(discharge_m3d),
.groups = "drop"
)
#> # A tibble: 34 × 6
#> date n qs_total qs_average discharge_total discharge_average
#> <date> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2014-03-01 31 18.1 0.585 15.3 0.494
#> 2 2014-04-01 30 12.9 0.429 15.2 0.507
#> 3 2014-05-01 31 15.5 0.500 15.3 0.493
#> 4 2014-06-01 30 15.8 0.525 16.3 0.542
#> 5 2014-07-01 31 15.1 0.487 13.9 0.449
#> 6 2014-08-01 31 14.8 0.478 16.2 0.522
#> 7 2014-09-01 30 15.3 0.511 13.1 0.436
#> 8 2014-10-01 31 15.6 0.504 14.7 0.475
#> 9 2014-11-01 30 16.0 0.532 15.1 0.502
#> 10 2014-12-01 31 14.2 0.458 15.5 0.502
#> # … with 24 more rows
# Assert that the "start of the week" is Sunday.
# So groups are made of data from [Sunday -> Monday]
sunday <- 7L
df %>%
mutate(date = floor_date(date, unit = "week", week_start = sunday)) %>%
group_by(date) %>%
summarise(
n = n(),
qs_total = sum(Qs_sum),
qs_average = mean(Qs_sum),
discharge_total = sum(discharge_m3d),
discharge_average = mean(discharge_m3d),
.groups = "drop"
)
#> # A tibble: 149 × 6
#> date n qs_total qs_average discharge_total discharge_average
#> <date> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2014-02-23 1 0.288 0.288 0.560 0.560
#> 2 2014-03-02 7 4.49 0.641 3.65 0.521
#> 3 2014-03-09 7 3.77 0.539 3.88 0.554
#> 4 2014-03-16 7 4.05 0.579 3.45 0.493
#> 5 2014-03-23 7 4.43 0.632 3.08 0.440
#> 6 2014-03-30 7 4.00 0.572 4.74 0.677
#> 7 2014-04-06 7 2.50 0.357 3.15 0.449
#> 8 2014-04-13 7 2.48 0.355 2.44 0.349
#> 9 2014-04-20 7 2.30 0.329 2.45 0.349
#> 10 2014-04-27 7 3.44 0.492 4.40 0.629
#> # … with 139 more rows
Created on 2022-04-13 by the reprex package (v2.0.1)
One way to approach this is using the lubridate and dplyr packages in the tidyverse. I assume here that your dates are year-month-day which they appear to be and that you only have one calendar year or at least no repeated months/weeks across two years.
monthly_discharge <- discharge %>%
filter(variable == "discharge") # First select just the rows that represent discharge (not clear if that's necessary here)
mutate(date = ymd(date), # convert date to a lubridate date object
month = month(date), # extract the numbered month from the date
week = week(date)) %>% # extract the numbered week in a year from the date
group_by(month, stream) %>% # group your data by month and stream
summarize(discharge_summary = mean(discharge_m3d)) # summarize your data so that each month has a single row with a single (mean) discharge value
# you can include multiple summary variables within the summarize function
This should produce a data frame with one row per month for each stream and a summary value for discharge. You could summarize by week by changing the month label in group_by to week.
Make use of the functions week(), month() and year() from the package lubridate to get the corresponding values for your date column. Afterwards we can find the means per week, month or year. For illustration, I added a row with year 2015, since there was only year 2014 in your sample data. Furthermore, for plotting reasons, I added a column "Year_Month" that shows the abbreviated month followed by year (x axis of the plot).
library(dplyr)
library(lubridate)
data <- data %>% mutate(Week = week(date), Month = month(date), Year = year(date)) %>%
group_by(Year, Week) %>%
mutate(mean_Week_Qs = mean(Qs_sum)) %>%
ungroup() %>%
group_by(Year, Month) %>%
mutate(mean_Month_Qs = mean(Qs_sum)) %>%
ungroup() %>%
group_by(Year) %>%
mutate(mean_Year_Qs = mean(Qs_sum)) %>%
ungroup() %>%
mutate(Year_Month = paste0(lubridate::month(date, label = TRUE), " ", Year)) %>%
ungroup()
> data
# A tibble: 12 x 10
date discharge_m3d Qs_sum Week Month Year mean_Week_Qs mean_Month_Qs mean_Year_Qs Year_Month
<date> <dbl> <dbl> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
1 2014-03-01 797 0 9 3 2014 0.0409 0.629 0.629 Mar 2014
2 2014-03-02 826 0.00833 9 3 2014 0.0409 0.629 0.629 Mar 2014
3 2014-03-03 3760 0.114 9 3 2014 0.0409 0.629 0.629 Mar 2014
4 2014-03-04 4330 0.292 10 3 2014 0.785 0.629 0.629 Mar 2014
5 2014-03-05 2600 0.480 10 3 2014 0.785 0.629 0.629 Mar 2014
6 2014-03-06 4620 0.656 10 3 2014 0.785 0.629 0.629 Mar 2014
7 2014-03-07 2510 0.816 10 3 2014 0.785 0.629 0.629 Mar 2014
8 2014-03-08 1620 0.959 10 3 2014 0.785 0.629 0.629 Mar 2014
9 2014-03-09 2270 1.09 10 3 2014 0.785 0.629 0.629 Mar 2014
10 2014-03-10 5650 1.20 10 3 2014 0.785 0.629 0.629 Mar 2014
11 2014-03-11 2530 1.31 11 3 2014 1.31 0.629 0.629 Mar 2014
12 2015-03-06 1470 1.52 10 3 2015 1.52 1.52 1.52 Mar 2015
Now we can plot, for example Qs_sum per year and month, and add the mean as a red dot:
ggplot(data, aes(Year_Month, Qs_sum)) +
theme_classic() +
geom_point(size = 2) +
geom_point(aes(Year_Month, mean_Month_Qs), color = "red", size = 5, alpha = 0.6)
To summarize the results by weekly or monthly averages, you can do as follows, using distinct():
data %>% distinct(Year, Week, mean_Week_Qs)
# A tibble: 4 x 3
Week Year mean_Week_Qs
<int> <int> <dbl>
1 9 2014 0.0409
2 10 2014 0.785
3 11 2014 1.31
4 10 2015 1.52
data %>% distinct(Year, Month, mean_Month_Qs)
# A tibble: 2 x 3
Month Year mean_Month_Qs
<int> <int> <dbl>
1 3 2014 0.629
2 3 2015 1.52
This can only be done after the mutate() and mean() commands above. If you want to get directly to summarized results, you can use summarize() directly on the initial dataframe:
data %>% group_by(Year, Week) %>% summarise(Week_Avg = mean(Qs_sum))
# A tibble: 4 x 3
# Groups: Year [2]
Year Week Week_Avg
<int> <int> <dbl>
1 2014 9 0.0409
2 2014 10 0.785
3 2014 11 1.31
4 2015 10 1.52
data %>% group_by(Year, Month) %>% summarise(Month_Avg = mean(Qs_sum))
# A tibble: 2 x 3
# Groups: Year [2]
Year Month Month_Avg
<int> <int> <dbl>
1 2014 3 0.629
2 2015 3 1.52
Note that for plotting, mutate() is preferred, since it preserves the single weekly points (black in the plot above), if we used summarise() instead, we would be left with only the red points.
Data
data <- structure(list(date = structure(16130:16140, class = "Date"),
discharge_m3d = c(797, 826, 3760, 4330, 2600, 4620, 2510,
1620, 2270, 5650, 2530), Qs_sum = c(0, 0.00833424, 0.114224781,
0.291812109, 0.479780482, 0.656321971, 0.816140731, 0.959334606,
1.087579095, 1.20284046, 1.30695595), Week = c(9L, 9L, 9L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L), Month = c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L)), row.names = c(NA, -11L
), class = c("tbl_df", "tbl", "data.frame"))

Filter in dplyr interval of dates

I have the following simulated dataset in R:
library(tidyverse)
A = seq(from = as.Date("2021/1/1"),to=as.Date("2022/1/1"), length.out = 252)
length(A)
x = rnorm(252)
d = tibble(A,x);d
that looks like :
# A tibble: 252 × 2
A x
<date> <dbl>
1 2021-01-01 0.445
2 2021-01-02 -0.793
3 2021-01-03 -0.367
4 2021-01-05 1.64
5 2021-01-06 -1.15
6 2021-01-08 0.276
7 2021-01-09 1.09
8 2021-01-11 0.443
9 2021-01-12 -0.378
10 2021-01-14 0.203
# … with 242 more rows
Is one year of 252 trading days.Let's say I have a date of my interest which is:
start = as.Date("2021-05-23");start.
I want to filter the data set and the result to be a new dataset starting from this starting date and the next 20 index dates NOT simple days, and then to find the total indexes that the new dataset contains.
For example from the starting date and after I have :
d1=d%>%
dplyr::filter(A>start)%>%
dplyr::summarise(n())
d1
# A tibble: 1 × 1
`n()`
<int>
1 98
but I want from the starting date and after the next 20 trading days.How can I do that ? Any help?
Perhaps a brute-force attempt:
d %>%
filter(between(A, start, max(head(sort(A[A > start]), 20))))
# # A tibble: 20 x 2
# A x
# <date> <dbl>
# 1 2021-05-23 -0.185
# 2 2021-05-24 0.102
# 3 2021-05-26 0.429
# 4 2021-05-27 -1.21
# 5 2021-05-29 0.260
# 6 2021-05-30 0.479
# 7 2021-06-01 -0.623
# 8 2021-06-02 0.982
# 9 2021-06-04 -0.0533
# 10 2021-06-05 1.08
# 11 2021-06-07 -1.96
# 12 2021-06-08 -0.613
# 13 2021-06-09 -0.267
# 14 2021-06-11 -0.284
# 15 2021-06-12 0.0851
# 16 2021-06-14 0.355
# 17 2021-06-15 -0.635
# 18 2021-06-17 -0.606
# 19 2021-06-18 -0.485
# 20 2021-06-20 0.255
If you have duplicate dates, you may prefer to use head(sort(unique(A[A > start])),20), depending on what "20 index dates" means.
And to find the number of indices, you can summarise or count as needed.
You could first sort by the date, filter for days greater than given date and then pull top 20 records.
d1 = d %>%
arrange(A) %>%
filter(A > start) %>%
head(20)

find yearly cumsum of a variable in R?

In my code below, i would like to find the cumsum for each year. Right now, Variable A is being summed for the entire duration. Any help would be appreciated.
library(dplyr)
library(lubridate)
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
group_by(Year, JDay) %>%
mutate(Precipitation = cumsum(A))
Just remove JDay from grouping variables
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
group_by(Year) %>%
mutate(Precipitation = cumsum(A)) %>%
ungroup()
It seems the issue here is with your grouping clause. Specifically, as there are as many distinct combinations of Year and JDay in your data as there are rows in DF, the subsequent cumsum operation inside mutate will simply return the same value as the input column, A. I believe the following should give you what you're after
library(dplyr)
library(lubridate)
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
arrange(Year, JDay) %>%
group_by(Year) %>%
mutate(Precipitation = cumsum(A)) %>%
ungroup()
# illustrate that Precipitation does indeed give the cumulative value of A for
# each year by printing the first 5 observations for each year in DF1
DF1 %>%
group_by(Year) %>%
slice(1:5)
#> # A tibble: 15 x 6
#> # Groups: Year [3]
#> date A Year Month JDay Precipitation
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001-05-01 6.25 2001 5 121 6.25
#> 2 2001-05-02 0.188 2001 5 122 6.43
#> 3 2001-05-03 5.37 2001 5 123 11.8
#> 4 2001-05-04 5.55 2001 5 124 17.4
#> 5 2001-05-05 5.15 2001 5 125 22.5
#> 6 2002-05-01 2.95 2002 5 121 2.95
#> 7 2002-05-02 6.75 2002 5 122 9.71
#> 8 2002-05-03 7.77 2002 5 123 17.5
#> 9 2002-05-04 8.13 2002 5 124 25.6
#> 10 2002-05-05 5.58 2002 5 125 31.2
#> 11 2003-05-01 9.98 2003 5 121 9.98
#> 12 2003-05-02 8.24 2003 5 122 18.2
#> 13 2003-05-03 6.13 2003 5 123 24.4
#> 14 2003-05-04 5.22 2003 5 124 29.6
#> 15 2003-05-05 9.81 2003 5 125 39.4
Here is a data.table solution for that. If you want the cumsum for each year, but show only the interval from month 5 to 10, this would be data.table code for that:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:lubridate':
#>
#> hour, isoweek, mday, minute, month, quarter, second, wday, week,
#> yday, year
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
data.table(DF)[, `:=` (Year = year(date), Month = month(date), JDay = yday(date))][, Precipitation := cumsum(A), by=Year][between(Month, 5, 10)][]
#> date A Year Month JDay Precipitation
#> 1: 2001-05-01 6.2465000 2001 5 121 568.9538
#> 2: 2001-05-02 0.1877191 2001 5 122 569.1416
#> 3: 2001-05-03 5.3717570 2001 5 123 574.5133
#> 4: 2001-05-04 5.5457454 2001 5 124 580.0591
#> 5: 2001-05-05 5.1508288 2001 5 125 585.2099
#> ---
#> 548: 2003-10-27 0.1979292 2003 10 300 1479.8115
#> 549: 2003-10-28 6.7286553 2003 10 301 1486.5402
#> 550: 2003-10-29 8.7215420 2003 10 302 1495.2617
#> 551: 2003-10-30 8.2572257 2003 10 303 1503.5190
#> 552: 2003-10-31 9.6567923 2003 10 304 1513.1757
If you want the cumsum for the months 5-10 only, you would put the filter before calculating the cumsum:
data.table(DF)[, `:=` (Year = year(date), Month = month(date), JDay = yday(date))][between(Month, 5, 10)][, Precipitation := cumsum(A), by=Year][]
#> date A Year Month JDay Precipitation
#> 1: 2001-05-01 6.2465000 2001 5 121 6.246500
#> 2: 2001-05-02 0.1877191 2001 5 122 6.434219
#> 3: 2001-05-03 5.3717570 2001 5 123 11.805976
#> 4: 2001-05-04 5.5457454 2001 5 124 17.351722
#> 5: 2001-05-05 5.1508288 2001 5 125 22.502550
#> ---
#> 548: 2003-10-27 0.1979292 2003 10 300 916.597973
#> 549: 2003-10-28 6.7286553 2003 10 301 923.326629
#> 550: 2003-10-29 8.7215420 2003 10 302 932.048171
#> 551: 2003-10-30 8.2572257 2003 10 303 940.305396
#> 552: 2003-10-31 9.6567923 2003 10 304 949.962189

keep most recent observations when there are duplicates in R

I have the following data.
date var1 level score_1 score_2
2020-02-19 12:10:52.166661 dog n1 1 3
2020-02-19 12:17:25.087898 dog n1 3 6
2020-02-19 12:34:27.624939 dog n2 4 3
2020-02-19 12:35:50.522116 cat n1 2 0
2020-02-19 12:38:49.547181 cat n2 3 4
There should be just one observation for any combination var1 & level. I want to eliminate duplicates and keep only most recent records. in the previous example the first row should be eliminated as dog-n1 from row 2 is more recent. nevertheless, I want to keep row 3 even if var1 is also equal to "dog" because level is different.
so, what I want to obtain:
date var1 level score_1 score_2
2020-02-19 12:17:25.087898 dog n1 3 6
2020-02-19 12:34:27.624939 dog n2 4 3
2020-02-19 12:35:50.522116 cat n1 2 0
2020-02-19 12:38:49.547181 cat n2 3 4
Using tidyverse
df %>%
group_by(var1, level) %>%
filter(date == max(date)) %>%
ungroup()
In base R, use duplicated. Looks like your data is already sorted by date, so you can use
df[!duplicated(df[c("var1", "level")], fromLast = TRUE), ]
(by default, duplicated will give FALSE for the first occurrence of anything, and TRUE for every other occurrence. Setting fromLast = TRUE will make reverse the direction, so the last occurrence is kept)
If you're not sure your data is already sorted, sort it first!
df = df[order(df$var1, df$level, dfd$date), ]
You can also use data.table approach as follows:
library(data.table)
setDT(df)[, .SD[which.max(date)], .(var1, level)]
Another tidyverse answer, using dplyr::slice_max().
To demonstrate with a reproducible example, here is flights data from nycflights13 package:
library(nycflights13) # for the data
library(dplyr, warn.conflicts = FALSE)
my_flights <- # a subset of 3 columns
flights |>
select(carrier, dest, time_hour)
my_flights # preview of the subset data
#> # A tibble: 336,776 × 3
#> carrier dest time_hour
#> <chr> <chr> <dttm>
#> 1 UA IAH 2013-01-01 05:00:00
#> 2 UA IAH 2013-01-01 05:00:00
#> 3 AA MIA 2013-01-01 05:00:00
#> 4 B6 BQN 2013-01-01 05:00:00
#> 5 DL ATL 2013-01-01 06:00:00
#> 6 UA ORD 2013-01-01 05:00:00
#> 7 B6 FLL 2013-01-01 06:00:00
#> 8 EV IAD 2013-01-01 06:00:00
#> 9 B6 MCO 2013-01-01 06:00:00
#> 10 AA ORD 2013-01-01 06:00:00
#> # … with 336,766 more rows
Grouping by carrier & dest, we can see many rows for each group.
my_flights |>
count(carrier, dest)
#> # A tibble: 314 × 3
#> carrier dest n
#> <chr> <chr> <int>
#> 1 9E ATL 59
#> 2 9E AUS 2
#> 3 9E AVL 10
#> 4 9E BGR 1
#> 5 9E BNA 474
#> 6 9E BOS 914
#> 7 9E BTV 2
#> 8 9E BUF 833
#> 9 9E BWI 856
#> 10 9E CAE 3
#> # … with 304 more rows
So if we want to deduplicate those in-group rows by taking the most recent time_hour value, we could utilize slice_max()
my_flights |>
group_by(carrier, dest) |>
slice_max(time_hour)
#> # A tibble: 329 × 3
#> # Groups: carrier, dest [314]
#> carrier dest time_hour
#> <chr> <chr> <dttm>
#> 1 9E ATL 2013-05-04 07:00:00
#> 2 9E AUS 2013-02-03 16:00:00
#> 3 9E AVL 2013-07-13 11:00:00
#> 4 9E BGR 2013-10-17 21:00:00
#> 5 9E BNA 2013-12-31 15:00:00
#> 6 9E BOS 2013-12-31 14:00:00
#> 7 9E BTV 2013-09-01 12:00:00
#> 8 9E BUF 2013-12-31 18:00:00
#> 9 9E BWI 2013-12-31 19:00:00
#> 10 9E CAE 2013-12-31 09:00:00
#> # … with 319 more rows
By the same token, we could have used slice_min() to get the rows with the earliest time_hour value.

Create holiday dummy variable in weekly data based on the Date column where Date represents end of week

DF has end of week dates.
df <- data.frame(Date=seq(as.Date("2014-01-03"), as.Date("2020-12-25"), by="week"))
df$week <- seq(nrow(df))
df <- df[, c("week", "Date")]
head(df)
#> week Date
#> 1 1 2014-01-03
#> 2 2 2014-01-10
#> 3 3 2014-01-17
#> 4 4 2014-01-24
#> 5 5 2014-01-31
#> 6 6 2014-02-07
tail(df)
#> week Date
#> 360 360 2020-11-20
#> 361 361 2020-11-27
#> 362 362 2020-12-04
#> 363 363 2020-12-11
#> 364 364 2020-12-18
#> 365 365 2020-12-25
I need New year dummy for the respective week. For example 2018-01-05 will have the dummy value 1 for Ney_Year dummy.
You could use lag of lubridate::year()to track change in year
library(lubridate)
library(dplyr) # for lag()
df$NewYear <- ifelse(is.na(lag(df$Date)) | year(lag(df$Date))!=year(df$Date), 1, 0)

Resources