Compare one row of a column against all others in group - r

I am trying to calculate the number of days over which all objects in a group overlap with each member of the group. To do this I want to compare each row of one column in a group, to each other row in that column in the same group. However, I am unable to come up with a simple solution for this; most of my effort has been with the map variants from purrr. Aside from that I have gone down some nested loop (:-/), nested apply rabbit holes; but I suspect there is a very simple way to accomplish this comparison.
Essentially I want the sum of the intersect of each interval in a group to one row of the group.
Input data: (format with intervals)
ID Group year interval_obs
1 A 2020 2020-04-29 UTC--2020-05-19 UTC
2 A 2020 2020-05-04 UTC--2020-05-29 UTC
3 A 2020 2020-05-09 UTC--2020-05-24 UTC
4 A 2020 2020-04-24 UTC--2020-04-28 UTC
5 A 2020 2020-05-30 UTC--2020-06-03 UTC
6 B 2020 2019-12-31 UTC--2020-01-20 UTC
7 B 2020 2020-01-10 UTC--2020-01-30 UTC
8 B 2020 2020-01-20 UTC--2020-02-09 UTC
9 B 2020 2020-01-15 UTC--2020-02-04 UTC
Input data (more human readable?) - where each start/end is the Day of Year (doy)
ID Group Year start end
1 A 2020 120 140
2 A 2020 125 150
3 A 2020 130 145
4 A 2020 115 119
5 A 2020 151 155
6 B 2020 0 20
7 B 2020 10 30
8 B 2020 20 40
9 B 2020 15 35
Desired Results:
ID total_overlap
1 25
2 30
3 25
4 0
5 0
6 15
7 35
8 25
9 35
note the desired total overlap is in days, the sum of all days the 4 other observations in group A overlap. Group B with 4 records to indicate variable lengths.
example data for problem
data <- structure(list(
ID = 1:9,
group = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L),
start = c(120L, 125L, 130L, 115L, 151L, 0L, 10L, 20L, 15L),
end = c(140L, 150L, 145L, 119L, 155L, 20L, 30L, 40L, 35L)),
class = "data.frame",
row.names = c(NA, -9L))
data <- data %>%
group_by(group, year) %>% # real dataset has several combos - both vars left as reminder
mutate(across(c(start, end), ~ as_date(., origin = paste0(year-1, "-12-31")))) %>% #this year-1 term is due to leap years etc.
mutate(interval_obs = interval(ymd(start), ymd(end))) %>%
dplyr::select(-start, -end)
output <- data %>% map(.x = .$interval_obs, # this code at least runs
.f = ~{results = sum(as.numeric(intersect(.x, .y$interval_obs)))})
The little chunk above is one of many types of way's I have approached this (map2, map_df etc.), and while it does not work I imagine (...) a solution is in that ballpark. Note that my example output has two features: 1) units are converted to days, 2) the 'self intersection' is subtracted out. Do not worry about those features I have ways to do both of those, I just did not include those because they may obfuscate the problem. However if it helps...
mutate(self_intersection = as.numeric(intersect(interval_obs, interval_obs2))) %>%
mutate(results = results - self_intersection) %>%
mutate(total_overlap = as.numeric(results)/86400))
I have been trying to keep data in lubridate or another date format so that different temporal resolutions could be easily accommodated in the future (e.g. hours, minutes)
edit 2 - example of calculating overlap for Group A
(data reproduced here)
ID Group Year start end
1 A 2020 120 140
2 A 2020 125 150
3 A 2020 130 145
4 A 2020 115 119
5 A 2020 151 155
for Group # 1, numbers after 'comparison' refer to ID.
comparison 1 - 2. End1 - Start2 = 15 days
comparison 1 - 3. End1 - Start2 = 10 days
comparison 1 - 4. NO OVERLAP = 0 days
comparison 1 - 5. NO OVERLAP = 0 days
total_overlap 25 days

Is this what you are looking for?
The total overlap in the third line is off from your desired output, but that may be a typo?
library(tidyverse)
library(lubridate)
data |>
group_by(group) |>
mutate(total_overlap = map_dbl(interval_obs,
\(x) x |>
intersect(interval_obs) |>
int_length() |>
sum(na.rm = T) - int_length(x)
) / 86400
)
#> # A tibble: 9 × 5
#> # Groups: group [2]
#> ID group year interval_obs total_overlap
#> <int> <chr> <int> <Interval> <dbl>
#> 1 1 A 2020 2020-04-29 UTC--2020-05-19 UTC 25
#> 2 2 A 2020 2020-05-04 UTC--2020-05-29 UTC 30
#> 3 3 A 2020 2020-05-09 UTC--2020-05-24 UTC 25
#> 4 4 A 2020 2020-04-24 UTC--2020-04-28 UTC 0
#> 5 5 A 2020 2020-05-30 UTC--2020-06-03 UTC 0
#> 6 6 B 2020 2019-12-31 UTC--2020-01-20 UTC 15
#> 7 7 B 2020 2020-01-10 UTC--2020-01-30 UTC 35
#> 8 8 B 2020 2020-01-20 UTC--2020-02-09 UTC 25
#> 9 9 B 2020 2020-01-15 UTC--2020-02-04 UTC 35

Related

Repeatedly count events before a certain date in R

I have a data set with a list of event dates and a list of sample dates. Events and samples are grouped by unit. For each sample date, I want to count the number of events that came before that sample date
and the number of different months in which those events occurred, grouped by unit. A couple complications: sometimes the event date happens after the sample date in the same year. Sometimes there are sample dates but no event in a particular year.
Example data (my actual dataset has ~6000 observations):
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
Output data should look something like this:
year unit numevent nummonth
1996 a 1 1
1997 a 1 1
1998 a 3 3
1999 a 3 3
1996 b 1 1
1997 b 2 1
1998 b 3 1
1999 b 4 1
Note that in 1997 in unit a, the event is not counted because it happened after the sample date.
For smaller datasets, I have manually subset the data by each sample date and counted events/unique months (and then merged the datasets back together), but I can't do that with ~6000 observations.
numevent.1996<-ddply(data[data$eventdate<'1996-08-01',], .(unit),
summarize, numevent=length(eventdate), nummth=length(unique(eventmonth)), year=1996)
This might work:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
data <- data %>%
mutate(eventdate = lubridate::ymd(eventdate),
sampledate = lubridate::ymd(sampledate))
data %>%
group_by(unit, year, eventmonth) %>%
summarise(numevent = sum(sampledate >= eventdate)) %>%
group_by(unit, year) %>%
summarise(nummonth = sum(numevent > 0),
numevent = sum(numevent))
#> `summarise()` has grouped output by 'unit', 'year'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'unit'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups: unit [2]
#> unit year nummonth numevent
#> <chr> <int> <int> <int>
#> 1 a 1996 1 1
#> 2 a 1997 0 0
#> 3 a 1998 1 1
#> 4 a 1999 NA NA
#> 5 b 1996 1 1
#> 6 b 1997 1 1
#> 7 b 1998 1 1
#> 8 b 1999 1 1
Created on 2023-01-08 by the reprex package (v2.0.1)
Note, I don't think the data you've included actually produce the output you proposed as the output looks to have 18 events that meet the condition and there are only 8 rows in the sample data provided.
Try this?
data %>%
group_by(unit) %>%
mutate(
numevent = sapply(sampledate, function(z) sum(eventdate < z, na.rm = TRUE)),
nummonth = sapply(sampledate, function(z) length(unique(na.omit(eventmonth[eventdate < z]))))
) %>%
ungroup()
# # A tibble: 8 × 7
# unit eventdate eventmonth sampledate year numevent nummonth
# <chr> <date> <int> <date> <int> <int> <int>
# 1 a 1996-06-01 6 1996-08-01 1996 1 1
# 2 a 1997-09-03 9 1997-08-02 1997 1 1
# 3 a 1998-05-15 5 1998-08-03 1998 3 3
# 4 a NA NA 1999-08-02 1999 3 3
# 5 b 1996-05-31 5 1996-08-01 1996 1 1
# 6 b 1997-05-31 5 1997-08-02 1997 2 1
# 7 b 1998-05-15 5 1998-08-03 1998 3 1
# 8 b 1999-05-16 5 1999-08-02 1999 4 1
Data
data <- structure(list(unit = c("a", "a", "a", "a", "b", "b", "b", "b"), eventdate = structure(c(9648, 10107, 10361, NA, 9647, 10012, 10361, 10727), class = "Date"), eventmonth = c(6L, 9L, 5L, NA, 5L, 5L, 5L, 5L), sampledate = structure(c(9709, 10075, 10441, 10805, 9709, 10075, 10441, 10805), class = "Date"), year = c(1996L, 1997L, 1998L, 1999L, 1996L, 1997L, 1998L, 1999L)), class = "data.frame", row.names = c(NA, -8L))

Creating subset of dataset based on multiple condition in r

I want to extract the past 3 weeks' data for each household_id, channel combination. These past 3 weeks will be calculated from mala_fide_week and mala_fide_year and it will be less than that for each household_id and channel combination.
Below is the dataset:
for e.g. Household_id 100 for channel A: the mala_fide_week is 42 and mala_fide_year 2021. So past three records will be less than week 42 of the year 2021. This will be calculated from the week and year columns.
For the Household_id 100 and channel B combination, there are only two records much less than mala_fide_week and mala_fide_year.
For Household_id 101 and channel C, there are two years involved in 2019 and 2020.
The final dataset will be as below
Household_id 102 is not considered as week and year is greater than mala_fide_week and mala_fide_year.
I am trying multiple options but not getting through. Any help is much appreciated!
sample dataset:
data <- data.frame(Household_id =
c(100,100,100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","B","B","C","C","c","C","D","D"),
duration = c(12,34,567,67,34,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,37,38,39,22,23,51,52,1,2,38,39),
year = c(2021,2021,2021,2021,2020,2020,2019,2019,2020,2020,2021,2021))
I think you first need to obtain the absolute number of weeks week + year * 52, then filter accordingly. slice_tail gets the last three rows of each group.
library(dplyr)
data |>
filter(week + 52*year <= mala_fide_week + 52 *mala_fide_year) |>
group_by(Household_id, channel) |>
arrange(year, week, .by_group = TRUE) |>
slice_tail(n = 3)
# A tibble: 8 x 7
# Groups: Household_id, channel [3]
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 A 34 42 2021 37 2021
2 100 A 567 42 2021 38 2021
3 100 A 67 42 2021 39 2021
4 100 B 34 42 2021 22 2020
5 100 B 67 42 2021 23 2020
6 101 C 23 5 2020 52 2019
7 101 C 56 5 2020 1 2020
8 101 C 89 5 2020 2 2020

How can I compute a variable lag value to create a moving average over different time spans?

My problem is quite simple but I still cannot find an easy solution that doesn't require me to create a lot of unnecessary filler rows.
Given this dataset:
df <- structure(
list(
date = c(
2015.16666666667,
2015.33333333333,
2015.83333333333,
2016,
2016.08333333333,
2016.25,
2016.33333333333,
2016.41666666667,
2016.5,
2016.66666666667
),
Age = c(
1,
2.99999999999818,
8.99999999999818,
10.9999999999991,
11.9999999999982,
13.9999999999991,
14.9999999999982,
16,
16.9999999999991,
19
),
year = c(
2015L,
2015L,
2015L,
2015L,
2016L,
2016L,
2016L,
2016L,
2016L,
2016L
),
month = c(2L, 4L,
10L, 12L, 1L, 3L, 4L, 5L, 6L, 8L),
r_Total = c(
481.02,
666.36,
851.7,
1633.74,
2155.1,
2613.74,
3105.44,
4429.52,
5170.88,
5170.88
)
),
row.names = c(NA,-10L),
class = c("tbl_df", "tbl", "data.frame")
)
I want to compute a moving average of r_total for the last 12 months. However the data has no rows for months where r_total was 0 so my usual solution does not work:
library(dplyr)
df %>%
mutate(cummulative_sum = cumsum(r_total), moving_average = (cummulative_sum-lag(cummulative_sum,12) )/12)
This computes a moving average for the last 12 values but crucially not the last 12 months!
lag() which I use here just looks in the ordered vector and identifies the value from the -12th position. However I need a function that gives me the value from when Age == Age-12 (Age being months since inception of the value).
So what can I do?
The slider package is great for when you need to use another column to define the time window.
library(slider)
df %>%
mutate(avg_12mo = slide_index_dbl(r_Total, Age, mean, .before = 11),
sum_12mo = slide_index_dbl(r_Total, Age, sum, .before = 11))
# A tibble: 10 x 7
date Age year month r_Total avg_12mo sum_12mo
<dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 2015. 1 2015 2 481. 481. 481.
2 2015. 3.00 2015 4 666. 574. 1147.
3 2016. 9.00 2015 10 852. 666. 1999.
4 2016 11.0 2015 12 1634. 908. 3633.
5 2016. 12.0 2016 1 2155. 1158. 5788.
6 2016. 14.0 2016 3 2614. 1814. 7254.
7 2016. 15.0 2016 4 3105. 2072. 10360.
8 2016. 16 2016 5 4430. 2465. 14789.
9 2016. 17.0 2016 6 5171. 2851. 19960.
10 2017. 19 2016 8 5171. 3141. 25131
I too have been struggling with some of these moving-windows issues. And in cases where the step size or window size is not a constant number of rows, tidyverse-approaches can become cumbersome.
In these cases, although I sometimes try to hard to fit it into a piped fashion, ordinary loops can be easier to work with.
First attempt with a loop. Instead of thinking the main input is the data.frame, our main input is actually the dates at which we want to look back the 12 months (which so happens to be derived from df, but you could instead choose to go by a calendar year or whatnot).
Remember, when using loops to build a result-set, use lists or pre-allocated result-vectors instead of growing a data.frame or vector by repeatedly appending.
df$yearmonth <- with(df, year + (month-1)/12)
df$cummulative <- NA_real_
for (i in seq_along(df$Age)) {
df$cummulative[i] <- df %>%
filter(between(Age, df$Age[i]-11, df$Age[i])) %>% ## any rows within the past 12 months
pull(r_Total) %>% sum
}
> df
# A tibble: 10 x 6
date Age year month r_Total cummulative
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 2015. 1 2015 2 481. 481.
2 2015. 3.00 2015 4 666. 1147.
3 2016. 9.00 2015 10 852. 1999.
4 2016 11.0 2015 12 1634. 3633.
5 2016. 12.0 2016 1 2155. 5788.
6 2016. 14.0 2016 3 2614. 7254.
7 2016. 15.0 2016 4 3105. 10360.
8 2016. 16 2016 5 4430. 14789.
9 2016. 17.0 2016 6 5171. 19960.
10 2017. 19 2016 8 5171. 25131
It is however unclear what you are averaging over; is the cummulative r_total always divided by 12 months, even when it's sum is only 2 rows?

Calculate number of negative values between two dates

I have a data frame of SPEI values. I want to calculate two statistics (explained below) at an interval of
20 years i.e 2021-2040, 2041-2060, 2061-2080, 2081-2100. The first column contains the Date (month-year), and
Each year i.e. 2021, 2022, 2023 etc. till 2100.
The statistics are:
Drought frequency: Number of times SPEI < 0 in the specified period (20 years and 1 year respectively)
Drought Duration: Equal to the number of months between its start (included) and end month (not included) of the specified period. I am assuming a drought event starts when SPEI < 0.
I was wondering if there's a way to do that in R? It seems like an easy problem, but I don't know how to do it. Please help me out. Excel is taking too long. Thanks.
> head(test, 20)
Date spei-3
1 2021-01-01 NA
2 2021-02-01 NA
3 2021-03-01 -0.52133737
4 2021-04-01 -0.60047887
5 2021-05-01 0.56838399
6 2021-06-01 0.02285012
7 2021-07-01 0.26288462
8 2021-08-01 -0.14314685
9 2021-09-01 -0.73132256
10 2021-10-01 -1.23389220
11 2021-11-01 -1.15874943
12 2021-12-01 0.27954143
13 2022-01-01 1.14606657
14 2022-02-01 0.66872986
15 2022-03-01 -1.13758050
16 2022-04-01 -0.27861017
17 2022-05-01 0.99992395
18 2022-06-01 0.61024314
19 2022-07-01 -0.47450485
20 2022-08-01 -1.06682997
Edit:
I very much like to add some code, but I don't know where to start.
test = "E:/drought.xlsx"
#Extract year and month and add it as a column
test$Year = format(test$Date,"%Y")
test$Month = format(test$Date,"%B")
I don't know how to go from here. I found that cumsum can help, but how do I select one year and then apply cumsum on it. I am not withholding code on purpose. I just don't know where or how to begin.
There are a couple questions the OP's post so I will go through them step by step. You'll need dplyr and lubridate for this workflow.
First, we create some fake data to use:
library(lubridate)
library(dplyr)
#create example data
dd<- data.frame(Date = seq.Date(as.Date("2021-01-01"), as.Date("2100-12-01"), by = "month"),
spei = rnorm(960,0,2))
That will look like this, similar to what you have above
> head(dd)
Date spei year year_20 drought
1 2021-01-01 -6.85689789 2021 2021_2040 1
2 2021-02-01 -0.09292459 2021 2021_2040 1
3 2021-03-01 0.13715922 2021 2021_2040 0
4 2021-04-01 2.26805601 2021 2021_2040 0
5 2021-05-01 -0.47325008 2021 2021_2040 1
6 2021-06-01 0.37034138 2021 2021_2040 0
Then we can use lubridate and cut to create our yearly and 20-year variables to group by later and create a column drought signifying if spei was negative.
#create a column to group on by year and by 20-year
dd <- dd %>%
mutate(year = year(Date),
year_20 = cut(year, breaks = c(2020,2040,2060,2080, 2100), include.lowest = T,
labels = c("2021_2040", "2041_2060", "2061_2080", "2081_2100"))) %>%
#column signifying if that month was a drought
mutate(drought = ifelse(spei<0,1,0))
Once we have that, we just use the group_by function to get frequency (or number of months with a drought) by year or 20-year period
#by year
dd %>%
group_by(year) %>%
summarise(year_freq = sum(drought)) %>%
ungroup()
# A tibble: 80 x 2
year year_freq
<dbl> <dbl>
1 2021 6
2 2022 4
3 2023 7
4 2024 6
5 2025 6
6 2026 7
#by 20-year group
dd %>%
group_by(year_20) %>%
summarise(year20_freq = sum(drought)) %>%
ungroup()
# A tibble: 4 x 2
year_20 year20_freq
<fct> <dbl>
1 2021_2040 125
2 2041_2060 121
3 2061_2080 121
4 2081_2100 132
Calculating drought duration is a bit more complicated. It involves
identifying the first month of each drought
calculating the length of each drought
combining information from 1 and 2 together
We can use lag to identify when a month changed from "no drought" to "drought". In this case we want an index of where the value in row i is different from that in row i-1
# find index of where values change.
change.ind <- dd$drought != lag(dd$drought)
#use index to find drought start
drought.start <- dd[change.ind & dd$drought == 1,]
This results in a subset of the initial dataset, but only with the rows with the first month of a drought. Then we can use rle to calculate the length of the drought. rle will calculate the length of every run of numbers, so we will have to subset to only those runs where the value==1 (drought)
#calculate drought lengths
drought.lengths <- rle(dd$drought)
# we only want droughts (values = 1)
drought.lengths <- drought.lengths$lengths[drought.lengths$values==1]
Now we can combine these two pieces of information together. The first row is an NA because there is no value at i-1 to compare the lag to. It can be dropped, unless you want to include that data.
drought.dur <- cbind(drought.start, drought_length = drought.lengths)
head(drought.dur)
Date spei year year_20 drought drought_length
NA <NA> NA NA <NA> NA 2
5 2021-05-01 -0.47325008 2021 2021_2040 1 1
9 2021-09-01 -2.04564549 2021 2021_2040 1 1
11 2021-11-01 -1.04293866 2021 2021_2040 1 2
14 2022-02-01 -0.83759671 2022 2021_2040 1 1
17 2022-05-01 -0.07784316 2022 2021_2040 1 1

Subsetting data by date range across years in R

I have a long term sightings data set of identified individuals (~16,000 records from 1979- 2019) and I would like to subset the same date range (YYYY-09-01 to YYYY(+1)-08-31) across years in R. I have successfully done so for each "year" (and obtained the unique IDs) using:
library(dplyr)
library(lubridate)
year79 <-data%>%
select(ID, Sex, AgeClass, Age, Date, Month, Year)%>%
filter(Date>= as.Date("1978-09-01") & Date<= as.Date("1979-08-31")) %>%
filter(!duplicated(ID))
year80 <-data%>%
select(ID, Sex, AgeClass, Age, Date, Month, Year)%>%
filter(Date>= as.Date("1979-09-01") & Date<= as.Date("1980-08-31")) %>%
filter(!duplicated(ID))
I would like to clean up the code and ideally not need to specify the each range (just have it iterate through). I am new at R and stuck how to do this. Any suggestions?
FYI "Month" and "Year" are included for producing a table via melt and cast later on.
example data:
ID Year Month Day Date AgeClass Age Sex
1 1034 1979 4 17 1979-04-17 U 3 F
2 1127 1979 5 3 1979-05-03 A 13 F
3 1222 1979 5 3 1979-05-03 U 0 F
4 1303 1979 6 16 1979-06-16 U 0 F
5 1153 1980 4 16 1980-04-16 C 0 F
6 1014 1980 4 16 1980-04-16 U 6 F
ID Year Month Day Date AgeClass Age Sex
16428 2503 2019 5 8 2019-05-08 U NA F
16429 3760 2019 5 8 2019-05-08 A 12 F
16430 4080 2019 5 9 2019-05-09 A 9 F
16431 4095 2019 5 9 2019-05-09 A 9 U
16432 1204 2019 5 11 2019-05-11 A 37 F
16433 1204 2019 5 11 2019-05-11 A NA F
#> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Every year has 122 days from Sept 1 to Dec 31 inclusive, so you could add a variable marking the "fiscal year" for each row:
set.seed(42)
library(dplyr)
my_data <- tibble(ID = 1:6,
Date = as.Date("1978-09-01") + c(-1, 0, 1, 364, 365, 366))
my_data
# There are 122 days from each Aug 31 (last of the FY) to the end of the CY.
# lubridate::ymd(19781231) - lubridate::ymd(19780831)
my_data %>%
mutate(FY = year(Date + 122))
## A tibble: 6 x 3
# ID Date FY
# <int> <date> <dbl>
#1 1 1978-08-31 1978
#2 2 1978-09-01 1979
#3 3 1978-09-02 1979
#4 4 1979-08-31 1979
#5 5 1979-09-01 1980
#6 6 1979-09-02 1980
You could keep the data in one table and do subsequent analysis using group_by(FY), or use %>% split(.$FY) to put each FY into its own element of a list. From my limited experience, I think it's generally an anti-pattern to create separate data frames for annual subsets of your data, as that makes your code harder to maintain, troubleshoot, and modify.

Resources