Turn a loop based code into a vectorised one in R? - r

I´ve got this dataset and want to perform some calculations based on certain conditions:
library(tidyverse)
library(lubridate)
filas <- structure(list(Año = c(rep(2020,4),rep(2021,4),2022),
Mes = c(2:5,3:4,9,11,1),
Id = c(rep(1,7),2,2)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")) %>%
mutate(fecha = make_date(Año,Mes,1),
meses_imp = make_date(2999,1,1))
Año
Mes
Id
fecha
meses_imp
2020
2
1
2020-02-01
2999-01-01
2020
3
1
2020-03-01
2999-01-01
2020
4
1
2020-04-01
2999-01-01
2020
5
1
2020-05-01
2999-01-01
2021
3
1
2021-03-01
2999-01-01
2021
4
1
2021-04-01
2999-01-01
2021
9
1
2021-09-01
2999-01-01
2021
11
2
2021-11-01
2999-01-01
2022
1
2
2022-01-01
2999-01-01
I need to add rows for each "Id" when there are "holes" between two consecutive ones, and count those added rows later. I´ve achieved this using a "while" loop:
i <- 2
while(!is.na(filas[i,]$Id)) {
if (as.double(difftime(filas[i,]$fecha,filas[i-1,]$fecha)) > 31 &
filas[i,]$Id == filas[i-1,]$Id) {
filas <- add_row(filas,
Id = filas[i,]$Id,
fecha = filas[i-1,]$fecha + months(1),
meses_imp = pmin(filas[i-1,]$fecha,
filas[i-1,]$meses_imp),
.after = i-1)}
i=i+1}
filas2 <- filas %>%
group_by(Id,meses_imp) %>%
summarise(cant_meses_imp = n()) %>%
ungroup() %>%
filter(meses_imp != "2999-01-01")
filas <- left_join(filas,
filas2,
by=c("Id","meses_imp"))
Año
Mes
Id
fecha
meses_imp
cant_meses_imp
2020
2
1
2020-02-01
2999-01-01
NA
2020
3
1
2020-03-01
2999-01-01
NA
2020
4
1
2020-04-01
2999-01-01
NA
2020
5
1
2020-05-01
2999-01-01
NA
NA
NA
1
2020-06-01
2020-05-01
9
NA
NA
1
2020-07-01
2020-05-01
9
NA
NA
1
2020-08-01
2020-05-01
9
NA
NA
1
2020-09-01
2020-05-01
9
NA
NA
1
2020-10-01
2020-05-01
9
NA
NA
1
2020-11-01
2020-05-01
9
NA
NA
1
2020-12-01
2020-05-01
9
NA
NA
1
2021-01-01
2020-05-01
9
NA
NA
1
2021-02-01
2020-05-01
9
2021
3
1
2021-03-01
2999-01-01
NA
2021
4
1
2021-04-01
2999-01-01
NA
NA
NA
1
2021-05-01
2021-04-01
4
NA
NA
1
2021-06-01
2021-04-01
4
NA
NA
1
2021-07-01
2021-04-01
4
NA
NA
1
2021-08-01
2021-04-01
4
2021
9
1
2021-09-01
2999-01-01
NA
2021
11
2
2021-11-01
2999-01-01
NA
NA
NA
2
2021-12-01
2021-11-01
1
2022
1
2
2022-01-01
2999-01-01
NA
Since I`d like to apply this to a much larger dataset (~ 300k rows), how could I rewrite it in a vectorised way so it´s more efficient (and elegant maybe)?
Thanks!

You can apply the following code using padr and zoo packages.
This idea is to:
Add missing dates with the padr::pad() function.
Remove unwanted lines (non-integer Id values)
Create na and grp columns to identify rows added in 1.
Group by grp and create a column cant_meses_imp to count the number of consecutive na in each group
Select only desired columns
library(dplyr)
library(padr)
library(zoo)
filas %>%
pad(by = "fecha") %>% # add missing dates
mutate(Id = na.approx(Id)) %>% # interpolate NA values in Id column
subset(Id%%1 == 0) %>% # Keep only Id interger
# This part is for generating the cant_meses_imp column
mutate(na = ifelse(is.na(Mes), 1, 0),
grp = rle(na)$lengths %>% {rep(seq(length(.)), .)}) %>%
group_by(grp) %>%
mutate(cant_meses_imp = ifelse(na == 0, NA, n())) %>%
ungroup() %>%
select(-c(na, grp))
The code does not reproduce exactly the fecha column as there is no guidelines for its values.

Related

Why does grepl work but not str_detect for mutate depending on row value?

I have been trying to wrap my head around this.
I need to create a corrected column based on detecting a specific comment at another "error" column in my database. I can work around this with grepl, but I am struggling with getting str_detect to work as well (it is usually faster for big datasets).
Here is an example database:
test <- tibble(
id = seq(1:30),
date = sample(seq(as.Date('2000/01/01'), as.Date('2018/01/01'), by="day"), 30),
error = c(rep(NA, 3), "wrong date! Correct date = 01.03.2022",
rep(NA, 5), "wrong date! Correct date = 01.05.2021",
rep(NA, 5), "wrong date! Correct date = 01.03.2022",
rep(NA, 7), "wrong date! Correct date = 01.05.2021",
rep(NA, 2), "date already corrected on 01.05.2021",
NA, "date already corrected on 01.03.2022", NA))
I first tried to create a new "date_corr" column with str_detect:
test %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
This yields:
A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
Adding rowwise is irrelevant:
test %>%
rowwise() %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
However, with grepl I get the desired outcome, regardless of rowwise:
test %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
# A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
test %>%
rowwise() %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
What I am missing here?
The difference is how they handle NA values
str_detect(NA, "missing")
# [1] NA
grepl("missing", NA)
# [1] FALSE
And note that if you have an NA value in the condition for if_else, it will also preserve the NA value
if_else(NA, 1, 2)
# [1] NA
The str_detect preserved the NA value. It's not clear what the "right" value should be. But if you want str_detect to have the same values as grepl, you can be explicit about not changing NA values
test %>%
mutate(date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))

Identify the number of active days in an isoweek given a date range

My data are as follows:
df <- read_table("begin.date end.date
2019-07-22 2019-07-29
2019-07-29 2019-08-03
2019-08-25 2019-08-30
2019-08-30 2019-09-24
2019-09-30 2019-10-05")
I would like to assign two new columns:
isoweek_id = every isoweek in the year (so there will be one row for every week in the year)
data_days = the number of days data collection occurred within that isoweek given the begin.date and end.date, which represent date ranges when data collection occurred.
We might, therefore, have weeks when the number of days data collection occurred is 0 if, for example, a temporal gap in data collection spanned more than one isoweek. (note: my real data have several years worth of data collection).
My desired output would look something like this:
begin.date end.date isoweek_id data_days
NA NA 29 0
2019-07-22 2019-07-29 30 7
2019-07-29 2019-08-03 31 6
NA NA 32 0
NA NA 33 0
2019-08-25 2019-08-30 34 1
2019-08-25 2019-08-30 35 5
2019-08-30 2019-09-24 36 7
2019-08-30 2019-09-24 37 7
2019-08-30 2019-09-24 38 7
2019-08-30 2019-09-24 39 2
2019-09-30 2019-10-05 40 6
NA NA 41 0
NA NA 42 0
NA NA 43 0
You can look at which isoweeks span which dates as follows:
library(ISOweek)
w <- paste("2019-W35", 1:7, sep = "-")
data.frame(weekdate = w, date = ISOweek2date(w))
Thank you in advance!
I hope this does the job:
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
dplyr::arrange(begin.date) %>%
# unnest day sequence from start to end into df https://stackoverflow.com/questions/50997084/create-dataframe-of-rows-of-sequence-of-years-from-rows-with-start-end-dates
dplyr::group_by(rn = dplyr::row_number()) %>%
dplyr::mutate(dates = list(seq.Date(from = begin.date, to = end.date, by = "days"))) %>%
tidyr::unnest(dates) %>%
dplyr::ungroup() %>%
# right join list of all dates with iso week and year
dplyr::right_join(dplyr::tibble(dates = seq.Date(from = min(df$begin.date), max(df$end.date), by = "days")) %>%
dplyr::mutate(year = lubridate::year(dates),
iso_week = lubridate::isoweek(dates)),
by = "dates") %>%
# fill up the rn in case it is zero with a number that is larger all rns
dplyr::mutate(rn = ifelse(is.na(rn), nrow(df) + 1, rn)) %>%
# summarize data
dplyr::group_by(year, iso_week, rn) %>%
dplyr::summarize(bdate = min(begin.date, na.rm = TRUE),
edate = min(end.date, na.rm = TRUE),
days = sum(ifelse(is.na(begin.date), 0, 1))) %>%
dplyr::ungroup() %>%
# get lowest sequential numbering per week since we can have duplicates like the example shows
dplyr::group_by(year, iso_week) %>%
dplyr::slice_min(order_by = rn, n = 1) %>%
dplyr::ungroup() # you might want to remove and or rename comluns
# A tibble: 11 x 6
year iso_week rn bdate edate days
<dbl> <dbl> <int> <date> <date> <dbl>
1 2019 30 1 2019-07-22 2019-07-29 7
2 2019 31 1 2019-07-22 2019-07-29 1
3 2019 32 6 NA NA 0
4 2019 33 6 NA NA 0
5 2019 34 3 2019-08-25 2019-08-30 1
6 2019 35 3 2019-08-25 2019-08-30 5
7 2019 36 4 2019-08-30 2019-09-24 7
8 2019 37 4 2019-08-30 2019-09-24 7
9 2019 38 4 2019-08-30 2019-09-24 7
10 2019 39 4 2019-08-30 2019-09-24 2
11 2019 40 5 2019-09-30 2019-10-05 6

r - Generating cumulative sum, total sum, and unique identifiers between start and end dates

What I want to do
I have a dataset of protest events in the United States. Some events are stand-alone events, while others persist day-after-day (a "multi-day event"). My dataset is structured at the daily level, so a three-day multi-day event is spread out over three rows.
I want to accomplish the following:
Create a cumulative sum of the number of days thus far in any given multi-day event. Specifically, I want to count the number of days between the "First day" and "Last day" of any linked event.
Put the total number of days of each multi-event as a variable
"Name" each multi-day event by concatenating the state in which the protest occurred and a sequential identity number starting at 1 in each state and extending upwards.
Data
Here's a reproducible example:
# Library
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
# Now create some lags and leads
test <- test %>%
group_by(state) %>%
mutate(date_lag = lag(date),
date_lead = lead(date),
days_last = date - date_lag,
days_next = date_lead - date,
link_last = if_else(days_last <= 1, 1, 0),
link_next = if_else(days_next <= 1, 1, 0),
sequence = if_else(link_last == 0 & link_next == 1, "First day",
if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
if_else(link_last == 1 & link_next == 1, "Ongoing",
if_else(link_last == 1 & link_next == 0, "Last day",
if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))
This generates the following dataframe:
state date date_lag date_lead days_last days_next link_last link_next sequence
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA
What I want to create:
state date date_lag date_lead days_last days_next link_last link_next sequence cumulative duration name
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day 1 2 Washington.1
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day 2 2 Washington.1
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked NA 0 NA
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day 1 5 Washington.2
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing 2 5 Washington.2
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing 3 5 Washington.2
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing 4 5 Washington.2
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day 5 5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA NA NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA 1 3 Idaho.1
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing 2 3 Idaho.1
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day 3 3 Idaho.1
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked NA NA NA
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day 1 3 Idaho.2
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing 2 3 Idaho.2
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day 3 3 Idaho.2
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked NA NA NA
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked NA NA NA
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA NA NA NA
Side question: Why is test$sequence[11] an NA and not "First day"?
I'm not sure these are the specific numbers you're looking for, but this represents what seems to me a simpler and more idiomatic tidyverse approach:
test %>%
group_by(state) %>%
mutate(days_last = as.numeric(date - lag(date)),
new_section = 1*(is.na(days_last) | days_last > 1), # EDIT
section = cumsum(new_section),
name = paste(state,section, sep = ".")) %>%
group_by(name) %>%
mutate(duration = as.numeric(max(date) - min(date) + 1),
sequence = case_when(duration == 1 ~ "Unlinked",
row_number() == 1 ~ "First Day",
row_number() == n() ~ "Last Day",
TRUE ~ "Ongoing")) %>%
ungroup()
Here, I mark any gap of more than one day as a new event, take the cumulative sum, and use that to define the duration of each event.
# A tibble: 20 x 8
state date days_last new_section section name duration sequence
<chr> <date> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 Washington 2021-01-01 NA 1 1 Washington.1 1 Unlinked
2 Washington 2021-01-03 2 1 2 Washington.2 2 First Day
3 Washington 2021-01-04 1 0 2 Washington.2 2 Last Day
4 Washington 2021-01-10 6 1 3 Washington.3 1 Unlinked
5 Washington 2021-01-15 5 1 4 Washington.4 5 First Day
6 Washington 2021-01-16 1 0 4 Washington.4 5 Ongoing
7 Washington 2021-01-17 1 0 4 Washington.4 5 Ongoing
8 Washington 2021-01-18 1 0 4 Washington.4 5 Ongoing
9 Washington 2021-01-19 1 0 4 Washington.4 5 Last Day
10 Washington 2021-01-28 9 1 5 Washington.5 1 Unlinked
11 Idaho 2021-01-12 NA 1 1 Idaho.1 3 First Day
12 Idaho 2021-01-13 1 0 1 Idaho.1 3 Ongoing
13 Idaho 2021-01-14 1 0 1 Idaho.1 3 Last Day
14 Idaho 2021-02-01 18 1 2 Idaho.2 1 Unlinked
15 Idaho 2021-02-03 2 1 3 Idaho.3 3 First Day
16 Idaho 2021-02-04 1 0 3 Idaho.3 3 Ongoing
17 Idaho 2021-02-05 1 0 3 Idaho.3 3 Last Day
18 Idaho 2021-02-08 3 1 4 Idaho.4 1 Unlinked
19 Idaho 2021-02-10 2 1 5 Idaho.5 1 Unlinked
20 Idaho 2021-02-14 4 1 6 Idaho.6 1 Unlinked
I think creating specific functions to do the counting is easier than try to do everything in a single pipe.
I left all the intermediate steps and the intermediate columns in the output so you can see what each step is doing. It's very likely you won't need to keep all these columns and you probably can simplify the steps once you understand the approach.
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
event_count <- function(v){
cnt <- 0
result <- integer(length(v))
for(idx in seq_along(v)) {
if(v[idx]) {
cnt <- 0
} else {
cnt <- cnt + 1
}
result[idx] <- cnt
}
result
}
need_name <- function(cnt) {
result <- logical(length(cnt))
for(idx in seq_along(cnt)){
if(cnt[idx] == 0){
if(idx == length(cnt)){
result[idx] <- FALSE
break
}
result[idx] <- (cnt[idx + 1] != 0)
} else{
result[idx] <- TRUE
}
}
result
}
running_count <- function(v) {
cnt <- 0
flag <- FALSE
result <- integer(length(v))
for(idx in seq_along(v)){
if(v[idx]) {
if(!flag) {
cnt <- cnt + 1
flag <- !flag
}
result[idx] <- cnt
} else{
result[idx] <- 0
flag <- FALSE
}
}
result
}
test %>%
group_by(state) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
duration = date - lag(date), # --- Compute durations
is_first = duration != 1 # --- Check if it is the first day of a protest
) %>%
replace_na(list(is_first = TRUE)) %>% # --- No more NAs
ungroup %>%
mutate(
cnt = event_count(is_first), # --- How many days does this event have?
need_name = need_name(cnt) # --- Should we name this event?
) %>%
group_by(state) %>%
mutate(
name_number = running_count(need_name) # --- What's the event count?
) %>%
mutate(
name = paste0(state, ".", name_number), # ---- Create names
cumulative = cnt + 1 # --- Start counting from one instead of zero
) %>%
group_by(name) %>%
mutate(
duration = max(duration) # --- Calc total duration
) %>%
ungroup() %>%
mutate( # --- Adding the NAs back
name = if_else(name_number == 0, NA_character_, name),
duration = if_else(name_number == 0, NA_integer_, as.integer(duration)),
cumulative = if_else(name_number == 0, NA_integer_, as.integer(cumulative)),
)
data.table::rleid is useful here to create run lengths based on the condition if days_last == 1 or days_next == 1 (ie sequential dates). If you want different event lengths you can edit that condition.
library(dplyr)
library(data.table)
test %>%
dplyr::group_by(state) %>%
dplyr::mutate(days_last = c(NA, diff(date)),
days_next = as.numeric(lead(date) - date),
name = paste0(state, ".", data.table::rleid(days_last == 1 | days_next == 1))) %>%
dplyr::group_by(name) %>%
dplyr::mutate(sequence = case_when(
n() == 1 ~ "Not Linked",
row_number() == 1 ~ "First day",
n() == row_number() ~ "Last day",
T ~ "Ongoing"),
duration = n(),
cumulative = seq_along(name)) %>%
dplyr::ungroup()
Output
state date days_last days_next name sequence duration cumulative
<chr> <date> <dbl> <dbl> <chr> <chr> <int> <int>
1 Washington 2021-01-01 NA 2 Washington.1 Not Linked 1 1
2 Washington 2021-01-03 2 1 Washington.2 First day 2 1
3 Washington 2021-01-04 1 6 Washington.2 Last day 2 2
4 Washington 2021-01-10 6 5 Washington.3 Not Linked 1 1
5 Washington 2021-01-15 5 1 Washington.4 First day 5 1
6 Washington 2021-01-16 1 1 Washington.4 Ongoing 5 2
7 Washington 2021-01-17 1 1 Washington.4 Ongoing 5 3
8 Washington 2021-01-18 1 1 Washington.4 Ongoing 5 4
9 Washington 2021-01-19 1 9 Washington.4 Last day 5 5
10 Washington 2021-01-28 9 NA Washington.5 Not Linked 1 1
11 Idaho 2021-01-12 NA 1 Idaho.1 First day 3 1
12 Idaho 2021-01-13 1 1 Idaho.1 Ongoing 3 2
13 Idaho 2021-01-14 1 18 Idaho.1 Last day 3 3
14 Idaho 2021-02-01 18 2 Idaho.2 Not Linked 1 1
15 Idaho 2021-02-03 2 1 Idaho.3 First day 3 1
16 Idaho 2021-02-04 1 1 Idaho.3 Ongoing 3 2
17 Idaho 2021-02-05 1 3 Idaho.3 Last day 3 3
18 Idaho 2021-02-08 3 2 Idaho.4 First day 2 1
19 Idaho 2021-02-10 2 4 Idaho.4 Last day 2 2
20 Idaho 2021-02-14 4 NA Idaho.5 Not Linked 1 1
If need by you can use the NA in the days_last column to NA values in other rows.
Side question: Why is test$sequence[11] an NA and not "First day"?
Generally, in R NA propagates, meaning if NA is part of the evaluation then normally NA is returned. When you define sequence your first ifelse condition is link_last == 0 & link_next == 1. On row 11, link_last = NA and link_next = 1. So what you're evaluating is:
NA == 0 & 1 == 1
[1] NA
Instead your nested condition should come first. How your ifelse is currently written that nested condition is not being evaluated:
is.na(NA) & 1 == 1
[1] TRUE
Here is a data.table approach.
library(data.table)
# Convert from data.frame to data.table
setDT(test)
# Subset the variables.
test2 <- test[, .(state, date, days_last = as.numeric(days_last),
days_next = as.numeric(days_next), sequence)]
# Code
test2[, name := paste0(state, '.', rleid(days_last == 1 | days_next == 1)),
by = state][
, ':='(duration = .N,
cumulative = seq(1:.N)),
by = name
][, c('days_next', 'days_last'):=NULL] # Removing these variables. Feel free to add back!
# Reorder the variables
test2 <- setcolorder(test2, c('state', 'name', 'date',
'sequence', 'duration',
'cumulative'))
# Print first 15 rows
print(test2[1:15,])
#> state name date sequence duration cumulative
#> 1: Washington Washington.1 2021-01-01 <NA> 1 1
#> 2: Washington Washington.2 2021-01-03 First day 2 1
#> 3: Washington Washington.2 2021-01-04 Last day 2 2
#> 4: Washington Washington.3 2021-01-10 Not linked 1 1
#> 5: Washington Washington.4 2021-01-15 First day 5 1
#> 6: Washington Washington.4 2021-01-16 Ongoing 5 2
#> 7: Washington Washington.4 2021-01-17 Ongoing 5 3
#> 8: Washington Washington.4 2021-01-18 Ongoing 5 4
#> 9: Washington Washington.4 2021-01-19 Last day 5 5
#> 10: Washington Washington.5 2021-01-28 <NA> 1 1
#> 11: Idaho Idaho.1 2021-01-12 <NA> 3 1
#> 12: Idaho Idaho.1 2021-01-13 Ongoing 3 2
#> 13: Idaho Idaho.1 2021-01-14 Last day 3 3
#> 14: Idaho Idaho.2 2021-02-01 Not linked 1 1
#> 15: Idaho Idaho.3 2021-02-03 First day 3 1
Created on 2021-03-16 by the reprex package (v0.3.0)

How to show missing dates in case of application of rolling function

Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2

R inserting rows between dates by group based on second column

I have a df that looks like this
ID FINAL_DT START_DT
23 NA 2020-03-20
25 NA 2020-04-10
29 2020-02-02 2020-01-23
30 NA 2020-01-02
What I would like to do is for each ID add a row for every month starting from START_DT and ending at whatever comes first FINAL_DT or the current date. Expected output would be the follow:
ID FINAL_DT START_DT ACTIVE_MONTH
23 NA 2020-03-20 2020-03
23 NA NA 2020-04
23 NA NA 2020-05
25 NA 2020-04-10 2020-04
25 NA NA 2020-05
29 2020-02-02 2020-01-23 2020-01
29 2020-02-02 NA 2020-02
30 NA 2020-01-02 2020-01
30 NA NA 2020-02
30 NA NA 2020-03
30 NA NA 2020-04
30 NA NA 2020-05
I have the following code which works but does not account for FINAL_DT
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, current_date, by = "month"))
I have tried the following but get an error I believe due to the NA's:
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, min(FINAL_DT,current_date), by = "month"))
The day of the month also does not matter I am not sure if it would be easier to drop that before or after.
Here is another approach. You can use floor_date to get the first day of the month to use in your sequence of months. Then, you can include the full sequence to today's date, and filter based on FINAL_DT. You can use as.yearmon from zoo if you'd like a month/year object for month.
library(zoo)
library(tidyr)
library(dplyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(ACTIVE_MONTH = floor_date(START_DT, unit = "month")) %>%
group_by(ID) %>%
complete(ACTIVE_MONTH = seq.Date(floor_date(START_DT, unit = "month"), current_date, by = "month")) %>%
filter(ACTIVE_MONTH <= first(FINAL_DT) | is.na(first(FINAL_DT))) %>%
ungroup() %>%
mutate(ACTIVE_MONTH = as.yearmon(ACTIVE_MONTH))
Output
# A tibble: 12 x 4
ID ACTIVE_MONTH FINAL_DT START_DT
<dbl> <yearmon> <date> <date>
1 23 Mar 2020 NA 2020-03-20
2 23 Apr 2020 NA NA
3 23 May 2020 NA NA
4 25 Apr 2020 NA 2020-04-10
5 25 May 2020 NA NA
6 29 Jan 2020 2020-02-02 2020-01-23
7 29 Feb 2020 NA NA
8 30 Jan 2020 NA 2020-01-02
9 30 Feb 2020 NA NA
10 30 Mar 2020 NA NA
11 30 Apr 2020 NA NA
12 30 May 2020 NA NA
Here is an approach that returns rows for each MONTH with the help of lubridate.
library(dplyr)
library(tidyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(MONTH = month(START_DT)) %>%
group_by(ID) %>%
complete(MONTH = seq(MONTH, min(month(FINAL_DT)[!is.na(FINAL_DT)],month(current_date))))
# A tibble: 12 x 4
# Groups: ID [4]
# ID MONTH FINAL_DT START_DT
# <int> <dbl> <fct> <fct>
# 1 23 3 NA 2020-03-20
# 2 23 4 NA NA
# 3 23 5 NA NA
# 4 25 4 NA 2020-04-10
# 5 25 5 NA NA
# 6 29 1 2020-02-02 2020-01-23
# 7 29 2 NA NA
# 8 30 1 NA 2020-01-02
# 9 30 2 NA NA
#10 30 3 NA NA
#11 30 4 NA NA
#12 30 5 NA NA

Resources