Combining rows based on value, creating new columns as needed - r

I have a dataframe like this:
data <- data.frame(Site= c("a","a","a","b","b","c","c","c"),
Start=c("2017-11-29","2018-09-24","2018-05-01","2018-09-23","2019-10-06","2020-09-07","2018-09-17","2019-10-08"),
End=c("2018-09-26","2019-09-11","2018-09-23","2019-06-28","2020-09-07","2021-08-26","2019-10-08","2020-09-02"))
Site Start End
1 a 2017-11-29 2018-09-26
2 a 2018-09-24 2019-09-11
3 a 2018-05-01 2018-09-23
4 b 2018-09-23 2019-06-28
5 b 2019-10-06 2020-09-07
6 c 2020-09-07 2021-08-26
7 c 2018-09-17 2019-10-08
8 c 2019-10-08 2020-09-02
I would like to combine rows with similar Sites, to look like this:
Site Start End Start2 End2 End3 End3
1 a 2017-11-29 2018-09-26 2018-09-24 2019-09-11 2018-05-01 2018-09-23
2 b 2018-09-23 2019-06-28 2019-10-06 2020-09-07 NA NA
3 c 2020-09-07 2021-08-26 2018-09-17 2019-10-08 2019-10-08 2020-09-02
Thanks!

data <- data.frame(Site= c("a","a","a","b","b","c","c","c"),
Start=c("2017-11-29","2018-09-24","2018-05-01","2018-09-23","2019-10-06","2020-09-07","2018-09-17","2019-10-08"),
End=c("2018-09-26","2019-09-11","2018-09-23","2019-06-28","2020-09-07","2021-08-26","2019-10-08","2020-09-02"))
library(tidyr)
library(dplyr)
data %>% group_by(Site) %>% mutate(id = 1:n()) %>%
pivot_wider(id_cols = Site, names_from = id, values_from = c(Start, End) )
#> # A tibble: 3 × 7
#> # Groups: Site [3]
#> Site Start_1 Start_2 Start_3 End_1 End_2 End_3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 2017-11-29 2018-09-24 2018-05-01 2018-09-26 2019-09-11 2018-09-23
#> 2 b 2018-09-23 2019-10-06 <NA> 2019-06-28 2020-09-07 <NA>
#> 3 c 2020-09-07 2018-09-17 2019-10-08 2021-08-26 2019-10-08 2020-09-02

Related

Rolling Window based on a fixed time interval

I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.
library(tidyverse)
library(lubridate)
df = tibble(id = 1:7,
date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))
# A tibble: 7 × 2
id date_time
<int> <dttm>
1 1 2022-05-07 15:00:00
2 2 2022-05-09 13:45:00
3 3 2022-05-09 13:51:00
4 4 2022-05-09 17:00:00
5 5 2022-05-10 15:25:00
6 6 2022-05-10 17:18:00
7 7 2022-05-11 14:00:00
With the example window of 48 hours, that would yield:
# A tibble: 7 × 4
id date_time lag_48hours count
<int> <dttm> <dttm> <dbl>
1 1 2022-05-07 15:00:00 2022-05-05 15:00:00 1
2 2 2022-05-09 13:45:00 2022-05-07 13:45:00 2
3 3 2022-05-09 13:51:00 2022-05-07 13:51:00 3
4 4 2022-05-09 17:00:00 2022-05-07 17:00:00 3
5 5 2022-05-10 15:25:00 2022-05-08 15:25:00 4
6 6 2022-05-10 17:18:00 2022-05-08 17:18:00 5
7 7 2022-05-11 14:00:00 2022-05-09 14:00:00 4
I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).
I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.
library(tidyverse)
library(lubridate)
library(slider)
df = tibble(
id = 1:7,
date_time = ymd_hm(
'2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
)
)
df %>%
mutate(
count = slide_index_int(
.x = id,
.i = date_time,
.f = length,
.before = dhours(48)
)
)
#> # A tibble: 7 × 3
#> id date_time count
#> <int> <dttm> <int>
#> 1 1 2022-05-07 15:00:00 1
#> 2 2 2022-05-09 13:45:00 2
#> 3 3 2022-05-09 13:51:00 3
#> 4 4 2022-05-09 17:00:00 3
#> 5 5 2022-05-10 15:25:00 4
#> 6 6 2022-05-10 17:18:00 5
#> 7 7 2022-05-11 14:00:00 4
How about this...
df %>%
mutate(count48 = map_int(date_time,
~sum(date_time <= . & date_time > . - 48 * 60 * 60)))
# A tibble: 7 × 3
id date_time count48
<int> <dttm> <int>
1 1 2022-05-07 15:00:00 1
2 2 2022-05-09 13:45:00 2
3 3 2022-05-09 13:51:00 3
4 4 2022-05-09 17:00:00 3
5 5 2022-05-10 15:25:00 4
6 6 2022-05-10 17:18:00 5
7 7 2022-05-11 14:00:00 4

R: Create new variable based on date in other variable

I have a data frame that looks somewhat like this:
a = c(seq(as.Date("2020-08-01"), as.Date("2020-11-01"), by="months"), seq(as.Date("2021-08-01"), as.Date("2021-11-01"), by="months"),
seq(as.Date("2022-08-01"), as.Date("2022-11-01"), by="months"))
b = rep(LETTERS[1:3], each = 4)
df = data_frame(ID = b, Date = a)
> df
ID Date
<chr> <date>
1 A 2020-08-01
2 A 2020-09-01
3 A 2020-10-01
4 A 2020-11-01
5 B 2021-08-01
6 B 2021-09-01
7 B 2021-10-01
8 B 2021-11-01
9 C 2022-08-01
10 C 2022-09-01
11 C 2022-10-01
12 C 2022-11-01
And I want to create a new variable that replaces Date with the smallest value in Date for each ID, the resulting data frame should look like this:
c = c(rep(as.Date("2020-08-01"), each = 4), rep(as.Date("2021-08-01"), each = 4), rep(as.Date("2022-08-01"), each = 4))
df$NewDate = c
> df
# A tibble: 12 × 3
ID Date NewDate
<chr> <date> <date>
1 A 2020-08-01 2020-08-01
2 A 2020-09-01 2020-08-01
3 A 2020-10-01 2020-08-01
4 A 2020-11-01 2020-08-01
5 B 2021-08-01 2021-08-01
6 B 2021-09-01 2021-08-01
7 B 2021-10-01 2021-08-01
8 B 2021-11-01 2021-08-01
9 C 2022-08-01 2022-08-01
10 C 2022-09-01 2022-08-01
11 C 2022-10-01 2022-08-01
12 C 2022-11-01 2022-08-01
Can someone please help me do it? Thank you very much in advance.
Frist group, then mutate & min:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NewDate = min(Date)) %>%
ungroup()
#> # A tibble: 12 × 3
#> ID Date NewDate
#> <chr> <date> <date>
#> 1 A 2020-08-01 2020-08-01
#> 2 A 2020-09-01 2020-08-01
#> 3 A 2020-10-01 2020-08-01
#> 4 A 2020-11-01 2020-08-01
#> 5 B 2021-08-01 2021-08-01
#> 6 B 2021-09-01 2021-08-01
#> 7 B 2021-10-01 2021-08-01
#> 8 B 2021-11-01 2021-08-01
#> 9 C 2022-08-01 2022-08-01
#> 10 C 2022-09-01 2022-08-01
#> 11 C 2022-10-01 2022-08-01
#> 12 C 2022-11-01 2022-08-01

Create new column based on cummulative/rolling values in grouping column

Edit: Unfortunately, I simplified my needs and data too much. I will update the question below.
I have a df similar to the code below. I need to create a new column called first_funding_date that is equal to the value of fund.date, where sigma==0, until the next time sigma==0. In the example df below, first_fund_date should be a vector with the first observation equal to "2019/05/22", the following 3 observations equal to "2020/09/05", and the final 4 equal to "2019/11/30".
set.seed(111)
df <- data.frame(id = c(1,1,3,4,5,6,2,7),
fund.date = sample(seq(as.Date('2018/01/01'),
as.Date('2021/01/01'), by="day"), 8),
sigma = c(0,0,1,2,0,1,2,3))
%>% mutate(first_fund_date = ??? )
I also need to create a column called last_funding_date that is equal to fund.date, for the rolling max of sigma. The first 4 observations should be "2020/03/03" and the last 4 should be "2020/12/04".
library(dplyr)
df %>%
mutate(first_fund_date = fund.date[sigma==0],
last_funding_date = fund.date[sigma==max(sigma)])
id fund.date sigma first_fund_date last_funding_date
1 1 2019-05-22 1 2020-09-05 2018-03-10
2 2 2020-09-05 0 2020-09-05 2018-03-10
3 3 2018-06-24 1 2020-09-05 2018-03-10
4 4 2020-03-03 2 2020-09-05 2018-03-10
5 5 2019-11-30 3 2020-09-05 2018-03-10
6 6 2018-03-10 4 2020-09-05 2018-03-10
The key here is to to create index variables to group_by with cumsum(sigma==0) and cumsum(sigma < lag(sigma)).
library(dplyr)
df %>%
group_by(index = cumsum(sigma==0))%>%
mutate(first_fund.date = first(fund.date))%>%
group_by(index_2 = cumsum(sigma < lag(sigma, default = Inf)))%>%
mutate(last_fund.date = last(fund.date))%>%
ungroup()%>%
select(-contains('index'))
# A tibble: 8 × 5
id fund.date sigma first_fund.date last_fund.date
<dbl> <date> <dbl> <date> <date>
1 1 2019-05-22 0 2019-05-22 2020-03-03
2 1 2020-09-05 0 2020-09-05 2020-03-03
3 3 2018-06-24 1 2020-09-05 2020-03-03
4 4 2020-03-03 2 2020-09-05 2020-03-03
5 5 2019-11-30 0 2019-11-30 2020-12-04
6 6 2018-03-10 1 2019-11-30 2020-12-04
7 2 2018-11-01 2 2019-11-30 2020-12-04
8 7 2020-12-04 3 2019-11-30 2020-12-04

How to deduplicate date sequences across non-consecutive rows in R?

I want to flag the first date in every window of at least 31 days for each id in my data.
Data:
library(tidyverse)
library(lubridate)
library(tibbletime)
D1 <- tibble(id = c(12,12,12,12,12,12,10,10,10,10),
index_date=c("2019-01-01","2019-01-07","2019-01-21","2019-02-02",
"2019-02-09","2019-03-06","2019-01-05","2019-02-01","2019-02-02","2019-02-08"))
D1
# A tibble: 10 x 2
id index_date
<dbl> <chr>
1 12 2019-01-01
2 12 2019-01-07
3 12 2019-01-21
4 12 2019-02-02
5 12 2019-02-09
6 12 2019-03-06
7 10 2019-01-05
8 10 2019-02-01
9 10 2019-02-02
10 10 2019-02-08
The desired rows to flag are rows 1, 4, 6, 7, and 10; these rows represent either the first index_date for a given id or the first index_date after a 31-day skip period from the previously flagged index_date for that given id.
Code:
temp <- D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start"),
keep = index_date == keyed_to_index_date)
temp %>% arrange(desc(id), index_date)
Result:
id index_date keyed_to_index_date keep
<dbl> <date> <date> <lgl>
1 12 2019-01-01 2019-01-01 TRUE
2 12 2019-01-07 2019-01-01 FALSE
3 12 2019-01-21 2019-01-01 FALSE
4 12 2019-02-02 2019-02-02 TRUE
5 12 2019-02-09 2019-02-02 FALSE
6 12 2019-03-06 2019-03-06 TRUE
7 10 2019-01-05 2019-01-05 TRUE
8 10 2019-02-01 2019-02-01 TRUE
9 10 2019-02-02 2019-02-01 FALSE
10 10 2019-02-08 2019-02-01 FALSE
Why does this code flag row 8 (which has an index_date less than 31 days after the previously flagged index_date for that id) and not row 10, and how do I fix this problem?
UPDATE: Adding the option start_date = first(index_date) to collapse_index(), as suggested by #mnaR99, successfully flagged the correct rows in the original example. However, when I applied the same principle to new data, I ran into a problem:
Data:
D2 <- tibble(id = c("A","A","A","B","B","B","B","B","C","C","C"),
index_date = c("2019-03-04","2019-03-05","2019-03-06",
"2019-03-01","2019-03-02","2019-03-04","2019-03-05","2019-03-06",
"2019-03-03","2019-03-04","2019-03-05"))
D2
id index_date
<chr> <chr>
1 A 2019-03-04
2 A 2019-03-05
3 A 2019-03-06
4 B 2019-03-01
5 B 2019-03-02
6 B 2019-03-04
7 B 2019-03-05
8 B 2019-03-06
9 C 2019-03-03
10 C 2019-03-04
11 C 2019-03-05
I now want to apply a 2-day window in the same manner as I previously applied a 31-day window (that is, consecutive calendar days should not both be flagged). The desired rows to flag are Rows 1, 3, 4, 6, 8, 9, and 11, because these rows are either the first `index_date` for a particular `id` or the first after a two-day skip.
Code:
t3 <- D2 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date,
period = '2 d',
side = "start",
start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(id, index_date)
Result:
> t3
# A time tibble: 11 x 4
# Index: index_date
# Groups: id [3]
id index_date keyed_to_index_date keep
<chr> <date> <date> <lgl>
1 A 2019-03-04 2019-03-04 TRUE
2 A 2019-03-05 2019-03-04 FALSE
3 A 2019-03-06 2019-03-06 TRUE
4 B 2019-03-01 2019-03-01 TRUE
5 B 2019-03-02 2019-03-01 FALSE
6 B 2019-03-04 2019-03-04 TRUE
7 B 2019-03-05 2019-03-05 TRUE
8 B 2019-03-06 2019-03-05 FALSE
9 C 2019-03-03 2019-03-03 TRUE
10 C 2019-03-04 2019-03-03 FALSE
11 C 2019-03-05 2019-03-05 TRUE
Row 7 is incorrectly flagged as TRUE, and Row 8 is incorrectly flagged as FALSE.
When I apply the purrr solution suggested by #tmfmnk, I get the correct result.
Code:
t4 <-
D2 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 |
accumulate(c(0, diff(index_date)), ~ if_else(.x >= 2,
.y,
.x + .y)
) >= 2
)
Result:
> t4
# A tibble: 11 x 3
# Groups: id [3]
id index_date keep
<chr> <date> <lgl>
1 A 2019-03-04 TRUE
2 A 2019-03-05 FALSE
3 A 2019-03-06 TRUE
4 B 2019-03-01 TRUE
5 B 2019-03-02 FALSE
6 B 2019-03-04 TRUE
7 B 2019-03-05 FALSE
8 B 2019-03-06 TRUE
9 C 2019-03-03 TRUE
10 C 2019-03-04 FALSE
11 C 2019-03-05 TRUE
What is wrong with the tibbletime approach in this example?
One option utilizing dplyr, lubridate and purrr could be:
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 | accumulate(c(0, diff(index_date)), ~ if_else(.x >= 31, .y, .x + .y)) >= 31)
id index_date keep
<dbl> <date> <lgl>
1 12 2019-01-01 TRUE
2 12 2019-01-07 FALSE
3 12 2019-01-21 FALSE
4 12 2019-02-02 TRUE
5 12 2019-02-09 FALSE
6 12 2019-03-06 TRUE
7 10 2019-01-05 TRUE
8 10 2019-02-01 FALSE
9 10 2019-02-02 FALSE
10 10 2019-02-08 TRUE
You just need to add the start_date argument to collapse_index:
D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start", start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(desc(id), index_date)
#> # A time tibble: 10 x 4
#> # Index: index_date
#> # Groups: id [2]
#> id index_date keyed_to_index_date keep
#> <dbl> <date> <date> <lgl>
#> 1 12 2019-01-01 2019-01-01 TRUE
#> 2 12 2019-01-07 2019-01-01 FALSE
#> 3 12 2019-01-21 2019-01-01 FALSE
#> 4 12 2019-02-02 2019-02-02 TRUE
#> 5 12 2019-02-09 2019-02-02 FALSE
#> 6 12 2019-03-06 2019-03-06 TRUE
#> 7 10 2019-01-05 2019-01-05 TRUE
#> 8 10 2019-02-01 2019-01-05 FALSE
#> 9 10 2019-02-02 2019-01-05 FALSE
#> 10 10 2019-02-08 2019-02-08 TRUE
Created on 2020-09-11 by the reprex package (v0.3.0)
You can use accumulate() from purrr.
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = index_date == accumulate(index_date, ~ if(.y - .x >= 31) .y else .x))
# id index_date keep
# <dbl> <date> <lgl>
# 1 12 2019-01-01 TRUE
# 2 12 2019-01-07 FALSE
# 3 12 2019-01-21 FALSE
# 4 12 2019-02-02 TRUE
# 5 12 2019-02-09 FALSE
# 6 12 2019-03-06 TRUE
# 7 10 2019-01-05 TRUE
# 8 10 2019-02-01 FALSE
# 9 10 2019-02-02 FALSE
# 10 10 2019-02-08 TRUE
The iteration rule is following:
1. 2019-01-07 - 2019-01-01 = 6 < 31 then return 2019-01-01
2. 2019-01-21 - 2019-01-01 = 20 < 31 then return 2019-01-01
3. 2019-02-02 - 2019-01-01 = 32 >= 31 then return (2019-02-02)*
4. 2019-02-09 - (2019-02-02)* = 7 < 31 then return 2019-02-02
5. etc.

R's padr package claiming the "datetime variable does not vary" when it does vary

library(tidyverse)
library(lubridate)
library(padr)
df
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
I have an R data frame named df shown above. I want to concentrate on row numbers 5 and 6. I can usually use the package padr to pad the months in between rows 5 and 6. The pad() function of the padr will basically add rows at intervals the user specifies, best shown as the added rows "X" below.
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> X 262-179 NA NA NA 2019-09-01
#> X 262-179 NA NA NA 2019-10-01
#> X 262-179 NA NA NA 2019-11-01
#> X 262-179 NA NA NA 2019-12-01
#> X 262-179 NA NA NA 2020-01-01
#> X 262-179 NA NA NA 2020-02-01
#> X 262-179 NA NA NA 2020-03-01
#> X 262-179 NA NA NA 2020-04-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
To get there I usually issue a command, such as is shown below, and it works fine in padr. But it doesn't work in my specific example, and instead yields the warning shown below.
df %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
#> Warning message:
#> datetime variable does not vary for 537 of the groups, no padding applied on this / these group(s)
Why does it claim that "the datetime variable does not vary" for rows 5 and 6, when the datetime does indeed vary. The datetime for row 5 variable YrMo is "2019-08-01" and the datetime for row 6 variable YrMo is "2020-05-01". Let me state the obvious that "2019-08-01" varies from "2020-05-01".
Any ideas what went wrong? I tried to create a reproducible example and could not. The basic examples I created all work as expected (as I describe). Hopefully these clues can help somebody determine what is going on.

Resources