R: Create new variable based on date in other variable - r

I have a data frame that looks somewhat like this:
a = c(seq(as.Date("2020-08-01"), as.Date("2020-11-01"), by="months"), seq(as.Date("2021-08-01"), as.Date("2021-11-01"), by="months"),
seq(as.Date("2022-08-01"), as.Date("2022-11-01"), by="months"))
b = rep(LETTERS[1:3], each = 4)
df = data_frame(ID = b, Date = a)
> df
ID Date
<chr> <date>
1 A 2020-08-01
2 A 2020-09-01
3 A 2020-10-01
4 A 2020-11-01
5 B 2021-08-01
6 B 2021-09-01
7 B 2021-10-01
8 B 2021-11-01
9 C 2022-08-01
10 C 2022-09-01
11 C 2022-10-01
12 C 2022-11-01
And I want to create a new variable that replaces Date with the smallest value in Date for each ID, the resulting data frame should look like this:
c = c(rep(as.Date("2020-08-01"), each = 4), rep(as.Date("2021-08-01"), each = 4), rep(as.Date("2022-08-01"), each = 4))
df$NewDate = c
> df
# A tibble: 12 × 3
ID Date NewDate
<chr> <date> <date>
1 A 2020-08-01 2020-08-01
2 A 2020-09-01 2020-08-01
3 A 2020-10-01 2020-08-01
4 A 2020-11-01 2020-08-01
5 B 2021-08-01 2021-08-01
6 B 2021-09-01 2021-08-01
7 B 2021-10-01 2021-08-01
8 B 2021-11-01 2021-08-01
9 C 2022-08-01 2022-08-01
10 C 2022-09-01 2022-08-01
11 C 2022-10-01 2022-08-01
12 C 2022-11-01 2022-08-01
Can someone please help me do it? Thank you very much in advance.

Frist group, then mutate & min:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NewDate = min(Date)) %>%
ungroup()
#> # A tibble: 12 × 3
#> ID Date NewDate
#> <chr> <date> <date>
#> 1 A 2020-08-01 2020-08-01
#> 2 A 2020-09-01 2020-08-01
#> 3 A 2020-10-01 2020-08-01
#> 4 A 2020-11-01 2020-08-01
#> 5 B 2021-08-01 2021-08-01
#> 6 B 2021-09-01 2021-08-01
#> 7 B 2021-10-01 2021-08-01
#> 8 B 2021-11-01 2021-08-01
#> 9 C 2022-08-01 2022-08-01
#> 10 C 2022-09-01 2022-08-01
#> 11 C 2022-10-01 2022-08-01
#> 12 C 2022-11-01 2022-08-01

Related

Rolling Window based on a fixed time interval

I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.
library(tidyverse)
library(lubridate)
df = tibble(id = 1:7,
date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))
# A tibble: 7 × 2
id date_time
<int> <dttm>
1 1 2022-05-07 15:00:00
2 2 2022-05-09 13:45:00
3 3 2022-05-09 13:51:00
4 4 2022-05-09 17:00:00
5 5 2022-05-10 15:25:00
6 6 2022-05-10 17:18:00
7 7 2022-05-11 14:00:00
With the example window of 48 hours, that would yield:
# A tibble: 7 × 4
id date_time lag_48hours count
<int> <dttm> <dttm> <dbl>
1 1 2022-05-07 15:00:00 2022-05-05 15:00:00 1
2 2 2022-05-09 13:45:00 2022-05-07 13:45:00 2
3 3 2022-05-09 13:51:00 2022-05-07 13:51:00 3
4 4 2022-05-09 17:00:00 2022-05-07 17:00:00 3
5 5 2022-05-10 15:25:00 2022-05-08 15:25:00 4
6 6 2022-05-10 17:18:00 2022-05-08 17:18:00 5
7 7 2022-05-11 14:00:00 2022-05-09 14:00:00 4
I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).
I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.
library(tidyverse)
library(lubridate)
library(slider)
df = tibble(
id = 1:7,
date_time = ymd_hm(
'2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
)
)
df %>%
mutate(
count = slide_index_int(
.x = id,
.i = date_time,
.f = length,
.before = dhours(48)
)
)
#> # A tibble: 7 × 3
#> id date_time count
#> <int> <dttm> <int>
#> 1 1 2022-05-07 15:00:00 1
#> 2 2 2022-05-09 13:45:00 2
#> 3 3 2022-05-09 13:51:00 3
#> 4 4 2022-05-09 17:00:00 3
#> 5 5 2022-05-10 15:25:00 4
#> 6 6 2022-05-10 17:18:00 5
#> 7 7 2022-05-11 14:00:00 4
How about this...
df %>%
mutate(count48 = map_int(date_time,
~sum(date_time <= . & date_time > . - 48 * 60 * 60)))
# A tibble: 7 × 3
id date_time count48
<int> <dttm> <int>
1 1 2022-05-07 15:00:00 1
2 2 2022-05-09 13:45:00 2
3 3 2022-05-09 13:51:00 3
4 4 2022-05-09 17:00:00 3
5 5 2022-05-10 15:25:00 4
6 6 2022-05-10 17:18:00 5
7 7 2022-05-11 14:00:00 4

Combining rows based on value, creating new columns as needed

I have a dataframe like this:
data <- data.frame(Site= c("a","a","a","b","b","c","c","c"),
Start=c("2017-11-29","2018-09-24","2018-05-01","2018-09-23","2019-10-06","2020-09-07","2018-09-17","2019-10-08"),
End=c("2018-09-26","2019-09-11","2018-09-23","2019-06-28","2020-09-07","2021-08-26","2019-10-08","2020-09-02"))
Site Start End
1 a 2017-11-29 2018-09-26
2 a 2018-09-24 2019-09-11
3 a 2018-05-01 2018-09-23
4 b 2018-09-23 2019-06-28
5 b 2019-10-06 2020-09-07
6 c 2020-09-07 2021-08-26
7 c 2018-09-17 2019-10-08
8 c 2019-10-08 2020-09-02
I would like to combine rows with similar Sites, to look like this:
Site Start End Start2 End2 End3 End3
1 a 2017-11-29 2018-09-26 2018-09-24 2019-09-11 2018-05-01 2018-09-23
2 b 2018-09-23 2019-06-28 2019-10-06 2020-09-07 NA NA
3 c 2020-09-07 2021-08-26 2018-09-17 2019-10-08 2019-10-08 2020-09-02
Thanks!
data <- data.frame(Site= c("a","a","a","b","b","c","c","c"),
Start=c("2017-11-29","2018-09-24","2018-05-01","2018-09-23","2019-10-06","2020-09-07","2018-09-17","2019-10-08"),
End=c("2018-09-26","2019-09-11","2018-09-23","2019-06-28","2020-09-07","2021-08-26","2019-10-08","2020-09-02"))
library(tidyr)
library(dplyr)
data %>% group_by(Site) %>% mutate(id = 1:n()) %>%
pivot_wider(id_cols = Site, names_from = id, values_from = c(Start, End) )
#> # A tibble: 3 × 7
#> # Groups: Site [3]
#> Site Start_1 Start_2 Start_3 End_1 End_2 End_3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 2017-11-29 2018-09-24 2018-05-01 2018-09-26 2019-09-11 2018-09-23
#> 2 b 2018-09-23 2019-10-06 <NA> 2019-06-28 2020-09-07 <NA>
#> 3 c 2020-09-07 2018-09-17 2019-10-08 2021-08-26 2019-10-08 2020-09-02

Combine data by several condition in R

I want to merge two data according to two conditions:
by same ID (only ID in the first data is retained)
if date_mid (from dat2) is in between date_begin and date_end (both from dat1), paste the result (from dat2), if not, noted as "NA"
Also, I want to drop the rows if the ID in the combine data already has the result (either as healthy or sick). In the example below I want to drop the 3rd and 12th rows.
First data (dat1):
dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"),
date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12),
date_end = date_begin + 365)
dat1
# A tibble: 12 x 3
ID date_begin date_end
<chr> <date> <date>
1 A1 2020-01-01 2020-12-31
2 A2 2020-02-01 2021-01-31
3 A3 2020-03-01 2021-03-01
4 A4 2020-04-01 2021-04-01
5 A5 2020-05-01 2021-05-01
6 A6 2020-06-01 2021-06-01
7 A7 2020-07-01 2021-07-01
8 A8 2020-08-01 2021-08-01
9 A9 2020-09-01 2021-09-01
10 A10 2020-10-01 2021-10-01
11 A2 2020-11-01 2021-11-01
12 A10 2020-12-01 2021-12-01
Second data (dat2):
dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"),
date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100,
result = rep(c("healthy", "sick"), length = 12))
dat2
# A tibble: 12 x 3
ID date_mid result
<chr> <date> <chr>
1 A1 2020-04-10 healthy
2 A2 2020-05-11 sick
3 A3 2020-06-09 healthy
4 A4 2020-07-10 sick
5 A9 2020-08-09 healthy
6 A10 2020-09-09 sick
7 A11 2020-10-09 healthy
8 A12 2020-11-09 sick
9 A13 2020-12-10 healthy
10 A14 2021-01-09 sick
11 A15 2021-02-09 healthy
12 A2 2021-03-11 sick
I have tried left_join as below:
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))
# A tibble: 14 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A2 2020-02-01 2021-01-31 2021-03-11 NA
4 A3 2020-03-01 2021-03-01 2020-06-09 healthy
5 A4 2020-04-01 2021-04-01 2020-07-10 sick
6 A5 2020-05-01 2021-05-01 NA NA
7 A6 2020-06-01 2021-06-01 NA NA
8 A7 2020-07-01 2021-07-01 NA NA
9 A8 2020-08-01 2021-08-01 NA NA
10 A9 2020-09-01 2021-09-01 2020-08-09 NA
11 A10 2020-10-01 2021-10-01 2020-09-09 NA
12 A2 2020-11-01 2021-11-01 2020-05-11 NA
13 A2 2020-11-01 2021-11-01 2021-03-11 sick
14 A10 2020-12-01 2021-12-01 2020-09-09 NA
As I mentioned, I want to drop the 3rd and 12th rows of ID A2, since A2 already have a result of either healthy or sick in 2nd and 13th rows.
The exact result that I want is something like this (only 2 rows of A2):
# A tibble: 12 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
4 A4 2020-04-01 2021-04-01 2020-07-10 sick
5 A5 2020-05-01 2021-05-01 NA NA
6 A6 2020-06-01 2021-06-01 NA NA
7 A7 2020-07-01 2021-07-01 NA NA
8 A8 2020-08-01 2021-08-01 NA NA
9 A9 2020-09-01 2021-09-01 2020-08-09 NA
10 A10 2020-10-01 2021-10-01 2020-09-09 NA
11 A2 2020-11-01 2021-11-01 2021-03-11 sick
12 A10 2020-12-01 2021-12-01 2020-09-09 NA
Any pointer is appreciated, thanks.
If there is more than one row for an ID in the result after joining keep only the non-NA rows. This can be written in dplyr as -
library(dplyr)
library(lubridate)
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
group_by(ID, date_begin, date_end) %>%
filter(if(n() > 1) !is.na(result) else TRUE) %>%
ungroup
# ID date_begin date_end date_mid result
# <chr> <date> <date> <date> <chr>
# 1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2 2020-02-01 2021-01-31 2020-05-11 sick
# 3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4 2020-04-01 2021-04-01 2020-07-10 sick
# 5 A5 2020-05-01 2021-05-01 NA NA
# 6 A6 2020-06-01 2021-06-01 NA NA
# 7 A7 2020-07-01 2021-07-01 NA NA
# 8 A8 2020-08-01 2021-08-01 NA NA
# 9 A9 2020-09-01 2021-09-01 2020-08-09 NA
#10 A10 2020-10-01 2021-10-01 2020-09-09 NA
#11 A2 2020-11-01 2021-11-01 2021-03-11 sick
#12 A10 2020-12-01 2021-12-01 2020-09-09 NA

r - filter by date with group by condition

In R, using dplyr I want to filter greater than a date in for each group.
Below gives me the results, but I am wondering if there is a more elegant way to get the same thing. Is it possible to filter without using mutate?
max_dates <- data.frame(col_1 = c('a', 'b', 'c'), max_date = c('2021-08-23', '2021-07-19', '2021-07-02'))
df <- data.frame(col_1 = c(rep('a', 10), rep('b', 10), rep('c', 10)),
date = rep(seq(as.Date('2021-07-01'), by = 'week', length.out = 10), 3))
desired_df <- df %>%
left_join(max_dates, by = 'col_1') %>%
mutate(greater_than = ifelse(date >= max_date, T, F)) %>%
filter(greater_than)
You don't need the mutate argument; move the conditional to the filter argument...
library(dplyr)
df %>%
left_join(max_dates, by = 'col_1') %>%
filter(date >= max_date)
#> col_1 date max_date
#> 1 a 2021-08-26 2021-08-23
#> 2 a 2021-09-02 2021-08-23
#> 3 b 2021-07-22 2021-07-19
#> 4 b 2021-07-29 2021-07-19
#> 5 b 2021-08-05 2021-07-19
#> 6 b 2021-08-12 2021-07-19
#> 7 b 2021-08-19 2021-07-19
#> 8 b 2021-08-26 2021-07-19
#> 9 b 2021-09-02 2021-07-19
#> 10 c 2021-07-08 2021-07-02
#> 11 c 2021-07-15 2021-07-02
#> 12 c 2021-07-22 2021-07-02
#> 13 c 2021-07-29 2021-07-02
#> 14 c 2021-08-05 2021-07-02
#> 15 c 2021-08-12 2021-07-02
#> 16 c 2021-08-19 2021-07-02
#> 17 c 2021-08-26 2021-07-02
#> 18 c 2021-09-02 2021-07-02
Created on 2021-08-31 by the reprex package (v2.0.0)
We may use non-equi join
library(data.table)
setDT(df)[, date1 := date][max_dates, on = .(col_1, date1 >= max_date)]
-output
col_1 date date1
1: a 2021-08-26 2021-08-23
2: a 2021-09-02 2021-08-23
3: b 2021-07-22 2021-07-19
4: b 2021-07-29 2021-07-19
5: b 2021-08-05 2021-07-19
6: b 2021-08-12 2021-07-19
7: b 2021-08-19 2021-07-19
8: b 2021-08-26 2021-07-19
9: b 2021-09-02 2021-07-19
10: c 2021-07-08 2021-07-02
11: c 2021-07-15 2021-07-02
12: c 2021-07-22 2021-07-02
13: c 2021-07-29 2021-07-02
14: c 2021-08-05 2021-07-02
15: c 2021-08-12 2021-07-02
16: c 2021-08-19 2021-07-02
17: c 2021-08-26 2021-07-02
18: c 2021-09-02 2021-07-02

R's padr package claiming the "datetime variable does not vary" when it does vary

library(tidyverse)
library(lubridate)
library(padr)
df
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
I have an R data frame named df shown above. I want to concentrate on row numbers 5 and 6. I can usually use the package padr to pad the months in between rows 5 and 6. The pad() function of the padr will basically add rows at intervals the user specifies, best shown as the added rows "X" below.
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> X 262-179 NA NA NA 2019-09-01
#> X 262-179 NA NA NA 2019-10-01
#> X 262-179 NA NA NA 2019-11-01
#> X 262-179 NA NA NA 2019-12-01
#> X 262-179 NA NA NA 2020-01-01
#> X 262-179 NA NA NA 2020-02-01
#> X 262-179 NA NA NA 2020-03-01
#> X 262-179 NA NA NA 2020-04-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
To get there I usually issue a command, such as is shown below, and it works fine in padr. But it doesn't work in my specific example, and instead yields the warning shown below.
df %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
#> Warning message:
#> datetime variable does not vary for 537 of the groups, no padding applied on this / these group(s)
Why does it claim that "the datetime variable does not vary" for rows 5 and 6, when the datetime does indeed vary. The datetime for row 5 variable YrMo is "2019-08-01" and the datetime for row 6 variable YrMo is "2020-05-01". Let me state the obvious that "2019-08-01" varies from "2020-05-01".
Any ideas what went wrong? I tried to create a reproducible example and could not. The basic examples I created all work as expected (as I describe). Hopefully these clues can help somebody determine what is going on.

Resources