Find if a date overlaps between multiple pairs of vectorised dates - r

I'm trying to find whether a date exists between multiple pairs of dates which are wide in my dataset - the length I've given here is just an example, the eventual number may be larger or smaller. Not sure if this is the most sensible option but working longwise didn't seem to work, this is also a very common way to work with overlapping dates and date pairs in SPSS, where you can have multiple variables numerised as the dates are here and it works through each numbered 'set' to give you a response.
Here is an example dataset:
person key_date 1_end_date 2_end_date 3_end_date 4_end_date 1_start_date 2_start_date 3_start_date 4_start_date
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 <NA> 2019-05-23 2019-09-30 2016-07-22 <NA>
2 2 2019-06-07 2019-05-16 2019-06-07 <NA> <NA> 2019-05-16 <NA> <NA> <NA>
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09
test <- structure(list(person = 1:3, key_date = structure(c(18169, 18054,18330), class = "Date"), `1_end_date` = structure(c(18039, 18032,16954), class = "Date"), `2_end_date` = structure(c(18169, 18054,18117), class = "Date"), `3_end_date` = structure(c(17004, NA,18409), class = "Date"), `4_end_date` = structure(c(NA, NA, 18304), class = "Date"), `1_start_date` = structure(c(18039, 18032,16954), class = "Date"), `2_start_date` = structure(c(18169,NA, 18117), class = "Date"), `3_start_date` = structure(c(17004,NA, 18409), class = "Date"), `4_start_date` = structure(c(NA,NA, 18330), class = "Date")), row.names = c(NA, 3L), class = "data.frame")
The expected output would be just a binary flag to indicate that the key_date exists between any pair of start_date and end_date. In the example given, that would mean person 1 and 3. Any ideas how to do this? Is this really inefficient?

tidyverse approach
library(tidyverse)
result <- test %>% mutate(across(ends_with("end_date"), ~
key_date <= . & key_date >= get(str_replace(cur_column(), "end", "start")),
.names = '{.col}_flag')) %>%
rowwise() %>%
mutate(Flag1 = sum(c_across(ends_with("flag")), na.rm = T)) %>%
ungroup() %>%
select(-ends_with("flag"))
> result$Flag1
[1] 1 0 0
Complete output will look like
> result
# A tibble: 3 x 11
person key_date `1_end_date` `2_end_date` `3_end_date` `4_end_date` `1_start_date` `2_start_date` `3_start_date` `4_start_date` Flag1
<int> <date> <date> <date> <date> <date> <date> <date> <date> <date> <dbl>
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 NA 2019-05-23 2019-09-30 2016-07-22 NA 1
2 2 2019-06-07 2019-05-16 2019-06-07 NA NA 2019-05-16 NA NA NA 0
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09 0

Related

Create new column using tidy evaluation on the left and right of mutate in R

I know there are many questions regarding tidy evaluation in R. However, I couldn't figure out a solution to this seemingly easily problem.
I have this data.frame
structure(list(Date = c("25.02.2020", "26.02.2020", "27.02.2020",
"28.02.2020", "02.03.2020", "03.03.2020", "04.03.2020", "05.03.2020",
"06.03.2020", "09.03.2020", "10.03.2020", "11.03.2020", "12.03.2020",
"13.03.2020", "16.03.2020", "17.03.2020", "18.03.2020", "19.03.2020",
"20.03.2020", "23.03.2020", "24.03.2020", "25.03.2020", "26.03.2020",
"27.03.2020", "30.03.2020", "31.03.2020", "01.04.2020", "02.04.2020",
"03.04.2020", "06.04.2020"), change_AAPL_stocks = c("1", "0,95",
"0,93", "0,85", "0,94", "1,01", "0,99", "0,98", "0,94", "0,88",
"0,92", "0,92", "0,85", "0,88", "0,8", "0,82", "0,8", "0,82",
"0,82", "0,76", "0,79", "0,83", "0,82", "0,84", "0,83", "0,85",
"0,82", "0,8", "0,81", "0,83"), change_AMZN_stocks = c("1", "0,97",
"0,95", "0,9", "0,94", "0,97", "0,96", "0,95", "0,93", "0,88",
"0,92", "0,92", "0,85", "0,87", "0,81", "0,88", "0,86", "0,92",
"0,95", "0,9", "0,96", "0,95", "0,94", "0,95", "0,95", "0,97",
"0,95", "0,94", "0,94", "0,96")), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
And I have these variables
date_col = "Date"
date_format = "%d.%m.%Y"
value_col = "change_AAPL_stocks"
And I'd like to write a function that can take arbitrary date_col and date_format values.
The code at the moment looks like this:
df %>%
select(date_col, value_col) %>%
mutate(
{{date_col}} := as.Date({date_col}, format=date_format)
)
Which creates (overwrites) the column names Date. However, the as.Date(...) function does not work. I am not entirely sure what to do about that.
We can use .data to subset
library(dplyr)
df %>%
select(all_of(date_col))%>%
mutate(!! date_col := as.Date(.data[[date_col]], format = date_format))
-output
# A tibble: 30 × 1
Date
<date>
1 2020-02-25
2 2020-02-26
3 2020-02-27
4 2020-02-28
5 2020-03-02
6 2020-03-03
7 2020-03-04
8 2020-03-05
9 2020-03-06
10 2020-03-09
# … with 20 more rows
Use sym and unsplice it on the rhs.
df %>%
select(all_of(date_col)) %>%
mutate(
{{date_col}} := as.Date(!!sym(date_col), format=date_format)
)
# A tibble: 30 x 1
Date
<date>
1 2020-02-25
2 2020-02-26
3 2020-02-27
4 2020-02-28
5 2020-03-02
6 2020-03-03
7 2020-03-04
8 2020-03-05
9 2020-03-06
10 2020-03-09
# ... with 20 more rows
You can generalize the function to take symbol or character as input as follows:
f <- function(data, date_col){
if(rlang::is_symbol(date_col)){
rhs <- enquo(date_col)
date_col <- as.character(date_col)
}else{
rhs <- sym(date_col)
}
data %>%
select(date_col) %>%
mutate(
{{date_col}} := as.Date(!!rhs, format=date_format)
)
}
df %>%
f('Date')
# A tibble: 30 x 1
Date
<date>
1 2020-02-25
2 2020-02-26
3 2020-02-27
4 2020-02-28
5 2020-03-02
6 2020-03-03
7 2020-03-04
8 2020-03-05
9 2020-03-06
10 2020-03-09
# ... with 20 more rows
df %>%
f(date_col)
# A tibble: 30 x 1
Date
<date>
1 2020-02-25
2 2020-02-26
3 2020-02-27
4 2020-02-28
5 2020-03-02
6 2020-03-03
7 2020-03-04
8 2020-03-05
9 2020-03-06
10 2020-03-09
# ... with 20 more rows

Creating an statement to check multiple dates between a start and end date

I have a dataframe like this in R:
Start date
End date
Date 1
Date 2
Date 3
Date 4
11/12/2018
29/11/2019
08/03/2021
NA
NA
NA
07/03/2018
24/04/2019
08/03/2021
12/09/2016
NA
NA
04/06/2018
23/04/2019
08/03/2021
02/10/2017
05/10/2018
NA
26/07/2018
29/08/2019
08/03/2021
03/08/2015
02/10/2017
23/01/2017
I want to create a new column in R that says: If Date 1, Date 2, Date 3 or Date 4 is between Start Date and End date, it should return 1, 0 otherwise, as the table below:
Start date
End date
Date 1
Date 2
Date 3
Date 4
Change
11/12/2018
29/11/2019
08/03/2021
NA
NA
NA
0
07/03/2018
24/04/2019
08/03/2021
12/09/2016
NA
NA
0
04/06/2018
23/04/2019
08/03/2021
02/10/2017
05/10/2018
NA
1
26/07/2018
29/08/2019
08/03/2021
03/08/2015
02/10/2017
23/01/2017
0
Does anyone have a suggestion on how to solve this? Thank you :)
It'll make it much easier for people to help you if you can post code / data which we can run directly. The easiest way to do this is to use a handy R function called dput, which generates instructions to exactly recreate any R object. So you might run dput(MY_DATA), or if your data is much larger than needed to demonstrate your question, dput(head(MY_DATA)) to get the first six rows, and paste the output of that into your question. </PSA>
Here's code to generate your example data:
my_data <- data.frame(
stringsAsFactors = FALSE,
Start.date = c("11/12/2018", "07/03/2018", "04/06/2018", "26/07/2018"),
End.date = c("29/11/2019", "24/04/2019", "23/04/2019", "29/08/2019"),
Date.1 = c("08/03/2021", "08/03/2021", "08/03/2021", "08/03/2021"),
Date.2 = c(NA, "12/09/2016", "02/10/2017", "03/08/2015"),
Date.3 = c(NA, NA, "05/10/2018", "02/10/2017"),
Date.4 = c(NA, NA, NA, "23/01/2017")
)
Here's a tidyverse approach to first convert your day/month/year dates into data in R's Date type using lubridate::dmy, then to compare each of Date.1 thru Date.4 against your start dates, and then finally to show if there are any 1's (within range).
library(dplyr); library(lubridate)
my_data %>%
mutate(across(.fns = ~dmy(.x))) %>%
mutate(across(.cols = starts_with("Date"),
.fns = ~coalesce(.x >= Start.date & .x <= End.date, FALSE)*1)) %>%
mutate(Change = pmax(Date.1, Date.2, Date.3, Date.4))
coalesce(..., FALSE) used here to treat NA like FALSE.
(...)*1 to convert TRUE/FALSE to 1/0.
pmax(...) to grab the largest of the 1/0's, i.e. "are there any 1's?"
Edit: alternative to leave Date columns intact:
my_data %>%
mutate(across(.fns = ~dmy(.x))) %>%
mutate(across(.cols = starts_with("Date"),
.names = "Check_{.col}",
.fns = ~coalesce(.x >= Start.date & .x <= End.date, FALSE)*1)) %>%
rowwise() %>%
mutate(Change = max(c_across(starts_with("Check")))) %>%
select(-starts_with("Check"))
Start.date End.date Date.1 Date.2 Date.3 Date.4 Change
<date> <date> <date> <date> <date> <date> <dbl>
1 2018-12-11 2019-11-29 2021-03-08 NA NA NA 0
2 2018-03-07 2019-04-24 2021-03-08 2016-09-12 NA NA 0
3 2018-06-04 2019-04-23 2021-03-08 2017-10-02 2018-10-05 NA 1
4 2018-07-26 2019-08-29 2021-03-08 2015-08-03 2017-10-02 2017-01-23 0
library(tidyverse)
library(lubridate)
df <- read.table(textConnection("start_date;end_date;date_1;date_2;date_3;date_4
11/12/2018;29/11/2019;08/03/2021;NA;NA;NA
07/03/2018;24/04/2019;08/03/2021;12/09/2016;NA;NA
04/06/2018;23/04/2019;08/03/2021;02/10/2017;05/10/2018;NA
26/07/2018;29/08/2019;08/03/2021;03/08/2015;02/10/2017;23/01/2017"),
sep=";",
header = TRUE)
df %>%
mutate(
across(everything(), lubridate::dmy),
change = ((date_1 > start_date & date_1 < end_date) |
(date_2 > start_date & date_2 < end_date) |
(date_3 > start_date & date_3 < end_date)
) %>%
coalesce(FALSE) %>%
as.integer()
)
#> start_date end_date date_1 date_2 date_3 date_4 change
#> 1 2018-12-11 2019-11-29 2021-03-08 <NA> <NA> <NA> 0
#> 2 2018-03-07 2019-04-24 2021-03-08 2016-09-12 <NA> <NA> 0
#> 3 2018-06-04 2019-04-23 2021-03-08 2017-10-02 2018-10-05 <NA> 1
#> 4 2018-07-26 2019-08-29 2021-03-08 2015-08-03 2017-10-02 2017-01-23 0

full_join but with condition on matching

I am going to try to make this as simple as possible, I would like a dplyr solution if possible:
Let's say I have a DataFrame of 2 columns called f1. The 2 columns are the reference number of an event and the date_begin is the begin date of the event:
f1
reference date_begin
1 01100144609598 2020-08-15
2 01100144692499 2020-08-12
3 01100144609598 2020-08-09
4 01100434045112 2020-08-26
5 01100434067379 2020-08-24
6 01100723546188 2020-08-16
I also have another DataFrame called f2 with 2 columns. The 2 columns are the reference number of an event and the date_end is the ending date of the event:
reference date_end
1 01100144609598 2020-09-06
2 01100144692499 2020-08-10
3 01100434121179 2020-08-25
4 01100578756185 2020-08-17
5 01100578757962 2020-08-31
6 01100578846401 2020-08-16
I want to use a full_join by reference. That being said:
If there is an end date before there is a begin date I want an NA in stead of the begin date
The end date has to be > than the begin date
If there are 2 end dates for the same reference that are bigger than a begin date take the smallest end date
If there is a begin date with no end date the end date should have an NA
So in this reproducible example I should have an f3 that looks something like the following :
reference date_begin date_end
1 01100144609598 2020-08-15 2020-09-06
2 01100144692499 NA 2020-08-10
3 01100144692499 2020-08-12 NA
4 01100434121179 NA 2020-08-25
5 01100578756185 NA 2020-08-17
6 01100578757962 NA 2020-08-31
7 01100578846401 NA 2020-08-16
8 01100144609598 2020-08-09 NA
9 01100434045112 2020-08-26 NA
10 01100434067379 2020-08-24 NA
11 01100723546188 2020-08-16 NA
As Chuck P mentions, the conditions make this a bit complicated. Rather than use full_join, I've first combined f1 and f2 and then transformed to "long" format. We can then group by reference and sort by date to set up to use case_when to apply the conditions stated in the post or other conditions as needed. The result is then transformed back to "wide" format to present as shown in the post. The code is
library(tidyverse)
#
# combine f1 and f2 and pivot to long format
#
all <- bind_rows(f1,f2) %>%
pivot_longer(cols = c(date_begin, date_end),
names_to = "type", values_to = "date",
values_drop_na = TRUE)
#
# group by reference, sort by date, and then use
# case_when function to pair begin and end dates
#
all <- all %>% group_by(reference) %>%
arrange(date) %>%
mutate(index = 1:n(),
index = case_when(
type == "date_end" & lag(type, n = 1) == "date_begin" ~ lag(index),
TRUE ~ index))
#
# pivot back to wide format to format results as shown in post
#
result <- all %>% pivot_wider(names_from =type, values_from = date) %>% mutate(index = NULL)
The result is
> result
# A tibble: 11 x 3
# Groups: reference [9]
reference date_begin date_end
<chr> <date> <date>
1 01100144609598 2020-08-09 NA
2 01100144692499 NA 2020-08-10
3 01100144692499 2020-08-12 NA
4 01100144609598 2020-08-15 2020-09-06
5 01100723546188 2020-08-16 NA
6 01100578846401 NA 2020-08-16
7 01100578756185 NA 2020-08-17
8 01100434067379 2020-08-24 NA
9 01100434121179 NA 2020-08-25
10 01100434045112 2020-08-26 NA
11 01100578757962 NA 2020-08-31
where the results are sorted by date.
This is more complex than it first appears because of the conditional logic. I broke it down into three steps that occur after we do the initial full_join to make f3
library(dplyr)
library(tidyr)
library(purrr)
f3 <- full_join(f1, f2)
#> Joining, by = "reference"
f3
#> reference date_begin date_end
#> 1 01100144609598 2020-08-15 2020-09-06
#> 2 01100144692499 2020-08-12 2020-08-10
#> 3 01100144609598 2020-08-09 2020-09-06
#> 4 01100434045112 2020-08-26 <NA>
#> 5 01100434067379 2020-08-24 <NA>
#> 6 01100723546188 2020-08-16 <NA>
#> 7 01100434121179 <NA> 2020-08-25
#> 8 01100578756185 <NA> 2020-08-17
#> 9 01100578757962 <NA> 2020-08-31
#> 10 01100578846401 <NA> 2020-08-16
Step 1 set aside the rows where we don't have to do anything because either the begin data or the end date is NA
nothing_to_do <-
f3 %>% filter(is.na(date_begin) | is.na(date_end))
Step 2 identify rows where we have a begin date after an end date like "01100144692499" for these we actually have to add a row and then adjust the rows.
end_before_beginning <-
f3 %>% filter(date_begin > date_end) %>%
group_by(reference) %>%
do (
add_row(.,
reference = .$reference,
date_begin = .$date_begin,
.after = 1)
) %>%
ungroup() %>%
mutate(date_begin =
case_when(
!is.na(date_end) ~ as.Date(NA_character_),
TRUE ~ date_begin
))
Step 3 identify rows with multiple beginnings same ending where we have to select the one with the shortest time space like "01100144609598"
multiple_beginnings <-
f3 %>%
group_by(reference, date_end) %>%
mutate(instances = n(),
date_diff = date_end - date_begin) %>%
filter(instances > 1) %>%
mutate(date_end =
case_when(
date_diff != min(date_diff) ~ as.Date(NA_character_),
TRUE ~ date_end
)) %>%
select(-instances, -date_diff)
Glue them all to together
final_answer <-
list(nothing_to_do, end_before_beginning, multiple_beginnings) %>%
reduce(full_join)
#> Joining, by = c("reference", "date_begin", "date_end")
#> Joining, by = c("reference", "date_begin", "date_end")
final_answer
#> reference date_begin date_end
#> 1 01100434045112 2020-08-26 <NA>
#> 2 01100434067379 2020-08-24 <NA>
#> 3 01100723546188 2020-08-16 <NA>
#> 4 01100434121179 <NA> 2020-08-25
#> 5 01100578756185 <NA> 2020-08-17
#> 6 01100578757962 <NA> 2020-08-31
#> 7 01100578846401 <NA> 2020-08-16
#> 8 01100144692499 <NA> 2020-08-10
#> 9 01100144692499 2020-08-12 <NA>
#> 10 01100144609598 2020-08-15 2020-09-06
#> 11 01100144609598 2020-08-09 <NA>
Your data...
f1 <- structure(list(reference = c("01100144609598", "01100144692499",
"01100144609598", "01100434045112", "01100434067379", "01100723546188"),
date_begin = structure(c(18489, 18486, 18483, 18500, 18498,
18490), class = "Date")), row.names = c(NA, -6L), class = "data.frame")
f2 <- structure(list(reference = c("01100144609598", "01100144692499",
"01100434121179", "01100578756185", "01100578757962", "01100578846401"),
date_end = structure(c(18511, 18484, 18499, 18491, 18505,
18490), class = "Date")), row.names = c(NA, -6L), class = "data.frame")

R: Reshape every 2 rows of data into 1 row based on factor of a single column

I have the below data:
TimeStamp Fab23.A start.end
1 2020-03-02 20:44:00 27.54236 start
2 2020-03-02 20:50:00 186.08670 end
3 2020-03-03 18:12:00 37.33132 start
4 2020-03-03 18:16:00 189.78060 end
5 2020-03-04 17:48:00 33.78360 start
6 2020-03-04 17:52:00 190.08100 end
.
.
.
I'm trying to reshape them to the below format based on the last categorical value of last column:
start.TimeStamp end.TimeStamp start.Fab23.A start.Fab23.A
1 2020-03-02 20:44:00 2020-03-02 20:50:00 27.54236 186.08670
2 2020-03-03 18:12:00 2020-03-03 18:16:00 37.33132 189.78060
3 2020-03-04 17:48:00 2020-03-04 17:52:00 33.78360 190.08100
.
.
.
I have tried reshape and melt function but to no avail.
Will appreciate any advice.
You can use :
library(dplyr)
df %>%
group_by(start.end) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = start.end,
values_from = c(TimeStamp, Fab23.A)) %>%
select(-row)
# A tibble: 3 x 4
# TimeStamp_start TimeStamp_end Fab23.A_start Fab23.A_end
# <chr> <chr> <dbl> <dbl>
#1 2020-03-0220:44:00 2020-03-0220:50:00 27.5 186.
#2 2020-03-0318:12:00 2020-03-0318:16:00 37.3 190.
#3 2020-03-0417:48:00 2020-03-0417:52:00 33.8 190.
Or using data.table :
library(data.table)
dcast(setDT(df), rowid(start.end)~start.end,value.var = c("TimeStamp", "Fab23.A"))
data
df <- structure(list(TimeStamp = c("2020-03-0220:44:00", "2020-03-0220:50:00",
"2020-03-0318:12:00", "2020-03-0318:16:00", "2020-03-0417:48:00",
"2020-03-0417:52:00"), Fab23.A = c(27.54236, 186.0867, 37.33132,
189.7806, 33.7836, 190.081), start.end = c("start", "end", "start",
"end", "start", "end")), class = "data.frame", row.names = c(NA, -6L))

Convert character column to date column in lubridate [duplicate]

This question already has answers here:
Convert date-time string to class Date
(4 answers)
Closed 3 years ago.
I have a dataframe looking like:
# A tibble: 10 x 4
id incoming_date expiry_date end_date
<dbl> <chr> <chr> <chr>
1 1 11.17.18 10.1.19 03.1.19
2 11 03.4.19 NA 03.20.19
3 2 03.17.19 02.1.20 05.7.19
4 2 05.7.19 NA 06.15.19
5 4 06.11.19 05.1.21 06.22.19
6 1 06.12.19 04.1.21 NA
7 2 06.12.19 04.1.21 NA
8 13 11.16.18 06.1.19 02.20.19
9 7 02.19.19 12.1.21 3.23.19
10 1 03.19.19 01.1.21 09.10.19
reproducible:
library(tidyverse)
library(lubridate)
df <- as.tibble(structure(list(id = c(1, 11, 2, 2, 4, 1, 2, 13, 7, 1), incoming_date = c("11.17.18",
"03.4.19", "03.17.19", "05.7.19", "06.11.19", "06.12.19", "06.12.19",
"11.16.18", "02.19.19", "03.19.19"), expiry_date = c("10.1.19",
NA, "02.1.20", NA, "05.1.21", "04.1.21", "04.1.21", "06.1.19",
"12.1.21", "01.1.21"), end_date = c("03.1.19", "03.20.19", "05.7.19",
"06.15.19", "06.22.19", NA, NA, "02.20.19", "3.23.19", "09.10.19"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
)))
I want to convert the columns from character to date format, as follows:
id incoming_date expiry_date end_date
<dbl> <date> <date> <date>
1 1 2018-11-17 2019-10-01 2019-03-01
2 11 2019-03-04 NA 2019-03-20
3 2 2019-03-17 2020-02-01 2019-05-07
4 2 2019-05-07 NA 2019-06-15
5 4 2019-06-11 2021-05-01 2019-06-22
6 1 2019-06-12 2021-04-01 NA
7 2 2019-06-12 2021-04-01 NA
8 13 2018-11-16 2019-06-01 2019-02-20
9 7 2019-02-19 2021-12-01 2019-03-23
10 1 2019-03-19 2021-01-01 2019-09-10
What I tried to do - starting from the incoming_date column - is:
df %>%
mutate(incoming_date_parsed = strptime(as.POSIXct(incoming_date), "%m.%d.%Y"))
throwing the error:
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
In logical order, what I would do is:
Convert the column from character to datetime
Parse the date following in the format YYYY-MM-dd
You're really close!
There's no reason to try to convert to posxict first and then date in your example.
And you would use a lower case 'y' for YY format years.
df <- df %>%
mutate(incoming_date_parsed = as.Date(incoming_date, "%m.%d.%y"))
> head(df$incoming_date_parsed)
[1] "2018-11-17" "2019-03-04" "2019-03-17" "2019-05-07" "2019-06-11" "2019-06-12"

Resources