Filling NA data with data from another row with matching - r

Here is example data:
df = data.frame(id = (1:5),
type= c("a_type","a_type","b_type","b_type", "c_type"),
start_date= lubridate::dmy(c("01/01/2014", "30/04/2014
", "30/04/2015", "10/05/2015", "30/03/2016")),
fail_date = lubridate::dmy(c("30/04/2015", rep(NA,4))))
> df
id type start_date fail_date
1 1 a_type 2014-01-01 2015-04-30
2 1 a_type 2014-04-30 <NA>
3 1 b_type 2015-04-30 <NA>
4 1 b_type 2015-05-10 <NA>
5 1 c_type 2016-03-30 <NA>
I want to fill in fail_date of "a_type" where they are NA. This needs to be the start_date of the next type that is "c_type" or "a_type". So output should be:
> df1
id type start_date fail_date
1 1 a_type 2014-01-01 2015-04-30
2 2 a_type 2014-04-30 2016-03-30
3 3 b_type 2015-04-30 <NA>
4 4 b_type 2015-05-10 <NA>
5 5 c_type 2016-03-30 <NA>
I have started with this code, but don't know how to specify the start_date of the next entry that is type "c_type" or "a_type":
df1= df%>%
mutate(fail_date = case_when(is.na(fail_date) & type =="a_type" ~ ??? ,
TRUE ~ fail_date))
I do not want to just filter out the "b_type"
#######Edit##############
Following the answer given by #Allan Cameron I have tried this on my data but have come across an issue.
If the last type is "b-type" the code won't work, as the cumsum(rev(df$type) != 'b_type') starts with a 0. Here is an example with amended data:
df = data.frame(id = (1:6),
type= c("a_type","a_type","b_type","b_type", "c_type","b_type"),
start_date= lubridate::dmy(c("01/12/2013","01/01/2014", "30/04/2014
", "30/04/2015", "10/05/2015", "30/03/2016")),
fail_date =rep(NA,6))
df1= df%>%
arrange(start_date) %>%
mutate(next_start = lead(rev(rev(start_date[type != 'b_type'])[
cumsum(rev(type) != 'b_type')])),
fail_date = if_else(type == 'a_type' &
is.na(fail_date) ,
next_start,
lubridate::dmy(fail_date)))
Error: Problem with `mutate()` column `next_start`.
i `next_start = lead(...)`.
i `next_start` must be size 6 or 1, not 5.
Here are some of the individual elements which have helped me to understand the error, but I dont' know how to overcome this:
df1= df%>%
mutate( rev = rev(type),
qa = rev(type) != 'b_type',
qw= cumsum(rev(type)!='b_type')
)
df1
id type start_date fail_date rev qa qw
1 1 a_type 2013-12-01 NA b_type FALSE 0
2 2 a_type 2014-01-01 NA c_type TRUE 1
3 3 b_type 2014-04-30 NA b_type FALSE 1
4 4 b_type 2015-04-30 NA b_type FALSE 1
5 5 c_type 2015-05-10 NA a_type TRUE 2
6 6 b_type 2016-03-30 NA a_type TRUE 3
Because the first entry in qw is 0, rev(rev(start_date[type != 'b_type'])[ cumsum(rev(type) != 'b_type')]) does not produce a result.

This is tricky. You can do it all within the pipe, but the hard part is back-propagating the next available start date, which requires a combination of lead, rev, cumsum and diff to create a temporary column. Choosing whether to insert a value from the temporary column just comes down to specifying your logic inside an if_else
df %>%
arrange(start_date) %>%
mutate(next_start = lead(rev(rev(start_date[type != 'b_type'])[
cumsum(rev(type) != 'b_type')])),
fail_date = if_else(type == 'a_type' &
is.na(fail_date) &
lead(type, default = last(type)) != 'a_type',
next_start,
fail_date)) %>%
select(-next_start)
#> id type start_date fail_date
#> 1 1 a_type 2014-01-01 2015-04-30
#> 2 2 a_type 2014-04-30 2016-03-30
#> 3 3 b_type 2015-04-30 <NA>
#> 4 4 b_type 2015-05-10 <NA>
#> 5 5 c_type 2016-03-30 <NA>
If you have multiple types (rather than just 3) you might need to change the instances of type != 'b_type' to type %in% allowed_types, where allowed_types is a pre-defined vector, as shown below:
allowed_types <- c('a_type', 'c_type')
df %>%
arrange(start_date) %>%
mutate(next_start = lead(rev(rev(start_date[type %in% allowed_types])[
cumsum(rev(type %in% allowed_types))])),
fail_date = if_else(type == 'a_type' &
is.na(fail_date) &
lead(type, default = last(type)) != 'a_type',
next_start,
fail_date)) %>%
select(-next_start)
This generates the same result in your example data but is more generalizable

Related

mutate variable based on other columns with similar names

I have a df here (the desired output, my starting df does not have the Flag variable):
df <- data.frame(
Person = c('1','2','3'),
Date = as.Date(c('2010-09-30', '2012-11-20', '2015-03-11')),
Treatment_1 = as.Date(c('2010-09-30', '2012-11-21', '2015-03-22')),
Treatment_2 = as.Date(c('2011-09-30', 'NA', '2011-03-22')),
Treatment_3 = as.Date(c('2012-09-30', '2015-11-21', '2015-06-22')),
Surgery_1 = as.Date(c(NA, '2016-11-21', '2015-03-12')),
Surgery_2 = as.Date(c(NA, '2017-11-21', '2019-03-12')),
Surgery_3 = as.Date(c(NA, '2018-11-21', '2013-03-12')),
Flag = c('', 'Y', '')
)
and I want to derive the Flag variable based on these conditions:
For any column that starts with Treatment, set Flag to "" if Date = Treatment
For any column that starts with Surgery, set Flag to "" if Date = Surgery OR Date = Surgery +1 OR Date = Surgery - 1 (basically if the Surgery date is on the day, one day before, or one day after the Date variable, set Flag to "").
else set Flag = "Y"
I've looked into mutate_at but that rewrites the variables and assigns values of True/False.
This is wrong but this is my attempt:
df2 <- df %>%
mutate(Flag = case_when(
vars(starts_with("Treatment"), Date == . ) ~ '',
vars(starts_with("Surgery"), Date == . | Date == . - 1 | Date == . + 1) ~ '',
TRUE ~ 'Y')
)
UPDATE 2022-Aug-22
When I change a cell with the same date as the one in row 2:
df <- data.frame(
Person = c('1','2','3'),
Date = as.Date(c('2010-09-30', '2012-11-20', '2015-03-11')),
Treatment_1 = as.Date(c('2010-09-30', '2012-11-21', '2015-03-22')),
Treatment_2 = as.Date(c('2011-09-30', 'NA', '2011-03-22')),
Treatment_3 = as.Date(c('2012-09-30', '2015-11-21', '2015-06-22')),
Surgery_1 = as.Date(c(NA, '2016-11-21', '2015-03-12')),
Surgery_2 = as.Date(c(NA, '2017-11-21', '2019-03-12')),
Surgery_3 = as.Date(c(NA, '2018-11-21', '2012-11-20')),
Flag = c('', 'Y', '')
)
and then re-run the base R solution, the Flag in the second row is no longer "Y" but it should be as in that row, it doesn't meet any of the above conditions.
We can use rowwise and c_across along with any for each condition in case_when. Then, we can make a list for the Date (and +1, -1 days) for Surgery to match.
library(tidyverse)
df %>%
rowwise() %>%
mutate(Flag = case_when(
any(c_across(starts_with("Treatment")) == Date) ~ "",
any(c_across(starts_with("Surgery")) %in% c(Date, (Date +1), (Date-1))) ~ "",
TRUE ~ "Y"
))
Output
Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3 Flag
<chr> <date> <date> <date> <date> <date> <date> <date> <chr>
1 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 NA NA NA ""
2 2 2012-11-20 2012-11-21 NA 2015-11-21 2016-11-21 2017-11-21 2018-11-21 "Y"
3 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12 ""
Update
Here is a possible base R solution that is a lot quicker than tidyverse. This could be done in one line of code, but I decided that readability is better. First, I duplicate the Surgery columns so that we have +1 day and -1 day, and then convert these columns to character. Then, I subset the Treatment columns and convert to character. I convert to character as you cannot compare Date with %in% or ==. Then, I bind the date, treatment, and surgery columns together (a). Then, I use an ifelse for if the Date is in any of the columns but doing it row by row with apply, then we return "" and if not then return Y. Then, I bind the result back to the original dataframe (minus Flag from your original dataframe).
dup_names <- colnames(df)[startsWith(colnames(df), "Surgery")]
surgery <-
cbind(df[dup_names], setNames(df[dup_names] + 1, paste0(dup_names, "_range1")))
surgery <-
sapply(cbind(surgery, setNames(df[dup_names] - 1, paste0(
dup_names, "_range2"
))), as.character)
treatment <-
sapply(df[startsWith(colnames(df), "Treatment")], as.character)
a <- cbind(Date = as.character(df$Date), treatment, surgery)
cbind(subset(df, select = -Flag),
Flag = ifelse(apply(a[,1]==a[,2:ncol(a)], 1, any, na.rm = TRUE), "", "Y"))
Benchmark
Here is an alternative using across approach:
library(tidyverse)
df %>%
mutate(across(starts_with("Treatment"), ~as.numeric(. %in% Date), .names ="new_{.col}"),
across(starts_with("Surgery"), ~as.numeric(. %in% c(Date, Date+1, Date-1)), .names ="new_{.col}")) %>%
mutate(Flag = ifelse(rowSums(select(., contains('new')))==1, "", "Y"), .keep="used") %>%
bind_cols(df)
Flag Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3
1 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 <NA> <NA> <NA>
2 Y 2 2012-11-20 2012-11-21 <NA> 2015-11-21 2016-11-21 2017-11-21 2018-11-21
3 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12
Updated to add data.table approach
If you want a data.table approach, here it is:
df[melt(df, id=c(1,2))[,flag:=fifelse(
(str_starts(variable,"T") & value==Date) |
(str_starts(variable,"S") & abs(value-Date)<=1),"", "Y")][
, .(flag=min(flag,na.rm=T)), Person], on=.(Person)]
Output
Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3 flag
1: 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 <NA> <NA> <NA>
2: 2 2012-11-20 2012-11-21 <NA> 2015-11-21 2016-11-21 2017-11-21 2018-11-21 Y
3: 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12
I like Andrew's approach, but I was working on this when his answer came in, so here it is in case you are interested
df %>% inner_join(
pivot_longer(df, cols=Treatment_1:Surgery_3) %>%
mutate(flag=case_when(
(str_starts(name,"T") & value==Date) | (str_starts(name,"S") & abs(value-Date)<=1) ~ "",
TRUE ~"Y")) %>%
group_by(Person) %>%
summarize(flag = min(flag))
)
Output:
Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3 flag
1 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 <NA> <NA> <NA>
2 2 2012-11-20 2012-11-21 <NA> 2015-11-21 2016-11-21 2017-11-21 2018-11-21 Y
3 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12

Imputing date based on next(or previous) available date grouped by another column

I have a dataframe that looks like this:
CYCLE date_cycle Randomization_Date COUPLEID
1 0 2016-02-16 10892
2 1 2016-08-17 2016-02-19 10894
3 1 2016-08-14 2016-02-26 10899
4 1 2016-02-26 10900
5 2 2016-03--- 2016-02-26 10900
6 3 2016-07-19 2016-02-26 10900
7 4 2016-11-15 2016-02-26 10900
8 1 2016-02-27 10901
9 2 2016-02--- 2016-02-27 10901
10 1 2016-03-27 2016-03-03 10902
11 2 2016-04-21 2016-03-03 10902
12 1 2016-03-03 10903
13 2 2016-03--- 2016-03-03 10903
14 0 2016-03-03 10904
15 1 2016-03-03 10905
16 2 2016-03-03 10905
17 3 2016-03-03 10905
18 4 2016-04-14 2016-03-03 10905
19 5 2016-05--- 2016-03-03 10905
20 6 2016-06--- 2016-03-03 10905
The goal is to fill in the missing day for a given ID using either an earlier or later date and add/subtract 28 from that.
The date_cycle variable was originally in the dataframe as a character type.
I have tried to code it as follows:
mutate(rowwise(df),
newdate = case_when( str_count(date1, pattern = "\\W") >2 ~ lag(as.Date.character(date1, "%Y-%m-%d"),1) + days(28)))
But I need to incorporate it by ID by CYCLE.
An example of my data could be made like this:
data.frame(stringsAsFactors = FALSE,
CYCLE =(0,1,1,1,2,3,4,1,2,1,2,1,2,0,1,2,3,4,5,6),
date_cycle = c(NA,"2016-08-17", "2016-08-14",NA,"2016-03---","2016-07-19", "2016-11-15",NA,"2016-02---", "2016-03-27","2016-04-21",NA, "2016-03---",NA,NA,NA,NA,"2016-04-14", "2016-05---","2016-06---"), Randomization_Date = c("2016-02-16","2016-02-19",
"2016-02-26","2016-02-26",
"2016-02-26","2016-02-26",
"2016-02-26",
"2016-02-27","2016-02-27",
"2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03",
"2016-03-03","2016-03-03"),
COUPLEID = c(10892,10894,10899,10900,
10900,10900,10900,10901,10901,
10902,10902,10903,10903,10904,
10905,10905,10905,10905,10905,10905)
)
The output I am after would look like:
COUPLEID CYCLE date_cycle new_date_cycle
a 1 2014-03-27 2014-03-27
a 1 2014-04--- 2014-04-24
b 1 2014-03-24 2014-03-24
b 2 2014-04-21
b 3 2014-05--- 2014-05-19
c 1 2014-04--- 2014-04-02
c 2 2014-04-30 2014-04-30
I have also started to make a long conditional, but I wanted to ask here and see if anyone new of a more straight forward way to do it, instead of explicitly writing out all of the possible conditions.
mutate(rowwise(df),
newdate = case_when(
grp == 1 & str_count(date1, pattern = "\\W") >2 & !is.na(lead(date1,1) ~ lead(date1,1) - days(28),
grp == 2 & str_count(date1, pattern = "\\W") >2 & !is.na(lead(date1,1)) ~ lead(date1,1) - days(28),
grp == 3 & str_count(date1, pattern = "\\W") >2 & ...)))
Function to fill dates forward and backwards
filldates <- function(dates) {
m = which(!is.na(dates))
if(length(m)>0 & length(m)!=length(dates)) {
if(m[1]>1) for(i in seq(m,1,-1)) if(is.na(dates[i])) dates[i]=dates[i+1]-28
if(sum(is.na(dates))>0) for(i in seq_along(dates)) if(is.na(dates[i])) dates[i] = dates[i-1]+28
}
return(dates)
}
Usage:
data %>%
arrange(ID, grp) %>%
group_by(ID) %>%
mutate(date2=filldates(as.Date(date1,"%Y-%m-%d")))
Ouput:
ID grp date1 date2
<chr> <dbl> <chr> <date>
1 a 1 2014-03-27 2014-03-27
2 a 2 2014-04--- 2014-04-24
3 b 1 2014-03-24 2014-03-24
4 b 2 2014-04--- 2014-04-21
5 b 3 2014-05--- 2014-05-19
6 c 1 2014-03--- 2014-04-02
7 c 2 2014-04-30 2014-04-30
An option using purrr::accumulate().
library(tidyverse)
center <- df %>%
group_by(ID) %>%
mutate(helpDate = ymd(str_replace(date1, '---', '-01')),
refDate = max(ymd(date1), na.rm = T))
backward <- center %>%
filter(refDate == max(helpDate)) %>%
mutate(date2 = accumulate(refDate, ~ . - days(28), .dir = 'backward'))
forward <- center %>%
filter(refDate == min(helpDate)) %>%
mutate(date2 = accumulate(refDate, ~ . + days(28)))
bind_rows(forward, backward) %>%
ungroup() %>%
mutate(date2 = as_date(date2)) %>%
select(-c('helpDate', 'refDate'))
# # A tibble: 7 x 4
# ID grp date1 date2
# <chr> <int> <chr> <date>
# 1 a 1 2014-03-27 2014-03-27
# 2 a 2 2014-04--- 2014-04-24
# 3 b 1 2014-03-24 2014-03-24
# 4 b 2 2014-04--- 2014-04-21
# 5 b 3 2014-05--- 2014-05-19
# 6 c 1 2014-03--- 2014-04-02
# 7 c 2 2014-04-30 2014-04-30

Extracting a date from a column and adding the year if missing in R

I am trying to extract dates from text and create a new column in a dataset. Dates are entered in different formats in column A1 (either mm-dd-yy or mm-dd). I need to find a way to identify the date in column A1 and then add the year if it is missing. Thus far, I have been able to extract the date regardless of the format; however, when I use as.Date on the new column A2, the date with mm-dd format becomes <NA>. I am aware that there might not be a direct solution for this situation, but a workaround (generalizable to a larger data set) would be great. The year would go from September 2019 to August 2020. Additionally, I am not sure why the format I use within the as.Date function is unable to control how the date gets displayed. This latter issue is not that important, but I am surprised by the behavior of the R function. A solution in tidyverse would be much appreciated.
library(tidyverse)
library(stringr)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+"))
# A1 A2
#1 review 11/18 11/18
#2 begins 12/4/19 12/4/19
#3 3/5/20 3/5/20
#4 <NA> <NA>
#5 deadline 09/5/19 09/5/19
#6 9/3 9/3
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+")) %>%
mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))
# A1 A2
# 1 review 11/18 <NA>
# 2 begins 12/4/19 2019-12-04
# 3 3/5/20 2020-03-05
# 4 <NA> <NA>
# 5 deadline 09/5/19 2019-09-05
# 6 9/3 <NA>
Perhaps:
library(tidyverse)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
#year from september to august 2019
(db <-
db %>%
mutate(A2 = str_extract(A1, '[\\d\\d/]+'),
A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d+')) > 8, paste0(A2, '/19'), A2),
A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d+')) <= 8, paste0(A2, '/20'), A2),
A2 = as.Date(A2, "%m/%d/%y")) )
#> A1 A2
#> 1 review 11/18 2019-11-18
#> 2 begins 12/4/19 2019-12-04
#> 3 3/5/20 2020-03-05
#> 4 <NA> <NA>
#> 5 deadline 09/5/19 2019-09-05
#> 6 9/3 2019-09-03
Created on 2021-11-21 by the reprex package (v2.0.1)
Well, this is neither a beautiful, concise or tidyverse solution but it does work and should be flexible in its modularity.
library(tidyverse)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
db <- db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+"), A2 = str_extract(A1, "[0-9/0-9]+"))
test1 <- unlist(lapply(str_split(db$A2, "/", n = 3), function(x) length(x)))
test2 <- lapply(str_split(db$A2, "/", n = 3), function(x) as.numeric(x))
if(test1 == 2){
if(test2[[1]] >= 9){
db$A2 <- ifelse(test = between(nchar(db$A2), 3, 5) & !is.na(db$A2), yes = paste0(db$A2, "/19"), no = db$A2)
}
if(test2[[1]] < 9){
db$A2 <- ifelse(test = between(nchar(db$A2), 3, 5) & !is.na(db$A2), yes = paste0(db$A2, "/20"), no = db$A2)
}
}
db <- db %>% mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))
db
A1 A2
1 review 11/18 2019-11-18
2 begins 12/4/19 2019-12-04
3 3/5/20 2020-03-05
4 <NA> <NA>
5 deadline 09/5/19 2019-09-05
6 9/3 2019-09-03
I like the rematch2 package for many regex scenarios.
The first pattern tries to match the full m/d/y values. The second patterns tried to match the partial m/d values (furthermore, it separates the month from the day, so it can determine if it should be 2019 or 2020).
Once those pieces are isolated, the rest is just a sequence of small steps.
db |>
rematch2::bind_re_match(from = A1, "^.*?(?<mdy>\\d{1,2}/\\d{1,2}/\\d{2})$") |>
rematch2::bind_re_match(from = A1, "^.*?(?<md_m>\\d{1,2})/(?<md_d>\\d{1,2})$") |>
dplyr::mutate(
md_m = as.integer(md_m),
md_y = dplyr::if_else(9L <= md_m, "19", "20"), # It's 2019 if the month is Sept or later
md = sprintf("%i/%s/%s", md_m, md_d, md_y), # Assemble components
md = as.Date(md , "%m/%d/%y"), # Convert data type
mdy = as.Date(mdy, "%m/%d/%y"), # Convert data type
date = dplyr::coalesce(mdy, md), # Prefer the mdy if it's not missing
)
Output:
A1 mdy md_m md_d md_y md date
1 review 11/18 <NA> 11 18 19 2019-11-18 2019-11-18
2 begins 12/4/19 2019-12-04 4 19 20 2020-04-19 2019-12-04
3 3/5/20 2020-03-05 5 20 20 2020-05-20 2020-03-05
4 <NA> <NA> NA <NA> <NA> <NA> <NA>
5 deadline 09/5/19 2019-09-05 5 19 20 2020-05-19 2019-09-05
6 9/3 <NA> 9 3 19 2019-09-03 2019-09-03

Creating an statement to check multiple dates between a start and end date

I have a dataframe like this in R:
Start date
End date
Date 1
Date 2
Date 3
Date 4
11/12/2018
29/11/2019
08/03/2021
NA
NA
NA
07/03/2018
24/04/2019
08/03/2021
12/09/2016
NA
NA
04/06/2018
23/04/2019
08/03/2021
02/10/2017
05/10/2018
NA
26/07/2018
29/08/2019
08/03/2021
03/08/2015
02/10/2017
23/01/2017
I want to create a new column in R that says: If Date 1, Date 2, Date 3 or Date 4 is between Start Date and End date, it should return 1, 0 otherwise, as the table below:
Start date
End date
Date 1
Date 2
Date 3
Date 4
Change
11/12/2018
29/11/2019
08/03/2021
NA
NA
NA
0
07/03/2018
24/04/2019
08/03/2021
12/09/2016
NA
NA
0
04/06/2018
23/04/2019
08/03/2021
02/10/2017
05/10/2018
NA
1
26/07/2018
29/08/2019
08/03/2021
03/08/2015
02/10/2017
23/01/2017
0
Does anyone have a suggestion on how to solve this? Thank you :)
It'll make it much easier for people to help you if you can post code / data which we can run directly. The easiest way to do this is to use a handy R function called dput, which generates instructions to exactly recreate any R object. So you might run dput(MY_DATA), or if your data is much larger than needed to demonstrate your question, dput(head(MY_DATA)) to get the first six rows, and paste the output of that into your question. </PSA>
Here's code to generate your example data:
my_data <- data.frame(
stringsAsFactors = FALSE,
Start.date = c("11/12/2018", "07/03/2018", "04/06/2018", "26/07/2018"),
End.date = c("29/11/2019", "24/04/2019", "23/04/2019", "29/08/2019"),
Date.1 = c("08/03/2021", "08/03/2021", "08/03/2021", "08/03/2021"),
Date.2 = c(NA, "12/09/2016", "02/10/2017", "03/08/2015"),
Date.3 = c(NA, NA, "05/10/2018", "02/10/2017"),
Date.4 = c(NA, NA, NA, "23/01/2017")
)
Here's a tidyverse approach to first convert your day/month/year dates into data in R's Date type using lubridate::dmy, then to compare each of Date.1 thru Date.4 against your start dates, and then finally to show if there are any 1's (within range).
library(dplyr); library(lubridate)
my_data %>%
mutate(across(.fns = ~dmy(.x))) %>%
mutate(across(.cols = starts_with("Date"),
.fns = ~coalesce(.x >= Start.date & .x <= End.date, FALSE)*1)) %>%
mutate(Change = pmax(Date.1, Date.2, Date.3, Date.4))
coalesce(..., FALSE) used here to treat NA like FALSE.
(...)*1 to convert TRUE/FALSE to 1/0.
pmax(...) to grab the largest of the 1/0's, i.e. "are there any 1's?"
Edit: alternative to leave Date columns intact:
my_data %>%
mutate(across(.fns = ~dmy(.x))) %>%
mutate(across(.cols = starts_with("Date"),
.names = "Check_{.col}",
.fns = ~coalesce(.x >= Start.date & .x <= End.date, FALSE)*1)) %>%
rowwise() %>%
mutate(Change = max(c_across(starts_with("Check")))) %>%
select(-starts_with("Check"))
Start.date End.date Date.1 Date.2 Date.3 Date.4 Change
<date> <date> <date> <date> <date> <date> <dbl>
1 2018-12-11 2019-11-29 2021-03-08 NA NA NA 0
2 2018-03-07 2019-04-24 2021-03-08 2016-09-12 NA NA 0
3 2018-06-04 2019-04-23 2021-03-08 2017-10-02 2018-10-05 NA 1
4 2018-07-26 2019-08-29 2021-03-08 2015-08-03 2017-10-02 2017-01-23 0
library(tidyverse)
library(lubridate)
df <- read.table(textConnection("start_date;end_date;date_1;date_2;date_3;date_4
11/12/2018;29/11/2019;08/03/2021;NA;NA;NA
07/03/2018;24/04/2019;08/03/2021;12/09/2016;NA;NA
04/06/2018;23/04/2019;08/03/2021;02/10/2017;05/10/2018;NA
26/07/2018;29/08/2019;08/03/2021;03/08/2015;02/10/2017;23/01/2017"),
sep=";",
header = TRUE)
df %>%
mutate(
across(everything(), lubridate::dmy),
change = ((date_1 > start_date & date_1 < end_date) |
(date_2 > start_date & date_2 < end_date) |
(date_3 > start_date & date_3 < end_date)
) %>%
coalesce(FALSE) %>%
as.integer()
)
#> start_date end_date date_1 date_2 date_3 date_4 change
#> 1 2018-12-11 2019-11-29 2021-03-08 <NA> <NA> <NA> 0
#> 2 2018-03-07 2019-04-24 2021-03-08 2016-09-12 <NA> <NA> 0
#> 3 2018-06-04 2019-04-23 2021-03-08 2017-10-02 2018-10-05 <NA> 1
#> 4 2018-07-26 2019-08-29 2021-03-08 2015-08-03 2017-10-02 2017-01-23 0

Creating column of 0 and 1 based on inequalities of three date columns

I would like to create a column of 0s and 1s based on inequalities of three columns of dates.
The idea is the following. If event_date is before death_date or study_over, the the column event should be ==1, if event_date occurs after death_date or study_over, event should be == 0. Both event_date and death_date may contain NAs.
set.seed(1337)
rand_dates <- Sys.Date() - 365:1
df <-
data.frame(
event_date = sample(rand_dates, 20),
death_date = sample(rand_dates, 20),
study_over = sample(rand_dates, 20)
)
My attempt was the following
eventR <-
function(x, y, z){
if(is.na(y)){
ifelse(x <= z, 1, 0)
} else if(y <= z){
ifelse(x < y, 1, 0)
} else {
ifelse(x <= z, 1, 0)
}
}
I use it in the following manner
library(dplyr)
df[c(3, 5, 7), "event_date"] <- NA #there are some NA in .$event_date
df[c(3, 4, 6), "death_date"] <- NA #there are some NA in .$death_date
df %>%
mutate(event = sapply(.$event_date, eventR, y = .$death_date, z = .$study_over))
##Error: wrong result size (400), expected 20 or 1
##In addition: There were 40 warnings (use warnings() to see them)
I can't figure out how to do this. Any suggestions?
This would seem to construct a binary column (with NA's where needed) where 1 indicates "event_date is before death_date or study_over" and 0 is used elsewhere. As already pointed out your specification does not cover all cases:
df$event <- with(df, as.numeric( event_date < pmax( death_date , study_over) ) )
df
Can use pmap_dbl() from the purrr package instead of sapply...
library(dplyr)
library(purrr)
df %>% mutate(event = pmap_dbl(list(event_date, death_date, study_over), eventR))
event_date death_date study_over event
1 2016-10-20 2017-01-27 2016-12-16 1
2 2016-10-15 2016-12-12 2017-01-20 1
3 <NA> <NA> 2016-10-09 NA
4 2016-09-04 <NA> 2016-11-17 1
5 <NA> 2016-10-13 2016-06-09 NA
6 2016-07-21 <NA> 2016-04-26 0
7 <NA> 2017-02-21 2016-07-12 NA
8 2016-07-02 2017-02-08 2016-08-24 1
9 2016-06-19 2016-09-07 2016-04-11 0
10 2016-05-14 2017-03-13 2016-08-03 1
11 2017-03-06 2017-02-05 2017-02-28 0
12 2017-03-10 2016-04-28 2016-11-30 0
13 2017-01-10 2016-12-10 2016-10-27 0
14 2016-05-31 2016-06-12 2016-08-13 1
15 2017-03-03 2016-12-25 2016-12-20 0
16 2016-04-01 2016-11-03 2016-06-30 1
17 2017-02-26 2017-02-25 2016-05-12 0
18 2017-02-08 2016-12-08 2016-10-14 0
19 2016-07-19 2016-07-03 2016-09-22 0
20 2016-06-17 2016-06-06 2016-11-09 0
You might also be interested in the dplyr function, case_when() for handling many if else statements.

Resources