Dates subtraction from different rows of data frame - r

I have big data frame (dim: 12867779x5) which looks like that:
id
group
date1
date 2
icf
id1
2
2020-03-17
2019-06-05
id1
3
2020-04-03
2019-05-09
id2
2
2020-04-10
2019-07-04
id2
3
2021-04-1
2020-06-01
id3
1
2020-04-13
2019-07-07
id3
2
2021-04-10
2020-06-01
id3
3
2020-04-10
2019-07-04
id3
4
2021-04-13
2020-06-01
Desired output:
id
group
date1
date 2
icf
id1
3
2020-04-03
2019-05-09
0
id2
2
2020-04-10
2019-07-04
52
id2
3
2021-04-01
2020-06-01
0
id3
1
2020-04-13
2019-07-07
49
id3
2
2021-04-10
2020-06-01
-646
id3
3
2020-04-10
2019-07-04
52
id3
4
2021-04-13
2020-06-01
0
To calculate icf I need to check if the id's from row i and i+1 are the same. If yes icf = date2(i+1) - date1(i).
I wrote this function to calculate icf, but it's too slow. I'm looking for ways to speed it up, I was thinking about using the apply function but I don't have idea how to re-write this icfCalculation fucntion.
icfCalculation <- function(dataFrame){
nr <- nrow(dataFrame) - 1
for (i in 1:nr) {
if(dataFrame[i, 1] == dataFrame[i+1, 1]){
dataFrame[i,5] = dataFrame[i+1, 4] - dataFrame[i, 3]
}
else{
dataFrame[i,5] = 0
}
}
return(dataFrame)
}

Thanks for putting expected output. This is not the same as what you have put - but it does give the same results as your function and should be significantly quicker to thanks to the data.table internal optimisations:
library(data.table)
# Read in data
dat <- read.table(text = "id group date1 date2
id1 2 2020-03-17 2019-06-05
id1 3 2020-04-03 2019-05-09
id2 2 2020-04-10 2019-07-04
id2 3 2021-04-1 2020-06-01
id3 1 2020-04-13 2019-07-07
id3 2 2021-04-10 2020-06-01
id3 3 2020-04-10 2019-07-04
id3 4 2021-04-13 2020-06-01",
h = T,
colClasses = c("character", "character", "Date", "Date")
)
# Make it a data.table
setDT(dat)
dat[, icf := fifelse(
id == shift(id, type = "lead"),
as.integer(
shift(date2, type = "lead") - date1
),
0)
]
dat
# id group date1 date2 icf
# 1: id1 2 2020-03-17 2019-06-05 -313
# 2: id1 3 2020-04-03 2019-05-09 0
# 3: id2 2 2020-04-10 2019-07-04 52
# 4: id2 3 2021-04-01 2020-06-01 0
# 5: id3 1 2020-04-13 2019-07-07 49
# 6: id3 2 2021-04-10 2020-06-01 -646
# 7: id3 3 2020-04-10 2019-07-04 52
# 8: id3 4 2021-04-13 2020-06-01 NA
If you want the last NA to be 0, just add dat$icf[nrow(dat)] <- 0.

library(dplyr)
library(tidyr)
df %>%
mutate(icf = replace_na(ifelse(id == lead(id), lead(date2) - date1, 0), 0))
Rather than use tidyr::replace_na you could also specify the default argument of lead.
Base R
A base R approach would be something like:
df$icf <- with(df, ifelse(id == c(id[2:nrow(df)], NA), c(date2[2:nrow(df)], NA) - date1, 0))
Output
id group date1 date2 icf
1 id1 2 2020-03-17 2019-06-05 -313
2 id1 3 2020-04-03 2019-05-09 0
3 id2 2 2020-04-10 2019-07-04 52
4 id2 3 2021-04-01 2020-06-01 0
5 id3 1 2020-04-13 2019-07-07 49
6 id3 2 2021-04-10 2020-06-01 -646
7 id3 3 2020-04-10 2019-07-04 52
8 id3 4 2021-04-13 2020-06-01 0

Related

How to create a new column that counts the number of occurrences of a value in another column and orders them by date

I have a 2 column data frame with "date" and "ID" headings. Some IDs are listed more than once. I want to create a new column "Attempt" that denotes the number of attempts that each ID has taken, ordered by the date of occurrence.
Here is my sample data:
ID <- c(1,2,5,8,4,9,1,11,15,32,54,1,4,2,14)
Date <- c("2021-04-12", "2021-04-12", "2021-04-13", "2021-04-14", "2021-04-19",
"2021-04-19", "2021-04-20", "2021-04-21", "2021-04-22", "2021-04-28",
"2021-04-28", "2021-04-29", "2021-04-29", "2021-05-06", "2021-05-07")
Data <- data.frame(ID, Date)
Data$Date <- as.Date(Data$Date, format="%Y-%m-%d")
I tried various iterations of duplicated(). I can remove all duplicates or make every instance of a duplicated value "2" or "3" for example, but I want each occurrence to be ordered based on the date of the attempt taken.
Here is my expected result column to be added onto the original data frame:
Attempt <- c(1,1,1,1,1,1,2,1,1,1,1,3,2,2,1)
Data %>%
group_by(ID)
mutate(Attempt1 = row_number())
ID Date Attempt
1 1 2021-04-12 1
2 2 2021-04-12 1
3 5 2021-04-13 1
4 8 2021-04-14 1
5 4 2021-04-19 1
6 9 2021-04-19 1
7 1 2021-04-20 2
8 11 2021-04-21 1
9 15 2021-04-22 1
10 32 2021-04-28 1
11 54 2021-04-28 1
12 1 2021-04-29 3
13 4 2021-04-29 2
14 2 2021-05-06 2
15 14 2021-05-07 1
If you have the latest version of dplyr use
Data %>%
mutate(Attempt = row_number(), .by = ID)
Using data.table
library(data.table)
setDT(Data)[, Attempt := rowid(ID)]
-output
> Data
ID Date Attempt
1: 1 2021-04-12 1
2: 2 2021-04-12 1
3: 5 2021-04-13 1
4: 8 2021-04-14 1
5: 4 2021-04-19 1
6: 9 2021-04-19 1
7: 1 2021-04-20 2
8: 11 2021-04-21 1
9: 15 2021-04-22 1
10: 32 2021-04-28 1
11: 54 2021-04-28 1
12: 1 2021-04-29 3
13: 4 2021-04-29 2
14: 2 2021-05-06 2
15: 14 2021-05-07 1

Converting variable with 5 digit numbers and dates into date values

I have the following data, which contains some date values as 5 digit character values. When I try to convert to date, the correct date changes to NA value.
dt <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2),
Registrationdate=c('2019-01-09','2019-01-09','2019-01-09','2019-01-09','2019-01-09',
'2019-01-09',"44105","44105","44105","44105","44105"))
Expected value
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
I tried using
library(openxlsx)
dt$Registrationdate <- convertToDate(dt$Registrationdate, origin = "1900-01-01")
But I got
1 1 <NA>
2 1 <NA>
3 1 <NA>
4 1 <NA>
5 1 <NA>
6 1 <NA>
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Here's one approach using a mix of dplyr and base R:
library(dplyr, warn = FALSE)
dt |>
mutate(Registrationdate = if_else(grepl("-", Registrationdate),
as.Date(Registrationdate),
openxlsx::convertToDate(Registrationdate, origin = "1900-01-01")))
#> Warning in openxlsx::convertToDate(Registrationdate, origin = "1900-01-01"): NAs
#> introduced by coercion
#> id Registrationdate
#> 1 1 2019-01-09
#> 2 1 2019-01-09
#> 3 1 2019-01-09
#> 4 1 2019-01-09
#> 5 1 2019-01-09
#> 6 1 2019-01-09
#> 7 2 2020-10-01
#> 8 2 2020-10-01
#> 9 2 2020-10-01
#> 10 2 2020-10-01
#> 11 2 2020-10-01
Created on 2022-10-15 with reprex v2.0.2
library(janitor)
dt$Registrationdate <- convert_to_date(dt$Registrationdate)
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Another option is to import columns in the expected format. An example with openxlsx2 is shown below. The top half creates a file that causes the behavior you see with openxlsx. This is because some of the rows in the Registrationdate column are formatted as dates and some as strings, a fairly common error caused by the person who generated the xlsx input.
With openxlsx2 you can define the type of column you want to import. The option was inspired by readxl (iirc).
library(openxlsx2)
## prepare data
date_as_string <- data.frame(
id = rep(1, 6),
Registrationdate = rep('2019-01-09', 6)
)
date_as_date <- data.frame(
id = rep(2, 5),
Registrationdate = rep(as.Date('2019-01-10'), 5)
)
options(openxlsx2.dateFormat = "yyyy-mm-dd")
wb <- wb_workbook()$
add_worksheet()$
add_data(x = date_as_string)$
add_data(x = date_as_date, colNames = FALSE, startRow = 7)
#wb$open()
## read data as date
dt <- wb_to_df(wb, types = c(id = 1, Registrationdate = 2))
## check that Registrationdate is actually a Date column
str(dt$Registrationdate)
#> Date[1:10], format: "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" ...

Select rows based on multiple conditions from two independent database

I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same.
I would like to select cases in data2 with conditions:
Match by ID
Eventdate in data2 >= Eventdate in data1.
If one ID has multiple Eventdates in data2, select the earliest one.
If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!
You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])
Does this work:
library(dplyr)
data1 %>% rename(Eventdate_dat1 = Eventdate) %>% left_join(data2, by = 'ID') %>%
group_by(ID) %>% filter(Eventdate >= Eventdate_dat1) %>%
mutate(Eventdate = case_when(length(unique(Eventdate)) > 1 ~ min(Eventdate), TRUE ~ Eventdate),
Eventcode = case_when(length(unique(Eventcode)) > 1 ~ min(Eventcode), TRUE ~ Eventcode)) %>%
distinct() %>% right_join(data1, by = 'ID') %>% select(ID, 'Eventdate' = Eventdate.y, 'Eventdate.data2' = Eventdate.x, Eventcode)
# A tibble: 6 x 4
# Groups: ID [6]
ID Eventdate Eventdate.data2 Eventcode
<dbl> <chr> <chr> <dbl>
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01 NA NA
5 5 2019-05-01 NA NA
6 6 2019-06-01 NA NA

R data manipulation - Rang condition in data.table / dplyr

In R I am conducting analyses on df1 but I also need to pull data from the more detailed records / observations in df2 and attach to df1 based on certain conditions.
This is sample data comparable to my own:
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ),
actiontime=c("2020-03-10" , "2020-02-17" , "2020-04-22" , "2020-04-19" , "2020-04-20" , "2020-04-22" , "2020-03-02" , "2020-05-07" ) )
df2 <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3,3,3,3,3,3,3,3,3, 4,4,4,4,4, 5,5,5) ,
observation=c( "2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00", "2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00", "2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" , "2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" , "2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30", "2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" , "2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
var1=round(runif(25, min=10, max=60),0) ,
var2=c("Red" , "Blue" , "Yellow" , NA , "Yellow" , "Blue" , "Red" , "Yellow" , NA , NA , "Yellow" , NA , NA , NA , NA , NA , "Blue", NA , "Blue" , "Yellow" , NA , "Blue" , "Yellow" , "Red" , "Blue") )
In example how can I do the following procedures (preferably with data.table but if someone also would like to demonstrate with dplyr it is also nice) :
Q1. If I decide the following rang Blue > Red > Yellow . How can I then get the highest rang color in df2$var2 among the observations related to same id (if any) attached to a new variable by respective id in df1 ?
Q2. In addition to rang as in Q1, how do I add condition to only select var2 if the observation happens a day before actiontime in df1 ?
Q3 And to learn even more - how can the data that was pulled out of df2 joined to df1 in Q1 be updated on the record with the earliest observation by the id in df2 - meaning just working on df2 not involving df1 (and the join).
The output for Q3 would be something like this:
id observation var1 var2 color
1 1 2020-03-09 01:00 37 Red Blue
2 1 2020-03-09 10:00 35 Blue <NA>
3 1 2020-03-10 05:00 27 Yellow <NA>
4 2 2020-02-15 08:00 21 <NA> Yellow
5 2 2020-02-16 09:00 37 Yellow <NA>
6 2 2020-02-17 08:00 38 Blue <NA>
7 3 2020-04-16 14:30 56 Red <NA>
8 3 2020-04-16 07:30 35 Yellow Red
9 3 2020-04-17 15:00 40 <NA> <NA>
10 3 2020-04-25 07:20 20 <NA> <NA>
11 3 2020-04-18 10:00 49 <NA> <NA>
12 3 2020-04-19 10:30 58 <NA> <NA>
13 3 2020-04-20 12:00 37 <NA> <NA>
14 3 2020-04-21 12:00 25 <NA> <NA>
15 3 2020-04-22 09:30 16 <NA> <NA>
16 3 2020-04-24 23:00 52 <NA> <NA>
17 3 2020-04-23 17:30 46 Blue <NA>
18 4 2020-03-01 08:00 16 <NA> Blue
19 4 2020-03-02 08:00 14 Blue <NA>
20 4 2020-03-03 08:00 21 Yellow <NA>
21 4 2020-03-15 16:45 52 <NA> <NA>
22 4 2020-03-16 08:00 40 Blue <NA>
23 5 2020-05-05 13:45 13 Yellow Red
24 5 2020-05-06 08:00 12 Red <NA>
25 5 2020-05-07 11:00 11 Blue <NA>
There are 3 questions in one, I will try to answer them one by one.
Question 1
If I understand correctly, the OP wants to identify the highest ranked color in var2 per id and wants to copy the color to a new column in df1 for the matching ids.
This can be solved by turning var2 into an ordered factor, aggregating df2 by id, and adding the result to df1 by an update join:
library(data.table)
setDT(df1)
setDT(df2)
df2[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow", NA), exclude = NULL)]
str(df2)
Classes ‘data.table’ and 'data.frame': 25 obs. of 5 variables:
$ id : num 1 1 1 2 2 2 3 3 3 3 ...
$ observation: chr "2020-03-09 01:00" "2020-03-09 10:00" "2020-03-10 05:00" "2020-02-15 08:00" ...
$ var1 : num 15 58 12 35 11 25 24 54 14 15 ...
$ var2 : Ord.factor w/ 4 levels "Blue"<"Red"<"Yellow"<..: 2 1 3 4 3 1 2 3 4 4 ...
$ action_day : IDate, format: "2020-03-10" "2020-03-10" "2020-03-11" "2020-02-16" ...
- attr(*, ".internal.selfref")=<externalptr>
So, we can find the highest ranked color per id by using min()
df2[, min(var2, na.rm = TRUE), by = id]
id V1
1: 1 Blue
2: 2 Blue
3: 3 Blue
4: 4 Blue
5: 5 Blue
which is rather trivial because all id groups include Blue in var2.
This can be appended to df1 by an update join
df1[df2[, min(var2, na.rm = TRUE), by = id], on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Blue
3: 3 a 2020-04-22 Blue
4: 3 b 2020-04-19 Blue
5: 3 b 2020-04-20 Blue
6: 4 a 2020-04-22 Blue
7: 4 a 2020-03-02 Blue
8: 5 a 2020-05-07 Blue
Question 2
If I understand correctly, the OP wants to filter df2 so that only those rows are kept where the date of the observation in df2 is exactly one day before an actiontime in df1 (for the same id). This intermediate result is then processed in the same way as df2 in Question 1, above.
The filtering is accomplished by a join operation but requires to coerce the character date actiontime and character date-time observation, resp., to numeric date type for date calculation.
df1[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
keep_df2_rows
[1] 1 2 5 14 11 12 18 24
keep_df2_rows contains the row numbers of those rows of df2 which fullfil the condition that the observation has happened exactly one day before an actiontime in df1 (for the same id).
Now, we can use the code of question 1 but use keep_df2_rows to filter df2:
df1[df2[keep_df2_rows, min(var2, na.rm = TRUE), by = id]
, on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Yellow
3: 3 a 2020-04-22 Yellow
4: 3 b 2020-04-19 Yellow
5: 3 b 2020-04-20 Yellow
6: 4 a 2020-04-22 <NA>
7: 4 a 2020-03-02 <NA>
8: 5 a 2020-05-07 Red
Question 3
If I understand correctly, the final goal of the OP is to add the color column to df2 instead of df1 with the additional requirement that the only the row with the earliest observation within an id is to be updated.
This can be accomplished by an update join with a look-up table lut which contains the colors by id as described above and the earliest observation by id
library(data.table)
setDT(df2)[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow"))]
setDT(df1)[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
agg1 <- df2[keep_df2_rows][!is.na(var2), min(var2), by = id]
agg2 <- df2[, .(observation = min(observation)), by = id]
lut <- merge(agg1, agg2, by = "id")
df2[lut, on = .(id, observation), color := as.character(V1)][]
id observation var1 var2 action_day color
1: 1 2020-03-09 01:00 23 Red 2020-03-10 Blue
2: 1 2020-03-09 10:00 29 Blue 2020-03-10 <NA>
3: 1 2020-03-10 05:00 39 Yellow 2020-03-11 <NA>
4: 2 2020-02-15 08:00 55 <NA> 2020-02-16 Yellow
5: 2 2020-02-16 09:00 20 Yellow 2020-02-17 <NA>
6: 2 2020-02-17 08:00 55 Blue 2020-02-18 <NA>
7: 3 2020-04-16 14:30 57 Red 2020-04-17 <NA>
8: 3 2020-04-16 07:30 43 Yellow 2020-04-17 Yellow
9: 3 2020-04-17 15:00 41 <NA> 2020-04-18 <NA>
10: 3 2020-04-25 07:20 13 <NA> 2020-04-26 <NA>
11: 3 2020-04-18 10:00 20 Yellow 2020-04-19 <NA>
12: 3 2020-04-19 10:30 19 <NA> 2020-04-20 <NA>
13: 3 2020-04-20 12:00 44 <NA> 2020-04-21 <NA>
14: 3 2020-04-21 12:00 29 <NA> 2020-04-22 <NA>
15: 3 2020-04-22 09:30 48 <NA> 2020-04-23 <NA>
16: 3 2020-04-24 23:00 35 <NA> 2020-04-25 <NA>
17: 3 2020-04-23 17:30 46 Blue 2020-04-24 <NA>
18: 4 2020-03-01 08:00 60 <NA> 2020-03-02 <NA>
19: 4 2020-03-02 08:00 29 Blue 2020-03-03 <NA>
20: 4 2020-03-03 08:00 49 Yellow 2020-03-04 <NA>
21: 4 2020-03-15 16:45 57 <NA> 2020-03-16 <NA>
22: 4 2020-03-16 08:00 21 Blue 2020-03-17 <NA>
23: 5 2020-05-05 13:45 43 Yellow 2020-05-06 Red
24: 5 2020-05-06 08:00 16 Red 2020-05-07 <NA>
25: 5 2020-05-07 11:00 23 Blue 2020-05-08 <NA>
id observation var1 var2 action_day color
Note that the result differs from the example table posted by the OP because OP's definition of df2 is different to the example table.
Also note that I had to modify the computation of agg1 because of an unexpected behaviour of min(var2, na.rm = TRUE) when an id group consists only of NA. (To reproduce the issue, try min(ordered(NA), na.rm = TRUE) vs min(ordered(NA)))

Add a column to data frame n days before an observation

I need an more efficient way to add a marker that shows that an observation is registered 3 days before a specific date. The problem is that these dates are not necessarily consecutive i.e. they can be missing, yet I need the marker to ignore the missing days. The example below illustrates the problem and what I need more clearly:
library(tidyverse)
library(lubridate)
df <- data.frame("Date" = c(as_date(0:9)), ID = rep(paste0("ID", 1:3), each = 10))
df <- df[-c(5, 13, 24),]
date_before <- "1970-01-07"
df[, "three_days_before"] <- 0
for(i in df$ID){
cond <- df[, "ID"] == i &
df[, "Date"] == date_before
before_n <- (which(cond)-3):(which(cond)-1)
df[before_n, "three_days_before"] <- 1
}
df
The loop gives me what I need (three days are marked each time regardless their inclusion in the data.frame), yet it takes quite a long time to calculate on a larger data set. Can someone recommend a better way?
1) Apply a rolling window separately for each ID. The rolling window function checks whether any of the next 3 elements of Date equal date_before.
(Specifying a width of list(1:3) says to use offsets 1, 2 and 3 which means the next 3 ahead.) Note that there are no next 3 elements for the last value so we use fill to fill it in. We add 0 to convert from logical to numeric. This solution involves only two lines of code and no explicit looping.
library(zoo)
roll <- function(x) rollapply(x, list(1:3), FUN = any, partial = TRUE, fill = FALSE)
transform(df, before = ave(Date == date_before, ID, FUN = roll) + 0)
giving:
Date ID before
1 1970-01-01 ID1 0
2 1970-01-02 ID1 0
3 1970-01-03 ID1 1
4 1970-01-04 ID1 1
6 1970-01-06 ID1 1
7 1970-01-07 ID1 0
8 1970-01-08 ID1 0
9 1970-01-09 ID1 0
10 1970-01-10 ID1 0
11 1970-01-01 ID2 0
12 1970-01-02 ID2 0
14 1970-01-04 ID2 1
15 1970-01-05 ID2 1
16 1970-01-06 ID2 1
17 1970-01-07 ID2 0
18 1970-01-08 ID2 0
19 1970-01-09 ID2 0
20 1970-01-10 ID2 0
21 1970-01-01 ID3 0
22 1970-01-02 ID3 0
23 1970-01-03 ID3 1
25 1970-01-05 ID3 1
26 1970-01-06 ID3 1
27 1970-01-07 ID3 0
28 1970-01-08 ID3 0
29 1970-01-09 ID3 0
30 1970-01-10 ID3 0
2) This could also be expressed as a pipeline where roll is from above:
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(before = roll(Date == date_before)) %>%
ungroup
Here is a tidyverse solution using difftime and cumsum:
library(tidyverse);
df %>%
group_by(ID) %>%
mutate(
is_before = difftime(as_date(date_before), Date) >= 0,
three_days_before = as.numeric((max(cumsum(is_before)) - cumsum(is_before)) %in% 1:3)) %>%
select(-is_before) %>%
as.data.frame()
# Date ID three_days_before
#1 1970-01-01 ID1 0
#2 1970-01-02 ID1 0
#3 1970-01-03 ID1 1
#4 1970-01-04 ID1 1
#5 1970-01-06 ID1 1
#6 1970-01-07 ID1 0
#7 1970-01-08 ID1 0
#8 1970-01-09 ID1 0
#9 1970-01-10 ID1 0
#10 1970-01-01 ID2 0
#11 1970-01-02 ID2 0
#12 1970-01-04 ID2 1
#13 1970-01-05 ID2 1
#14 1970-01-06 ID2 1
#15 1970-01-07 ID2 0
#16 1970-01-08 ID2 0
#17 1970-01-09 ID2 0
#18 1970-01-10 ID2 0
#19 1970-01-01 ID3 0
#20 1970-01-02 ID3 0
#21 1970-01-03 ID3 1
#22 1970-01-05 ID3 1
#23 1970-01-06 ID3 1
#24 1970-01-07 ID3 0
#25 1970-01-08 ID3 0
#26 1970-01-09 ID3 0
#27 1970-01-10 ID3 0
Explanation: We group entries by ID; is_before flags entries at or before date_before; we then flag the first three rows before date_before with (max(cumsum(is_before)) - cumsum(is_before)) %in% 1:3).
Sample data
library(lubridate);
df <- data.frame("Date" = c(as_date(0:9)), ID = rep(paste0("ID", 1:3), each = 10))
df <- df[-c(5, 13, 24),]
date_before <- "1970-01-07"

Resources