How to merge/join dataframes in R conditionally to dynamic time intervals - r

In R I have two data frames representing covid-19 patients and I would like to merge them into one table to be able to perform the desired analyses.
df1 contains metadata of each hospital stay
df2 contains observational data for different timepoints during each stay, usually one per day but for some there are more than one per day
some cases have multiple stays and I find it difficult to merge these datasets so that the observational data is assigned to its resepctive metadata from the corresponding stay where there is no variable to indicate which stay the observational data belongs to other than the dates
Sample data can be generated with this code
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5), in_date=c("2020-03-09", "2020-02-15" , "2020-04-16" , "2020-04-19", "2020-04-24", "2020-03-01" , "2020-03-15" , "2020-05-05") , location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ) )
df2 <- data.frame(id=c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,5,5,5) , obs_time=c(
"2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00",
"2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00",
"2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" ,
"2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" ,
"2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30",
"2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" ,
"2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
temp_celsius=runif(25, min=35.8, max=42.0) )
lubridate ymd_hm and ymd functions was used to convert the factors with dates into POSIX date-variables.
Be aware data is not completely sorted and case id 3 has 3 stays but they are all consecutive with no days between and at first day there is 2 observations. Case 4 has two stays but there are days between.
When merging the two data frames I need to assign the different observations to the different stays so that I can make plots with starting point (time zero) the time when they came in at the actual stay. In example plot the development of temprature along x axis for all cases by location group where first obs_time is time zero.
Alltough I find other threads related it is not the same.
The desired output would be like this:
id obs_time temp_celsius stay_id stay_day location
1 1 2020-03-09 01:00 40.53805 1 1 a
2 1 2020-03-09 10:00 37.54832 1 1 a
3 1 2020-03-10 05:00 38.78600 1 2 a
4 2 2020-02-15 08:00 36.19048 1 1 a
5 2 2020-02-16 09:00 37.74323 1 2 a
6 2 2020-02-17 08:00 41.83050 1 3 a
7 3 2020-04-16 14:30 39.82978 1 1 a
8 3 2020-04-16 07:30 39.84554 1 1 a
9 3 2020-04-17 15:00 38.31164 1 2 a
10 3 2020-04-25 07:20 36.37992 3 2 b
11 3 2020-04-18 10:00 38.65261 1 3 a
12 3 2020-04-19 10:30 38.94991 2 1 b
13 3 2020-04-20 12:00 36.84384 2 2 b
14 3 2020-04-21 12:00 35.81786 2 3 b
15 3 2020-04-22 09:30 39.20979 2 4 b
16 3 2020-04-24 23:00 41.39876 3 1 b
17 3 2020-04-23 17:30 37.68251 2 5 b
18 4 2020-03-01 08:00 41.55690 1 1 a
19 4 2020-03-02 08:00 38.53060 1 2 a
20 4 2020-03-03 08:00 39.99385 1 3 a
21 4 2020-03-15 16:45 38.29500 2 1 a
22 4 2020-03-16 08:00 41.20947 2 2 a
23 5 2020-05-05 13:45 36.43556 1 1 a
24 5 2020-05-06 08:00 41.06712 1 2 a
25 5 2020-05-07 11:00 36.76612 1 3 a
Hope anyone can help me with this issue

I think this just about covers it. Rolling joins from data.table and a little manipulation should get you there.
set.seed(1)
library(data.table)
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
in_date=c("2020-03-09", "2020-02-15" , "2020-04-16" , "2020-04-19", "2020-04-24", "2020-03-01" , "2020-03-15" , "2020-05-05") ,
location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ) )
df2 <- data.frame(id=c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,5,5,5) , obs_time=c(
"2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00",
"2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00",
"2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" ,
"2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" ,
"2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30",
"2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" ,
"2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
temp_celsius=runif(25, min=35.8, max=42.0))
setDT(df1)
setDT(df2)
df1[, c('in_date') := .(lubridate::ymd(in_date))]
df1[, stay_id := 1:.N, by = id]
df2[, obs_time := lubridate::ymd_hm(obs_time)]
df2[, obs_date := as.Date(obs_time)]
df1[df2, .(id, in_date, obs_time, temp_celsius, stay_id, location), on = c('id', 'in_date' = 'obs_date'), roll = Inf][
, stay_day := as.numeric(difftime(as.Date(obs_time), min(as.Date(in_date)), units = 'day')) + 1, by = .(id, stay_id)][, in_date := NULL][]
#> id obs_time temp_celsius stay_id location stay_day
#> 1: 1 2020-03-09 01:00:00 37.44615 1 a 1
#> 2: 1 2020-03-09 10:00:00 38.10717 1 a 1
#> 3: 1 2020-03-10 05:00:00 39.35169 1 a 2
#> 4: 2 2020-02-15 08:00:00 41.43089 1 a 1
#> 5: 2 2020-02-16 09:00:00 37.05043 1 a 2
#> 6: 2 2020-02-17 08:00:00 41.37002 1 a 3
#> 7: 3 2020-04-16 14:30:00 41.65699 1 a 1
#> 8: 3 2020-04-16 07:30:00 39.89695 1 a 1
#> 9: 3 2020-04-17 15:00:00 39.70051 1 a 2
#> 10: 3 2020-04-25 07:20:00 36.18307 3 b 2
#> 11: 3 2020-04-18 10:00:00 37.07704 1 a 3
#> 12: 3 2020-04-19 10:30:00 36.89465 2 b 1
#> 13: 3 2020-04-20 12:00:00 40.05954 2 b 2
#> 14: 3 2020-04-21 12:00:00 38.18144 2 b 3
#> 15: 3 2020-04-22 09:30:00 40.57302 2 b 4
#> 16: 3 2020-04-24 23:00:00 38.88574 3 b 1
#> 17: 3 2020-04-23 17:30:00 40.24923 2 b 5
#> 18: 4 2020-03-01 08:00:00 41.94982 1 a 1
#> 19: 4 2020-03-02 08:00:00 38.15622 1 a 2
#> 20: 4 2020-03-03 08:00:00 40.62016 1 a 3
#> 21: 4 2020-03-15 16:45:00 41.59517 2 a 1
#> 22: 4 2020-03-16 08:00:00 37.11528 2 a 2
#> 23: 5 2020-05-05 13:45:00 39.84038 1 a 1
#> 24: 5 2020-05-06 08:00:00 36.57844 1 a 2
#> 25: 5 2020-05-07 11:00:00 37.45677 1 a 3
#> id obs_time temp_celsius stay_id location stay_day
Created on 2020-07-16 by the reprex package (v0.3.0)

Related

How can I create a day number variable in R based on dates?

I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
Id
date
1
08/03/2020 08:17
1
08/03/2020 12:01
1
08/04/2020 15:08
1
08/04/2020 22:16
2
07/03/2020 08:10
2
07/03/2020 12:03
2
07/04/2020 15:07
2
07/05/2020 00:16
3
08/22/2020 09:17
3
08/23/2020 11:04
3
08/24/2020 00:01
4
10/03/2020 08:37
4
10/03/2020 11:13
4
10/04/2020 15:20
4
10/04/2020 23:05
This is what I want:
Id
date
day
1
08/03/2020 08:17
1
1
08/03/2020 12:01
1
1
08/04/2020 15:08
2
1
08/04/2020 22:16
2
2
07/03/2020 08:10
1
2
07/03/2020 12:03
1
2
07/04/2020 15:07
2
2
07/05/2020 00:16
2
3
08/22/2020 09:17
1
3
08/23/2020 11:04
2
3
08/24/2020 00:01
2
4
10/03/2020 08:37
1
4
10/03/2020 11:13
1
4
10/04/2020 15:20
2
4
10/04/2020 23:05
2
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.
Please check the below code
code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
output
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3
Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Output
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days

R data manipulation - Rang condition in data.table / dplyr

In R I am conducting analyses on df1 but I also need to pull data from the more detailed records / observations in df2 and attach to df1 based on certain conditions.
This is sample data comparable to my own:
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ),
actiontime=c("2020-03-10" , "2020-02-17" , "2020-04-22" , "2020-04-19" , "2020-04-20" , "2020-04-22" , "2020-03-02" , "2020-05-07" ) )
df2 <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3,3,3,3,3,3,3,3,3, 4,4,4,4,4, 5,5,5) ,
observation=c( "2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00", "2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00", "2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" , "2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" , "2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30", "2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" , "2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
var1=round(runif(25, min=10, max=60),0) ,
var2=c("Red" , "Blue" , "Yellow" , NA , "Yellow" , "Blue" , "Red" , "Yellow" , NA , NA , "Yellow" , NA , NA , NA , NA , NA , "Blue", NA , "Blue" , "Yellow" , NA , "Blue" , "Yellow" , "Red" , "Blue") )
In example how can I do the following procedures (preferably with data.table but if someone also would like to demonstrate with dplyr it is also nice) :
Q1. If I decide the following rang Blue > Red > Yellow . How can I then get the highest rang color in df2$var2 among the observations related to same id (if any) attached to a new variable by respective id in df1 ?
Q2. In addition to rang as in Q1, how do I add condition to only select var2 if the observation happens a day before actiontime in df1 ?
Q3 And to learn even more - how can the data that was pulled out of df2 joined to df1 in Q1 be updated on the record with the earliest observation by the id in df2 - meaning just working on df2 not involving df1 (and the join).
The output for Q3 would be something like this:
id observation var1 var2 color
1 1 2020-03-09 01:00 37 Red Blue
2 1 2020-03-09 10:00 35 Blue <NA>
3 1 2020-03-10 05:00 27 Yellow <NA>
4 2 2020-02-15 08:00 21 <NA> Yellow
5 2 2020-02-16 09:00 37 Yellow <NA>
6 2 2020-02-17 08:00 38 Blue <NA>
7 3 2020-04-16 14:30 56 Red <NA>
8 3 2020-04-16 07:30 35 Yellow Red
9 3 2020-04-17 15:00 40 <NA> <NA>
10 3 2020-04-25 07:20 20 <NA> <NA>
11 3 2020-04-18 10:00 49 <NA> <NA>
12 3 2020-04-19 10:30 58 <NA> <NA>
13 3 2020-04-20 12:00 37 <NA> <NA>
14 3 2020-04-21 12:00 25 <NA> <NA>
15 3 2020-04-22 09:30 16 <NA> <NA>
16 3 2020-04-24 23:00 52 <NA> <NA>
17 3 2020-04-23 17:30 46 Blue <NA>
18 4 2020-03-01 08:00 16 <NA> Blue
19 4 2020-03-02 08:00 14 Blue <NA>
20 4 2020-03-03 08:00 21 Yellow <NA>
21 4 2020-03-15 16:45 52 <NA> <NA>
22 4 2020-03-16 08:00 40 Blue <NA>
23 5 2020-05-05 13:45 13 Yellow Red
24 5 2020-05-06 08:00 12 Red <NA>
25 5 2020-05-07 11:00 11 Blue <NA>
There are 3 questions in one, I will try to answer them one by one.
Question 1
If I understand correctly, the OP wants to identify the highest ranked color in var2 per id and wants to copy the color to a new column in df1 for the matching ids.
This can be solved by turning var2 into an ordered factor, aggregating df2 by id, and adding the result to df1 by an update join:
library(data.table)
setDT(df1)
setDT(df2)
df2[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow", NA), exclude = NULL)]
str(df2)
Classes ‘data.table’ and 'data.frame': 25 obs. of 5 variables:
$ id : num 1 1 1 2 2 2 3 3 3 3 ...
$ observation: chr "2020-03-09 01:00" "2020-03-09 10:00" "2020-03-10 05:00" "2020-02-15 08:00" ...
$ var1 : num 15 58 12 35 11 25 24 54 14 15 ...
$ var2 : Ord.factor w/ 4 levels "Blue"<"Red"<"Yellow"<..: 2 1 3 4 3 1 2 3 4 4 ...
$ action_day : IDate, format: "2020-03-10" "2020-03-10" "2020-03-11" "2020-02-16" ...
- attr(*, ".internal.selfref")=<externalptr>
So, we can find the highest ranked color per id by using min()
df2[, min(var2, na.rm = TRUE), by = id]
id V1
1: 1 Blue
2: 2 Blue
3: 3 Blue
4: 4 Blue
5: 5 Blue
which is rather trivial because all id groups include Blue in var2.
This can be appended to df1 by an update join
df1[df2[, min(var2, na.rm = TRUE), by = id], on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Blue
3: 3 a 2020-04-22 Blue
4: 3 b 2020-04-19 Blue
5: 3 b 2020-04-20 Blue
6: 4 a 2020-04-22 Blue
7: 4 a 2020-03-02 Blue
8: 5 a 2020-05-07 Blue
Question 2
If I understand correctly, the OP wants to filter df2 so that only those rows are kept where the date of the observation in df2 is exactly one day before an actiontime in df1 (for the same id). This intermediate result is then processed in the same way as df2 in Question 1, above.
The filtering is accomplished by a join operation but requires to coerce the character date actiontime and character date-time observation, resp., to numeric date type for date calculation.
df1[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
keep_df2_rows
[1] 1 2 5 14 11 12 18 24
keep_df2_rows contains the row numbers of those rows of df2 which fullfil the condition that the observation has happened exactly one day before an actiontime in df1 (for the same id).
Now, we can use the code of question 1 but use keep_df2_rows to filter df2:
df1[df2[keep_df2_rows, min(var2, na.rm = TRUE), by = id]
, on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Yellow
3: 3 a 2020-04-22 Yellow
4: 3 b 2020-04-19 Yellow
5: 3 b 2020-04-20 Yellow
6: 4 a 2020-04-22 <NA>
7: 4 a 2020-03-02 <NA>
8: 5 a 2020-05-07 Red
Question 3
If I understand correctly, the final goal of the OP is to add the color column to df2 instead of df1 with the additional requirement that the only the row with the earliest observation within an id is to be updated.
This can be accomplished by an update join with a look-up table lut which contains the colors by id as described above and the earliest observation by id
library(data.table)
setDT(df2)[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow"))]
setDT(df1)[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
agg1 <- df2[keep_df2_rows][!is.na(var2), min(var2), by = id]
agg2 <- df2[, .(observation = min(observation)), by = id]
lut <- merge(agg1, agg2, by = "id")
df2[lut, on = .(id, observation), color := as.character(V1)][]
id observation var1 var2 action_day color
1: 1 2020-03-09 01:00 23 Red 2020-03-10 Blue
2: 1 2020-03-09 10:00 29 Blue 2020-03-10 <NA>
3: 1 2020-03-10 05:00 39 Yellow 2020-03-11 <NA>
4: 2 2020-02-15 08:00 55 <NA> 2020-02-16 Yellow
5: 2 2020-02-16 09:00 20 Yellow 2020-02-17 <NA>
6: 2 2020-02-17 08:00 55 Blue 2020-02-18 <NA>
7: 3 2020-04-16 14:30 57 Red 2020-04-17 <NA>
8: 3 2020-04-16 07:30 43 Yellow 2020-04-17 Yellow
9: 3 2020-04-17 15:00 41 <NA> 2020-04-18 <NA>
10: 3 2020-04-25 07:20 13 <NA> 2020-04-26 <NA>
11: 3 2020-04-18 10:00 20 Yellow 2020-04-19 <NA>
12: 3 2020-04-19 10:30 19 <NA> 2020-04-20 <NA>
13: 3 2020-04-20 12:00 44 <NA> 2020-04-21 <NA>
14: 3 2020-04-21 12:00 29 <NA> 2020-04-22 <NA>
15: 3 2020-04-22 09:30 48 <NA> 2020-04-23 <NA>
16: 3 2020-04-24 23:00 35 <NA> 2020-04-25 <NA>
17: 3 2020-04-23 17:30 46 Blue 2020-04-24 <NA>
18: 4 2020-03-01 08:00 60 <NA> 2020-03-02 <NA>
19: 4 2020-03-02 08:00 29 Blue 2020-03-03 <NA>
20: 4 2020-03-03 08:00 49 Yellow 2020-03-04 <NA>
21: 4 2020-03-15 16:45 57 <NA> 2020-03-16 <NA>
22: 4 2020-03-16 08:00 21 Blue 2020-03-17 <NA>
23: 5 2020-05-05 13:45 43 Yellow 2020-05-06 Red
24: 5 2020-05-06 08:00 16 Red 2020-05-07 <NA>
25: 5 2020-05-07 11:00 23 Blue 2020-05-08 <NA>
id observation var1 var2 action_day color
Note that the result differs from the example table posted by the OP because OP's definition of df2 is different to the example table.
Also note that I had to modify the computation of agg1 because of an unexpected behaviour of min(var2, na.rm = TRUE) when an id group consists only of NA. (To reproduce the issue, try min(ordered(NA), na.rm = TRUE) vs min(ordered(NA)))

Indicator variable for dates within 7 day range of each other in R

I am working with electronic health records data and would like to create an indicator variable called "episode" that joins antibiotic medications that occur within 7 days of each other. Below is a mock dataset and the output that I would like. I program in R.
df2=data.frame(
id = c(01,01,01,01,01,02,02,03,04),
date = c("2015-01-01 11:00",
"2015-01-06 13:29",
"2015-01-10 12:46",
"2015-01-25 14:45",
"2015-02-15 13:30",
"2015-01-01 10:00",
"2015-05-05 15:20",
"2015-01-01 15:19",
"2015-08-01 13:15"),
abx = c("AMPICILLIN",
"ERYTHROMYCIN",
"NEOMYCIN",
"AMPICILLIN",
"VANCOMYCIN",
"VANCOMYCIN",
"NEOMYCIN",
"PENICILLIN",
"ERYTHROMYCIN"));
df2
Output desired
id date abx episode
1 2015-01-01 11:00 AMPICILLIN 1
1 2015-01-06 13:29 ERYTHROMYCIN 1
1 2015-01-10 12:46 NEOMYCIN 1
1 2015-01-25 14:45 AMPICILLIN 2
1 2015-02-15 13:30 VANCOMYCIN 3
2 2015-01-01 10:00 VANCOMYCIN 1
2 2015-05-05 15:20 NEOMYCIN 1
3 2015-01-01 15:19 PENICILLIN 1
4 2015-08-01 13:15 ERYTHROMYCIN 1
Use ave like this:
grpno <- function(x) cumsum(c(TRUE, diff(x) >=7 ))
transform(df2, episode = ave(as.numeric(as.Date(date)), id, FUN = grpno))
giving:
id date abx episode
1 1 2015-01-01 11:00 AMPICILLIN 1
2 1 2015-01-06 13:29 ERYTHROMYCIN 1
3 1 2015-01-10 12:46 NEOMYCIN 1
4 1 2015-01-25 14:45 AMPICILLIN 2
5 1 2015-02-15 13:30 VANCOMYCIN 3
6 2 2015-01-01 10:00 VANCOMYCIN 1
7 2 2015-05-05 15:20 NEOMYCIN 2
8 3 2015-01-01 15:19 PENICILLIN 1
9 4 2015-08-01 13:15 ERYTHROMYCIN 1
or with dplyr and grpno from above:
df2 %>%
group_by(id) %>%
mutate(episode = date %>% as.Date %>% as.numeric %>% grpno) %>%
ungroup

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

R: Count by id, number of occurences in a predefined time interval

I want to compute a column that counts the number of occurences looking backward in a predefined time interval (e.g. 2 days) for a particular ID.
I have the following data structure (see code below) in R and want to compute the column countLast2d automatically:
userID <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3)
datetime <-c("2015-07-02 13:20:00", "2015-07-03 13:20:00", "2015-07-04 01:20:00",
"2015-07-10 01:20:00", "2015-07-23 01:20:00", "2015-07-23 06:08:00", "2015-07-24 06:08:00",
"2015-09-02 09:01:00", "2015-08-19 11:41:00", "2015-08-19 14:38:00", "2015-08-19 17:36:00",
"2015-08-19 20:33:00", "2015-08-19 23:30:00", "2015-08-19 23:46:00", "2015-08-19 05:19:00",
"2015-09-13 17:02:00", "2015-10-01 00:32:00", "2015-10-01 00:50:00")
The outcome should take on these values:
countLast2d <- c(0,1,2,0,0,1,2,0,0,1,0,0,0,1,0,0,0,1)
df <- data.frame(userID, countLast2d, datetime)
df$datetime = as.POSIXct(strptime(df$datetime, format = "%Y-%m-%d %H:%M:%S"))
In Excel, I would use the following formula:
=countifs([datecolumn],"<"&[date cell in that row],[datecolumn],"<"&[date cell in that row]-2,[idcolumn],[id cell in that row])
(So for example [C2]=+COUNTIFS($B:$B,"<"&$B2,$B:$B,">="&$B2-2,$A:$A,$A2), if Column A contains the id and column B the date)
I already asked that question once before (https://stackoverflow.com/questions/30998596/r-count-number-of-occurences-by-id-in-the-last-48h) but didn't include an example in my question. So sorry for asking again.
Here's a solution:
df <- data.frame(userID=c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3),datetime=as.POSIXct(c('2015-07-02 13:20:00','2015-07-03 13:20:00','2015-07-04 01:20:00','2015-07-10 01:20:00','2015-07-23 01:20:00','2015-07-23 06:08:00','2015-07-24 06:08:00','2015-09-02 09:01:00','2015-08-19 11:41:00','2015-08-19 14:38:00','2015-08-19 17:36:00','2015-08-19 20:33:00','2015-08-19 23:30:00','2015-08-19 23:46:00','2015-08-19 05:19:00','2015-09-13 17:02:00','2015-10-01 00:32:00','2015-10-01 00:50:00')));
window <- as.difftime(2,units='days');
df$countLast2d <- sapply(1:nrow(df),function(r) sum(df$userID==df$userID[r] & df$datetime<df$datetime[r] & df$datetime>=df$datetime[r]-window));
df;
## userID datetime countLast2d
## 1 1 2015-07-02 13:20:00 0
## 2 1 2015-07-03 13:20:00 1
## 3 1 2015-07-04 01:20:00 2
## 4 1 2015-07-10 01:20:00 0
## 5 1 2015-07-23 01:20:00 0
## 6 1 2015-07-23 06:08:00 1
## 7 1 2015-07-24 06:08:00 2
## 8 1 2015-09-02 09:01:00 0
## 9 2 2015-08-19 11:41:00 1
## 10 2 2015-08-19 14:38:00 2
## 11 2 2015-08-19 17:36:00 3
## 12 2 2015-08-19 20:33:00 4
## 13 2 2015-08-19 23:30:00 5
## 14 2 2015-08-19 23:46:00 6
## 15 2 2015-08-19 05:19:00 0
## 16 3 2015-09-13 17:02:00 0
## 17 3 2015-10-01 00:32:00 0
## 18 3 2015-10-01 00:50:00 1
Note that this differs from your expected output because your expected output is incorrect for userID==2.
This solution will work regardless of the ordering of df, which is essential for your example df because it is unordered (or at least not perfectly ordered) for userID==2.
Edit Here's a possibility, using by() to group by userID and only comparing each element against lesser-index elements, under the assumption that only those elements can be in the lookback window:
df2 <- df[order(df$userID,df$datetime),];
df2$countLast2d <- do.call(c,by(df2$datetime,df$userID,function(x) c(0,sapply(2:length(x),function(i) sum(x[1:(i-1)]>=x[i]-window)))));
df2;
## userID datetime countLast2d
## 1 1 2015-07-02 13:20:00 0
## 2 1 2015-07-03 13:20:00 1
## 3 1 2015-07-04 01:20:00 2
## 4 1 2015-07-10 01:20:00 0
## 5 1 2015-07-23 01:20:00 0
## 6 1 2015-07-23 06:08:00 1
## 7 1 2015-07-24 06:08:00 2
## 8 1 2015-09-02 09:01:00 0
## 15 2 2015-08-19 05:19:00 0
## 9 2 2015-08-19 11:41:00 1
## 10 2 2015-08-19 14:38:00 2
## 11 2 2015-08-19 17:36:00 3
## 12 2 2015-08-19 20:33:00 4
## 13 2 2015-08-19 23:30:00 5
## 14 2 2015-08-19 23:46:00 6
## 16 3 2015-09-13 17:02:00 0
## 17 3 2015-10-01 00:32:00 0
## 18 3 2015-10-01 00:50:00 1

Resources