Indicator variable for dates within 7 day range of each other in R - r

I am working with electronic health records data and would like to create an indicator variable called "episode" that joins antibiotic medications that occur within 7 days of each other. Below is a mock dataset and the output that I would like. I program in R.
df2=data.frame(
id = c(01,01,01,01,01,02,02,03,04),
date = c("2015-01-01 11:00",
"2015-01-06 13:29",
"2015-01-10 12:46",
"2015-01-25 14:45",
"2015-02-15 13:30",
"2015-01-01 10:00",
"2015-05-05 15:20",
"2015-01-01 15:19",
"2015-08-01 13:15"),
abx = c("AMPICILLIN",
"ERYTHROMYCIN",
"NEOMYCIN",
"AMPICILLIN",
"VANCOMYCIN",
"VANCOMYCIN",
"NEOMYCIN",
"PENICILLIN",
"ERYTHROMYCIN"));
df2
Output desired
id date abx episode
1 2015-01-01 11:00 AMPICILLIN 1
1 2015-01-06 13:29 ERYTHROMYCIN 1
1 2015-01-10 12:46 NEOMYCIN 1
1 2015-01-25 14:45 AMPICILLIN 2
1 2015-02-15 13:30 VANCOMYCIN 3
2 2015-01-01 10:00 VANCOMYCIN 1
2 2015-05-05 15:20 NEOMYCIN 1
3 2015-01-01 15:19 PENICILLIN 1
4 2015-08-01 13:15 ERYTHROMYCIN 1

Use ave like this:
grpno <- function(x) cumsum(c(TRUE, diff(x) >=7 ))
transform(df2, episode = ave(as.numeric(as.Date(date)), id, FUN = grpno))
giving:
id date abx episode
1 1 2015-01-01 11:00 AMPICILLIN 1
2 1 2015-01-06 13:29 ERYTHROMYCIN 1
3 1 2015-01-10 12:46 NEOMYCIN 1
4 1 2015-01-25 14:45 AMPICILLIN 2
5 1 2015-02-15 13:30 VANCOMYCIN 3
6 2 2015-01-01 10:00 VANCOMYCIN 1
7 2 2015-05-05 15:20 NEOMYCIN 2
8 3 2015-01-01 15:19 PENICILLIN 1
9 4 2015-08-01 13:15 ERYTHROMYCIN 1
or with dplyr and grpno from above:
df2 %>%
group_by(id) %>%
mutate(episode = date %>% as.Date %>% as.numeric %>% grpno) %>%
ungroup

Related

How can I create a day number variable in R based on dates?

I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
Id
date
1
08/03/2020 08:17
1
08/03/2020 12:01
1
08/04/2020 15:08
1
08/04/2020 22:16
2
07/03/2020 08:10
2
07/03/2020 12:03
2
07/04/2020 15:07
2
07/05/2020 00:16
3
08/22/2020 09:17
3
08/23/2020 11:04
3
08/24/2020 00:01
4
10/03/2020 08:37
4
10/03/2020 11:13
4
10/04/2020 15:20
4
10/04/2020 23:05
This is what I want:
Id
date
day
1
08/03/2020 08:17
1
1
08/03/2020 12:01
1
1
08/04/2020 15:08
2
1
08/04/2020 22:16
2
2
07/03/2020 08:10
1
2
07/03/2020 12:03
1
2
07/04/2020 15:07
2
2
07/05/2020 00:16
2
3
08/22/2020 09:17
1
3
08/23/2020 11:04
2
3
08/24/2020 00:01
2
4
10/03/2020 08:37
1
4
10/03/2020 11:13
1
4
10/04/2020 15:20
2
4
10/04/2020 23:05
2
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.
Please check the below code
code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
output
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3
Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Output
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days

Filter data based on subgroups R

In reality it's much more complex, but let's say my data looks like this:
df <- data.frame(
id = c(1,1,1,2,2,2,2,3,3,3),
event = c(0,0,0,1,1,1,1,0,0,0),
day = c(1,3,3,1,6,6,7,1,4,6),
time = c("2016-10-25 14:00:00", "2016-10-27 12:00:15", "2016-10-27 15:30:00",
"2016-10-23 11:00:00", "2016-10-28 08:00:15", "2016-10-28 23:00:00", "2016-10-29 12:00:00",
"2016-10-24 15:00:00", "2016-10-27 15:00:15", "2016-10-29 16:00:00"))
df$time <- as.POSIXct(df$time)
Output:
id event day time
1 1 0 1 2016-10-25 14:00:00
2 1 0 3 2016-10-27 12:00:15
3 1 0 3 2016-10-27 15:30:00
4 2 1 1 2016-10-23 11:00:00
5 2 1 6 2016-10-28 08:00:15
6 2 1 6 2016-10-28 23:00:00
7 2 1 7 2016-10-29 12:00:00
8 3 0 1 2016-10-24 15:00:00
9 3 0 4 2016-10-27 15:00:15
10 3 0 6 2016-10-29 16:00:00
What I need to do:
If event is 0, I want to keep only the last 24 hours per id.
If event is 1, I want to keep the 6th day.
I know how to keep the last 24 hours in general:
library(lubridate)
last_twentyfour_hours <- df %>%
group_by(id) %>%
filter(time > last(time) - hours(24))
But how do i filter differently for each group?
Thank you very much in advance!
Grouped by 'id', 'event', do a filter with if/else i.e. if 0 is in 'event', then use the OP's condition or else return the rows where 'day' is 6
library(dplyr)
library(lubridate)
df %>%
group_by(id, event) %>%
filter(if(0 %in% event) time > last(time) - hours(24) else
day == 6) %>%
ungroup
-output
# A tibble: 5 × 4
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00
We could use the & and | operator:
df %>%
group_by(id) %>%
filter(event == 0 & time > last(time) - hours(24) |
event == 1 & day==6)
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00

How to merge/join dataframes in R conditionally to dynamic time intervals

In R I have two data frames representing covid-19 patients and I would like to merge them into one table to be able to perform the desired analyses.
df1 contains metadata of each hospital stay
df2 contains observational data for different timepoints during each stay, usually one per day but for some there are more than one per day
some cases have multiple stays and I find it difficult to merge these datasets so that the observational data is assigned to its resepctive metadata from the corresponding stay where there is no variable to indicate which stay the observational data belongs to other than the dates
Sample data can be generated with this code
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5), in_date=c("2020-03-09", "2020-02-15" , "2020-04-16" , "2020-04-19", "2020-04-24", "2020-03-01" , "2020-03-15" , "2020-05-05") , location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ) )
df2 <- data.frame(id=c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,5,5,5) , obs_time=c(
"2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00",
"2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00",
"2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" ,
"2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" ,
"2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30",
"2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" ,
"2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
temp_celsius=runif(25, min=35.8, max=42.0) )
lubridate ymd_hm and ymd functions was used to convert the factors with dates into POSIX date-variables.
Be aware data is not completely sorted and case id 3 has 3 stays but they are all consecutive with no days between and at first day there is 2 observations. Case 4 has two stays but there are days between.
When merging the two data frames I need to assign the different observations to the different stays so that I can make plots with starting point (time zero) the time when they came in at the actual stay. In example plot the development of temprature along x axis for all cases by location group where first obs_time is time zero.
Alltough I find other threads related it is not the same.
The desired output would be like this:
id obs_time temp_celsius stay_id stay_day location
1 1 2020-03-09 01:00 40.53805 1 1 a
2 1 2020-03-09 10:00 37.54832 1 1 a
3 1 2020-03-10 05:00 38.78600 1 2 a
4 2 2020-02-15 08:00 36.19048 1 1 a
5 2 2020-02-16 09:00 37.74323 1 2 a
6 2 2020-02-17 08:00 41.83050 1 3 a
7 3 2020-04-16 14:30 39.82978 1 1 a
8 3 2020-04-16 07:30 39.84554 1 1 a
9 3 2020-04-17 15:00 38.31164 1 2 a
10 3 2020-04-25 07:20 36.37992 3 2 b
11 3 2020-04-18 10:00 38.65261 1 3 a
12 3 2020-04-19 10:30 38.94991 2 1 b
13 3 2020-04-20 12:00 36.84384 2 2 b
14 3 2020-04-21 12:00 35.81786 2 3 b
15 3 2020-04-22 09:30 39.20979 2 4 b
16 3 2020-04-24 23:00 41.39876 3 1 b
17 3 2020-04-23 17:30 37.68251 2 5 b
18 4 2020-03-01 08:00 41.55690 1 1 a
19 4 2020-03-02 08:00 38.53060 1 2 a
20 4 2020-03-03 08:00 39.99385 1 3 a
21 4 2020-03-15 16:45 38.29500 2 1 a
22 4 2020-03-16 08:00 41.20947 2 2 a
23 5 2020-05-05 13:45 36.43556 1 1 a
24 5 2020-05-06 08:00 41.06712 1 2 a
25 5 2020-05-07 11:00 36.76612 1 3 a
Hope anyone can help me with this issue
I think this just about covers it. Rolling joins from data.table and a little manipulation should get you there.
set.seed(1)
library(data.table)
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
in_date=c("2020-03-09", "2020-02-15" , "2020-04-16" , "2020-04-19", "2020-04-24", "2020-03-01" , "2020-03-15" , "2020-05-05") ,
location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ) )
df2 <- data.frame(id=c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,5,5,5) , obs_time=c(
"2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00",
"2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00",
"2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" ,
"2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" ,
"2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30",
"2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" ,
"2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
temp_celsius=runif(25, min=35.8, max=42.0))
setDT(df1)
setDT(df2)
df1[, c('in_date') := .(lubridate::ymd(in_date))]
df1[, stay_id := 1:.N, by = id]
df2[, obs_time := lubridate::ymd_hm(obs_time)]
df2[, obs_date := as.Date(obs_time)]
df1[df2, .(id, in_date, obs_time, temp_celsius, stay_id, location), on = c('id', 'in_date' = 'obs_date'), roll = Inf][
, stay_day := as.numeric(difftime(as.Date(obs_time), min(as.Date(in_date)), units = 'day')) + 1, by = .(id, stay_id)][, in_date := NULL][]
#> id obs_time temp_celsius stay_id location stay_day
#> 1: 1 2020-03-09 01:00:00 37.44615 1 a 1
#> 2: 1 2020-03-09 10:00:00 38.10717 1 a 1
#> 3: 1 2020-03-10 05:00:00 39.35169 1 a 2
#> 4: 2 2020-02-15 08:00:00 41.43089 1 a 1
#> 5: 2 2020-02-16 09:00:00 37.05043 1 a 2
#> 6: 2 2020-02-17 08:00:00 41.37002 1 a 3
#> 7: 3 2020-04-16 14:30:00 41.65699 1 a 1
#> 8: 3 2020-04-16 07:30:00 39.89695 1 a 1
#> 9: 3 2020-04-17 15:00:00 39.70051 1 a 2
#> 10: 3 2020-04-25 07:20:00 36.18307 3 b 2
#> 11: 3 2020-04-18 10:00:00 37.07704 1 a 3
#> 12: 3 2020-04-19 10:30:00 36.89465 2 b 1
#> 13: 3 2020-04-20 12:00:00 40.05954 2 b 2
#> 14: 3 2020-04-21 12:00:00 38.18144 2 b 3
#> 15: 3 2020-04-22 09:30:00 40.57302 2 b 4
#> 16: 3 2020-04-24 23:00:00 38.88574 3 b 1
#> 17: 3 2020-04-23 17:30:00 40.24923 2 b 5
#> 18: 4 2020-03-01 08:00:00 41.94982 1 a 1
#> 19: 4 2020-03-02 08:00:00 38.15622 1 a 2
#> 20: 4 2020-03-03 08:00:00 40.62016 1 a 3
#> 21: 4 2020-03-15 16:45:00 41.59517 2 a 1
#> 22: 4 2020-03-16 08:00:00 37.11528 2 a 2
#> 23: 5 2020-05-05 13:45:00 39.84038 1 a 1
#> 24: 5 2020-05-06 08:00:00 36.57844 1 a 2
#> 25: 5 2020-05-07 11:00:00 37.45677 1 a 3
#> id obs_time temp_celsius stay_id location stay_day
Created on 2020-07-16 by the reprex package (v0.3.0)

Time intervals from data across multiple rows

I have a data structure similar to the one below:
# A tibble: 5 x 4
group task start end
<chr> <dbl> <chr> <chr>
1 a 1 01:00 01:30
2 a 2 02:00 02:25
3 b 3 01:05 01:40
4 b 4 01:50 02:30
5 a 5 03:00 03:30
Basically i need to compute the time difference between the end of the last task and the start of the next one - for each group - given that it needs to be following a cronological order, and belong to the same group.
Desired output:
# A tibble: 5 x 7
group last_task last_end next_task next_start next_end interval
<chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
1 a NA NA 1 01:00 01:30 NA
2 a 1 01:30 2 02:00 02:25 00:30
3 b NA NA 3 01:05 01:40 NA
4 b 3 01:40 4 01:50 02:30 00:10
5 a 2 02:25 5 03:00 03:30 00:35
Here is an approach with lead and lag from dplyr.
The output differs from your expected output, but I believe it matches your request in words because of grouping.
I use lubridate since your times are actually factors. It will fail for tasks which cross dates.
library(dplyr)
library(lubridate)
data %>%
group_by(group) %>%
arrange(task) %>%
mutate(last_task = lag(task),
last_end = lag(end),
next_task = lead(task),
next_start = lead(start),
interval = ymd_hm(paste(today(),start,sep = " ")) - ymd_hm(paste(today(),lag(end),sep = " ")))
# A tibble: 5 x 9
group task start end last_task last_end next_task next_start interval
<fct> <int> <fct> <fct> <int> <fct> <int> <fct> <drtn>
1 a 1 01:00 01:30 NA NA 2 02:00 NA mins
2 a 2 02:00 02:25 1 01:30 5 03:00 30 mins
3 b 3 01:05 01:40 NA NA 4 01:50 NA mins
4 b 4 01:50 02:30 3 01:40 NA NA 10 mins
5 a 5 03:00 03:30 2 02:25 NA NA 35 mins
If you're set on the interval format, we can hack that together:
data %>%
group_by(group) %>%
arrange(task) %>%
mutate(last_task = lag(task),
last_end = lag(end),
next_task = lead(task),
next_start = lead(start),
interval = ymd_hm(paste(today(),start,sep = " ")) - ymd_hm(paste(today(),lag(end),sep = " ")),
interval = ifelse(is.na(interval),NA,paste(hour(as.period(interval)),minute(as.period(interval)),sep = ":")))
# A tibble: 5 x 9
group task start end last_task last_end next_task next_start interval
<fct> <int> <fct> <fct> <int> <fct> <int> <fct> <chr>
1 a 1 01:00 01:30 NA NA 2 02:00 NA
2 a 2 02:00 02:25 1 01:30 5 03:00 0:30
3 b 3 01:05 01:40 NA NA 4 01:50 NA
4 b 4 01:50 02:30 3 01:40 NA NA 0:10
5 a 5 03:00 03:30 2 02:25 NA NA 0:35

Calculate person years by subtracting start date from end date

Suppose I have a dataframe that looks like this:
id start_date death_date
1 2011-05-20 2014-12-11
2 2014-08-01 2016-01-05
3 2005-01-02 2015-10-20
4 2015-06-30 2016-02-14
5 2014-07-01 2014-09-03
I want to create a new column that contains the difference between death_date and start_date in months UNLESS start_date is before 2014-05-31. If start_date < 2014-05-31, then I want the new column to be the difference between death_date and 2014-05-31 in months.
Code to create sample dataframe:
id <- c(1:5)
start_date <- c(as.Date("2011-05-20"), as.Date("2014-08-01"),
as.Date("2005-01-02"), as.Date("2015-06-30"),
as.Date("2014-07-01"))
death_date <- c(as.Date("2014-12-11"), as.Date("2016-01-05"),
as.Date("2015-10-20"), as.Date("2016-02-14"),
as.Date("2014-09-03"))
example_dates <- data.frame(id, start_date, death_date)
Try this:
df$new_col <- round(ifelse(df$start_date<as.Date("2014-05-31"),
df$death_date-as.Date("2014-05-31"), df$death_date-df$start_date)/30, 2)
# id start_date death_date new_col
# 1 1 2011-05-20 2014-12-11 6.47
# 2 2 2014-08-01 2016-01-05 17.40
# 3 3 2005-01-02 2015-10-20 16.90
# 4 4 2015-06-30 2016-02-14 7.63
# 5 5 2014-07-01 2014-09-03 2.13

Resources