How can I create a day number variable in R based on dates? - r

I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
Id
date
1
08/03/2020 08:17
1
08/03/2020 12:01
1
08/04/2020 15:08
1
08/04/2020 22:16
2
07/03/2020 08:10
2
07/03/2020 12:03
2
07/04/2020 15:07
2
07/05/2020 00:16
3
08/22/2020 09:17
3
08/23/2020 11:04
3
08/24/2020 00:01
4
10/03/2020 08:37
4
10/03/2020 11:13
4
10/04/2020 15:20
4
10/04/2020 23:05
This is what I want:
Id
date
day
1
08/03/2020 08:17
1
1
08/03/2020 12:01
1
1
08/04/2020 15:08
2
1
08/04/2020 22:16
2
2
07/03/2020 08:10
1
2
07/03/2020 12:03
1
2
07/04/2020 15:07
2
2
07/05/2020 00:16
2
3
08/22/2020 09:17
1
3
08/23/2020 11:04
2
3
08/24/2020 00:01
2
4
10/03/2020 08:37
1
4
10/03/2020 11:13
1
4
10/04/2020 15:20
2
4
10/04/2020 23:05
2
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.

Please check the below code
code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
output
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3

Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Output
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days

Related

Given a series of dates and a birth day, is there a way to obtain the age at every date entry along with a final age using the lubridate package?

I have a database of information pertaining to individuals observed over time. I would like to find a way to obtain the age of these individuals whenever a record was taken. Assuming the BIRTH assigns a value of 0, I would like to obtain the age either in days or months for the visits after. It would also be helpful to obtain a final age (either day or month) for each individual (*not included in the code). For example, for ID (A), the final age would be 10 months. I would like to use the lubridate function as it's in-built date feature makes it easier to work with dates. Any help with this is much appreciated.
date<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2002-06-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
df1<-data.frame(date,ID,status)
print(df1)
date ID status
1 2000-01-01 A BIRTH
2 2000-01-14 A ETC
3 2000-01-25 A ETC
4 2000-02-12 A ETC
5 2000-02-27 A ETC
6 2000-06-05 A ETC
7 2000-10-30 A ETC
8 2001-02-04 B BIRTH
9 2001-06-15 B ETC
10 2001-12-26 B ETC
11 2002-05-22 B ETC
12 2002-06-04 B ETC
13 2000-01-08 C BIRTH
14 2000-07-11 C ETC
15 2000-08-18 C ETC
16 2000-11-27 C ETC
date.new<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2001-02-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID.new<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status.new<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
age<-c(0,1,1,2,2,6,10,
0,4,10,15,16,
0,6,7,10)
df2<-data.frame(date.new,ID.new,status.new,age)
print(df2)
date.new ID.new status.new age
1 2000-01-01 A BIRTH 0
2 2000-01-14 A ETC 1
3 2000-01-25 A ETC 1
4 2000-02-12 A ETC 2
5 2000-02-27 A ETC 2
6 2000-06-05 A ETC 6
7 2000-10-30 A ETC 10
8 2001-02-04 B BIRTH 0
9 2001-06-15 B ETC 4
10 2001-12-26 B ETC 10
11 2002-05-22 B ETC 15
12 2001-02-04 B ETC 16
13 2000-01-08 C BIRTH 0
14 2000-07-11 C ETC 6
15 2000-08-18 C ETC 7
16 2000-11-27 C ETC 10
For calculations related to age in years or months, I'd like to encourage you to try the clock package rather than lubridate. lubridate is a great package, but produces some unexpected results with these kinds of calculations if you aren't 100% sure of what you are doing. In clock, the function to do this is date_count_between(). Notice that one of the results is different between clock and lubridate here:
library(clock)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
date = c("2000-01-01","2000-01-14",
"2000-01-25","2000-02-12","2000-02-27","2000-06-05",
"2000-10-30","2001-02-04","2001-06-15","2001-12-26",
"2002-05-22","2002-06-04","2000-01-08","2000-07-11",
"2000-08-18","2000-11-27"),
ID = c("A","A","A","A","A","A",
"A","B","B","B","B","B","C","C","C","C"),
status = c("BIRTH","ETC","ETC","ETC",
"ETC","ETC","ETC","BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
)
df %>%
mutate(date = date_parse(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"]) %>%
ungroup() %>%
mutate(
age_clock = date_count_between(birth_date, date, "month"),
age_lubridate = as.period(date - birth_date) %/% months(1))
#> # A tibble: 16 × 6
#> date ID status birth_date age_clock age_lubridate
#> <date> <chr> <chr> <date> <int> <dbl>
#> 1 2000-01-01 A BIRTH 2000-01-01 0 0
#> 2 2000-01-14 A ETC 2000-01-01 0 0
#> 3 2000-01-25 A ETC 2000-01-01 0 0
#> 4 2000-02-12 A ETC 2000-01-01 1 1
#> 5 2000-02-27 A ETC 2000-01-01 1 1
#> 6 2000-06-05 A ETC 2000-01-01 5 5
#> 7 2000-10-30 A ETC 2000-01-01 9 9
#> 8 2001-02-04 B BIRTH 2001-02-04 0 0
#> 9 2001-06-15 B ETC 2001-02-04 4 4
#> 10 2001-12-26 B ETC 2001-02-04 10 10
#> 11 2002-05-22 B ETC 2001-02-04 15 15
#> 12 2002-06-04 B ETC 2001-02-04 16 15
#> 13 2000-01-08 C BIRTH 2000-01-08 0 0
#> 14 2000-07-11 C ETC 2000-01-08 6 6
#> 15 2000-08-18 C ETC 2000-01-08 7 7
#> 16 2000-11-27 C ETC 2000-01-08 10 10
clock says that 2001-02-04 to 2002-06-04 is 16 months, while the lubridate method here only says it is 15 months. This has to do with the fact that the lubridate calculation uses the length of an average month, which doesn't always accurately reflect how we think about months.
Consider this simple example, I think most people would agree that a child born on this date in February is considered "1 month and 1 day" old. But lubridate shows 0 months!
library(clock)
library(lubridate, warn.conflicts = FALSE)
# "1 month and 1 day apart"
feb <- as.Date("2020-02-28")
mar <- as.Date("2020-03-29")
# As expected when thinking about age in months
date_count_between(feb, mar, "month")
#> [1] 1
# Not expected
as.period(mar - feb) %/% months(1)
#> [1] 0
secs_in_day <- 86400
secs_in_month <- as.numeric(months(1))
secs_in_month / secs_in_day
#> [1] 30.4375
# Less than 30.4375 days, so not 1 month
mar - feb
#> Time difference of 30 days
The issue is that lubridate uses the length of an average month in the computation, which is 30.4375 days. But there are only 30 days between these two dates, so it isn't considered a full month.
clock, on the other hand, uses the day component of the starting date to determine if a "full month" has passed or not. In other words, because we have passed the 28th of March, clock decides that 1 month has passed, which is consistent with how we generally think about age.
Using dplyr and lubridate, we can do the following. We first turn the date column into a date. Then we group by ID, find the birth date and calculate the number of months since that date via some lubridate magic (see How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?).
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = as_date(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"],
age = as.period(date - birth_date) %/% months(1)) %>%
ungroup()
Which gives:
date ID status birth_date age
<date> <fct> <fct> <date> <dbl>
1 2000-01-01 A BIRTH 2000-01-01 0
2 2000-01-14 A ETC 2000-01-01 0
3 2000-01-25 A ETC 2000-01-01 0
4 2000-02-12 A ETC 2000-01-01 1
5 2000-02-27 A ETC 2000-01-01 1
6 2000-06-05 A ETC 2000-01-01 5
7 2000-10-30 A ETC 2000-01-01 9
8 2001-02-04 B BIRTH 2001-02-04 0
9 2001-06-15 B ETC 2001-02-04 4
10 2001-12-26 B ETC 2001-02-04 10
11 2002-05-22 B ETC 2001-02-04 15
12 2002-06-04 B ETC 2001-02-04 15
13 2000-01-08 C BIRTH 2000-01-08 0
14 2000-07-11 C ETC 2000-01-08 6
15 2000-08-18 C ETC 2000-01-08 7
16 2000-11-27 C ETC 2000-01-08 10
Which is your expected output except for some rounding differences. See my comment on your question.

Filter data based on subgroups R

In reality it's much more complex, but let's say my data looks like this:
df <- data.frame(
id = c(1,1,1,2,2,2,2,3,3,3),
event = c(0,0,0,1,1,1,1,0,0,0),
day = c(1,3,3,1,6,6,7,1,4,6),
time = c("2016-10-25 14:00:00", "2016-10-27 12:00:15", "2016-10-27 15:30:00",
"2016-10-23 11:00:00", "2016-10-28 08:00:15", "2016-10-28 23:00:00", "2016-10-29 12:00:00",
"2016-10-24 15:00:00", "2016-10-27 15:00:15", "2016-10-29 16:00:00"))
df$time <- as.POSIXct(df$time)
Output:
id event day time
1 1 0 1 2016-10-25 14:00:00
2 1 0 3 2016-10-27 12:00:15
3 1 0 3 2016-10-27 15:30:00
4 2 1 1 2016-10-23 11:00:00
5 2 1 6 2016-10-28 08:00:15
6 2 1 6 2016-10-28 23:00:00
7 2 1 7 2016-10-29 12:00:00
8 3 0 1 2016-10-24 15:00:00
9 3 0 4 2016-10-27 15:00:15
10 3 0 6 2016-10-29 16:00:00
What I need to do:
If event is 0, I want to keep only the last 24 hours per id.
If event is 1, I want to keep the 6th day.
I know how to keep the last 24 hours in general:
library(lubridate)
last_twentyfour_hours <- df %>%
group_by(id) %>%
filter(time > last(time) - hours(24))
But how do i filter differently for each group?
Thank you very much in advance!
Grouped by 'id', 'event', do a filter with if/else i.e. if 0 is in 'event', then use the OP's condition or else return the rows where 'day' is 6
library(dplyr)
library(lubridate)
df %>%
group_by(id, event) %>%
filter(if(0 %in% event) time > last(time) - hours(24) else
day == 6) %>%
ungroup
-output
# A tibble: 5 × 4
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00
We could use the & and | operator:
df %>%
group_by(id) %>%
filter(event == 0 & time > last(time) - hours(24) |
event == 1 & day==6)
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00

Time intervals from data across multiple rows

I have a data structure similar to the one below:
# A tibble: 5 x 4
group task start end
<chr> <dbl> <chr> <chr>
1 a 1 01:00 01:30
2 a 2 02:00 02:25
3 b 3 01:05 01:40
4 b 4 01:50 02:30
5 a 5 03:00 03:30
Basically i need to compute the time difference between the end of the last task and the start of the next one - for each group - given that it needs to be following a cronological order, and belong to the same group.
Desired output:
# A tibble: 5 x 7
group last_task last_end next_task next_start next_end interval
<chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
1 a NA NA 1 01:00 01:30 NA
2 a 1 01:30 2 02:00 02:25 00:30
3 b NA NA 3 01:05 01:40 NA
4 b 3 01:40 4 01:50 02:30 00:10
5 a 2 02:25 5 03:00 03:30 00:35
Here is an approach with lead and lag from dplyr.
The output differs from your expected output, but I believe it matches your request in words because of grouping.
I use lubridate since your times are actually factors. It will fail for tasks which cross dates.
library(dplyr)
library(lubridate)
data %>%
group_by(group) %>%
arrange(task) %>%
mutate(last_task = lag(task),
last_end = lag(end),
next_task = lead(task),
next_start = lead(start),
interval = ymd_hm(paste(today(),start,sep = " ")) - ymd_hm(paste(today(),lag(end),sep = " ")))
# A tibble: 5 x 9
group task start end last_task last_end next_task next_start interval
<fct> <int> <fct> <fct> <int> <fct> <int> <fct> <drtn>
1 a 1 01:00 01:30 NA NA 2 02:00 NA mins
2 a 2 02:00 02:25 1 01:30 5 03:00 30 mins
3 b 3 01:05 01:40 NA NA 4 01:50 NA mins
4 b 4 01:50 02:30 3 01:40 NA NA 10 mins
5 a 5 03:00 03:30 2 02:25 NA NA 35 mins
If you're set on the interval format, we can hack that together:
data %>%
group_by(group) %>%
arrange(task) %>%
mutate(last_task = lag(task),
last_end = lag(end),
next_task = lead(task),
next_start = lead(start),
interval = ymd_hm(paste(today(),start,sep = " ")) - ymd_hm(paste(today(),lag(end),sep = " ")),
interval = ifelse(is.na(interval),NA,paste(hour(as.period(interval)),minute(as.period(interval)),sep = ":")))
# A tibble: 5 x 9
group task start end last_task last_end next_task next_start interval
<fct> <int> <fct> <fct> <int> <fct> <int> <fct> <chr>
1 a 1 01:00 01:30 NA NA 2 02:00 NA
2 a 2 02:00 02:25 1 01:30 5 03:00 0:30
3 b 3 01:05 01:40 NA NA 4 01:50 NA
4 b 4 01:50 02:30 3 01:40 NA NA 0:10
5 a 5 03:00 03:30 2 02:25 NA NA 0:35

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

Indicator variable for dates within 7 day range of each other in R

I am working with electronic health records data and would like to create an indicator variable called "episode" that joins antibiotic medications that occur within 7 days of each other. Below is a mock dataset and the output that I would like. I program in R.
df2=data.frame(
id = c(01,01,01,01,01,02,02,03,04),
date = c("2015-01-01 11:00",
"2015-01-06 13:29",
"2015-01-10 12:46",
"2015-01-25 14:45",
"2015-02-15 13:30",
"2015-01-01 10:00",
"2015-05-05 15:20",
"2015-01-01 15:19",
"2015-08-01 13:15"),
abx = c("AMPICILLIN",
"ERYTHROMYCIN",
"NEOMYCIN",
"AMPICILLIN",
"VANCOMYCIN",
"VANCOMYCIN",
"NEOMYCIN",
"PENICILLIN",
"ERYTHROMYCIN"));
df2
Output desired
id date abx episode
1 2015-01-01 11:00 AMPICILLIN 1
1 2015-01-06 13:29 ERYTHROMYCIN 1
1 2015-01-10 12:46 NEOMYCIN 1
1 2015-01-25 14:45 AMPICILLIN 2
1 2015-02-15 13:30 VANCOMYCIN 3
2 2015-01-01 10:00 VANCOMYCIN 1
2 2015-05-05 15:20 NEOMYCIN 1
3 2015-01-01 15:19 PENICILLIN 1
4 2015-08-01 13:15 ERYTHROMYCIN 1
Use ave like this:
grpno <- function(x) cumsum(c(TRUE, diff(x) >=7 ))
transform(df2, episode = ave(as.numeric(as.Date(date)), id, FUN = grpno))
giving:
id date abx episode
1 1 2015-01-01 11:00 AMPICILLIN 1
2 1 2015-01-06 13:29 ERYTHROMYCIN 1
3 1 2015-01-10 12:46 NEOMYCIN 1
4 1 2015-01-25 14:45 AMPICILLIN 2
5 1 2015-02-15 13:30 VANCOMYCIN 3
6 2 2015-01-01 10:00 VANCOMYCIN 1
7 2 2015-05-05 15:20 NEOMYCIN 2
8 3 2015-01-01 15:19 PENICILLIN 1
9 4 2015-08-01 13:15 ERYTHROMYCIN 1
or with dplyr and grpno from above:
df2 %>%
group_by(id) %>%
mutate(episode = date %>% as.Date %>% as.numeric %>% grpno) %>%
ungroup

Resources