I am working on a project where I have to only include patients who had lab tests ordered at least 12 hours apart, and to keep the timestamp of each included lab test. The issue is that many patients get several labs done within the 12 hour window, but the client has asked to not include those tests. I have made it this far:
#Create dummy dataset
df = data.frame(
"Encounter" = c(rep("12345", times=16), rep("67890", times = 5)),
"Timestamp" = c("01/06/2022 04:00:00", "01/07/2022 08:00:00",
"01/08/2022 00:00:00", "01/08/2022 04:00:00",
"01/08/2022 08:00:00", "01/08/2022 20:00:00",
"01/09/2022 04:00:00", "01/09/2022 08:00:00",
"01/09/2022 20:00:00", "01/09/2022 23:26:00",
"01/10/2022 00:00:00", "01/10/2022 08:00:00",
"01/10/2022 20:00:00", "01/11/2022 00:00:00",
"01/11/2022 20:00:00", "01/12/2022 04:00:00",
"11/10/2021 11:00:00", "11/10/2021 12:00:00",
"11/10/2021 13:00:00", "11/10/2021 14:00:00",
"11/11/2021 00:00:00"))
#Convert timestamp to POSIXlt format
df$Timestamp <- strptime(as.character(df$Timestamp), format="%m/%d/%Y %H:%M")
#Calculate time (in hours) between each previous timestamp by Encounter
df <- df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(difftime(Timestamp, lag(Timestamp), units="hours"))
I can't seem to figure out what to do next. It seems like I need to calculate a rolling 12-hours that then resets to 0 once a row hits 12 hours, but I'm not sure how to go about it. Below is my ideal result:
df$Keep.Row <- c(1,1,1,0,0,1,0,1,1,0,0,1,1,0,1,0,1,0,0,0,1)
There is absolutely nothing elegant about this, but I believe it gives you what you’re looking for. I use a temporary variable to store the “rolling” sum before it’s reset once the hours between is 12 or greater.
library(tidyverse)
df <- df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(time_diff = difftime(Timestamp, lag(Timestamp), units="hours")) %>%
replace_na(list(time_diff = 0)) %>%
mutate(temp = ifelse(time_diff < 12 & lag(time_diff) >= 12, time_diff, lag(time_diff) + time_diff),
temp = ifelse(is.na(temp), 0, temp),
hours_between = ifelse(time_diff >= 12, time_diff,
ifelse(time_diff < 12 & lag(time_diff) >= 12, time_diff, lag(temp) + time_diff)),
keep = ifelse(hours_between >= 12 | is.na(hours_between), 1, 0)) %>%
select(-temp)
Created on 2022-01-27 by the reprex package (v2.0.1)
Here is an alternative option using accumulate. Here, you can use you differences, and once they exceed the threshold of 12 hours, reset by just using the diff value (starting over) instead of using the cumulative sum. To include the first time for each Encounter, you can either make that diff 12 hours, or add a separate mutate and check where Timestamp == first(Timestamp) and in those cases set keep to 1.
library(tidyverse)
thresh <- 12
df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(diff = difftime(Timestamp, lag(Timestamp, default = first(Timestamp) - (thresh * 60 * 60)), units = "hours"),
keep = +(accumulate(diff, ~if_else(.x >= thresh, .y, .x + .y)) >= thresh))
Output
Encounter Timestamp diff keep
<chr> <dttm> <drtn> <int>
1 12345 2022-01-06 04:00:00 12.0000000 hours 1
2 12345 2022-01-07 08:00:00 28.0000000 hours 1
3 12345 2022-01-08 00:00:00 16.0000000 hours 1
4 12345 2022-01-08 04:00:00 4.0000000 hours 0
5 12345 2022-01-08 08:00:00 4.0000000 hours 0
6 12345 2022-01-08 20:00:00 12.0000000 hours 1
7 12345 2022-01-09 04:00:00 8.0000000 hours 0
8 12345 2022-01-09 08:00:00 4.0000000 hours 1
9 12345 2022-01-09 20:00:00 12.0000000 hours 1
10 12345 2022-01-09 23:26:00 3.4333333 hours 0
11 12345 2022-01-10 00:00:00 0.5666667 hours 0
12 12345 2022-01-10 08:00:00 8.0000000 hours 1
13 12345 2022-01-10 20:00:00 12.0000000 hours 1
14 12345 2022-01-11 00:00:00 4.0000000 hours 0
15 12345 2022-01-11 20:00:00 20.0000000 hours 1
16 12345 2022-01-12 04:00:00 8.0000000 hours 0
17 67890 2021-11-10 11:00:00 12.0000000 hours 1
18 67890 2021-11-10 12:00:00 1.0000000 hours 0
19 67890 2021-11-10 13:00:00 1.0000000 hours 0
20 67890 2021-11-10 14:00:00 1.0000000 hours 0
21 67890 2021-11-11 00:00:00 10.0000000 hours 1
Probably missing something, but wouldn't this work:
library(dplyr)
df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(time_dif = difftime(Timestamp, lag(Timestamp), units="hours")) %>%
filter(time_dif > 12)
Related
I have a time series, that spans almost 20 years with a resolution of 15 min.
I want to extract only hourly values (00:00:00, 01:00:00, and so on...) and plot the resulting time series.
The df looks like this:
3 columns: date, time, and discharge
How would you approach this?
a reproducible example would be good for this kind of question. Here is my code, hope it helps you:
#creating dummy data
df <- data.frame(time = seq(as.POSIXct("2018-01-01 00:00:00"), as.POSIXct("2018-01-01 23:59:59"), by = "15 min"), variable = runif(96, 0, 1))
example output: (only 5 rows)
time variable
1 2018-01-01 00:00:00 0.331546992
2 2018-01-01 00:15:00 0.407269290
3 2018-01-01 00:30:00 0.635367577
4 2018-01-01 00:45:00 0.808612045
5 2018-01-01 01:00:00 0.258801201
df %>% filter(format(time, "%M:%S") == "00:00")
output:
1 2018-01-01 00:00:00 0.76198532
2 2018-01-01 01:00:00 0.01304103
3 2018-01-01 02:00:00 0.10729465
4 2018-01-01 03:00:00 0.74534184
5 2018-01-01 04:00:00 0.25942667
plot(df %>% filter(format(time, "%M:%S") == "00:00") %>% ggplot(aes(x = time, y = variable)) + geom_line())
I have a start and end date for individuals and i need to estimate if the time passed from the start to the end is within 2 days
or 3 plus days.These dates are assign to record ids, how can i filter ones that ended within 2 days (from the start date)
and the ones that ended after 3 days or later.
Record_id <- c("2245","6728","5122","9287")
Start <- c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End <- c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
I tried using
elapsed.time <- DF$start %--% DF$End
time.duration <- as.duration(elapsed.time)
but I am getting error because End date contains hour.Thank you.
Here's a dplyr pipe that will include both constraints (2 and 3 days):
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 2, 3))
# # A tibble: 4 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750000 days
# 2 6728 2021-01-21 00:00:00 2021-01-22 16:00:00 1.666667 days
# 3 5122 2021-01-17 00:00:00 2021-01-22 13:00:00 5.541667 days
# 4 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625000 days
I included mutate(d= so that we can see what the actual differences are. If you were looking to remove those, then use filter(between(..)) (no !).
In the case of the data you provided, all observations are less than 2 or more than 3 days. I'll expand this range so that we can see it in effect:
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 1, 6))
# # A tibble: 2 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750 days
# 2 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625 days
Data
df <- structure(list(Record_id = c("2245", "6728", "5122", "9287"), Start = c("2021-01-13 CST", "2021-01-21 CST", "2021-01-17 CST", "2021-01-13 CST"), End = c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST", "2021-01-25 15:00:00 CST")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I just converted the character to a date time with lubridate and then subtracted the dates. What you'll get back are days. I then filter for dates that are within 2 days.
Record_id<- c("2245","6728","5122","9287")
Start<-c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End<-c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
df <- dplyr::tibble(x = Record_id, y = Start, z = End)
df %>%
dplyr::mutate_at(vars(y:z), ~ lubridate::as_datetime(.)) %>%
dplyr::mutate(diff = as.numeric(z - y)) %>%
dplyr::filter(diff <= 2 )
I have a dataset that contains start and end time stamps, as well as a performance percentage. I'd like to calculate group statistics over hourly blocks, e.g. "the average performance for the midnight hour was x%."
My question is if there is a more efficient way to do this than a series of ifelse() statements.
# some sample data
pre.starting <- data.frame(starting = format(seq.POSIXt(from =
as.POSIXct(Sys.Date()), to = as.POSIXct(Sys.Date()+1), by = "5 min"),
"%H:%M", tz="GMT"))
pre.ending <- data.frame(ending = pre.starting[seq(1, nrow(pre.starting),
2), ])
ending2 <- pre.ending[-c(1), ]
starting2 <- data.frame(pre.starting = pre.starting[!(pre.starting$starting
%in% pre.ending$ending),])
dataset <- data.frame(starting = starting2
, ending = ending2
, perct = rnorm(nrow(starting2), 0.5, 0.2))
For example, I could create hour blocks with code along the lines of the following:
dataset2 <- dataset %>%
mutate(hour = ifelse(starting >= 00:00 & ending < 01:00, 12
, ifelse(starting >= 01:00 & ending < 02:00, 1
, ifelse(starting >= 02:00 & ending < 03:00, 13)))
) %>%
group_by(hour) %>%
summarise(mean.perct = mean(perct, na.rm=T))
Is there a way to make this code more efficient, or improve beyond ifelse()?
We can use cut ending hour based on hourly interval after converting timestamps into POSIXct and then take mean for each hour.
library(dplyr)
dataset %>%
mutate_at(vars(pre.starting, ending), as.POSIXct, format = "%H:%M") %>%
group_by(ending_hour = cut(ending, breaks = "1 hour")) %>%
summarise(mean.perct = mean(perct, na.rm = TRUE))
# ending_hour mean.perct
# <fct> <dbl>
# 1 2019-09-30 00:00:00 0.540
# 2 2019-09-30 01:00:00 0.450
# 3 2019-09-30 02:00:00 0.612
# 4 2019-09-30 03:00:00 0.470
# 5 2019-09-30 04:00:00 0.564
# 6 2019-09-30 05:00:00 0.437
# 7 2019-09-30 06:00:00 0.413
# 8 2019-09-30 07:00:00 0.397
# 9 2019-09-30 08:00:00 0.492
#10 2019-09-30 09:00:00 0.613
# … with 14 more rows
Assume we have an interval spanning several days (interval "A" in Figure below).
library(lubridate)
int <- interval("2018-01-01 22:00:00", "2018-01-04 10:00:00")
In hours, I get
as.period(int, unit = "hours")
"60H 0M 0S"
Now, I want to subtract all non-working-hours, here 16:00-08:00 (greyed out) in that interval, i.e. only keep the blue parts (08:00-16:00) and, again, calculate the remaining hours (see "B" in Figure below), which would be 8 + 8 + 2 = 18 hours.
One approach would be to create a list of intervals I want to keep which span the entire interval and then calculate intersections. (The code below could, of course, be setup programmatically using floor/ceiling/seq functions etc.)
int_keep <- list(
interval("2018-01-01 08:00:00", "2018-01-01 16:00:00"),
interval("2018-01-02 08:00:00", "2018-01-02 16:00:00"),
interval("2018-01-03 08:00:00", "2018-01-03 16:00:00"),
interval("2018-01-04 08:00:00", "2018-01-04 16:00:00"),
interval("2018-01-05 08:00:00", "2018-01-05 16:00:00")
)
l <- lapply(int_keep, function(x) intersect(x, int))
mns <- sapply(l, as.numeric) # returns seconds
sum(mns, na.rm = T) / 60 / 60 # sum of intersections in hours
[1] 18
While this works, it appears utterly clumsy to me. What would be a less tedious way to do this?
df <- data.frame(DateTime=seq.POSIXt(as.POSIXct("2018-01-01 22:00:00"), as.POSIXct("2018-01-04 10:00:00"), by = "1 hour"))
head(df)
#DateTime
#1 2018-01-01 22:00:00
#2 2018-01-01 23:00:00
#3 2018-01-02 00:00:00
#4 2018-01-02 01:00:00
#5 2018-01-02 02:00:00
#6 2018-01-02 03:00:00
#you want the hours worked between A and B
A <-format(strptime("8:00:00", "%H:%M:%S"),"%H:%M:%S")
B <-format(strptime("16:00:00", "%H:%M:%S"),"%H:%M:%S")
#a simple ifelse statement to assign a value of 1 to column "value" if the time is between 8 and 16 or a 0 if it's not:
df$value<-ifelse((format(df[1],"%H:%M:%S")>A & format(df[1],"%H:%M:%S")<=B),1,0)
tail(df)
#DateTime DateTime
#56 2018-01-04 05:00:00 0
#57 2018-01-04 06:00:00 0
#58 2018-01-04 07:00:00 0
#59 2018-01-04 08:00:00 0
#60 2018-01-04 09:00:00 1
#61 2018-01-04 10:00:00 1
#now taking the column sum of the value column will give you the total hours worked:
TotalHoursWorked<-colSums(df$value)
TotalHoursWorked
#DateTime
# 18
I am having some problems using the lag function in dplyr. This is my dataset.
ID <- c(100, 100, 100, 200, 200, 300, 300)
daytime <- c("2010-12-21 06:00:00", "2010-12-21 09:00:00", "2010-12-21 13:00:00 ", "2010-12-23 23:00:00", "2010-12-24 02:00:00", "2010-12-25 19:00:00", "2010-12-31 08:00:00")
lagfirstvisit <- c(0, 0, 2, 0, 1, 0, 0)
table <- cbind(ID, daytime, lagfirstvisit)
table <- as.data.frame(table)
table$daytime <- as.POSIXct(table$daytime)
My aim is to generate a new column with the lag of variable daytime by the number as indicated in the lagfirstvisit column. i.e. If lagfirstvisit == 2, I would want the lag2 daytime value of the particular ID. If lagfirstvisit == 0, it would mean to keep the observation row's original daytime value.
My expected result is as follow:
ID <- c(100, 100, 100, 200, 200, 300, 300)
daytime <- c("2010-12-21 06:00:00", "2010-12-21 09:00:00", "2010-12-21 13:00:00 ", "2010-12-23 23:00:00", "2010-12-24 02:00:00", "2010-12-25 19:00:00", "2010-12-31 08:00:00")
lagfirstvisit <- c(0, 0, 2, 0, 1, 0, 0)
result <- c("2010-12-21 06:00:00", "2010-12-21 09:00:00", "2010-12-21 06:00:00", "2010-12-23 23:00:00", "2010-12-23 23:00:00", "2010-12-25 19:00:00", "2010-12-31 08:00:00")
table.results <- cbind(ID, daytime, lagfirstvisit, result)
Currently, the code I am using is:
table <- table %>%
group_by(ID) %>%
mutate(result = lag(as.POSIXct(daytime, format="%m/%d/%Y %H:%M:%S", tz= "UTC"), n = as.integer(lagfirstvisit)))
However, I get the error:
Error in mutate_impl(.data, dots) :
Evaluation error: n must be a non-negative integer scalar, not integer of length 3.
Does, anyone out there know how do I resolve this problem? Thank you very much!
table.results %>%
group_by(ID) %>%
mutate(
result2=mapply(`[`, list(day), row_number() - lagfirstvisit)
)
# A tibble: 7 x 5
# Groups: ID [3]
ID day lagfirstvisit result result2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 100. 21. 0. 21. 21.
2 100. 22. 0. 22. 22.
3 100. 23. 2. 21. 21.
4 200. 12. 0. 12. 12.
5 200. 13. 1. 12. 12.
6 300. 19. 0. 19. 19.
7 300. 22. 0. 22. 22.
table%>%
mutate_all(~as.numeric(as.character(.x)))%>%#First ensure all columns are numeric
mutate(result=day[1:n()-lagfirstvisit])# you can also use row_number() instead of 1:n()
ID day lagfirstvisit result
1 100 21 0 21
2 100 22 0 22
3 100 23 2 21
4 200 12 0 12
5 200 13 1 12
6 300 19 0 19
7 300 22 0 22
Caution: Refrain from using the inbuilt function names as variable names. eg, you are not supposed to use the name table as this is a function in base r
EDIT:
With the new data, the procedure remains the same, as long as the lagfirstvisit is numeric:
table%>%
mutate(result=daytime[1:n()-as.numeric(as.character(lagfirstvisit))])
ID daytime lagfirstvisit result
1 100 2010-12-21 06:00:00 0 2010-12-21 06:00:00
2 100 2010-12-21 09:00:00 0 2010-12-21 09:00:00
3 100 2010-12-21 13:00:00 2 2010-12-21 06:00:00
4 200 2010-12-23 23:00:00 0 2010-12-23 23:00:00
5 200 2010-12-24 02:00:00 1 2010-12-23 23:00:00
6 300 2010-12-25 19:00:00 0 2010-12-25 19:00:00
7 300 2010-12-31 08:00:00 0 2010-12-31 08:00:00
I think this is a little cleaner than the current answers:
table %>%
group_by(ID, lagfirstvisit) %>%
mutate(result = dplyr::lag(daytime, n = lagfirstvisit[1])) %>%
ungroup()
Since it's grouped lagfirstvisit all the indexes are the same, so taking the first works ok.