I'm working with EHR data and trying to identify the time when an individual identified by ID has had at least 2 unique events of type "A" and at least 1 unique event of type "B" within a 6hr time range. The order of events does not matter - the 6hr range can start with either type "A" or "B". I would like to create a new data frame that contains the timestamp when the individual meets the criteria.
The data looks like this:
d <- data.frame(ID = c("Z001","Z001","Z001","Z001","Z001","Z001"),
event = c("TEMP","HR","TEMP","RR","LACTATE","INR"),
eventType = c("A","A","A","A","B","B"),
eventDTS = as_datetime(c("2022-06-01T02:00:00Z","2022-06-01T02:00:00Z","2022-06-01T02:05:00Z","2022-06-01T02:01:00Z","2022-06-01T03:00:00Z","2022-06-01T03:45:00Z")),
stringsAsFactors=FALSE)
ID
event
eventType
eventDTS
Z001
TEMP
A
2022-06-01 02:00:00
Z001
HR
A
2022-06-01 02:00:00
Z001
RR
A
2022-06-01 02:01:00
Z001
TEMP
A
2022-06-01 02:05:00
Z001
LACTATE
B
2022-06-01 03:00:00
Z001
INR
B
2022-06-01 03:45:00
For this individual the output should look like this:
ID
lastQualDTS
Z001
2022-06-01 03:00:00
So far I've been able to create a count of events within the 6hr range for each event using this post R to create a tally of previous events within a sliding window time period. But I can't figure out how to actually identify the timestamp of interest.
Related
I am seeing missing timseries data corresponding to GMT time change for Summer. I guess this might also be for Winter. I have two parts of query.
How to generate the missing timeseries from the code below. The table is in a xts format.
How to filter the records by time that will include the missing time series once generated. This is only a sample dataset. Thanks.
start <- as.POSIXct("2022-03-27 00:58:00")
interval <- 2
end <- as.POSIXct("2022-03-27 03:00:00")
missing_timestamp <- data.frame(TIMESTAMP = seq(from=start, by=interval*60, to=end))
head(missing_timestamp)
TIMESTAMP
1 2022-03-27 00:58:00
2 2022-03-27 02:00:00
3 2022-03-27 02:02:00
4 2022-03-27 02:04:00
5 2022-03-27 02:06:00
6 2022-03-27 02:08:00
Update:
Linked to the second query, when below code is executed for time between 00hrs and 02:06hrs all data is returned rather than the first four records.
a <- missing_timestamp %>% filter(TIMESTAMP > ymd_hms("2022-03-27 00:00:00") & TIMESTAMP < ymd_hms("2022-03-27 02:06:00"))
I am working in R and trying to understand the best way to join data frames when one of them is very large.
I have a data frame which is not excruciatingly large but also not small (~80K observations of 8 variables, 144 MB). I need to match observations from this data frame to observations from another smaller data frame on the basis of a date range. Specifically, I have:
events.df <- data.frame(individual=c('A','B','C','A','B','C'),
event=c(1,1,1,2,2,2),
time=as.POSIXct(c('2014-01-01 08:00:00','2014-01-05 13:00:00','2014-01-10 07:00:00','2014-05-01 01:00:00','2014-06-01 12:00:00','2014-08-01 10:00:00'),format="%Y-%m-%d %H:%M:%S"))
trips.df <- data.frame(individual=c('A','B','C'),trip=c('x1A','CA1B','XX78'),
trip_start = as.POSIXct(c('2014-01-01 06:00:00','2014-01-04 03:00:00','2014-01-08 12:00:00'),format="%Y-%m-%d %H:%M:%S"),
trip_end=as.POSIXct(c('2014-01-03 06:00:00','2014-01-06 03:00:00','2014-01-11 12:00:00'),format="%Y-%m-%d %H:%M:%S"))
In my case events.df contains around 80,000 unique events and I am looking to match them to events from the trips.df data frame, which has around 200 unique trips. Each trip has a unique trip identifier ('trip'). I would like to match based on whether the event took place during the date range defining a trip.
First, I have tried fuzzy_inner_join from the fuzzyjoin library. It works great in principal:
fuzzy_inner_join(events.df,trips.df,by=c('individual'='individual','time'='trip_start','time'='trip_end'),match_fun=list(`==`,`>=`,`<=`))
individual.x event time individual.y trip trip_start trip_end
1 A 1 2014-01-01 08:00:00 A x1A 2014-01-01 06:00:00 2014-01-03 06:00:00
2 B 1 2014-01-05 13:00:00 B CA1B 2014-01-04 03:00:00 2014-01-06 03:00:00
3 C 1 2014-01-10 07:00:00 C XX78 2014-01-08 12:00:00 2014-01-11 12:00:00
>
but runs out of memory when I try to apply it to the larger data frames.
Here is a second solution I cobbled together:
trip.match <- function(tripid){
individual <- trips.df$individual[trips$trip==tripid]
start <- trips.df$trip_start[trips$trip==tripid]
end <- trips.df$trip_end[trips$trip==tripid]
tmp <- events.df[events.df$individual==individual &
events.df$time>= start &
events.df$time<= end,]
tmp$trip <- tripid
return(tmp)
}
result <- data.frame(rbindlist(lapply(unique(trips.df$trip),trip.match)
This solution also breaks down because the list object returned by lapply is 25GB and the attempt to cast this list to a data frame also exhausts the available memory.
I have been able to do what I need to do using a for loop. Basically, I append a column onto events.df and loop through the unique trip identifiers and populate the new column in events.df accordingly:
events.df$trip <- NA
for(i in unique(trips.df$trip)){
individual <- trips.df$individual[trips.df$trip==i]
start <- min(trips.df$trip_start[trips.df$trip==i])
end <- max(trips.df$trip_end[trips.df$trip==i])
events.df$trip[events.df$individual==individual & events.df$time >= start & events.df$time <= end] <- i
}
> events.df
individual event time trip
1 A 1 2014-01-01 08:00:00 x1A
2 B 1 2014-01-05 13:00:00 CA1B
3 C 1 2014-01-10 07:00:00 XX78
4 A 2 2014-05-01 01:00:00 <NA>
5 B 2 2014-06-01 12:00:00 <NA>
6 C 2 2014-08-01 10:00:00 <NA>
My question is this: I'm not a very advanced R programmer so I expect there is a more memory efficient way to accomplish what I'm trying to do. Is there?
Try creating a table that expands the trip ranges by hour and then merge with the event. Here is an example (using the data.table function because data.table outperforms data.frame for larger datasets):
library('data.table')
tripsV <- unique(trips.df$trip)
tripExpand <- function(t){
dateV <- seq(trips.df$trip_start[trips.df$trip == t],
trips.df$trip_end[trips.df$trip == t],
by = 'hour')
data.table(trip = t, time = dateV)
}
trips.dt <- rbindlist(
lapply(tripsV, function(t) tripExpand(t))
)
merge(events.df,
trips.dt,
by = 'time')
Output:
time individual event trip
1 2014-01-01 08:00:00 A 1 x1A
2 2014-01-05 13:00:00 B 1 CA1B
3 2014-01-10 07:00:00 C 1 XX78
So you are basically translating the trip table to trip-hour long-form panel dataset. That makes for easy merging with the event dataset. I haven't benchmarked it to your current method but my hunch is that it will be more memory & cpu efficient.
Consider splitting your data with data.table's split and run each subset on fuzzy_inner_join then call rbindlist to bind all data frame elements together for single output.
df_list <- data.table::split(events.df, by="individual")
fuzzy_list <- lapply(df_list, function(sub.df) {
fuzzy_inner_join(sub.df, trips.df,
by = c('individual'='individual', 'time'='trip_start', 'time'='trip_end'),
match_fun = list(`==`,`>=`,`<=`)
)
})
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(df_list); gc()
final_df <- rbindlist(fuzzy_list)
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(fuzzy_list); gc()
I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03
I have a DF containing time stamps and the type of corresponding "event" (two types of events), and another one with the timestamps and types of event acknowledgements.
I'm trying to claculate the time between an event, and the time at which it is acknowledged.
Acknowledgements can be received immediately, or later (Tack>=Tevent) or never. If there is no corresponding acknowledgement, I'd like the time difference to show NA
Example:
dfEvent
Time Event
00:00:01 A
00:01:00 B
00:05:00 A
00:09:00 B
dfAcknowledgement
Time Event
00:00:02 A
00:05:10 A
00:09:05 B
RESULT
Time Event Delay
00:00:01 A 00:00:01
00:01:00 B NA
00:05:00 A 00:00:10
00:09:00 B 00:00:05
I've been looking at data.table but I don't just want to merge tables: I want to calculate the difference between the values of two "nearby" rows
I was also thinking of using which() to find the index of the event corresponding to an acknowledgement, but once I've got the index, how do I subtract it from the corresponding row without a for loop ?
Define inputs dfEvent and dfAcknowledgement and then convert to data.table with appropriate keys also computing NextTime. Then perform a rolling join of the data.tables, resort by Time and keep only Time, Event and computed Delay.
library(data.table)
library(chron)
# test data from question
dfEvent <- data.frame(Time = c("00:00:01", "00:01:00", "00:05:00", "00:09:00"),
Event = c("A", "B"), stringsAsFactors = FALSE)
dfAcknowledgement <- data.frame(Time = c("00:00:02", "00:05:10", "00:09:05"),
Event = c("A", "A", "B"), stringsAsFactors = FALSE)
# convert to data.table (also compute NextTime column)
dtEvent <- with(dfEvent, data.table(Event = Event, Time = times(Time),
NextTime = times(c(tail(Time, -1), NA)),
key = "Event,Time"))
dtAcknowledgement <- with(dfAcknowledgement, data.table(Event = Event,
Time = times(Time), TimeAck = times(Time),
key = "Event,Time"))
# real work done here - perform rolling join, re-sort & compute desired columns
roll <- dtAcknowledgement[dtEvent,,roll=-Inf]
setkey(setkey(roll, NULL), Time, Event)
roll[, list(Time, Event, Delay = replace(TimeAck, TimeAck>=NextTime, NA)-Time)]
This gives:
Time Event Delay
1: 00:00:01 A 00:00:01
2: 00:01:00 B <NA>
3: 00:05:00 A 00:00:10
4: 00:09:00 B 00:00:05
Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.