Rolling join using data.table with a constraint in R - r

I am trying to join two datatables using rolling join. I have looked at various answers including here but unfortunately unable to locate one that helps in this case. I am borrowing the same example from the link posted.
my first dataset is a websession data for two users 1 and 2:
user web_date_time
1 29-Oct-2016 6:10:03 PM
1 29-Oct-2016 7:34:17 PM
1 30-Oct-2016 2:08:03 PM
1 30-Oct-2016 3:55:12 PM
2 31-Oct-2016 11:32:12 AM
2 31-Oct-2016 2:59:56 PM
2 01-Nov-2016 12:49:44 PM
My second time stamp is for purchase:
user purchase_date_time
1 29-Oct-2016 6:10:00 PM
1 29-Oct-2016 6:11:00 PM
2 31-Oct-2016 11:35:12 AM
2 31-Oct-2016 2:50:00 PM
My desired output is which web session led to a purchase but with a constraint. The constraint is - the websession should be after the previous purchase. The desired out is as follows (requires for all purchases, an additional column "websession_led_purchase" to be created ):
user purchase_date_time websession_led_purchase
1 29-Oct-2016 6:10:00 PM NA
1 29-Oct-2016 6:11:00 PM 29-Oct-2016 6:10:03 PM
2 31-Oct-2016 11:35:12 AM 31-Oct-2016 11:32:12 AM
2 31-Oct-2016 2:50:00 PM NA
The first NA is on account of no websession before that purchase, the second NA is on account of no websession after the previous purchase (and before the purchase) that led to the second purchase for user 2.
I tried using the roll join method of dt2[dt1,roll=Inf], however, I get "31-Oct-2016 11:32:12 AM" for the fourth row in the desired output, which is incorrect.
Let me know your advice.

The rolling joins is behaving as expected.
The document suggests as:
+Inf (or TRUE) rolls the prevailing value in x forward. It is also known as last observation carried forward (LOCF).
That means the last observation can be carried forward and joined with for many records. Exactly the same is happening with 4th row where 2016-10-31 11:32:12 is coped and mapped with even next record (2016-10-31 14:50:00).
A simple way to fix this issue is to match lag value of websession_led_purchase with current row if those two are same then set value in current row as NA. This will ensure data was carried forwards only-once.
library(lubridate)
library(data.table)
setDT(DT1)
setDT(DT2)
DT1[,':='(date_time = dmy_hms(web_date_time), web_date_time = dmy_hms(web_date_time))]
DT2[, ':='(date_time = dmy_hms(purchase_date_time),
purchase_date_time = dmy_hms(purchase_date_time)) ]
setkey(DT1, user, date_time)
setkey(DT2, user, date_time)
DT1[DT2, roll= Inf][,.(user, purchase_date_time,
websession_led_purchase = as.POSIXct(ifelse(!is.na(shift(web_date_time)) &
web_date_time == shift(web_date_time), NA, web_date_time),
origin = "1970-01-01"))]
# user purchase_date_time websession_led_purchase
# 1: 1 2016-10-29 18:10:00 <NA>
# 2: 1 2016-10-29 18:11:00 2016-10-29 19:10:03
# 3: 2 2016-10-31 11:35:12 2016-10-31 11:32:12
# 4: 2 2016-10-31 14:50:00 <NA>

Related

Identify event that meets criteria within time range in R

I'm working with EHR data and trying to identify the time when an individual identified by ID has had at least 2 unique events of type "A" and at least 1 unique event of type "B" within a 6hr time range. The order of events does not matter - the 6hr range can start with either type "A" or "B". I would like to create a new data frame that contains the timestamp when the individual meets the criteria.
The data looks like this:
d <- data.frame(ID = c("Z001","Z001","Z001","Z001","Z001","Z001"),
event = c("TEMP","HR","TEMP","RR","LACTATE","INR"),
eventType = c("A","A","A","A","B","B"),
eventDTS = as_datetime(c("2022-06-01T02:00:00Z","2022-06-01T02:00:00Z","2022-06-01T02:05:00Z","2022-06-01T02:01:00Z","2022-06-01T03:00:00Z","2022-06-01T03:45:00Z")),
stringsAsFactors=FALSE)
ID
event
eventType
eventDTS
Z001
TEMP
A
2022-06-01 02:00:00
Z001
HR
A
2022-06-01 02:00:00
Z001
RR
A
2022-06-01 02:01:00
Z001
TEMP
A
2022-06-01 02:05:00
Z001
LACTATE
B
2022-06-01 03:00:00
Z001
INR
B
2022-06-01 03:45:00
For this individual the output should look like this:
ID
lastQualDTS
Z001
2022-06-01 03:00:00
So far I've been able to create a count of events within the 6hr range for each event using this post R to create a tally of previous events within a sliding window time period. But I can't figure out how to actually identify the timestamp of interest.

Split a variable into two columns(tried colsplit)

I have one variable called Date which is of the format 03/08/2015 09:00:00 AM.
I want to split the column into two columns, namely Date and Time. I used colsplit() command as:
> colsplit(crime$Date,"",names = c("Date","Time"))[1:5,]
Date Time
1 0 3/18/2015 07:44:00 PM
2 0 3/18/2015 11:00:00 PM
3 0 3/18/2015 10:45:00 PM
4 0 3/18/2015 10:30:00 PM
5 0 3/18/2015 09:00:00 PM
But it's not quite as expected. The Date variable has 0 and the Time variable has the other values. How do I rectify this?
Also, when I try to include these variables in the crime data set, the column names are date.date and date.time.
You can also use tidyr::separate to divide up character columns.
data %>% separate(date_time, c("date", "time"), sep = " ", remove = TRUE)
For your purposes, just use substr:
data.frame(Date = substr(crime$Date, 1, 10), Time = substr(crime$Date, 12, 19))
But I have to agree with the first comment about not changing this to two columns.

subset data by time interval if I have all data between time interval

I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]

Difference between two dates from two consecutive rows in two different columns

enter image description hereI have a hive table with more than millions records.
The input is of the following type:
Input:
rowid |starttime |endtime |line |status
--- 1 2007-07-19 00:05:00 2007-07-19 00:23:00 l1 s1
--- 2 2007-07-20 00:00:10 2007-07-20 00:22:00 l1 s2
--- 3 2007-07-19 00:00:00 2007-07-19 00:11:00 l2 s2
What I want to do is when 1st order the table by starttime group by line.
Then find the difference between two consecutive rows endtime and starttime. If the difference is more than 5mins then in a new table add a new row in between with status misstime.
In input row 1 & 2 the time time difference is 1 hour 10 mins so 1st I will create row for 19th Date and complete that days with missing time and then add one more row for 20th as below.
output:
rowid |starttime |endtime |line |status
--- 1 |2007-07-19 00:05:00 |2007-07-19 00:23:00 |l1 |s1
--- 2 |2007-07-19 00:23:01 |2007-07-19 00:00:00 |l1 |misstime
--- 3 |2007-07-20 00:00:01 |2007-07-20 00:00:09 |l1 |misstime
--- 4 |2007-07-20 00:00:10 |2007-07-20 00:22:00 |l1 |s2
--- 3 |2007-07-19 00:00:00 |2007-07-19 00:11:00 |l2 |s2
Can anyone help me achieve this directly in hue - hive ?
Unix script will also do.
Thanks in advance.
The solution template is:
Use LAG() function to get previous line starttime or endtime.
For each line calculate the different between current and previous time
Filter rows with difference more than 5 minutes.
Transform the dataset into required output.
Example:
insert into yourtable
select
s.rowid,
s.starttime ,
s.endtime,
--calculate your status here, etc, etc
from
(
select rowid starttime endtime,
lag(endtime) over(partition by rowid order by starttime) prev_endtime
from yourtable ) s
where (unix_timestamp(endtime) - unix_timestamp(prev_endtime))/60 > 5 --latency>5 min

Merge Records Over Time Interval

Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.

Resources