I have two columns
starttime endtime
2019-11-05 18:02:04 2019-11-05 00:02:04
2019-08-02 20:18:00 2019-0802 01:10:00
2019-12-07 17:28:00 2019-12-07 18:00:00
I am trying to find the difference in time between starttime and endtime
mutate(col = difftime(endtime,starttime,units = "hours")
but i am getting negative hours which makes no sense, and i need it to be endtime - startime because it would mess up things for the dataframe that I have I beleive that 0.533 is right I got
col
-18
-19
0.533
We can increment the endtime by 1 day if startttime > endtime and then use difftime
library(dplyr)
df %>%
mutate(endtime = if_else(starttime > endtime, endtime + 86400, endtime),
col = difftime(endtime,starttime,units = "hours"))
# starttime endtime col
#1 2019-11-05 18:02:04 2019-11-06 00:02:04 6.0000000 hours
#2 2019-08-02 20:18:00 2019-08-03 01:10:00 4.8666667 hours
#3 2019-12-07 17:28:00 2019-12-07 18:00:00 0.5333333 hours
data
df <- structure(list(starttime = structure(c(1572976924, 1564777080,
1575739680), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
endtime = structure(c(1572912124, 1564708200, 1575741600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -3L), class = "data.frame")
Related
I have two data frames with timestamps (in as.POSIXct, format="%Y-%m-%d %H:%M:%S") as below.
df_ID1
ID DATETIME TIMEDIFF EV
A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
B 2019-04-03 08:00:00 2019-04-03 02:00:00 1
B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
df_ID0
ID DATETIME
A 2019-03-26 00:02:00
A 2019-03-26 04:55:00
A 2019-03-26 11:22:00
B 2019-04-02 20:43:00
B 2019-04-04 11:03:00
B 2019-04-06 03:12:00
I want to compare the DATETIME in df_ID1 with the DATETIME in df_ID0 that is with the same ID and the DATETIME is "smaller than but closest to" the one in df_ID1,
For the pair in two data frames that matches, I want to further compare the TIMEDIFF in df_ID1 to the matched DATETIME in df_ID0, if TIMEDIFF in df_ID1 greater than the DATETIME in df_ID0, change EV 1 to 4 in df_ID1.
My desired result is
df_ID1
ID DATETIME TIMEDIFF EV
A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
I've checked how to compare timestamps and calculate the time difference, also how to change values based on criteria...
But I cannot find anything to select the "smaller than but closest to" timestamps and cannot figure out how to apply all these logic too..
Any help would be appreciate!
You can do this with a for loop keeping in mind that if your actual data base is very big then the overhead would be quite bad performance wise.
for(i in 1:nrow(df_1)){
sub <- subset(df_0, ID == df_1$ID[i]) # filter on ID
df_0_dt <- max(sub[sub$DATETIME < df_1$DATETIME[i],]$DATETIME) # Take max of those with DATETIME less than (ie less than but closest to)
if(df_0_dt < df_1$TIMEDIFF[i]){ # final condition
df_1[i, "EV"] <- 4
}
}
df_1
# A tibble: 3 x 4
ID DATETIME TIMEDIFF EV
<chr> <dttm> <dttm> <dbl>
1 A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
2 B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
3 B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
One option using nested mapply, is to first split df_ID1 and df_ID0 based on ID. Calculate the difference in time between each value in df_ID1 with that of df_ID0 of same ID. Get the index of "smaller than but closest to" and store it in inds and change the value to 4 if the value of corresponding TIMEDIFF column is greater than the matched DATETIME value.
df_ID1$EV[unlist(mapply(function(x, y) {
mapply(function(p, q) {
vals = as.numeric(difftime(p, y$DATETIME))
inds = which(vals == min(vals[vals > 0]))
q > y$DATETIME[inds]
}, x$DATETIME, x$TIMEDIFF)
}, split(df_ID1, df_ID1$ID), split(df_ID0, df_ID0$ID)))] <- 4
df_ID1
# ID DATETIME TIMEDIFF EV
#1 A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
#2 B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
#3 B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
data
df_ID0 <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("A",
"B"), class = "factor"), DATETIME = structure(c(1553529720, 1553547300,
1553570520, 1554208980, 1554346980, 1554491520), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = c(NA, -6L), class = "data.frame")
df_ID1 <- structure(list(ID = structure(c(1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATETIME = structure(c(1553551980, 1554249600,
1554352500), class = c("POSIXct", "POSIXt"), tzone = ""), TIMEDIFF =
structure(c(1553530380,
1554228000, 1554330900), class = c("POSIXct", "POSIXt"), tzone = ""),
EV = c(1, 1, 1)), row.names = c(NA, -3L), class = "data.frame")
I have 3 data frames, df1 = a time interval, df2 = list of IDs, df3 = list of IDs with associated date.
df1 <- structure(list(season = structure(c(2L, 1L), .Label = c("summer",
"winter"), class = "factor"), mindate = structure(c(1420088400,
1433131200), class = c("POSIXct", "POSIXt")), maxdate = structure(c(1433131140,
1448945940), class = c("POSIXct", "POSIXt")), diff = structure(c(150.957638888889,
183.040972222222), units = "days", class = "difftime")), .Names = c("season",
"mindate", "maxdate", "diff"), row.names = c(NA, -2L), class = "data.frame")
df2 <- structure(list(ID = c(23796, 23796, 23796)), .Names = "ID", row.names = c(NA,
-3L), class = "data.frame")
df3 <- structure(list(ID = c("23796", "123456", "12134"), time = structure(c(1420909920,
1444504500, 1444504500), class = c("POSIXct", "POSIXt"), tzone = "US/Eastern")), .Names = c("ID",
"time"), row.names = c(NA, -3L), class = "data.frame")
The code should compare if df2$ID == df3$ID. If true, and if df3$time >= df1$mindate and df3$time <= df1$maxdate, then df1$maxdate - df3$time, else df1$maxdate - df1$mindate. I tried using the ifelse function. This works when i manually specify specific cells, but this is not what i want as I have many more (uneven rows) for each of the dfs.
df1$result <- ifelse(df2[1,1] == df3[1,1] & df3[1,2] >= df1$mindate & df3[1,2] <= df1$maxdate,
difftime(df1$maxdate,df3[1,2],units="days"),
difftime(df1$maxdate,df1$mindate,units="days")
EDIT: The desired output is (when removing last row of df2):
season mindate maxdate diff result
1 winter 2015-01-01 2015-05-31 23:59:00 150.9576 days 141.9576
2 summer 2015-06-01 2015-11-30 23:59:00 183.0410 days 183.0410
Any ideas? I don't see how I could merge dfs to make them of the same length. Note that df2 can be of any row length and not affect the code. Issues arise when df1 and df3 differ in # of rows.
The > and < are vectorized:
transform(df1,result=ifelse(df3$ID%in%df2$ID & df3$time>mindate & df3$time <maxdate, difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
You can also use the between function from data.table library
library(data.table)
transform(df1,result=ifelse(df3$ID%in%df2$ID&df3$time%between%df1[2:3],
difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
I have a large dataset with multiple groups within the dataset of IDs with Start & Stop datetimes. What I'm trying to do is within each group identify where a subgroup occurred. A subgroup within a group would be when two ID's overlap with their START & END datetime columns. Below is script to create a sample dataset in R for one group. What I want to do is within each group create a column called, "Grp" that groups those subgroups with overlapping START & END datetimes.
What I have...
structure(list(ID = c(1,2,3,4), START = structure(c(1490904000, 1490918400,
1508363100, 1508379300), tzone = "UTC", class = c("POSIXct",
"POSIXt")), END = structure(c(1492050600, 1492247700,
1509062400, 1509031800), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -4L), .Names = c("ID","START",
"END"))
What I want is...
structure(list(ID = c(1,2,3,4), START = structure(c(1490904000, 1508379300,
1508363100, 1490918400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), END = structure(c(1492050600, 1509031800,
1509062400, 1492247700), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Grp = c(1,2,2,1)), class = "data.frame", row.names = c(NA, -4L), .Names = c("ID","START",
"END","Grp"))
I've tried using lubridate's interval, and finding an overlap that way, but no luck. Any help would be greatly appreciated.
Atfter sorting by START, the condition for a new group is that the END of the previous row is less than the START of the next group:
head(df1$END, -1) < tail(df1$START,-1)
df1 <- structure(list(ID = c(1,2,3,4), START = structure(c(1490904000, 1490918400,
1508363100, 1508379300), tzone = "UTC", class = c("POSIXct",
"POSIXt")), END = structure(c(1492050600, 1492247700,
1509062400, 1509031800), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -4L), .Names = c("ID","START",
"END"))
df1
ID START END
1 1 2017-03-30 20:00:00 2017-04-13 02:30:00
2 2 2017-03-31 00:00:00 2017-04-15 09:15:00
3 3 2017-10-18 21:45:00 2017-10-27 00:00:00
4 4 2017-10-19 02:15:00 2017-10-26 15:30:00
df1a <- df1[ order(df1$START), ]
df1a$grp <- cumsum( c( 1, head(df1$END, -1) < tail(df1$START,-1) ))
df1a
#---------------
ID START END grp
1 1 2017-03-30 20:00:00 2017-04-13 02:30:00 1
2 2 2017-03-31 00:00:00 2017-04-15 09:15:00 1
3 3 2017-10-18 21:45:00 2017-10-27 00:00:00 2
4 4 2017-10-19 02:15:00 2017-10-26 15:30:00 2
Here's a function that answers the first part of my response to the comment below:
grp_overlaps <- function(endings, beginnings){
cumsum(c( 1, head(endings, -1) < tail(beginnings, -1) )) }
I have a dataframe consisting of:
two columns with start and end timestamps (POSIXct class) of various
projects
another timestamp (POSIXct class) column showing events which
occured within the start and end timeframe
a project id column
Projects have multiple events naturally.
Projid Event BEGIN_DT END_DT
1 04/12/2013 09:00:00 04/12/2013 08:12:00 04/14/2013 20:14:00
1 04/13/2013 15:16:24 04/12/2013 08:12:00 04/14/2013 20:14:00
2 06/06/2012 18:00:00 06/06/2012 13:54:32 08/06/2012 23:59:43
2 06/07/2012 22:54:32 06/06/2012 13:54:32 08/06/2012 23:59:43
I would like to add a field showing for each event the 60 min time bucket it belongs to (as in first hour or second hour or n-th hour of the project etc...). How could this be done?
How about the following using a floor on the difftime in hours:
# Your sample data
df <- structure(list(
Projid = c(1L, 1L, 2L, 2L),
Event = structure(c(1365721200, 1365830184, 1338969600, 1339073672), class = c("POSIXct", "POSIXt"), tzone = ""),
BEGIN_DT = structure(c(1365718320, 1365718320, 1338954872, 1338954872), class = c("POSIXct", "POSIXt"), tzone = ""),
END_DT = structure(c(1365934440, 1365934440, 1344261583, 1344261583), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("Projid", "Event", "BEGIN_DT", "END_DT"), row.names = c(NA, -4L), class = "data.frame");
# Add hour bin
df$hourBin <- floor(difftime(df$Event, df$BEGIN_DT, unit = "hours")) + 1;
df;
#Projid Event BEGIN_DT END_DT hourBin
#1 1 2013-04-12 09:00:00 2013-04-12 08:12:00 2013-04-14 20:14:00 1 hours
#2 1 2013-04-13 15:16:24 2013-04-12 08:12:00 2013-04-14 20:14:00 32 hours
#3 2 2012-06-06 18:00:00 2012-06-06 13:54:32 2012-08-06 23:59:43 5 hours
#4 2 2012-06-07 22:54:32 2012-06-06 13:54:32 2012-08-06 23:59:43 34 hours
I have been looking around but I still couldn't find a way to subset my dataframe by time, here is the sample data:
Duration End Date Start Date
228 2013-01-03 09:10:00 2013-01-03 09:06:00
1675 2013-01-04 17:34:00 2013-01-04 17:06:00
393 2013-01-04 17:54:00 2013-01-04 17:48:00
426 2013-01-04 11:10:00 2013-01-04 11:03:00
827 2013-01-01 16:13:00 2013-01-01 15:59:00
780 2013-01-01 16:13:00 2013-01-01 16:00:00
The End Date and Start Date are in POSIXct format, and here is what I have tried if I only what times between 8:00 to 9:30.
tm1 <- as.POSIXct("08:00", format = "%H:%M")
tm2 <- as.POSIXct("09:30", format = "%H:%M")
df.time <- with(df, df[format('Start Date', '%H:%M')>= tm1 & format('End Date', '%H:%M')< tm2, ])
but this returns an error. I have also tried this, but it didn't work as well.
df.time <- subset(df, format('Start Date', '%H:%M') >= '8:00' & format('End Date', '%H:%M') < '9:30'))
if anybody tell me what am I doing wrong? Thanks
Assuming that the start and end dates are always the same and only the times differ and you want those rows for which the time starts at or after 8:00 and ends before 9:30, convert the date/time values to characters strings of the form HH:MM and compare:
subset(DF, format(`Start Date`, "%H:%M") >= "08:00" &
format(`End Date`, "%H:%M") < "09:30")
giving:
Duration End Date Start Date
1 228 2013-01-03 09:10:00 2013-01-03 09:06:00
Note: We used the following for DF. (Next time please use dput to provide your data in reproducible form.)
DF <- structure(list(Duration = c(228L, 1675L, 393L, 426L, 827L, 780L
), `End Date` = structure(c(1357222200, 1357338840, 1357340040,
1357315800, 1357074780, 1357074780), class = c("POSIXct", "POSIXt"
), tzone = ""), `Start Date` = structure(c(1357221960, 1357337160,
1357339680, 1357315380, 1357073940, 1357074000), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("Duration", "End Date", "Start Date"
), row.names = c(NA, -6L), class = "data.frame")