Not even sure if I've described the problem accurately in the title, but here goes.
Suppose I have the following data.table/data.frame:
library(data.table)
library(lubridate)
DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
"2019-06-01 09:00:00", "2016-06-01 09:00:00","2016-06-01 09:00:00"),
end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
"2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
person = c("A", "A","A", "B", "B", "C"))
begin end person
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
2: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
3: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 05:00:00 B
5: 2016-06-01 09:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
This is essentially a dataset summarizing time stamps of when a period began and ended for each person. The number of rows are repeated for each person by the number of days which the time period spans. For example, person A has three entries for the same "shift" because their shift spans three distinct dates, 06-01, 06-02, and 06-03. The entries are repeated by the number of dates which the shifts span, but some shifts begin and end within the same day.
What I want is to update the begin and end dates of the above dataset, so that I can see what time each shift began and ended at the day level. So the dataset should look like:
begin end person
1: 2019-06-01 09:00:00 2019-06-02 00:00:00 A
2: 2019-06-02 00:00:00 2019-06-03 00:00:00 A
3: 2019-06-03 00:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 00:00:00 B
5: 2016-06-02 00:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
Any help would be greatly appreciated!
First, fixing the dates (and I already fixed row 5's starting in 2016 and going through to 2019, seems unlikely),
DT[, c("begin", "end"):=lapply(.SD, as.POSIXct), .SDcols=c("begin", "end")]
## we get this
DT <- as.data.table(structure(list(begin = structure(c(1559394000, 1559394000, 1559394000, 1559394000, 1559394000, 1464786000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1559584800, 1559584800, 1559584800, 1559466000, 1559466000, 1464837300), class = c("POSIXct", "POSIXt"), tzone = ""), person = c("A", "A", "A", "B", "B", "C")), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
Second, we then create this function
func <- function(st, en) {
midns <- lubridate::ceiling_date(seq(st, en, by = "day"), unit = "day")
times <- unique(sort(c(midns[ st < midns & midns < en], st, en)))
data.table(begin = times[-length(times)], end = times[-1])
}
Lastly, we use it, using by=.(person) to preserve that column in the output. I use DT since we do not need (or even want) duplicates for each shift/day:
unique(DT)[, rbindlist(Map(func, begin, end)), by = .(person)]
# person begin end
# <char> <POSc> <POSc>
# 1: A 2019-06-01 09:00:00 2019-06-02 00:00:00
# 2: A 2019-06-02 00:00:00 2019-06-03 00:00:00
# 3: A 2019-06-03 00:00:00 2019-06-03 14:00:00
# 4: B 2019-06-01 09:00:00 2019-06-02 00:00:00
# 5: B 2019-06-02 00:00:00 2019-06-02 05:00:00
# 6: C 2016-06-01 09:00:00 2016-06-01 23:15:00
Assuming you had a typo for row 5 person B (begin 2019 not 2016):
library(data.table)
library(lubridate)
> DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
+ "2019-06-01 09:00:00", "2019-06-01 09:00:00","2016-06-01 09:00:00"),
+ end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
+ "2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
+ person = c("A", "A","A", "B", "B", "C"))
>
> DT[, `:=`(min=as.numeric(difftime(end,begin, units="mins")),
+ days=as.numeric(as_date(end)-as_date(begin)+1))][, min_day:=min/days]
>
> unique(DT)
begin end person min days min_day
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A 3180 3 1060
2: 2019-06-01 09:00:00 2019-06-02 05:00:00 B 1200 2 600
3: 2016-06-01 09:00:00 2016-06-01 23:15:00 C 855 1 855
Related
I am trying to count the number of events that are occurring within each interval and for each of my factor (mystations).
Below is a MWE:
library(lubridate)
myintervals <- c(dmy_hms(
"01/01/2000 08:00:00",
"25/02/2000 09:00:00",
"01/03/2000 10:00:00",
"30/04/2000 11:00:00",
"01/05/2000 12:00:00",
"30/06/2000 13:00:00",
"01/07/2000 14:00:00",
"30/08/2000 15:00:00",
"01/09/2000 16:00:00",
"30/10/2000 17:00:00"))
mystations <- c("A","B","C","A","B","C","A","B","C","D")
mydata <- data.frame(myintervals,mystations)
myintervals mystations
|1 2000-01-01 08:00:00 A
|2 2000-02-25 09:00:00 B
|3 2000-03-01 10:00:00 C
|4 2000-04-30 11:00:00 A
|5 2000-05-01 12:00:00 B
|6 2000-06-30 13:00:00 C
|7 2000-07-01 14:00:00 A
|8 2000-08-30 15:00:00 B
|9 2000-09-01 16:00:00 C
|10 2000-10-30 17:00:00 D
Here I am creating the detections
date.time <- c(dmy_hms(
"31/12/1999 08:00:00",
"24/02/2000 09:00:00",
"25/02/2000 08:00:00",
"26/02/2000 10:00:00",
"27/02/2000 11:00:00",
"01/03/2000 10:00:00",
"10/03/2000 22:00:00",
"20/03/2000 23:00:00",
"01/04/2000 10:00:00",
"20/04/2000 20:00:00",
"25/04/2000 08:00:00",
"30/04/2000 10:00:00",
"01/05/2000 12:00:00",
"10/05/2000 20:00:00",
"20/05/2000 08:00:00",
"30/06/2000 13:00:00",
"10/07/2000 10:00:00",
"20/07/2000 20:00:00",
"30/08/2000 15:00:00",
"01/09/2000 16:00:00"))
mydetections <- data.frame(date.time=date.time,mystations=mystations)
date.time mystations
|1 1999-12-31 08:00:00 A
|2 2000-02-24 09:00:00 B
|3 2000-02-25 08:00:00 C
|4 2000-02-26 10:00:00 A
|5 2000-02-27 11:00:00 B
|6 2000-03-01 10:00:00 C
|7 2000-03-10 22:00:00 A
|8 2000-03-20 23:00:00 B
|9 2000-04-01 10:00:00 C
|10 2000-04-20 20:00:00 D
|11 2000-04-25 08:00:00 A
|12 2000-04-30 10:00:00 B
|13 2000-05-01 12:00:00 C
|14 2000-05-10 20:00:00 A
|15 2000-05-20 08:00:00 B
|16 2000-06-30 13:00:00 C
|17 2000-07-10 10:00:00 A
|18 2000-07-20 20:00:00 B
|19 2000-08-30 15:00:00 C
|20 2000-09-01 16:00:00 D
The origins for each interval are here:
myorigins <- data.frame(myintervals=c(
dmy_hms("01/01/1970 00:00:00","01/04/1970 00:00:00","01/08/1970 00:00:00","01/12/1970 00:00:00")),mystations=c(unique(mydata$mystations)))
The expected output is this:
myintervals mystation value
1 1970-01-01 00:00:00 UTC--2000-01-01 08:00:00 UTC A 1
2 2000-01-01 08:00:00 UTC--2000-04-30 11:00:00 UTC A 3
3 2000-04-30 11:00:00 UTC--2000-07-01 14:00:00 UTC A 1
4 1970-04-01 00:00:00 UTC--2000-02-25 09:00:00 UTC B 1
5 2000-02-25 09:00:00 UTC--2000-05-01 12:00:00 UTC B 3
6 2000-05-01 12:00:00 UTC--2000-08-30 15:00:00 UTC B 2
7 1970-08-01 00:00:00 UTC--2000-03-01 10:00:00 UTC C 2
8 2000-03-01 10:00:00 UTC--2000-06-30 13:00:00 UTC C 3
9 2000-06-30 13:00:00 UTC--2000-09-01 16:00:00 UTC C 1
10 1970-12-01 00:00:00 UTC--2000-10-30 17:00:00 UTC D 1
What I was able to achieve so far is this:
#line by line
mydata <- add_row(mydata,myorigins)
mydata <- arrange(mydata,mystations,myintervals)
DF <- group_split(mydata,mystations)
Y <- lapply(seq_along(DF), function(x) as.data.frame(DF[[x]]))
names(Y) <- c(unique(mydata$mystations))
list2env(Y, envir = .GlobalEnv)
#splitting the detections
DFD <- group_split(mydetections,mystations)
Z <- lapply(seq_along(DFD), function(x) as.data.frame(DFD[[x]]))
names(Z) <- c(paste(unique(mydata$mystations),"det",sep=""))
list2env(Z, envir = .GlobalEnv)
I believe now is time to "only" construct the intervals for each dataframe like this:
Aint <- int_diff(A$myintervals)
and "checking" which detection falls in which interval with this:
myresA <- Adet$date.time%within%Aint
Clearly, I would like to avoid to "manually" construct the intervals for each df As.
As always, I would greatly appreciate any helps or tips for getting the desired output. I apologize for the initial confusion in the post.
Here are a couple of options to consider - hope this may be helpful.
Using tidyverse you can add your myorigins to mydata, then after sorting with arrange make time intervals (start-end).
You can use fuzzy_left_join to add the events table, matching on mystations and where the date.time falls between the interval start and end.
Then, after grouping, you can count the number of rows. You will get something close to your result, depending on how you want to handle edge cases.
library(tidyverse)
library(fuzzyjoin)
library(lubridate)
bind_rows(mydata, myorigins) %>%
arrange(myintervals) %>%
group_by(mystations) %>%
transmute(start = myintervals, end = lead(myintervals)) %>%
filter(!is.na(end)) %>%
fuzzy_left_join(
mydetections,
by = c("mystations", "start" = "date.time", "end" = "date.time"),
match_fun = c(`==`, `<`, `>=`)
) %>%
group_by(start, end, mystations.x) %>%
summarise(count = n()) %>%
arrange(mystations.x)
Output
start end mystations.x count
<dttm> <dttm> <chr> <int>
1 1970-01-01 00:00:00 2000-01-01 08:00:00 A 1
2 2000-01-01 08:00:00 2000-04-30 11:00:00 A 3
3 2000-04-30 11:00:00 2000-07-01 14:00:00 A 1
4 1970-04-01 00:00:00 2000-02-25 09:00:00 B 1
5 2000-02-25 09:00:00 2000-05-01 12:00:00 B 3
6 2000-05-01 12:00:00 2000-08-30 15:00:00 B 2
7 1970-08-01 00:00:00 2000-03-01 10:00:00 C 2
8 2000-03-01 10:00:00 2000-06-30 13:00:00 C 3
9 2000-06-30 13:00:00 2000-09-01 16:00:00 C 1
10 1970-12-01 00:00:00 2000-10-30 17:00:00 D 2
An alternative to consider is using data.table which would be faster. One function that may be helpful here is foverlaps to find overlap between the event dates and date ranges.
library(data.table)
dt <- rbind(myorigins, mydata)
setDT(dt)
dt[, c("start", "end") := list(myintervals, lead(myintervals)), by = mystations]
dt <- na.omit(dt, "end")
setDT(mydetections)
mydetections[,date.time.copy := copy(date.time)]
setkey(mydetections, mystations, date.time, date.time.copy)
dt_ovlp <- foverlaps(dt,
mydetections,
by.x = c("mystations", "start", "end"),
by.y = c("mystations", "date.time", "date.time.copy"))
dt_ovlp[ , .(value = .N), by = c("mystations", "start", "end")][order(mystations, start)]
I have a dataframe in which each row is the working hours of an employee defined by a start and a stop time:
DF < - EmployeeNum Start_datetime End_datetime
123 2012-02-01 07:30:00 2012-02-01 17:45:00
342 2012-02-01 08:00:00 2012-02-01 17:45:00
876 2012-02-01 10:45:00 2012-02-01 18:45:00
I'd like to find the number of employees working during each hour on each day in a timespan:
Date Hour NumberofEmployeesWorking
2012-02-01 00:00 ? (number of employees working between 00:00 and 00:59)
2012-02-01 01:00 ?
2012-02-01 02:00 ?
2012-02-01 03:00 ?
2012-02-01 04:00 ?
2012-02-01 05:00 ?
2012-02-01 06:00 ?
How do I put my working hours into bins like this?
Your data, in a more consumable format, plus one row to span midnight (for example). I changed the format to include a "T" here, to make consumption easier, otherwise the middle space makes it less trivial to do it with read.table(text='...'). (You can skip this since you already have your real data.)
x <- read.table(text='EmployeeNum Start_datetime End_datetime
123 2012-02-01T07:30:00 2012-02-01T17:45:00
342 2012-02-01T08:00:00 2012-02-01T17:45:00
876 2012-02-01T10:45:00 2012-02-01T18:45:00
877 2012-02-01T22:45:00 2012-02-02T05:45:00',
header=TRUE, stringsAsFactors=FALSE)
In case you haven't done it with your own data, convert all times to POSIXt, otherwise skip this, too.
x[c('Start_datetime','End_datetime')] <- lapply(x[c('Start_datetime','End_datetime')],
as.POSIXct, format='%Y-%m-%dT%H:%M:%S')
We need to generate a sequence of hourly timestamps:
startdate <- trunc(min(x$Start_datetime), units = "hours")
enddate <- round(max(x$End_datetime), units = "hours")
c(startdate, enddate)
# [1] "2012-02-01 07:00:00 PST" "2012-02-02 06:00:00 PST"
timestamps <- seq(startdate, enddate, by = "hour")
head(timestamps)
# [1] "2012-02-01 07:00:00 PST" "2012-02-01 08:00:00 PST" "2012-02-01 09:00:00 PST"
# [4] "2012-02-01 10:00:00 PST" "2012-02-01 11:00:00 PST" "2012-02-01 12:00:00 PST"
(Assumption: all end timestamps are after their start timestamps ...)
Now it's just a matter of tallying:
counts <- mapply(function(st,en) sum(st <= x$End_datetime & x$Start_datetime <= en),
timestamps[-length(timestamps)], timestamps[-1])
data.frame(
start = timestamps[ -length(timestamps) ],
count = counts
)
# start count
# 1 2012-02-01 07:00:00 2
# 2 2012-02-01 08:00:00 2
# 3 2012-02-01 09:00:00 2
# 4 2012-02-01 10:00:00 3
# 5 2012-02-01 11:00:00 3
# 6 2012-02-01 12:00:00 3
# 7 2012-02-01 13:00:00 3
# 8 2012-02-01 14:00:00 3
# 9 2012-02-01 15:00:00 3
# 10 2012-02-01 16:00:00 3
# 11 2012-02-01 17:00:00 3
# 12 2012-02-01 18:00:00 1
# 13 2012-02-01 19:00:00 0
# 14 2012-02-01 20:00:00 0
# 15 2012-02-01 21:00:00 0
# 16 2012-02-01 22:00:00 1
# 17 2012-02-01 23:00:00 1
# 18 2012-02-02 00:00:00 1
# 19 2012-02-02 01:00:00 1
# 20 2012-02-02 02:00:00 1
# 21 2012-02-02 03:00:00 1
# 22 2012-02-02 04:00:00 1
# 23 2012-02-02 05:00:00 1
I did not see #r2evans answer before posting. I came up with this independently, though it looks similar. I posted it here, so it may be helpful. Feel free to accept #r2evans answer.
Data:
df1 <- read.table(text="EmployeeNum Start_datetime End_datetime
123 '2012-02-01 07:30:00' '2012-02-01 17:45:00'
342 '2012-02-01 08:00:00' '2012-02-01 17:45:00'
876 '2012-02-01 10:45:00' '2012-02-01 18:45:00'", header = TRUE )
df1 <- within(df1, Start_datetime <- as.POSIXct( Start_datetime))
df1 <- within(df1, End_datetime <- as.POSIXct( End_datetime))
Code:
Find datetime sequence by 1 hour for each employee and count the number by Start_datetime.
Also, with this code, it is assumed that you separate original data by each single day and then apply the following code. If your data has multiple days mixed in it, with IDateTime() function from data.table package, it is possible to separate days from time and group by them while making the datetime sequence.
library('data.table')
setDT(df1) # assign data.table class by reference
df2 <- df1[, Map( f = function(x, y) seq( from = trunc(x, "hour"),
to = round(y, "hour"),
by = "1 hour" ),
x = Start_datetime, y = End_datetime ),
by = EmployeeNum ]
colnames(df2)[ colnames(df2) == "V1" ] <- "Start_datetime" # for some reason I can't assign column name properly during the column creation step.
Output:
df2[, .N, by = .( Start_datetime, End_datetime = Start_datetime + 3599 ) ]
# Start_datetime End_datetime N
# 1: 2012-02-01 07:00:00 2012-02-01 07:59:59 1
# 2: 2012-02-01 08:00:00 2012-02-01 08:59:59 2
# 3: 2012-02-01 09:00:00 2012-02-01 09:59:59 2
# 4: 2012-02-01 10:00:00 2012-02-01 10:59:59 3
# 5: 2012-02-01 11:00:00 2012-02-01 11:59:59 3
# 6: 2012-02-01 12:00:00 2012-02-01 12:59:59 3
# 7: 2012-02-01 13:00:00 2012-02-01 13:59:59 3
# 8: 2012-02-01 14:00:00 2012-02-01 14:59:59 3
# 9: 2012-02-01 15:00:00 2012-02-01 15:59:59 3
# 10: 2012-02-01 16:00:00 2012-02-01 16:59:59 3
# 11: 2012-02-01 17:00:00 2012-02-01 17:59:59 3
# 12: 2012-02-01 18:00:00 2012-02-01 18:59:59 3
# 13: 2012-02-01 19:00:00 2012-02-01 19:59:59 1
Graph:
binwidth = 3600 the value indicates 1 hour = 60 min * 60 sec = 3600 seconds
library('ggplot2')
ggplot( data = df2,
mapping = aes( x = Start_datetime ) ) +
geom_histogram(binwidth = 3600, color = "red", fill = "white" ) +
scale_x_datetime( date_breaks = "1 hour", date_labels = "%H:%M" ) +
ylab("Number of Employees") +
xlab( "Working Hours: 2012-02-01" ) +
theme( axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
panel.background = element_rect( fill = "white", color = "black") )
Thank you both for your answers. I came up with a solution which is pretty similar to yours, but I was wondering if you could have a look and let me know what you think of it.
I started a new empty dataframe, and then made two nested loops, to look at each start and end time in each row, and generate a sequence of hours in between. Then I each hour in the sequence to the new empty dataframe. This way, I can simply do a count later.
staffDetailHours <- data.frame("personnelNum"=integer(0),
"workDate"=character(0),
"Hour"=integer(0))
for (i in 1:dim(DF)[1]){
hoursList <- seq(as.POSIXlt(DF[i,]$START)$hour,
as.POSIXlt(DF[i,]$END)$hour)
for (j in 1:length(hoursList)) {
staffDetailHours[nrow(staffDetailHours)+1,] = list(
DF[i,]$EmployeeNum,
DF[i,]$Date,
hoursList[j]
)
}
}
I have a sample dataset of the trajectory of one bike. My objective is to figure out, on average, the amount of time that lapses in between visits to station B.
So far, I have been able to simply order the dataset with:
test[order(test$starttime, decreasing = FALSE),]
and find the row index of where start_station and end_station equal B.
which(test$start_station == 'B')
which(test$end_station == 'B')
The next part is where I run into trouble. In order to calculate the time that lapses in between when the bike is at Station B, we must take the difftime() between where start_station = "B" (bike leaves) and the next occurring record where end_station= "B", even if the record happens to be in the same row (see row 6).
Using the dataset below, we know that the bike spent 510 minutes between 7:30:00 and 16:00:00 outside of Station B, 30 minutes between 18:00:00 and 18:30:00 outside of Station B, and 210 minutes between 19:00:00 and 22:30:00 outside of Station B, which averages to 250 minutes.
How would one reproduce this output in R using difftime()?
> test
bikeid start_station starttime end_station endtime
1 1 A 2017-09-25 01:00:00 B 2017-09-25 01:30:00
2 1 B 2017-09-25 07:30:00 C 2017-09-25 08:00:00
3 1 C 2017-09-25 10:00:00 A 2017-09-25 10:30:00
4 1 A 2017-09-25 13:00:00 C 2017-09-25 13:30:00
5 1 C 2017-09-25 15:30:00 B 2017-09-25 16:00:00
6 1 B 2017-09-25 18:00:00 B 2017-09-25 18:30:00
7 1 B 2017-09-25 19:00:00 A 2017-09-25 19:30:00
8 1 А 2017-09-25 20:00:00 C 2017-09-25 20:30:00
9 1 C 2017-09-25 22:00:00 B 2017-09-25 22:30:00
10 1 B 2017-09-25 23:00:00 C 2017-09-25 23:30:00
Here is the sample data:
> dput(test)
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A",
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600,
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400,
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt"
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B",
"A", "C", "B", "C"), endtime = structure(c(1506317400, 1506340800,
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800,
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid",
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA,
-10L), class = "data.frame")
This will calculate the difference as asked in the order it occurs, but does not append it to the data.frame
lapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
[[1]]
Time difference of 510 mins
[[2]]
Time difference of 30 mins
[[3]]
Time difference of 210 mins
[[4]]
Time difference of NA mins
To calculate the average time:
v1 <- sapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
mean(v1, na.rm = TRUE)
[1] 250
Another possibility:
library(data.table)
d <- setDT(test)[ , {
start = starttime[start_station == "B"]
end = endtime[end_station == "B"]
.(start = start, end = end, duration = difftime(end, start, units = "min"))
}
, by = .(trip = cumsum(start_station == "B"))]
d
# trip start end duration
# 1: 0 <NA> 2017-09-25 01:30:00 NA mins
# 2: 1 2017-09-25 07:30:00 2017-09-25 16:00:00 510 mins
# 3: 2 2017-09-25 18:00:00 2017-09-25 18:30:00 30 mins
# 4: 3 2017-09-25 19:00:00 2017-09-25 22:30:00 210 mins
# 5: 4 2017-09-25 23:00:00 <NA> NA mins
d[ , mean(duration, na.rm = TRUE)]
# Time difference of 250 mins
# or
d[ , mean(as.integer(duration), na.rm = TRUE)]
# [1] 250
The data is grouped by a counter which increases by 1 each time a bike starts from "B" (by = cumsum(start_station == "B")).
I have a sample dataset which tracks the trajectory of a bike to different stations. My objective is to find the intervals that the bike remains at a particular station with difftime(), in this case, station B.
> test
bikeid start_station starttime end_station endtime
1 1 A 2017-09-25 01:00:00 B 2017-09-25 01:30:00
2 1 B 2017-09-25 07:30:00 C 2017-09-25 08:00:00
3 1 C 2017-09-25 10:00:00 A 2017-09-25 10:30:00
4 1 A 2017-09-25 13:00:00 C 2017-09-25 13:30:00
5 1 C 2017-09-25 15:30:00 B 2017-09-25 16:00:00
6 1 B 2017-09-25 18:00:00 B 2017-09-25 18:30:00
7 1 B 2017-09-25 19:00:00 A 2017-09-25 19:30:00
8 1 А 2017-09-25 20:00:00 B 2017-09-25 20:30:00
9 1 C 2017-09-25 22:00:00 C 2017-09-25 22:30:00
10 1 B 2017-09-25 23:00:00 C 2017-09-25 23:30:00
Sometimes, the bikes do not start at the same station that they ended, and these cases should be ignored. In the above dataset, we can see that 360 minutes lapsed between 01:30:00 and 07:30:00, that 120 minutes lapsed between 16:00:00 and 18:00:00, and that 30 minutes lapsed between 18:30:00 and 19:00:00. Row 8 and 10 are ignored because the bike does not start at the same station at which it ended. Therefore, the output vector should be:
[1] 360 120 30
The following code using does not produce the desired output:
sapply(test$starttime[test$end_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = test$endtime[test$start_station == "B"])
How would one take into account the next row and calculate difftime() only when the end_station and start_station in the following row are equal? Using lead() in dplyr? Any suggestion would be appreciated
Here is the sample data:
> dput(test)
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A",
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600,
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400,
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt"
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B",
"A", "B", "C", "C"), endtime = structure(c(1506317400, 1506340800,
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800,
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid",
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA,
-10L), class = "data.frame")
Reshaping as suggested last time...
library(data.table)
mtest = melt(setDT(test), id="bikeid",
meas = patterns("_station", "time"),
variable.name = "event",
value.name = c("station", "time"))
mtest[.(factor(1:2), c("start", "end")), on=.(event), event := i.V2]
setkey(mtest, bikeid, time)
Then back to wide for the spells while the bike is idle...
idleDT = dcast(mtest[-c(1,.N)][, g := rep(1:.N, each=2, length.out=.N)],
g ~ rowid(g), value.var=c("station", "time"))
g station_1 station_2 time_1 time_2
1: 1 B B 2017-09-25 01:30:00 2017-09-25 07:30:00
2: 2 C C 2017-09-25 08:00:00 2017-09-25 10:00:00
3: 3 A A 2017-09-25 10:30:00 2017-09-25 13:00:00
4: 4 C C 2017-09-25 13:30:00 2017-09-25 15:30:00
5: 5 B B 2017-09-25 16:00:00 2017-09-25 18:00:00
6: 6 B B 2017-09-25 18:30:00 2017-09-25 19:00:00
7: 7 A <U+0410> 2017-09-25 19:30:00 2017-09-25 20:00:00
8: 8 B C 2017-09-25 20:30:00 2017-09-25 22:00:00
9: 9 C B 2017-09-25 22:30:00 2017-09-25 23:00:00
Then join or filter and compute...
idleDT[.("B", "B"), on=.(station_1, station_2), time_2 - time_1 ]
Time differences in mins
[1] 360 120 30
Comment
I should probably explain why I prefer long-format mtest over the OP's test, even though I go right back to wide format for the analysis (thanks #Henrik)...
Stations could/should arguably be a factor, and if you have it split over two columns in the core data, it's a burden to ensure that both factors have the same levels.
The data is presumably recorded in terms of events (like "the bike left" and "the bike arrived"), not in terms of trips. If someone steals the bike or it gets lost, for example, the endtime and end_station should logically be missing, but this is easier to keep track of in long format, in my opinion.
The measured data could even have two "the bike arrived" events in a row, even though it doesn't logically make sense, anything that can go wrong with data will go wrong, in my experience. If this happened, you'd have a hard time figuring out how to record it in wide format in terms of trips.
Generally, I'm just applying my (perhaps overzealous or wrong) understanding of tidy data, riffing on Hadley's complaint in the link about data layouts where "[c]olumn headers are values, not variable names."
A dplyr solution:
library(dplyr)
df %>%
mutate(lag_end_station = lag(end_station),
lag_end_time = lag(endtime)) %>%
filter(start_station == "B" & lag_end_station == "B") %>%
transmute(interval = difftime(starttime, lag_end_time))
Result:
interval
1 360 mins
2 120 mins
3 30 mins
The problem:
I have two dataframes that I would like to merge depending on the date/time of one dataframe being in the interval of the other dataframe.
traffic: Date and Time (Posixct), Frequency
mydata: Interval, Sum of Frequency
I would now like to calculate if the Posixct time from traffic is within the interval of mydata and if this is TRUE I would like to count the frequency in the column "Sum of Frequencies" in mydata.
The two problems, that I encountered:
1. traffic data frame has significantly more rows than mydata. I dont know how to tell R to loop through every observation in traffic to check for one row in mydata.
There can be more than one observation fitting in the frequency interval of mydata. I want R to add up all frequencies of the different traffic observations to get a total score of frequencies. Also the intervals are overlapping.
Here is the data:
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
mydata$SumFrequency <- NA
The expected outcome should be something like this:
mydata$SumFrequency <- c(24, 2, 2)
head(mydata)
I tried int_overlaps from package lubridate.
Any tips on how to solve this are higly appreciated!
A short solution with foverlaps from the data.table package:
mydata <- data.table(DateTime1, DateTime2, key = c("DateTime1", "DateTime2"))
traffic <- data.table(start = DateTime, end = DateTime, Frequency, key = c("start","end"))
foverlaps(traffic, mydata, type="within", nomatch=0L)[, .(sumFreq = sum(Frequency)),
by = .(DateTime1, DateTime2)]
which gives:
DateTime1 DateTime2 sumFreq
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
On a data.table approach with between to filter traffic dataset on time:
setDT(traffic)
setDT(mydata)
mydata[,SumFrequency := as.numeric(SumFrequency)] # coerce logical to numeric for next step.
mydata[,SumFrequency := sum( traffic[ DateTime %between% c(DateTime1, DateTime2), Frequency] ), by=1:nrow(mydata)]
which give:
DateTime1 DateTime2 Interval SumFrequency
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2
If there's a lot of row in mydata, it could be better to create an index column and use it in by clause:
mydata[, idx := .I]
mydata[, SumFrequency := sum( traffic[DateTime %between% c(DateTime1, DateTime2),Frequency] ),by=idx]
And this gives:
DateTime1 DateTime2 Interval SumFrequency idx
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24 1
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2 3
I see two solutions :
With data.frame and plyr
You could do it using %within% function in lubridate and with a for-loop or using plyr loop functions like dlply
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
library(plyr)
# Create a group-by variable
mydata$NumInt <- 1:nrow(mydata)
mydata$SumFrequency <- dlply(mydata, .(NumInt),
function(row){
sum(
traffic[traffic$DateTime %within% row$Interval, "Frequency"]
)
})
mydata
#> DateTime1 DateTime2
#> 1 2014-11-01 04:00:00 2014-11-01 04:15:00
#> 2 2015-08-01 04:03:00 2015-08-01 04:13:00
#> 3 2015-08-01 14:00:00 2015-08-01 14:15:00
#> Interval NumInt SumFrequency
#> 1 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 1 24
#> 2 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
#> 3 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 3 2
With data.table and functions foverlaps
data.table has implemented a function for overlapping joins that you could use in your case with a little trick.
This functions is foverlaps (I uses below data.table 1.9.6)
(see How to perform join over date ranges using data.table? and this presentation)
Notice that you do not need to create interval with lubridate
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.table(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
mydata <- data.table(DateTime1 = as.POSIXct(DateTime1), DateTime2 = as.POSIXct(DateTime2))
# Use function `foverlaps` for overlapping joins
# Here's the trick : create a dummy variable to artificially have an interval
traffic[, dummy:=DateTime]
setkey(mydata, DateTime1, DateTime2)
# do the join
mydata2 <- foverlaps(traffic, mydata, by.x=c("DateTime", "dummy"), type ="within", nomatch=0L)[, dummy := NULL][]
mydata2
#> DateTime1 DateTime2 DateTime Frequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 1
#> 2: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:03:00 2
#> 3: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:06:00 3
#> 4: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:08:00 5
#> 5: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:10:00 12
#> 6: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:12:00 1
#> 7: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:13:00 2
#> 8: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:15:00 1
#> 9: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:13:00 1
# summarise with a sum by grouping by each line of mydata
setkeyv(mydata2, key(mydata))
mydata2[mydata, .(SumFrequency = sum(Frequency)), by = .EACHI]
#> DateTime1 DateTime2 SumFrequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
#> 2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
#> 3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
As far as point 2 is concerned you can use aggregate for instance
aggData <- aggregate(traffic$Frequency~format(traffic$DateTime, "%Y%m%d h:m"), data=traffic, sum)
This sums all frequencies in minute intervals.
And for point 1. Wouldn't a merge work?
merge(x = myData, y = aggData, by = "DateTime", all.x = TRUE)
The outer merge is explained here
Using a for.loop we could do something like this:
for(i in 1:nrow(mydata)) {
mydata$SumFrequency[i] <- sum(traffic$Frequency[traffic$DateTime %within% mydata$Interval[i]])
}
> mydata
# DateTime1 DateTime2 Interval SumFrequency
#1 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
#2 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
#3 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2