Referencing data in leading rows - r

I have a sample dataset which tracks the trajectory of a bike to different stations. My objective is to find the intervals that the bike remains at a particular station with difftime(), in this case, station B.
> test
bikeid start_station starttime end_station endtime
1 1 A 2017-09-25 01:00:00 B 2017-09-25 01:30:00
2 1 B 2017-09-25 07:30:00 C 2017-09-25 08:00:00
3 1 C 2017-09-25 10:00:00 A 2017-09-25 10:30:00
4 1 A 2017-09-25 13:00:00 C 2017-09-25 13:30:00
5 1 C 2017-09-25 15:30:00 B 2017-09-25 16:00:00
6 1 B 2017-09-25 18:00:00 B 2017-09-25 18:30:00
7 1 B 2017-09-25 19:00:00 A 2017-09-25 19:30:00
8 1 А 2017-09-25 20:00:00 B 2017-09-25 20:30:00
9 1 C 2017-09-25 22:00:00 C 2017-09-25 22:30:00
10 1 B 2017-09-25 23:00:00 C 2017-09-25 23:30:00
Sometimes, the bikes do not start at the same station that they ended, and these cases should be ignored. In the above dataset, we can see that 360 minutes lapsed between 01:30:00 and 07:30:00, that 120 minutes lapsed between 16:00:00 and 18:00:00, and that 30 minutes lapsed between 18:30:00 and 19:00:00. Row 8 and 10 are ignored because the bike does not start at the same station at which it ended. Therefore, the output vector should be:
[1] 360 120 30
The following code using does not produce the desired output:
sapply(test$starttime[test$end_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = test$endtime[test$start_station == "B"])
How would one take into account the next row and calculate difftime() only when the end_station and start_station in the following row are equal? Using lead() in dplyr? Any suggestion would be appreciated
Here is the sample data:
> dput(test)
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A",
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600,
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400,
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt"
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B",
"A", "B", "C", "C"), endtime = structure(c(1506317400, 1506340800,
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800,
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid",
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA,
-10L), class = "data.frame")

Reshaping as suggested last time...
library(data.table)
mtest = melt(setDT(test), id="bikeid",
meas = patterns("_station", "time"),
variable.name = "event",
value.name = c("station", "time"))
mtest[.(factor(1:2), c("start", "end")), on=.(event), event := i.V2]
setkey(mtest, bikeid, time)
Then back to wide for the spells while the bike is idle...
idleDT = dcast(mtest[-c(1,.N)][, g := rep(1:.N, each=2, length.out=.N)],
g ~ rowid(g), value.var=c("station", "time"))
g station_1 station_2 time_1 time_2
1: 1 B B 2017-09-25 01:30:00 2017-09-25 07:30:00
2: 2 C C 2017-09-25 08:00:00 2017-09-25 10:00:00
3: 3 A A 2017-09-25 10:30:00 2017-09-25 13:00:00
4: 4 C C 2017-09-25 13:30:00 2017-09-25 15:30:00
5: 5 B B 2017-09-25 16:00:00 2017-09-25 18:00:00
6: 6 B B 2017-09-25 18:30:00 2017-09-25 19:00:00
7: 7 A <U+0410> 2017-09-25 19:30:00 2017-09-25 20:00:00
8: 8 B C 2017-09-25 20:30:00 2017-09-25 22:00:00
9: 9 C B 2017-09-25 22:30:00 2017-09-25 23:00:00
Then join or filter and compute...
idleDT[.("B", "B"), on=.(station_1, station_2), time_2 - time_1 ]
Time differences in mins
[1] 360 120 30
Comment
I should probably explain why I prefer long-format mtest over the OP's test, even though I go right back to wide format for the analysis (thanks #Henrik)...
Stations could/should arguably be a factor, and if you have it split over two columns in the core data, it's a burden to ensure that both factors have the same levels.
The data is presumably recorded in terms of events (like "the bike left" and "the bike arrived"), not in terms of trips. If someone steals the bike or it gets lost, for example, the endtime and end_station should logically be missing, but this is easier to keep track of in long format, in my opinion.
The measured data could even have two "the bike arrived" events in a row, even though it doesn't logically make sense, anything that can go wrong with data will go wrong, in my experience. If this happened, you'd have a hard time figuring out how to record it in wide format in terms of trips.
Generally, I'm just applying my (perhaps overzealous or wrong) understanding of tidy data, riffing on Hadley's complaint in the link about data layouts where "[c]olumn headers are values, not variable names."

A dplyr solution:
library(dplyr)
df %>%
mutate(lag_end_station = lag(end_station),
lag_end_time = lag(endtime)) %>%
filter(start_station == "B" & lag_end_station == "B") %>%
transmute(interval = difftime(starttime, lag_end_time))
Result:
interval
1 360 mins
2 120 mins
3 30 mins

Related

Updating Dates and Date Intervals in R

Not even sure if I've described the problem accurately in the title, but here goes.
Suppose I have the following data.table/data.frame:
library(data.table)
library(lubridate)
DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
"2019-06-01 09:00:00", "2016-06-01 09:00:00","2016-06-01 09:00:00"),
end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
"2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
person = c("A", "A","A", "B", "B", "C"))
begin end person
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
2: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
3: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 05:00:00 B
5: 2016-06-01 09:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
This is essentially a dataset summarizing time stamps of when a period began and ended for each person. The number of rows are repeated for each person by the number of days which the time period spans. For example, person A has three entries for the same "shift" because their shift spans three distinct dates, 06-01, 06-02, and 06-03. The entries are repeated by the number of dates which the shifts span, but some shifts begin and end within the same day.
What I want is to update the begin and end dates of the above dataset, so that I can see what time each shift began and ended at the day level. So the dataset should look like:
begin end person
1: 2019-06-01 09:00:00 2019-06-02 00:00:00 A
2: 2019-06-02 00:00:00 2019-06-03 00:00:00 A
3: 2019-06-03 00:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 00:00:00 B
5: 2016-06-02 00:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
Any help would be greatly appreciated!
First, fixing the dates (and I already fixed row 5's starting in 2016 and going through to 2019, seems unlikely),
DT[, c("begin", "end"):=lapply(.SD, as.POSIXct), .SDcols=c("begin", "end")]
## we get this
DT <- as.data.table(structure(list(begin = structure(c(1559394000, 1559394000, 1559394000, 1559394000, 1559394000, 1464786000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1559584800, 1559584800, 1559584800, 1559466000, 1559466000, 1464837300), class = c("POSIXct", "POSIXt"), tzone = ""), person = c("A", "A", "A", "B", "B", "C")), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
Second, we then create this function
func <- function(st, en) {
midns <- lubridate::ceiling_date(seq(st, en, by = "day"), unit = "day")
times <- unique(sort(c(midns[ st < midns & midns < en], st, en)))
data.table(begin = times[-length(times)], end = times[-1])
}
Lastly, we use it, using by=.(person) to preserve that column in the output. I use DT since we do not need (or even want) duplicates for each shift/day:
unique(DT)[, rbindlist(Map(func, begin, end)), by = .(person)]
# person begin end
# <char> <POSc> <POSc>
# 1: A 2019-06-01 09:00:00 2019-06-02 00:00:00
# 2: A 2019-06-02 00:00:00 2019-06-03 00:00:00
# 3: A 2019-06-03 00:00:00 2019-06-03 14:00:00
# 4: B 2019-06-01 09:00:00 2019-06-02 00:00:00
# 5: B 2019-06-02 00:00:00 2019-06-02 05:00:00
# 6: C 2016-06-01 09:00:00 2016-06-01 23:15:00
Assuming you had a typo for row 5 person B (begin 2019 not 2016):
library(data.table)
library(lubridate)
> DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
+ "2019-06-01 09:00:00", "2019-06-01 09:00:00","2016-06-01 09:00:00"),
+ end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
+ "2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
+ person = c("A", "A","A", "B", "B", "C"))
>
> DT[, `:=`(min=as.numeric(difftime(end,begin, units="mins")),
+ days=as.numeric(as_date(end)-as_date(begin)+1))][, min_day:=min/days]
>
> unique(DT)
begin end person min days min_day
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A 3180 3 1060
2: 2019-06-01 09:00:00 2019-06-02 05:00:00 B 1200 2 600
3: 2016-06-01 09:00:00 2016-06-01 23:15:00 C 855 1 855

Add column in dataframe based on 3 columns from another dataframe using R

I have 2 dataframes which are as follows:
Dataframe 1: traffic_df which is hourly data.
Date_Time
Traffic
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2020-03-24 08:00:00
15
Dataframe 2: Alert.level
Start
End
Alert.level
10/03/2020 13:30
23/03/2020 13:30
2
23/03/2020 13:30
25/03/2020 23:59
3
I want to add a 3rd column to traffic_df which is the associated Alert.level if the Date_Time falls within the Start and End Date_Time of the Alert.level df so that the resulting dataframe will look like this:
Dataframe 1: traffic_df
Date_Time
Traffic
Alert.Level
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2
2020-03-24 08:00:00
15
3
Is there anyway to do this without having to make a matching hourly dataframe and then using join?
I'm thinking somehow using the map function?
Code to produce the df:
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame")
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3))
You may try the fuzzyjoin package.
Data
library(lubridate)
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame") %>%
mutate(Date_Time = ymd_hms(Date_Time))
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3)) %>%
mutate(Start = dmy_hms(Start),
End = dmy_hms(End))
Code
library(fuzzyjoin)
traffic_df %>%
fuzzy_left_join(Alert.Level,
match_fun = list(`>=`, `<=`),
by = list(x = c("Date_Time",
"Date_Time"),
y = c("Start",
"End"))) %>%
select(-Start, -End)
Output
In contrast to your expected output above, row three is not matched, because 7:00 o'clock is before the starting time of 13:30.
Date_Time Traffic Alert.level
1 2020-03-09 06:00:00 10 NA
2 2020-03-09 07:00:00 20 NA
3 2020-03-10 07:00:00 20 NA
4 2020-03-24 08:00:00 15 3
Here is a solution using sqldf. Note that I renamed the data.frame to have an underscore for convenience with SQL.
library(sqldf)
Alert_level <- Alert.level
sqldf("SELECT * FROM traffic_df
LEFT JOIN Alert_level
ON traffic_df.Date_Time BETWEEN Alert_level.Start AND Alert_level.End")
Output
Date_Time Traffic Start End Alert.level
1 2020-03-09 06:00:00 10 <NA> <NA> NA
2 2020-03-09 07:00:00 20 <NA> <NA> NA
3 2020-03-10 07:00:00 20 <NA> <NA> NA
4 2020-03-24 08:00:00 15 2020-03-23 13:30:00 2020-03-25 23:59:00 3
I like outer approaches in such cases. First, define a Vectorized FUNction, that looks if a specific x is between an y interval. Put it in outer which iterates each Date_Time with each start/end interval of Alert.Level. This gives a matrix o that informs which of the intervals is applicable (I use unname to avoid confusion). Then, in traffic_df we crate a NA column alert_lv (should just have a different name than "Alert.Level"), subset it with positive colSums, and put in the according levels of Alert.Level.
FUN <- Vectorize(function(x, y) x >= y[1] & x < y[2])
(o <- unname(outer(traffic_df$Date_Time, Alert.Level[-3], FUN)))
# [,1] [,2] [,3] [,4]
# [1,] FALSE FALSE TRUE FALSE
# [2,] FALSE FALSE FALSE TRUE
w <- unlist(apply(o, 1, which))
traffic_df <- within(traffic_df, {
alert_lv <- NA
alert_lv[rowSums(o) > 0] <- Alert.Level[w, 3]
})
traffic_df
# Date_Time Traffic alert_lv
# 1 2020-03-09 06:00:00 10 NA
# 2 2020-03-09 07:00:00 20 NA
# 3 2020-03-10 07:00:00 20 2
# 4 2020-03-24 08:00:00 15 3
Note: To use this solution you first need the usual 'POSIXct' formats, so first you should do
traffic_df$Date_Time <- as.POSIXct(traffic_df$Date_Time)
Alert.Level[1:2] <- lapply(Alert.Level[1:2], strptime, format='%d/%m/%Y %H:%M')

Convert start time and total duration to elapsed time per hour

I have data on start time ('startTime', a date-time variable, POSIXct) and duration in minutes ('duration_minutes'):
df <- data.frame(id = c(1, 2, 3),
startTime = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11")),
duration_minutes = c(315, 120, 45))
I want to convert the start time and duration to elapsed time per hour, for each hour, from the hour of the start time to the last hour at the end of the duration:
df_result <- data.frame(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3),
startTime = c("2018-01-01 12:15:31","2018-01-01 13:00:00",
"2018-01-01 14:00:00","2018-01-01 15:00:00",
"2018-01-01 16:00:00","2018-01-01 17:00:00",
"2018-01-02 23:43:00","2018-01-03 00:00:00",
"2018-01-03 01:00:00",
"2018-01-03 11:00:11"),
duration_minutes = c(44.48, 60, 60, 60, 60, 30.5, 17, 60, 43, 45))
Please, advice with the possible solution.
Another possibility:
library(data.table)
library(lubridate)
setDT(df)
df[ , ceil_start := ceiling_date(start, "hour", change_on_boundary = TRUE)]
df[ , {
if(difftime(ceil_start, start, units = "min") > dur) {
.SD[ , .(start, dur)]
} else {
end <- start + dur * 60
time <- c(start,
seq(from = ceil_start,
to = floor_date(end, "hour"),
by = "hour"),
end)
.(start = head(time, -1), dur = `units<-`(diff(time), "mins"))
}
},
by = id]
# id start dur
# 1: 1 2018-01-01 12:15:31 44.48333 mins
# 2: 1 2018-01-01 13:00:00 60.00000 mins
# 3: 1 2018-01-01 14:00:00 60.00000 mins
# 4: 1 2018-01-01 15:00:00 60.00000 mins
# 5: 1 2018-01-01 16:00:00 60.00000 mins
# 6: 1 2018-01-01 17:00:00 30.51667 mins
# 7: 2 2018-01-02 23:43:00 17.00000 mins
# 8: 2 2018-01-03 00:00:00 60.00000 mins
# 9: 2 2018-01-03 01:00:00 43.00000 mins
# 10: 3 2018-01-03 11:00:11 45.00000 mins
# 11: 4 2018-01-03 11:35:00 25.00000 mins
# 12: 4 2018-01-03 12:00:00 10.00000 mins
# 13: 5 2018-01-03 00:00:00 60.00000 mins
# 14: 5 2018-01-03 01:00:00 0.00000 mins
Explanation
Convert data.frame to data.table (setDT). Round up start times to nearest hour (ceiling_date(start, "hour", ...). Use change_on_boundary = TRUE for easier handling of times without minutes and seconds (not in the data, but tested).
To handle cases when the end time (start + duration) is in the same hour as the start time (e.g. id = 3), check if difference between rounded time and start time is larger than duration (if(difftime(ceil_start, start, units = "min") > dur))). If so, just select the start and duration columns (.SD[ , .(start, dur)).
For other cases (else), calculate end time: end <- start + dur * 60. Create a sequence from the up-rounded start time ('ceil_start'), to the down-rounded end time, with an hourly increment (seq(from = ceil_start, to = floor_date(end, "hour"), by = "hour")). Concatenate with 'start' and 'end' times. Return all times except the last (head(time, -1) and calculate difference between time steps in minutes (`units<-`(diff(time), "mins")).
For times with H:M:S = 00:00:00 and duration is a multiple of 60 min, like id = 5, the current solution gives a row with a duration of 0 minutes for the last hour. While waiting for a more elegant solution, a quick and dirty way is just to delete such rows with duration = 0.
Data
Please note that I have added a case not included in original data, id = 4 (see also my comment above) and id = 5.
df <- data.frame(id = 1:5,
start = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-03 11:35:00",
"2018-01-03 00:00:00")),
dur = c(315, 120, 45, 35, 60))
Try this:
library(data.table)
library(lubridate)
library(magrittr)
df <-
setDT(df)[, start_ceiling := ceiling_date(startTime, "hour", change_on_boundary = TRUE)] %>%
.[, `:=` (
reps = ifelse(
startTime + (duration_minutes * 60) <= start_ceiling, 1, pmax(2, floor(duration_minutes / 60) + 1)
),
initial_diff = as.numeric(difftime(start_ceiling[1], startTime[1], units = "mins"))
), by = id] %>%
.[, df[df[, rep(.I, reps)]]] %>%
.[, startTime := pmax(startTime, floor_date(startTime, "hour") + hours(0:(.N - 1))), by = id] %>%
.[reps > 1, duration_minutes := c(initial_diff[.N],
rep(60, reps[.N] - 2),
(duration_minutes[.N] - initial_diff[.N]) %% 60), by = id] %>%
.[!(duration_minutes == 0 & reps > 1), ] %>%
.[, c("reps", "start_ceiling", "initial_diff") := NULL]
I've tested this with all the scenarios we've gathered so far, and this is the output:
id startTime duration_minutes
1: 1 2018-01-01 12:15:31 44.48333
2: 1 2018-01-01 13:00:00 60.00000
3: 1 2018-01-01 14:00:00 60.00000
4: 1 2018-01-01 15:00:00 60.00000
5: 1 2018-01-01 16:00:00 60.00000
6: 1 2018-01-01 17:00:00 30.51667
7: 2 2018-01-02 23:43:00 17.00000
8: 2 2018-01-03 00:00:00 60.00000
9: 2 2018-01-03 01:00:00 43.00000
10: 3 2018-01-03 11:00:11 45.00000
11: 4 2018-01-04 10:00:00 60.00000
12: 4 2018-01-04 11:00:00 5.00000
13: 5 2018-01-05 00:00:00 60.00000
14: 6 2018-01-06 11:35:00 25.00000
15: 6 2018-01-06 12:00:00 10.00000
16: 7 2018-01-07 00:00:00 60.00000
17: 7 2018-01-07 01:00:00 60.00000
Data used:
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7),
startTime = as.POSIXct(
c(
"2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-04 10:00:00",
"2018-01-05 00:00:00",
"2018-01-06 11:35:00",
"2018-01-07 00:00:00"
)
),
duration_minutes = c(315, 120, 45, 65, 60, 35, 120)
)
df
id startTime duration_minutes
1 1 2018-01-01 12:15:31 315
2 2 2018-01-02 23:43:00 120
3 3 2018-01-03 11:00:11 45
4 4 2018-01-04 10:00:00 65
5 5 2018-01-05 00:00:00 60
6 6 2018-01-06 11:35:00 35
7 7 2018-01-07 00:00:00 120

In R, how do I create a time histogram of intervals defined by a start and stop time for each entry?

I have a dataframe in which each row is the working hours of an employee defined by a start and a stop time:
DF < - EmployeeNum Start_datetime End_datetime
123 2012-02-01 07:30:00 2012-02-01 17:45:00
342 2012-02-01 08:00:00 2012-02-01 17:45:00
876 2012-02-01 10:45:00 2012-02-01 18:45:00
I'd like to find the number of employees working during each hour on each day in a timespan:
Date Hour NumberofEmployeesWorking
2012-02-01 00:00 ? (number of employees working between 00:00 and 00:59)
2012-02-01 01:00 ?
2012-02-01 02:00 ?
2012-02-01 03:00 ?
2012-02-01 04:00 ?
2012-02-01 05:00 ?
2012-02-01 06:00 ?
How do I put my working hours into bins like this?
Your data, in a more consumable format, plus one row to span midnight (for example). I changed the format to include a "T" here, to make consumption easier, otherwise the middle space makes it less trivial to do it with read.table(text='...'). (You can skip this since you already have your real data.)
x <- read.table(text='EmployeeNum Start_datetime End_datetime
123 2012-02-01T07:30:00 2012-02-01T17:45:00
342 2012-02-01T08:00:00 2012-02-01T17:45:00
876 2012-02-01T10:45:00 2012-02-01T18:45:00
877 2012-02-01T22:45:00 2012-02-02T05:45:00',
header=TRUE, stringsAsFactors=FALSE)
In case you haven't done it with your own data, convert all times to POSIXt, otherwise skip this, too.
x[c('Start_datetime','End_datetime')] <- lapply(x[c('Start_datetime','End_datetime')],
as.POSIXct, format='%Y-%m-%dT%H:%M:%S')
We need to generate a sequence of hourly timestamps:
startdate <- trunc(min(x$Start_datetime), units = "hours")
enddate <- round(max(x$End_datetime), units = "hours")
c(startdate, enddate)
# [1] "2012-02-01 07:00:00 PST" "2012-02-02 06:00:00 PST"
timestamps <- seq(startdate, enddate, by = "hour")
head(timestamps)
# [1] "2012-02-01 07:00:00 PST" "2012-02-01 08:00:00 PST" "2012-02-01 09:00:00 PST"
# [4] "2012-02-01 10:00:00 PST" "2012-02-01 11:00:00 PST" "2012-02-01 12:00:00 PST"
(Assumption: all end timestamps are after their start timestamps ...)
Now it's just a matter of tallying:
counts <- mapply(function(st,en) sum(st <= x$End_datetime & x$Start_datetime <= en),
timestamps[-length(timestamps)], timestamps[-1])
data.frame(
start = timestamps[ -length(timestamps) ],
count = counts
)
# start count
# 1 2012-02-01 07:00:00 2
# 2 2012-02-01 08:00:00 2
# 3 2012-02-01 09:00:00 2
# 4 2012-02-01 10:00:00 3
# 5 2012-02-01 11:00:00 3
# 6 2012-02-01 12:00:00 3
# 7 2012-02-01 13:00:00 3
# 8 2012-02-01 14:00:00 3
# 9 2012-02-01 15:00:00 3
# 10 2012-02-01 16:00:00 3
# 11 2012-02-01 17:00:00 3
# 12 2012-02-01 18:00:00 1
# 13 2012-02-01 19:00:00 0
# 14 2012-02-01 20:00:00 0
# 15 2012-02-01 21:00:00 0
# 16 2012-02-01 22:00:00 1
# 17 2012-02-01 23:00:00 1
# 18 2012-02-02 00:00:00 1
# 19 2012-02-02 01:00:00 1
# 20 2012-02-02 02:00:00 1
# 21 2012-02-02 03:00:00 1
# 22 2012-02-02 04:00:00 1
# 23 2012-02-02 05:00:00 1
I did not see #r2evans answer before posting. I came up with this independently, though it looks similar. I posted it here, so it may be helpful. Feel free to accept #r2evans answer.
Data:
df1 <- read.table(text="EmployeeNum Start_datetime End_datetime
123 '2012-02-01 07:30:00' '2012-02-01 17:45:00'
342 '2012-02-01 08:00:00' '2012-02-01 17:45:00'
876 '2012-02-01 10:45:00' '2012-02-01 18:45:00'", header = TRUE )
df1 <- within(df1, Start_datetime <- as.POSIXct( Start_datetime))
df1 <- within(df1, End_datetime <- as.POSIXct( End_datetime))
Code:
Find datetime sequence by 1 hour for each employee and count the number by Start_datetime.
Also, with this code, it is assumed that you separate original data by each single day and then apply the following code. If your data has multiple days mixed in it, with IDateTime() function from data.table package, it is possible to separate days from time and group by them while making the datetime sequence.
library('data.table')
setDT(df1) # assign data.table class by reference
df2 <- df1[, Map( f = function(x, y) seq( from = trunc(x, "hour"),
to = round(y, "hour"),
by = "1 hour" ),
x = Start_datetime, y = End_datetime ),
by = EmployeeNum ]
colnames(df2)[ colnames(df2) == "V1" ] <- "Start_datetime" # for some reason I can't assign column name properly during the column creation step.
Output:
df2[, .N, by = .( Start_datetime, End_datetime = Start_datetime + 3599 ) ]
# Start_datetime End_datetime N
# 1: 2012-02-01 07:00:00 2012-02-01 07:59:59 1
# 2: 2012-02-01 08:00:00 2012-02-01 08:59:59 2
# 3: 2012-02-01 09:00:00 2012-02-01 09:59:59 2
# 4: 2012-02-01 10:00:00 2012-02-01 10:59:59 3
# 5: 2012-02-01 11:00:00 2012-02-01 11:59:59 3
# 6: 2012-02-01 12:00:00 2012-02-01 12:59:59 3
# 7: 2012-02-01 13:00:00 2012-02-01 13:59:59 3
# 8: 2012-02-01 14:00:00 2012-02-01 14:59:59 3
# 9: 2012-02-01 15:00:00 2012-02-01 15:59:59 3
# 10: 2012-02-01 16:00:00 2012-02-01 16:59:59 3
# 11: 2012-02-01 17:00:00 2012-02-01 17:59:59 3
# 12: 2012-02-01 18:00:00 2012-02-01 18:59:59 3
# 13: 2012-02-01 19:00:00 2012-02-01 19:59:59 1
Graph:
binwidth = 3600 the value indicates 1 hour = 60 min * 60 sec = 3600 seconds
library('ggplot2')
ggplot( data = df2,
mapping = aes( x = Start_datetime ) ) +
geom_histogram(binwidth = 3600, color = "red", fill = "white" ) +
scale_x_datetime( date_breaks = "1 hour", date_labels = "%H:%M" ) +
ylab("Number of Employees") +
xlab( "Working Hours: 2012-02-01" ) +
theme( axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
panel.background = element_rect( fill = "white", color = "black") )
Thank you both for your answers. I came up with a solution which is pretty similar to yours, but I was wondering if you could have a look and let me know what you think of it.
I started a new empty dataframe, and then made two nested loops, to look at each start and end time in each row, and generate a sequence of hours in between. Then I each hour in the sequence to the new empty dataframe. This way, I can simply do a count later.
staffDetailHours <- data.frame("personnelNum"=integer(0),
"workDate"=character(0),
"Hour"=integer(0))
for (i in 1:dim(DF)[1]){
hoursList <- seq(as.POSIXlt(DF[i,]$START)$hour,
as.POSIXlt(DF[i,]$END)$hour)
for (j in 1:length(hoursList)) {
staffDetailHours[nrow(staffDetailHours)+1,] = list(
DF[i,]$EmployeeNum,
DF[i,]$Date,
hoursList[j]
)
}
}

Index of next occurring record

I have a sample dataset of the trajectory of one bike. My objective is to figure out, on average, the amount of time that lapses in between visits to station B.
So far, I have been able to simply order the dataset with:
test[order(test$starttime, decreasing = FALSE),]
and find the row index of where start_station and end_station equal B.
which(test$start_station == 'B')
which(test$end_station == 'B')
The next part is where I run into trouble. In order to calculate the time that lapses in between when the bike is at Station B, we must take the difftime() between where start_station = "B" (bike leaves) and the next occurring record where end_station= "B", even if the record happens to be in the same row (see row 6).
Using the dataset below, we know that the bike spent 510 minutes between 7:30:00 and 16:00:00 outside of Station B, 30 minutes between 18:00:00 and 18:30:00 outside of Station B, and 210 minutes between 19:00:00 and 22:30:00 outside of Station B, which averages to 250 minutes.
How would one reproduce this output in R using difftime()?
> test
bikeid start_station starttime end_station endtime
1 1 A 2017-09-25 01:00:00 B 2017-09-25 01:30:00
2 1 B 2017-09-25 07:30:00 C 2017-09-25 08:00:00
3 1 C 2017-09-25 10:00:00 A 2017-09-25 10:30:00
4 1 A 2017-09-25 13:00:00 C 2017-09-25 13:30:00
5 1 C 2017-09-25 15:30:00 B 2017-09-25 16:00:00
6 1 B 2017-09-25 18:00:00 B 2017-09-25 18:30:00
7 1 B 2017-09-25 19:00:00 A 2017-09-25 19:30:00
8 1 А 2017-09-25 20:00:00 C 2017-09-25 20:30:00
9 1 C 2017-09-25 22:00:00 B 2017-09-25 22:30:00
10 1 B 2017-09-25 23:00:00 C 2017-09-25 23:30:00
Here is the sample data:
> dput(test)
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A",
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600,
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400,
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt"
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B",
"A", "C", "B", "C"), endtime = structure(c(1506317400, 1506340800,
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800,
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid",
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA,
-10L), class = "data.frame")
This will calculate the difference as asked in the order it occurs, but does not append it to the data.frame
lapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
[[1]]
Time difference of 510 mins
[[2]]
Time difference of 30 mins
[[3]]
Time difference of 210 mins
[[4]]
Time difference of NA mins
To calculate the average time:
v1 <- sapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
mean(v1, na.rm = TRUE)
[1] 250
Another possibility:
library(data.table)
d <- setDT(test)[ , {
start = starttime[start_station == "B"]
end = endtime[end_station == "B"]
.(start = start, end = end, duration = difftime(end, start, units = "min"))
}
, by = .(trip = cumsum(start_station == "B"))]
d
# trip start end duration
# 1: 0 <NA> 2017-09-25 01:30:00 NA mins
# 2: 1 2017-09-25 07:30:00 2017-09-25 16:00:00 510 mins
# 3: 2 2017-09-25 18:00:00 2017-09-25 18:30:00 30 mins
# 4: 3 2017-09-25 19:00:00 2017-09-25 22:30:00 210 mins
# 5: 4 2017-09-25 23:00:00 <NA> NA mins
d[ , mean(duration, na.rm = TRUE)]
# Time difference of 250 mins
# or
d[ , mean(as.integer(duration), na.rm = TRUE)]
# [1] 250
The data is grouped by a counter which increases by 1 each time a bike starts from "B" (by = cumsum(start_station == "B")).

Resources