Related
I have 2 dataframes which are as follows:
Dataframe 1: traffic_df which is hourly data.
Date_Time
Traffic
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2020-03-24 08:00:00
15
Dataframe 2: Alert.level
Start
End
Alert.level
10/03/2020 13:30
23/03/2020 13:30
2
23/03/2020 13:30
25/03/2020 23:59
3
I want to add a 3rd column to traffic_df which is the associated Alert.level if the Date_Time falls within the Start and End Date_Time of the Alert.level df so that the resulting dataframe will look like this:
Dataframe 1: traffic_df
Date_Time
Traffic
Alert.Level
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2
2020-03-24 08:00:00
15
3
Is there anyway to do this without having to make a matching hourly dataframe and then using join?
I'm thinking somehow using the map function?
Code to produce the df:
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame")
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3))
You may try the fuzzyjoin package.
Data
library(lubridate)
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame") %>%
mutate(Date_Time = ymd_hms(Date_Time))
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3)) %>%
mutate(Start = dmy_hms(Start),
End = dmy_hms(End))
Code
library(fuzzyjoin)
traffic_df %>%
fuzzy_left_join(Alert.Level,
match_fun = list(`>=`, `<=`),
by = list(x = c("Date_Time",
"Date_Time"),
y = c("Start",
"End"))) %>%
select(-Start, -End)
Output
In contrast to your expected output above, row three is not matched, because 7:00 o'clock is before the starting time of 13:30.
Date_Time Traffic Alert.level
1 2020-03-09 06:00:00 10 NA
2 2020-03-09 07:00:00 20 NA
3 2020-03-10 07:00:00 20 NA
4 2020-03-24 08:00:00 15 3
Here is a solution using sqldf. Note that I renamed the data.frame to have an underscore for convenience with SQL.
library(sqldf)
Alert_level <- Alert.level
sqldf("SELECT * FROM traffic_df
LEFT JOIN Alert_level
ON traffic_df.Date_Time BETWEEN Alert_level.Start AND Alert_level.End")
Output
Date_Time Traffic Start End Alert.level
1 2020-03-09 06:00:00 10 <NA> <NA> NA
2 2020-03-09 07:00:00 20 <NA> <NA> NA
3 2020-03-10 07:00:00 20 <NA> <NA> NA
4 2020-03-24 08:00:00 15 2020-03-23 13:30:00 2020-03-25 23:59:00 3
I like outer approaches in such cases. First, define a Vectorized FUNction, that looks if a specific x is between an y interval. Put it in outer which iterates each Date_Time with each start/end interval of Alert.Level. This gives a matrix o that informs which of the intervals is applicable (I use unname to avoid confusion). Then, in traffic_df we crate a NA column alert_lv (should just have a different name than "Alert.Level"), subset it with positive colSums, and put in the according levels of Alert.Level.
FUN <- Vectorize(function(x, y) x >= y[1] & x < y[2])
(o <- unname(outer(traffic_df$Date_Time, Alert.Level[-3], FUN)))
# [,1] [,2] [,3] [,4]
# [1,] FALSE FALSE TRUE FALSE
# [2,] FALSE FALSE FALSE TRUE
w <- unlist(apply(o, 1, which))
traffic_df <- within(traffic_df, {
alert_lv <- NA
alert_lv[rowSums(o) > 0] <- Alert.Level[w, 3]
})
traffic_df
# Date_Time Traffic alert_lv
# 1 2020-03-09 06:00:00 10 NA
# 2 2020-03-09 07:00:00 20 NA
# 3 2020-03-10 07:00:00 20 2
# 4 2020-03-24 08:00:00 15 3
Note: To use this solution you first need the usual 'POSIXct' formats, so first you should do
traffic_df$Date_Time <- as.POSIXct(traffic_df$Date_Time)
Alert.Level[1:2] <- lapply(Alert.Level[1:2], strptime, format='%d/%m/%Y %H:%M')
I have a dataframe df1 that summarizes different observations of an individual ID overtime but rounded at fixed 45-minutes intervals starting at 00:00:00 (00:00:00, 00:45:00, etc.). As an example:
df1<- data.frame(DateTime45=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
ID=c("A","A","A","A"),
VariableX=c(0,2,0,4))
df1
DateTime45 ID VariableX
1 2017-07-09 00:00:00 A 0
2 2017-07-09 00:45:00 A 2
3 2017-07-09 02:15:00 A 0
4 2017-07-09 03:45:00 A 4
I have another dataframe df2 in which I have other info (vedba) about this individual also overtime, but in this case without 45-minutes time intervals. As an example:
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2
DateTime ID vedba
1 2017-07-08 23:40:57.244 A 1.87
2 2017-07-08 23:58:12.944 A 2.30
3 2017-07-09 00:01:00.345 A 0.30
4 2017-07-09 00:07:12.845 A 0.67
. . . . .
. . . . .
I want to calculate for each row in df1, the mean vedba using values from df2. The key is that I want to consider that for each time in df1, the window encompasses between 22 minutes and 30 seconds before and after (that is, df1$DateTime45 is the central value of the range). For instance, the time-range for df1[1,1] (2017-07-09 00:00:00) is between 2017-07-08 23:37:30 and 2017-07-09 00:22:30.
In this example, I would expect to get this:
df3
DateTime45 ID VariableX meanVedba n_vedba
1 2017-07-09 00:00:00 A 0 1.2850000 4
2 2017-07-09 00:45:00 A 2 1.7750000 4
3 2017-07-09 02:15:00 A 0 1.5833333 3
4 2017-07-09 03:45:00 A 4 0.8266667 3
*Note: I include an n_vedba variable to check if the code is taking the right number of rows from df2.
My try was this code:
setDT(df1)[, DateTime45 := ymd_hms(DateTime45)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "45 mins")]
df3<- df2[df1, .(meanVedba = mean(vedba),
n_vedba=.N),
on = .(ID, dt_floor = DateTime45), by = .EACHI]
df3
ID dt_floor meanVedba n_vedba
1: A 2017-07-09 00:00:00 0.4850000 2
2: A 2017-07-09 00:45:00 2.3333333 3
3: A 2017-07-09 02:15:00 NA 0
4: A 2017-07-09 03:45:00 0.8266667 3
However, as you can see, I don't get what I would expect.
Does anyone know why and how to change the code in order to accomplish what I want?
Extra comment
When I have hour-intervals instead of 45-minutes intervals the code I showed works.
I create the dataframes
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 01:00:00","2017-07-09 02:00:00","2017-07-09 03:00:00","2017-07-09 04:00:00"),
ID=c("A","A","A","A","A"),
VariableX=c(0,2,0,4,7))
df1$DateTime<- as.POSIXct(df1$DateTime45, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df1
DateTime ID VariableX
1 2017-07-09 00:00:00 A 0
2 2017-07-09 01:00:00 A 2
3 2017-07-09 02:00:00 A 0
4 2017-07-09 03:00:00 A 4
5 2017-07-09 04:00:00 A 7
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2
DateTime ID vedba dt_floor
1: 2017-07-08 23:40:57 A 1.87 2017-07-09 00:00:00
2: 2017-07-08 23:58:12 A 2.30 2017-07-09 00:00:00
3: 2017-07-09 00:01:00 A 0.30 2017-07-09 00:00:00
4: 2017-07-09 00:07:12 A 0.67 2017-07-09 00:00:00
. . . . .
. . . . .
I calculate vedba for hourly-bin intervals
setDT(df1)[, DateTime45 := ymd_hms(DateTime)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "hour")]
df3<- df2[df1, .(meanVedba = mean(vedba),
n_vedba=.N),
on = .(ID, dt_floor = DateTime), by = .EACHI]
df3
ID dt_floor meanVedba n_vedba
1: A 2017-07-09 00:00:00 1.288000 5
2: A 2017-07-09 01:00:00 1.580000 5
3: A 2017-07-09 02:00:00 1.710000 3
4: A 2017-07-09 03:00:00 1.352857 7
5: A 2017-07-09 04:00:00 0.940000 1
You need an non-equi join
library(data.table)
library(lubridate)
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
ID=c("A","A","A","A"),
VariableX=c(0,2,0,4))
df1$DateTime<- as.POSIXct(df1$DateTime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
setDT(df1)
setDT(df2)
df1[, date_lo := DateTime - minutes(22) - seconds(30)]
df1[, date_hi := DateTime + minutes(22) + seconds(30)]
df2[df1, .(mean = mean(vedba),
N = .N), on = .(ID, DateTime <= date_hi, DateTime >= date_lo), .EACHI]
ID DateTime DateTime mean N
1: A 2017-07-09 00:22:30 2017-07-08 23:37:30 1.2850000 4
2: A 2017-07-09 01:07:30 2017-07-09 00:22:30 1.7750000 4
3: A 2017-07-09 02:37:30 2017-07-09 01:52:30 1.5833333 3
4: A 2017-07-09 04:07:30 2017-07-09 03:22:30 0.8266667 3
Well, I thought of working around it differently, first I switched your POSIXct forPOSIXlt and I applied it to both df1 and df2 ( instead of just df1)
So I ran this:
df1$DateTime45<- as.POSIXlt(df1$DateTime45, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2$DateTime<- as.POSIXlt(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
Then I decided to go for conditions, since you have times, you can check if the difference between each df2 and your df1 is greater ( by absolute value) than 22.5 minutes.
I did it with 2 nested for loops:
for (i in 1:length(df1$DateTime45)){
for (n in 1:length(df2$DateTime)){
df2$DateTime[abs((df1$DateTime45[i] - df2$DateTime[n])) < seconds_to_period(seconds(22.5*60))][n] <- df1$DateTime45[i]
}
}
Basically so far I overwrote ( converted) all df2 dates into the relevant df1's.So be mindful that if you want to keep your original df2 dates and times you should initially run this on a duplicate of df2.
Now finally we can calculate the mean vedba and join it to df1, again using a simple for loop:
means <- list()
for (i in 1:length(df1$DateTime45)){
means[[i]] <- mean(df2[df1$DateTime45[i]==df2$DateTime,]$vedba)
}
df1<- cbind(df1,means = unlist(means))
rm(means)
now running df1 gives us:
DateTime45 ID VariableX means
1 2017-07-09 00:00:00 A 0 1.2850000
2 2017-07-09 00:45:00 A 2 1.7750000
3 2017-07-09 02:15:00 A 0 1.5833333
4 2017-07-09 03:45:00 A 4 0.8266667
I have data on start time ('startTime', a date-time variable, POSIXct) and duration in minutes ('duration_minutes'):
df <- data.frame(id = c(1, 2, 3),
startTime = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11")),
duration_minutes = c(315, 120, 45))
I want to convert the start time and duration to elapsed time per hour, for each hour, from the hour of the start time to the last hour at the end of the duration:
df_result <- data.frame(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3),
startTime = c("2018-01-01 12:15:31","2018-01-01 13:00:00",
"2018-01-01 14:00:00","2018-01-01 15:00:00",
"2018-01-01 16:00:00","2018-01-01 17:00:00",
"2018-01-02 23:43:00","2018-01-03 00:00:00",
"2018-01-03 01:00:00",
"2018-01-03 11:00:11"),
duration_minutes = c(44.48, 60, 60, 60, 60, 30.5, 17, 60, 43, 45))
Please, advice with the possible solution.
Another possibility:
library(data.table)
library(lubridate)
setDT(df)
df[ , ceil_start := ceiling_date(start, "hour", change_on_boundary = TRUE)]
df[ , {
if(difftime(ceil_start, start, units = "min") > dur) {
.SD[ , .(start, dur)]
} else {
end <- start + dur * 60
time <- c(start,
seq(from = ceil_start,
to = floor_date(end, "hour"),
by = "hour"),
end)
.(start = head(time, -1), dur = `units<-`(diff(time), "mins"))
}
},
by = id]
# id start dur
# 1: 1 2018-01-01 12:15:31 44.48333 mins
# 2: 1 2018-01-01 13:00:00 60.00000 mins
# 3: 1 2018-01-01 14:00:00 60.00000 mins
# 4: 1 2018-01-01 15:00:00 60.00000 mins
# 5: 1 2018-01-01 16:00:00 60.00000 mins
# 6: 1 2018-01-01 17:00:00 30.51667 mins
# 7: 2 2018-01-02 23:43:00 17.00000 mins
# 8: 2 2018-01-03 00:00:00 60.00000 mins
# 9: 2 2018-01-03 01:00:00 43.00000 mins
# 10: 3 2018-01-03 11:00:11 45.00000 mins
# 11: 4 2018-01-03 11:35:00 25.00000 mins
# 12: 4 2018-01-03 12:00:00 10.00000 mins
# 13: 5 2018-01-03 00:00:00 60.00000 mins
# 14: 5 2018-01-03 01:00:00 0.00000 mins
Explanation
Convert data.frame to data.table (setDT). Round up start times to nearest hour (ceiling_date(start, "hour", ...). Use change_on_boundary = TRUE for easier handling of times without minutes and seconds (not in the data, but tested).
To handle cases when the end time (start + duration) is in the same hour as the start time (e.g. id = 3), check if difference between rounded time and start time is larger than duration (if(difftime(ceil_start, start, units = "min") > dur))). If so, just select the start and duration columns (.SD[ , .(start, dur)).
For other cases (else), calculate end time: end <- start + dur * 60. Create a sequence from the up-rounded start time ('ceil_start'), to the down-rounded end time, with an hourly increment (seq(from = ceil_start, to = floor_date(end, "hour"), by = "hour")). Concatenate with 'start' and 'end' times. Return all times except the last (head(time, -1) and calculate difference between time steps in minutes (`units<-`(diff(time), "mins")).
For times with H:M:S = 00:00:00 and duration is a multiple of 60 min, like id = 5, the current solution gives a row with a duration of 0 minutes for the last hour. While waiting for a more elegant solution, a quick and dirty way is just to delete such rows with duration = 0.
Data
Please note that I have added a case not included in original data, id = 4 (see also my comment above) and id = 5.
df <- data.frame(id = 1:5,
start = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-03 11:35:00",
"2018-01-03 00:00:00")),
dur = c(315, 120, 45, 35, 60))
Try this:
library(data.table)
library(lubridate)
library(magrittr)
df <-
setDT(df)[, start_ceiling := ceiling_date(startTime, "hour", change_on_boundary = TRUE)] %>%
.[, `:=` (
reps = ifelse(
startTime + (duration_minutes * 60) <= start_ceiling, 1, pmax(2, floor(duration_minutes / 60) + 1)
),
initial_diff = as.numeric(difftime(start_ceiling[1], startTime[1], units = "mins"))
), by = id] %>%
.[, df[df[, rep(.I, reps)]]] %>%
.[, startTime := pmax(startTime, floor_date(startTime, "hour") + hours(0:(.N - 1))), by = id] %>%
.[reps > 1, duration_minutes := c(initial_diff[.N],
rep(60, reps[.N] - 2),
(duration_minutes[.N] - initial_diff[.N]) %% 60), by = id] %>%
.[!(duration_minutes == 0 & reps > 1), ] %>%
.[, c("reps", "start_ceiling", "initial_diff") := NULL]
I've tested this with all the scenarios we've gathered so far, and this is the output:
id startTime duration_minutes
1: 1 2018-01-01 12:15:31 44.48333
2: 1 2018-01-01 13:00:00 60.00000
3: 1 2018-01-01 14:00:00 60.00000
4: 1 2018-01-01 15:00:00 60.00000
5: 1 2018-01-01 16:00:00 60.00000
6: 1 2018-01-01 17:00:00 30.51667
7: 2 2018-01-02 23:43:00 17.00000
8: 2 2018-01-03 00:00:00 60.00000
9: 2 2018-01-03 01:00:00 43.00000
10: 3 2018-01-03 11:00:11 45.00000
11: 4 2018-01-04 10:00:00 60.00000
12: 4 2018-01-04 11:00:00 5.00000
13: 5 2018-01-05 00:00:00 60.00000
14: 6 2018-01-06 11:35:00 25.00000
15: 6 2018-01-06 12:00:00 10.00000
16: 7 2018-01-07 00:00:00 60.00000
17: 7 2018-01-07 01:00:00 60.00000
Data used:
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7),
startTime = as.POSIXct(
c(
"2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-04 10:00:00",
"2018-01-05 00:00:00",
"2018-01-06 11:35:00",
"2018-01-07 00:00:00"
)
),
duration_minutes = c(315, 120, 45, 65, 60, 35, 120)
)
df
id startTime duration_minutes
1 1 2018-01-01 12:15:31 315
2 2 2018-01-02 23:43:00 120
3 3 2018-01-03 11:00:11 45
4 4 2018-01-04 10:00:00 65
5 5 2018-01-05 00:00:00 60
6 6 2018-01-06 11:35:00 35
7 7 2018-01-07 00:00:00 120
Trying to manipulate the timestamp variable in such a way: if the start time of the following activity is before the end time of the previous activity, then update the start and end time of the previous activity as 1 second before the following activity.
Additional notes:
An activity can be repeated within the same work; i.e. the activity "A".
Some individual activities have the same start and end times some different. This is something I've done intentionally; you can ignore this.
workID workActivityID activity status timestamp timestampDesired
1 1 A start 2018-01-01 09:55:01 2018-01-01 09:54:05
1 1 A end 2018-01-01 09:55:01 2018-01-01 09:54:05
1 2 B start 2018-01-01 09:54:06 2018-01-01 09:54:06
1 2 B end 2018-01-01 09:56:22 2018-01-01 09:56:22
1 3 C start 2018-01-01 09:57:22 2018-01-01 09:57:22
1 3 C end 2018-01-01 09:57:22 2018-01-01 09:57:22
1 4 A start 2018-02-02 08:35:00 2018-02-02 08:35:00
1 4 A end 2018-02-02 08:35:00 2018-02-02 08:35:00
2 1 A start 2018-02-02 08:13:55 2018-02-02 08:14:01
2 1 A end 2018-02-02 08:14:20 2018-02-02 08:14:01
2 2 B start 2018-02-02 08:14:02 2018-02-02 08:14:02
2 2 B end 2018-02-02 08:14:50 2018-02-02 08:14:50
2 3 C start 2018-02-02 10:00:00 2018-02-02 10:00:00
2 3 C end 2018-02-02 10:00:00 2018-02-02 10:00:00
2 4 A start 2018-02-02 10:22:00 2018-02-02 10:22:00
2 4 A end 2018-02-02 10:24:00 2018-02-02 10:24:00
Data:
library(lubridate)
df <-
data.frame(
workID = rep(c(1,2), each=8),
workActivityID = rep(c(1,2,3,4), each=2, times=2),
activity = rep(c("A","B","C","A"), each=2, times=2),
startEnd = rep(c("start", "end"), times=8),
timestamp = ymd_hms(c("2018-01-01 09:55:01", "2018-01-01 09:55:01", "2018-01-01 09:54:06", "2018-01-01 09:56:22", "2018-01-01 09:57:22", "2018-01-01 09:57:22", "2018-02-02 08:35:00","2018-02-02 08:35:00",
"2018-02-02 08:13:55", "2018-02-02 08:14:20", "2018-02-02 08:14:02", "2018-02-02 08:14:50", "2018-02-02 10:00:00", "2018-02-02 10:00:00", "2018-02-02 10:22:00", "2018-02-02 10:24:00")),
timestampDesired = ymd_hms(c("2018-01-01 09:54:05", "2018-01-01 09:54:05", "2018-01-01 09:54:06", "2018-01-01 09:56:22", "2018-01-01 09:57:22", "2018-01-01 09:57:22", "2018-02-02 08:35:00", "2018-02-02 08:35:00",
"2018-02-02 08:14:01", "2018-02-02 08:14:01", "2018-02-02 08:14:02", "2018-02-02 08:14:50", "2018-02-02 10:00:00", "2018-02-02 10:00:00", "2018-02-02 10:22:00", "2018-02-02 10:24:00")))
A possible solution can be reached using tidyr::spread, tidyr::gather. The approach is simple in the sense that move start and end in same row so that decision and change operation (if needed) will be easier. Once modification is performed change it back to long format.
library(tidyverse)
df %>% select(-timestampDesired) %>%
spread(startEnd, timestamp) %>%
group_by(workID) %>%
mutate(start = as.POSIXct(ifelse(!is.na(lead(start)) & lead(start) < end,
lead(start) - 1, start), origin = "1970-01-01 00:00:00" )) %>%
mutate(end = as.POSIXct(ifelse(!is.na(lead(start)) & lead(start) < end,
lead(start) - 1, end), origin = "1970-01-01 00:00:00" )) %>%
ungroup() %>%
gather("startEnd", "timestamp", c("start","end")) %>%
arrange(workID, workActivityID, desc(startEnd)) %>%
as.data.frame()
# workID workActivityID activity startEnd timestamp
# 1 1 1 A start 2018-01-01 09:54:05
# 2 1 1 A end 2018-01-01 09:54:05
# 3 1 2 B start 2018-01-01 09:54:06
# 4 1 2 B end 2018-01-01 09:56:22
# 5 1 3 C start 2018-01-01 09:57:22
# 6 1 3 C end 2018-01-01 09:57:22
# 7 1 4 A start 2018-02-02 08:35:00
# 8 1 4 A end 2018-02-02 08:35:00
# 9 2 1 A start 2018-02-02 08:14:01
# 10 2 1 A end 2018-02-02 08:14:01
# 11 2 2 B start 2018-02-02 08:14:02
# 12 2 2 B end 2018-02-02 08:14:50
# 13 2 3 C start 2018-02-02 10:00:00
# 14 2 3 C end 2018-02-02 10:00:00
# 15 2 4 A start 2018-02-02 10:22:00
# 16 2 4 A end 2018-02-02 10:24:00
Just posting a data.table solution. Explanation inline
#cast into a wide format
wideDT <- dcast.data.table(DT, ... ~ startEnd, value.var="timestamp")
#lead the start time vector and compare start time and amend start and end time if required
wideDT[, c("newstart", "newend") := {
x <- shift(start, type="lead", fill=max(end))
list(newstart=as.POSIXct(ifelse(x < end, x - 1, start), origin="1970-01-01"),
newend=as.POSIXct(ifelse(x < end, x - 1, end), origin="1970-01-01"))
}, by=.(workID)]
#get OP's desired output
wideDT[.(workID, workActivityID, activity),
list(startend=c("start", "end"),
timestamp=c(start, end),
timestampDesired=c(newstart, newend)), by=.EACHI]
data:
library(data.table)
DT <- data.table(
workID = rep(c(1,2), each=8),
workActivityID = rep(c(1,2,3,4), each=2, times=2),
activity = rep(c("A","B","C","A"), each=2, times=2),
startEnd = rep(c("start", "end"), times=8),
timestamp = as.POSIXct(c("2018-01-01 09:55:01", "2018-01-01 09:55:01", "2018-01-01 09:54:06", "2018-01-01 09:56:22", "2018-01-01 09:57:22", "2018-01-01 09:57:22", "2018-02-02 08:35:00","2018-02-02 08:35:00",
"2018-02-02 08:13:55", "2018-02-02 08:14:20", "2018-02-02 08:14:02", "2018-02-02 08:14:50", "2018-02-02 10:00:00", "2018-02-02 10:00:00", "2018-02-02 10:22:00", "2018-02-02 10:24:00")))
I have a dataframe in which each row is the working hours of an employee defined by a start and a stop time:
DF < - EmployeeNum Start_datetime End_datetime
123 2012-02-01 07:30:00 2012-02-01 17:45:00
342 2012-02-01 08:00:00 2012-02-01 17:45:00
876 2012-02-01 10:45:00 2012-02-01 18:45:00
I'd like to find the number of employees working during each hour on each day in a timespan:
Date Hour NumberofEmployeesWorking
2012-02-01 00:00 ? (number of employees working between 00:00 and 00:59)
2012-02-01 01:00 ?
2012-02-01 02:00 ?
2012-02-01 03:00 ?
2012-02-01 04:00 ?
2012-02-01 05:00 ?
2012-02-01 06:00 ?
How do I put my working hours into bins like this?
Your data, in a more consumable format, plus one row to span midnight (for example). I changed the format to include a "T" here, to make consumption easier, otherwise the middle space makes it less trivial to do it with read.table(text='...'). (You can skip this since you already have your real data.)
x <- read.table(text='EmployeeNum Start_datetime End_datetime
123 2012-02-01T07:30:00 2012-02-01T17:45:00
342 2012-02-01T08:00:00 2012-02-01T17:45:00
876 2012-02-01T10:45:00 2012-02-01T18:45:00
877 2012-02-01T22:45:00 2012-02-02T05:45:00',
header=TRUE, stringsAsFactors=FALSE)
In case you haven't done it with your own data, convert all times to POSIXt, otherwise skip this, too.
x[c('Start_datetime','End_datetime')] <- lapply(x[c('Start_datetime','End_datetime')],
as.POSIXct, format='%Y-%m-%dT%H:%M:%S')
We need to generate a sequence of hourly timestamps:
startdate <- trunc(min(x$Start_datetime), units = "hours")
enddate <- round(max(x$End_datetime), units = "hours")
c(startdate, enddate)
# [1] "2012-02-01 07:00:00 PST" "2012-02-02 06:00:00 PST"
timestamps <- seq(startdate, enddate, by = "hour")
head(timestamps)
# [1] "2012-02-01 07:00:00 PST" "2012-02-01 08:00:00 PST" "2012-02-01 09:00:00 PST"
# [4] "2012-02-01 10:00:00 PST" "2012-02-01 11:00:00 PST" "2012-02-01 12:00:00 PST"
(Assumption: all end timestamps are after their start timestamps ...)
Now it's just a matter of tallying:
counts <- mapply(function(st,en) sum(st <= x$End_datetime & x$Start_datetime <= en),
timestamps[-length(timestamps)], timestamps[-1])
data.frame(
start = timestamps[ -length(timestamps) ],
count = counts
)
# start count
# 1 2012-02-01 07:00:00 2
# 2 2012-02-01 08:00:00 2
# 3 2012-02-01 09:00:00 2
# 4 2012-02-01 10:00:00 3
# 5 2012-02-01 11:00:00 3
# 6 2012-02-01 12:00:00 3
# 7 2012-02-01 13:00:00 3
# 8 2012-02-01 14:00:00 3
# 9 2012-02-01 15:00:00 3
# 10 2012-02-01 16:00:00 3
# 11 2012-02-01 17:00:00 3
# 12 2012-02-01 18:00:00 1
# 13 2012-02-01 19:00:00 0
# 14 2012-02-01 20:00:00 0
# 15 2012-02-01 21:00:00 0
# 16 2012-02-01 22:00:00 1
# 17 2012-02-01 23:00:00 1
# 18 2012-02-02 00:00:00 1
# 19 2012-02-02 01:00:00 1
# 20 2012-02-02 02:00:00 1
# 21 2012-02-02 03:00:00 1
# 22 2012-02-02 04:00:00 1
# 23 2012-02-02 05:00:00 1
I did not see #r2evans answer before posting. I came up with this independently, though it looks similar. I posted it here, so it may be helpful. Feel free to accept #r2evans answer.
Data:
df1 <- read.table(text="EmployeeNum Start_datetime End_datetime
123 '2012-02-01 07:30:00' '2012-02-01 17:45:00'
342 '2012-02-01 08:00:00' '2012-02-01 17:45:00'
876 '2012-02-01 10:45:00' '2012-02-01 18:45:00'", header = TRUE )
df1 <- within(df1, Start_datetime <- as.POSIXct( Start_datetime))
df1 <- within(df1, End_datetime <- as.POSIXct( End_datetime))
Code:
Find datetime sequence by 1 hour for each employee and count the number by Start_datetime.
Also, with this code, it is assumed that you separate original data by each single day and then apply the following code. If your data has multiple days mixed in it, with IDateTime() function from data.table package, it is possible to separate days from time and group by them while making the datetime sequence.
library('data.table')
setDT(df1) # assign data.table class by reference
df2 <- df1[, Map( f = function(x, y) seq( from = trunc(x, "hour"),
to = round(y, "hour"),
by = "1 hour" ),
x = Start_datetime, y = End_datetime ),
by = EmployeeNum ]
colnames(df2)[ colnames(df2) == "V1" ] <- "Start_datetime" # for some reason I can't assign column name properly during the column creation step.
Output:
df2[, .N, by = .( Start_datetime, End_datetime = Start_datetime + 3599 ) ]
# Start_datetime End_datetime N
# 1: 2012-02-01 07:00:00 2012-02-01 07:59:59 1
# 2: 2012-02-01 08:00:00 2012-02-01 08:59:59 2
# 3: 2012-02-01 09:00:00 2012-02-01 09:59:59 2
# 4: 2012-02-01 10:00:00 2012-02-01 10:59:59 3
# 5: 2012-02-01 11:00:00 2012-02-01 11:59:59 3
# 6: 2012-02-01 12:00:00 2012-02-01 12:59:59 3
# 7: 2012-02-01 13:00:00 2012-02-01 13:59:59 3
# 8: 2012-02-01 14:00:00 2012-02-01 14:59:59 3
# 9: 2012-02-01 15:00:00 2012-02-01 15:59:59 3
# 10: 2012-02-01 16:00:00 2012-02-01 16:59:59 3
# 11: 2012-02-01 17:00:00 2012-02-01 17:59:59 3
# 12: 2012-02-01 18:00:00 2012-02-01 18:59:59 3
# 13: 2012-02-01 19:00:00 2012-02-01 19:59:59 1
Graph:
binwidth = 3600 the value indicates 1 hour = 60 min * 60 sec = 3600 seconds
library('ggplot2')
ggplot( data = df2,
mapping = aes( x = Start_datetime ) ) +
geom_histogram(binwidth = 3600, color = "red", fill = "white" ) +
scale_x_datetime( date_breaks = "1 hour", date_labels = "%H:%M" ) +
ylab("Number of Employees") +
xlab( "Working Hours: 2012-02-01" ) +
theme( axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
panel.background = element_rect( fill = "white", color = "black") )
Thank you both for your answers. I came up with a solution which is pretty similar to yours, but I was wondering if you could have a look and let me know what you think of it.
I started a new empty dataframe, and then made two nested loops, to look at each start and end time in each row, and generate a sequence of hours in between. Then I each hour in the sequence to the new empty dataframe. This way, I can simply do a count later.
staffDetailHours <- data.frame("personnelNum"=integer(0),
"workDate"=character(0),
"Hour"=integer(0))
for (i in 1:dim(DF)[1]){
hoursList <- seq(as.POSIXlt(DF[i,]$START)$hour,
as.POSIXlt(DF[i,]$END)$hour)
for (j in 1:length(hoursList)) {
staffDetailHours[nrow(staffDetailHours)+1,] = list(
DF[i,]$EmployeeNum,
DF[i,]$Date,
hoursList[j]
)
}
}