I have a raw data frame that looks like this:
test
id class time
1 1 start 2019-06-20 00:00:00
2 1 end 2019-06-20 00:05:00
3 1 start 2019-06-20 00:10:00
4 1 end 2019-06-20 00:15:00
5 2 end 2019-06-20 00:20:00
6 2 start 2019-06-20 00:25:00
7 2 end 2019-06-20 00:30:00
8 2 start 2019-06-20 00:35:00
9 3 end 2019-06-20 00:40:00
10 3 start 2019-06-20 00:45:00
11 3 end 2019-06-20 00:50:00
12 3 start 2019-06-20 00:55:00
My goal is to map the values to an output table for each id only where there is a start and an end in consecutive order (time). Therefore, the output would look like:
output
id start end
1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2 1 2019-06-20 00:10:00 2019-06-20 00:15:00
3 2 2019-06-20 00:25:00 2019-06-20 00:30:00
4 3 2019-06-20 00:45:00 2019-06-20 00:50:00
I have tried with the dplyr package, but
test %>% group_by(id) %>% arrange(time) %>% starts_with("start")
Error in starts_with(., "start") : is_string(match) is not TRUE
starts_with always throws an error. I would like to avoid writing a for loop because I am sure this can be handled by a few chain operations. Any ideas for a workaround in dplyr or data.table?
One possible approach:
test[, {
si <- which(class=="start" & shift(class, -1L)=="end")
.(id, start=time[si], end=time[si + 1L])
}, by=.(id)]
output:
id start end
1: 1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2: 1 1 2019-06-20 00:10:00 2019-06-20 00:15:00
3: 2 2 2019-06-20 00:25:00 2019-06-20 00:30:00
4: 3 3 2019-06-20 00:45:00 2019-06-20 00:50:00
data:
library(data.table)
test <- fread("id,class,time
1,start,2019-06-20 00:00:00
1,end,2019-06-20 00:05:00
1,start,2019-06-20 00:10:00
1,end,2019-06-20 00:15:00
2,end,2019-06-20 00:20:00
2,start,2019-06-20 00:25:00
2,end,2019-06-20 00:30:00
2,start,2019-06-20 00:35:00
3,end,2019-06-20 00:40:00
3,start,2019-06-20 00:45:00
3,end,2019-06-20 00:50:00
3,start,2019-06-20 00:55:00")
I usually use cumsum() is these cases
test %>%
group_by(id) %>%
arrange(time, .by_group = TRUE) %>% # should use .by_group arg
mutate(flag = cumsum(class == "start")) %>%
group_by(id, flag) %>%
filter(n() == 2L) %>%
ungroup() %>%
spread(class, time) %>%
select(-flag)
Using dplyr and tidyr, we can first filter the rows which follow the "start" and "end" pattern, create groups of 2 rows and spread to long format.
library(dplyr)
library(tidyr)
test %>%
group_by(id) %>%
filter(class == "start" & lead(class) == "end" |
class == "end" & lag(class) == "start") %>%
group_by(group = gl(n()/2, 2)) %>%
spread(class, time) %>%
ungroup() %>%
select(-group) %>%
select(id, start, end)
# id start end
# <int> <dttm> <dttm>
#1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
#2 1 2019-06-20 00:10:00 2019-06-20 00:15:00
#3 2 2019-06-20 00:25:00 2019-06-20 00:30:00
#4 3 2019-06-20 00:45:00 2019-06-20 00:50:00
You can keep each start row plus the end immediately after it (if any), then use dcast to switch from long to wide form:
test[,
if (.N >= 2) head(.SD, 2)
, by=.(g = rleid(id, cumsum(class=="start"))), .SDcols=names(test)][,
dcast(.SD, id + g ~ factor(class, levels=c("start", "end")), value.var="time")
]
id g start end
1: 1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2: 1 2 2019-06-20 00:10:00 2019-06-20 00:15:00
3: 2 4 2019-06-20 00:25:00 2019-06-20 00:30:00
4: 3 7 2019-06-20 00:45:00 2019-06-20 00:50:00
rleid and cumsum are used to find the sequences; and factor is needed to tell dcast the column order.
Side note: This is essentially the same as #cheetahfly's answer (I didn't realize when I posted): since the cumsum is increasing, it is sufficient to group by id + cumsum and there's no need to use rleid (which is for tracking runs of values). The only difference is that my approach woudl keep a run like start, end, end; while the other answer would filter it out with the n() == 2 check.
Related
I am having trouble converting a time range in a column to a readable data for R. How would I go about converting this?
[1] "05:30P -08:00P" "07:00A -09:35A" "08:00A -10:30A" "08:55P -11:00P" "06:00P -06:30P"
c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
If we want to convert to Datetime, an option is to split at the - into two columns and then use as.POSIXct to do the conversion
library(stringr)
library(dplyr)
library(tidyr)
str_replace_all(str1, "([AP])", "\\1M") %>%
tibble(str1 = .) %>%
separate(str1, into = c('start', 'end'), sep="\\s*-") %>%
mutate(across(c(start, end), ~ as.POSIXct(., format = '%I:%M %p')))
# A tibble: 5 x 2
# start end
# <dttm> <dttm>
#1 2020-08-19 17:30:00 2020-08-19 20:00:00
#2 2020-08-19 07:00:00 2020-08-19 09:35:00
#3 2020-08-19 08:00:00 2020-08-19 10:30:00
#4 2020-08-19 20:55:00 2020-08-19 23:00:00
#5 2020-08-19 18:00:00 2020-08-19 18:30:00
Or using lubridate
library(lubridate)
str_replace_all(str1, "([AP])", "\\1M") %>%
tibble(str1 = .) %>%
separate(str1, into = c('start', 'end'), sep="\\s*-") %>%
mutate(across(c(start, end), ~ parse_date_time(., 'IMp')))
data
str1 <- c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
Base R attempt using strcapture to separate the timestamps out to two parts:
dr <- c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
tms <- strcapture(r"((\d+:\d+[AP])[- ]+(\d+:\d+[AP]))", dr, proto=list(start="",end=""))
tms[] <- lapply(tms, function(x) as.POSIXct(paste0(x, "M"), format="%I:%M%p", tz="UTC"))
# start end
#1 2020-08-20 17:30:00 2020-08-20 20:00:00
#2 2020-08-20 07:00:00 2020-08-20 09:35:00
#3 2020-08-20 08:00:00 2020-08-20 10:30:00
#4 2020-08-20 20:55:00 2020-08-20 23:00:00
#5 2020-08-20 18:00:00 2020-08-20 18:30:00
I have a time series (xts) of rain gage data and I would like to be able to sum all the rain amounts between a beginning and end time point from a list. And then make a new data frame that is StormNumber and TotalRain over that time
> head(RainGage)
Rain_mm
2019-07-01 00:00:00 0
2019-07-01 00:15:00 0
2019-07-01 00:30:00 0
2019-07-01 00:45:00 0
2019-07-01 01:00:00 0
2019-07-01 01:15:00 0
head(StormTimes)
StormNumber RainStartTime RainEndTime
1 1 2019-07-21 20:00:00 2019-07-22 04:45:00
2 2 2019-07-22 11:30:00 2019-07-22 23:45:00
3 3 2019-07-11 09:15:00 2019-07-11 19:00:00
4 4 2019-05-29 17:00:00 2019-05-29 20:45:00
5 5 2019-06-27 14:30:00 2019-06-27 17:15:00
6 6 2019-07-11 06:15:00 2019-07-11 09:00:00
I have this code that I got from the SO community when I was trying to do something similar in the past (but extract data rather than sum it). However, I have no idea how it works so I am struggling to adapt it to this situation.
do.call(rbind, Map(function(x, y) RainGage[paste(x, y, sep="/")],
StormTimes$RainStartTime, StormTimes$RainEndTime)
In this case I would suggest just to write your own function and then use apply to achieve what you want, for example:
dates <- c('2019-07-01 00:00:00', '2019-07-01 00:15:00',
'2019-07-01 00:30:00', '2019-07-01 00:45:00',
'2019-07-01 01:00:00', '2019-07-01 01:15:00')
dates <- as.POSIXct(strptime(dates, '%Y-%m-%d %H:%M:%S'))
mm <- c(0, 10, 10, 20, 0, 0)
rain <- data.frame(dates, mm)
number <- c(1,2)
start <- c('2019-07-01 00:00:00','2019-07-01 00:18:00')
start <- as.POSIXct(strptime(start, '%Y-%m-%d %H:%M:%S'))
end <- c('2019-07-01 00:17:00','2019-07-01 01:20:00')
end <- as.POSIXct(strptime(end, '%Y-%m-%d %H:%M:%S'))
storms <- data.frame(number, start, end)
# Sum of rain
f = function(x, output) {
# Get storm number
number = x[1]
# Get starting moment
start = x[2]
# Get ending moment
end = x[3]
# Calculate sum
output <- sum(rain[rain$dates >= start & rain$dates < end, 'mm'])
}
# Apply function to each row of the dataframe
storms$rain <- apply(storms, 1, f)
print(storms)
This yields:
number start end rain
1 1 2019-07-01 00:00:00 2019-07-01 00:17:00 10
2 2 2019-07-01 00:18:00 2019-07-01 01:20:00 30
So a column rain in storms now holds the sum of rain$mm, which is what you're after.
Hope that helps you out!
Given the datraframe below
class timestamp
1 A 2019-02-14 15:00:29
2 A 2019-01-27 17:59:53
3 A 2019-01-27 18:00:00
4 B 2019-02-02 18:00:00
5 C 2019-03-08 16:00:37
observation 2 and 3 point to the same event. How do I remove rows belonging to the same class if another timestamp within 2 minutes already exists?
Desired output:
class timestamp
1 A 2019-02-14 15:00:00
2 B 2019-01-27 18:00:00
3 A 2019-02-02 18:00:00
4 C 2019-03-08 16:00:00
round( ,c("mins")) can be used to get rid of the second component but if the timestamps are to far off some test samples will be rounded to the wrong minute leaving still different timestamps
EDIT
I think I over-complicated the problem in first attempt, I think what would work for your case is to round time for 2 minute intervals which we can do using round_date from lubridate .
library(lubridate)
library(dplyr)
df %>%
mutate(timestamp = round_date(as.POSIXct(timestamp), unit = "2 minutes")) %>%
group_by(class) %>%
filter(!duplicated(timestamp))
# class timestamp
# <chr> <dttm>
#1 A 2019-02-14 15:00:00
#2 A 2019-01-27 18:00:00
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:00
Original Attempt
We can first convert the timestamp to POSIXct object, then arrange rows by class and timestamp, use cut to divide them into "2 min" interval and then remove duplicates.
library(dplyr)
df %>%
mutate(timestamp = as.POSIXct(timestamp)) %>%
arrange(class, timestamp) %>%
group_by(class) %>%
filter(!duplicated(as.numeric(cut(timestamp, breaks = "2 mins")), fromLast = TRUE))
# class timestamp
# <chr> <dttm>
#1 A 2019-01-27 18:00:00
#2 A 2019-02-14 15:00:29
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:37
Here, I haven't changed or rounded the timestamp column and kept it as it is but it would be simple to round it if you use cut in mutate. Also if you want to keep the first entry like 2019-01-27 17:59:53 then remove fromLast = TRUE argument.
I have a dataset with periods
active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2018-01-01 01:50:00 2018-01-01 02:00:00
3: 2 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 3 2018-01-01 01:50:00 2018-01-01 02:00:00
during which an id was active. I would like to aggregate across ids and determine for every point in
time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))
the number of IDs that are inactive and the average number of minutes until they get active. That is, ideally, the table looks like
>ans
time inactive av.time
1: 2018-01-01 01:10:00 2 30
2: 2018-01-01 01:11:00 2 29
...
50: 2018-01-01 02:00:00 0 0
I believe this can be done using data.table but I cannot figure out the syntax to get the time differences.
Using dplyr, we can join by a dummy variable to create the Cartesian product of time and active. The definitions of inactive and av.time might not be exactly what you're looking for, but it should get you started. If your data is very large, I agree that data.table will be a better way of handling this.
library(tidyverse)
time %>%
mutate(dummy = TRUE) %>%
inner_join({
active %>%
mutate(dummy = TRUE)
#join by the dummy variable to get the Cartesian product
}, by = c("dummy" = "dummy")) %>%
select(-dummy) %>%
#define what makes an id inactive and the time until it becomes active
mutate(inactive = time < beg | time > end,
TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>%
#group by time and summarise
group_by(time) %>%
summarise(inactive = sum(inactive),
av.time = mean(TimeUntilActive, na.rm = TRUE))
# A tibble: 51 x 3
time inactive av.time
<dttm> <int> <dbl>
1 2018-01-01 01:10:00 3 40
2 2018-01-01 01:11:00 3 39
3 2018-01-01 01:12:00 3 38
4 2018-01-01 01:13:00 3 37
5 2018-01-01 01:14:00 3 36
6 2018-01-01 01:15:00 3 35
7 2018-01-01 01:16:00 3 34
8 2018-01-01 01:17:00 3 33
9 2018-01-01 01:18:00 3 32
10 2018-01-01 01:19:00 3 31
I have a dataframe in which each row is the working hours of an employee defined by a start and a stop time:
DF < - EmployeeNum Start_datetime End_datetime
123 2012-02-01 07:30:00 2012-02-01 17:45:00
342 2012-02-01 08:00:00 2012-02-01 17:45:00
876 2012-02-01 10:45:00 2012-02-01 18:45:00
I'd like to find the number of employees working during each hour on each day in a timespan:
Date Hour NumberofEmployeesWorking
2012-02-01 00:00 ? (number of employees working between 00:00 and 00:59)
2012-02-01 01:00 ?
2012-02-01 02:00 ?
2012-02-01 03:00 ?
2012-02-01 04:00 ?
2012-02-01 05:00 ?
2012-02-01 06:00 ?
How do I put my working hours into bins like this?
Your data, in a more consumable format, plus one row to span midnight (for example). I changed the format to include a "T" here, to make consumption easier, otherwise the middle space makes it less trivial to do it with read.table(text='...'). (You can skip this since you already have your real data.)
x <- read.table(text='EmployeeNum Start_datetime End_datetime
123 2012-02-01T07:30:00 2012-02-01T17:45:00
342 2012-02-01T08:00:00 2012-02-01T17:45:00
876 2012-02-01T10:45:00 2012-02-01T18:45:00
877 2012-02-01T22:45:00 2012-02-02T05:45:00',
header=TRUE, stringsAsFactors=FALSE)
In case you haven't done it with your own data, convert all times to POSIXt, otherwise skip this, too.
x[c('Start_datetime','End_datetime')] <- lapply(x[c('Start_datetime','End_datetime')],
as.POSIXct, format='%Y-%m-%dT%H:%M:%S')
We need to generate a sequence of hourly timestamps:
startdate <- trunc(min(x$Start_datetime), units = "hours")
enddate <- round(max(x$End_datetime), units = "hours")
c(startdate, enddate)
# [1] "2012-02-01 07:00:00 PST" "2012-02-02 06:00:00 PST"
timestamps <- seq(startdate, enddate, by = "hour")
head(timestamps)
# [1] "2012-02-01 07:00:00 PST" "2012-02-01 08:00:00 PST" "2012-02-01 09:00:00 PST"
# [4] "2012-02-01 10:00:00 PST" "2012-02-01 11:00:00 PST" "2012-02-01 12:00:00 PST"
(Assumption: all end timestamps are after their start timestamps ...)
Now it's just a matter of tallying:
counts <- mapply(function(st,en) sum(st <= x$End_datetime & x$Start_datetime <= en),
timestamps[-length(timestamps)], timestamps[-1])
data.frame(
start = timestamps[ -length(timestamps) ],
count = counts
)
# start count
# 1 2012-02-01 07:00:00 2
# 2 2012-02-01 08:00:00 2
# 3 2012-02-01 09:00:00 2
# 4 2012-02-01 10:00:00 3
# 5 2012-02-01 11:00:00 3
# 6 2012-02-01 12:00:00 3
# 7 2012-02-01 13:00:00 3
# 8 2012-02-01 14:00:00 3
# 9 2012-02-01 15:00:00 3
# 10 2012-02-01 16:00:00 3
# 11 2012-02-01 17:00:00 3
# 12 2012-02-01 18:00:00 1
# 13 2012-02-01 19:00:00 0
# 14 2012-02-01 20:00:00 0
# 15 2012-02-01 21:00:00 0
# 16 2012-02-01 22:00:00 1
# 17 2012-02-01 23:00:00 1
# 18 2012-02-02 00:00:00 1
# 19 2012-02-02 01:00:00 1
# 20 2012-02-02 02:00:00 1
# 21 2012-02-02 03:00:00 1
# 22 2012-02-02 04:00:00 1
# 23 2012-02-02 05:00:00 1
I did not see #r2evans answer before posting. I came up with this independently, though it looks similar. I posted it here, so it may be helpful. Feel free to accept #r2evans answer.
Data:
df1 <- read.table(text="EmployeeNum Start_datetime End_datetime
123 '2012-02-01 07:30:00' '2012-02-01 17:45:00'
342 '2012-02-01 08:00:00' '2012-02-01 17:45:00'
876 '2012-02-01 10:45:00' '2012-02-01 18:45:00'", header = TRUE )
df1 <- within(df1, Start_datetime <- as.POSIXct( Start_datetime))
df1 <- within(df1, End_datetime <- as.POSIXct( End_datetime))
Code:
Find datetime sequence by 1 hour for each employee and count the number by Start_datetime.
Also, with this code, it is assumed that you separate original data by each single day and then apply the following code. If your data has multiple days mixed in it, with IDateTime() function from data.table package, it is possible to separate days from time and group by them while making the datetime sequence.
library('data.table')
setDT(df1) # assign data.table class by reference
df2 <- df1[, Map( f = function(x, y) seq( from = trunc(x, "hour"),
to = round(y, "hour"),
by = "1 hour" ),
x = Start_datetime, y = End_datetime ),
by = EmployeeNum ]
colnames(df2)[ colnames(df2) == "V1" ] <- "Start_datetime" # for some reason I can't assign column name properly during the column creation step.
Output:
df2[, .N, by = .( Start_datetime, End_datetime = Start_datetime + 3599 ) ]
# Start_datetime End_datetime N
# 1: 2012-02-01 07:00:00 2012-02-01 07:59:59 1
# 2: 2012-02-01 08:00:00 2012-02-01 08:59:59 2
# 3: 2012-02-01 09:00:00 2012-02-01 09:59:59 2
# 4: 2012-02-01 10:00:00 2012-02-01 10:59:59 3
# 5: 2012-02-01 11:00:00 2012-02-01 11:59:59 3
# 6: 2012-02-01 12:00:00 2012-02-01 12:59:59 3
# 7: 2012-02-01 13:00:00 2012-02-01 13:59:59 3
# 8: 2012-02-01 14:00:00 2012-02-01 14:59:59 3
# 9: 2012-02-01 15:00:00 2012-02-01 15:59:59 3
# 10: 2012-02-01 16:00:00 2012-02-01 16:59:59 3
# 11: 2012-02-01 17:00:00 2012-02-01 17:59:59 3
# 12: 2012-02-01 18:00:00 2012-02-01 18:59:59 3
# 13: 2012-02-01 19:00:00 2012-02-01 19:59:59 1
Graph:
binwidth = 3600 the value indicates 1 hour = 60 min * 60 sec = 3600 seconds
library('ggplot2')
ggplot( data = df2,
mapping = aes( x = Start_datetime ) ) +
geom_histogram(binwidth = 3600, color = "red", fill = "white" ) +
scale_x_datetime( date_breaks = "1 hour", date_labels = "%H:%M" ) +
ylab("Number of Employees") +
xlab( "Working Hours: 2012-02-01" ) +
theme( axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
panel.background = element_rect( fill = "white", color = "black") )
Thank you both for your answers. I came up with a solution which is pretty similar to yours, but I was wondering if you could have a look and let me know what you think of it.
I started a new empty dataframe, and then made two nested loops, to look at each start and end time in each row, and generate a sequence of hours in between. Then I each hour in the sequence to the new empty dataframe. This way, I can simply do a count later.
staffDetailHours <- data.frame("personnelNum"=integer(0),
"workDate"=character(0),
"Hour"=integer(0))
for (i in 1:dim(DF)[1]){
hoursList <- seq(as.POSIXlt(DF[i,]$START)$hour,
as.POSIXlt(DF[i,]$END)$hour)
for (j in 1:length(hoursList)) {
staffDetailHours[nrow(staffDetailHours)+1,] = list(
DF[i,]$EmployeeNum,
DF[i,]$Date,
hoursList[j]
)
}
}