I have a dataframe grouped by bikeid and sorted by time. If type repeats consecutively, I want to keep the earliest time. In the case below, I want to remove line 17, 19,33,39 and 41
subtract value from previous row by group
This will get what I need once I removed the duplicates.
bikeid type time
1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
17 1004 repair_time 2019-04-20 05:29:00
19 1004 repair_time 2019-04-28 07:33:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
33 1010 red_time 2019-04-23 11:02:34
37 1010 repair_time 2019-04-24 17:20:00
39 1010 repair_time 2019-04-24 18:30:00
41 1010 repair_time 2019-04-24 18:42:00
The final result should look this this:
bikeid type time
1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
37 1010 repair_time 2019-04-24 17:20:00
An option is to use rleid (from data.table) to create a grouping variable along with the second column and slice the first observation. Here, the time column is already arranged, so we don't have to do any ordering
library(dplyr)
library(data.table)
df1 %>%
group_by(V2, grp = rleid(V3)) %>%
slice(1) %>%
ungroup %>%
select(-grp)
# A tibble: 8 x 4
# V1 V2 V3 V4
# <int> <int> <chr> <chr>
#1 1 1004 repair_time 2019-04-04 14:07:00
#2 3 1004 red_time 2019-04-19 00:54:56
#3 8 1004 repair_time 2019-04-19 12:47:00
#4 10 1004 red_time 2019-04-19 16:45:18
#5 15 1004 repair_time 2019-04-20 04:42:00
#6 27 1010 repair_time 2019-04-20 10:05:00
#7 29 1010 red_time 2019-04-22 20:51:21
#8 37 1010 repair_time 2019-04-24 17:20:00
Or use the data.table method where we convert the 'data.frame' to
'data.table' (setDT(df1)), grouped by 'V2', and rleid of 'V3', get the row index (.I) of the first observation, extract ($V1) it and subset the rows of dataset
library(data.table)
setDT(df1)[df1[, .I[1], .(V2, rleid(V3))]$V1]
data
df1 <- structure(list(V1 = c(1L, 3L, 8L, 10L, 15L, 17L, 19L, 27L, 29L,
33L, 37L, 39L, 41L), V2 = c(1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1010L, 1010L, 1010L, 1010L, 1010L, 1010L), V3 = c("repair_time",
"red_time", "repair_time", "red_time", "repair_time", "repair_time",
"repair_time", "repair_time", "red_time", "red_time", "repair_time",
"repair_time", "repair_time"), V4 = c("2019-04-04 14:07:00",
"2019-04-19 00:54:56", "2019-04-19 12:47:00", "2019-04-19 16:45:18",
"2019-04-20 04:42:00", "2019-04-20 05:29:00", "2019-04-28 07:33:00",
"2019-04-20 10:05:00", "2019-04-22 20:51:21", "2019-04-23 11:02:34",
"2019-04-24 17:20:00", "2019-04-24 18:30:00", "2019-04-24 18:42:00"
)), class = "data.frame", row.names = c(NA, -13L))
Another option using lag to check if the status is the same as the previous row. As akrun notes, this works because the data is already sorted by time:
library(dplyr)
df %>%
group_by(bikeid) %>%
mutate(repeated = status == lag(status)) %>%
# Need the is.na() check as first element of each group is NA
# due to the lag
filter(! repeated | is.na(repeated))
Data setup code:
txt = "1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
17 1004 repair_time 2019-04-20 05:29:00
19 1004 repair_time 2019-04-28 07:33:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
33 1010 red_time 2019-04-23 11:02:34
37 1010 repair_time 2019-04-24 17:20:00
39 1010 repair_time 2019-04-24 18:30:00
41 1010 repair_time 2019-04-24 18:42:00"
df = read.table(text = txt, header = FALSE)
colnames(df) = c("row", "bikeid", "status", "date", "time")
df$date = as.POSIXct(paste(df$date, df$time))
I have some data that I extract from an Elasticsearch system that shows employees' availability over the course of a date range, broken down into hourly slots.
Associates are never available for 24 hours per day, but I want to display the data across 24 hourly slots, with 0s populating the cells where no data are present.
My thoughts are that I need to to create up a blank data frame, impute the results into it and then populate the rest with 0s, but I would really like to know if there is a better way.
Note that the size of the initial data frame is not always the same size because different days return different hourly values (seven hour-long slots, three hour-long slots, 12 hour-long slots and so on).
Also note that where there aren't any hits/results from the query results, the hourly slots in between do not appear as there are no associated data (see between 18:00 and 21:00).
At present, the whole data frame looks like this:
hour associate_count minutes_covered
<dttm> <int> <dbl>
1 2018-08-06 10:00:00 2 37
2 2018-08-06 11:00:00 2 60
3 2018-08-06 12:00:00 2 42
4 2018-08-06 13:00:00 1 56
5 2018-08-06 14:00:00 2 60
6 2018-08-06 15:00:00 2 60
7 2018-08-06 16:00:00 2 60
8 2018-08-06 17:00:00 1 52
9 2018-08-06 18:00:00 1 0 # NOTE THAT THERE IS A 3-HOUR GAP HERE UNTIL THE NEXT HIT
10 2018-08-06 21:00:00 1 10
The data behind the data frame:
df <- structure(list(hour = structure(c(1533546000, 1533549600, 1533553200,
1533556800, 1533560400, 1533564000, 1533567600, 1533571200, 1533574800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
associate_count = c(2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L),
minutes_covered = c(37, 60, 42, 56, 60, 60, 60, 52, 0)), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
How can I get the data to look like this?
hour associate_count minutes_covered
1 2018-08-06 00:00 0 0
2 2018-08-06 01:00 0 0
3 2018-08-06 02:00 0 0
4 2018-08-06 03:00 0 0
5 2018-08-06 04:00 0 0
6 2018-08-06 05:00 0 0
7 2018-08-06 06:00 0 0
8 2018-08-06 07:00 0 0
9 2018-08-06 08:00 0 0
10 2018-08-06 09:00 0 0
11 2018-08-06 10:00 2 37
12 2018-08-06 11:00 2 60
13 2018-08-06 12:00 2 42
14 2018-08-06 13:00 1 56
15 2018-08-06 14:00 2 60
16 2018-08-06 15:00 2 60
17 2018-08-06 16:00 2 60
18 2018-08-06 17:00 1 52
19 2018-08-06 18:00 1 0
20 2018-08-06 19:00 0 0
21 2018-08-06 20:00 0 0
22 2018-08-06 21:00 1 10
23 2018-08-06 22:00 0 0
24 2018-08-06 23:00 0 0
You can use tidyr::complete for this. It lets you additionally fill out other variables to expand the data frame with, if needed.
library(tidyverse)
library(lubridate)
df <- structure(list(hour = structure(c(1533546000, 1533549600, 1533553200, 1533556800, 1533560400, 1533564000, 1533567600, 1533571200, 1533574800), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"), associate_count = c(2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L), minutes_covered = c(37, 60, 42, 56, 60, 60, 60, 52, 0)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
my_complete <- function(df, start_date, end_date){
start_hour <- str_c(start_date, " 00:00:00") %>% ymd_hms
end_hour <- str_c(end_date, " 00:00:00") %>% ymd_hms
df %>%
complete(
hour = seq(from = start_hour, to = end_hour, by = "hour"),
fill = list(associate_count = 0L, minutes_covered = 0)
)
}
my_complete(df, "2018-08-06", "2018-08-07")
#> # A tibble: 25 x 3
#> hour associate_count minutes_covered
#> <dttm> <int> <dbl>
#> 1 2018-08-06 00:00:00 0 0
#> 2 2018-08-06 01:00:00 0 0
#> 3 2018-08-06 02:00:00 0 0
#> 4 2018-08-06 03:00:00 0 0
#> 5 2018-08-06 04:00:00 0 0
#> 6 2018-08-06 05:00:00 0 0
#> 7 2018-08-06 06:00:00 0 0
#> 8 2018-08-06 07:00:00 0 0
#> 9 2018-08-06 08:00:00 0 0
#> 10 2018-08-06 09:00:00 2 37
#> # ... with 15 more rows
Created on 2018-08-13 by the reprex package (v0.2.0).
# create a sequence of hours for your day
allhours <- data.frame(hour=seq(from= as.POSIXct("2018-06-08 00:00"),
to = as.POSIXct("2018-06-08 23:00"),
by = "hours"))
# merge that sequence with your data (all=TRUE is important here)
res <- merge(df, allhours, by="hour", all=TRUE)
# convert NAs to Zeros
res[is.na(res$associate_count), "associate_count"] <- 0
res[is.na(res$minutes_covered), "minutes_covered"] <- 0
I'm not 100% sure what the expected output should be. But we may go somewhere starting around this:
new_df <- data.frame(hour=seq(ymd_hms('2018-08-06 00:00:00'),
ymd_hms('2018-08-06 23:00:00'), by = '1 hour'))
Now we can join with the old data frame
new_df %>% left_join(df)
Joining, by = "hour"
hour associate_count minutes_covered
1 2018-08-06 00:00:00 NA NA
2 2018-08-06 01:00:00 NA NA
3 2018-08-06 02:00:00 NA NA
4 2018-08-06 03:00:00 NA NA
5 2018-08-06 04:00:00 NA NA
6 2018-08-06 05:00:00 NA NA
7 2018-08-06 06:00:00 NA NA
8 2018-08-06 07:00:00 NA NA
9 2018-08-06 08:00:00 NA NA
10 2018-08-06 09:00:00 2 37
11 2018-08-06 10:00:00 2 60
12 2018-08-06 11:00:00 2 42
13 2018-08-06 12:00:00 1 56
14 2018-08-06 13:00:00 2 60
15 2018-08-06 14:00:00 2 60
16 2018-08-06 15:00:00 2 60
17 2018-08-06 16:00:00 1 52
18 2018-08-06 17:00:00 1 0
19 2018-08-06 18:00:00 NA NA
20 2018-08-06 19:00:00 NA NA
21 2018-08-06 20:00:00 NA NA
22 2018-08-06 21:00:00 NA NA
23 2018-08-06 22:00:00 NA NA
24 2018-08-06 23:00:00 NA NA
If absolutely must get rid of the NAs and you need them to be zero you can add another pipe term like this %>% mutate_at(c(2:3), funs(replace(., is.na(.), 0)))
New to R and to solving such a problem as the one below, so not sure about how certain functionality is achieved in particular instances.
I have a dataframe as such:
df <- data.frame(DATETIME = seq(from = as.POSIXct('2014-01-01 00:00', tz = "GMT"), to = as.POSIXct('2014-01-01 06:00', tz = "GMT"), by='15 mins'),
Price = c(23,22,23,24,27,31,33,34,31,26,24,23,19,18,19,19,23,25,26,26,27,30,26,25,24),
TroughPriceFlag = c(0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0))
df <- data.table(df)
df
DATETIME Price TroughPriceFlag
1: 2014-01-01 00:00:00 23 0
2: 2014-01-01 00:15:00 22 1
3: 2014-01-01 00:30:00 23 0
4: 2014-01-01 00:45:00 24 0
5: 2014-01-01 01:00:00 27 0
6: 2014-01-01 01:15:00 31 0
7: 2014-01-01 01:30:00 33 0
8: 2014-01-01 01:45:00 34 0
9: 2014-01-01 02:00:00 31 0
10: 2014-01-01 02:15:00 26 0
11: 2014-01-01 02:30:00 24 0
12: 2014-01-01 02:45:00 23 0
13: 2014-01-01 03:00:00 19 0
14: 2014-01-01 03:15:00 18 1
15: 2014-01-01 03:30:00 19 0
16: 2014-01-01 03:45:00 19 0
17: 2014-01-01 04:00:00 23 0
18: 2014-01-01 04:15:00 25 0
19: 2014-01-01 04:30:00 26 0
20: 2014-01-01 04:45:00 26 0
21: 2014-01-01 05:00:00 27 0
22: 2014-01-01 05:15:00 30 0
23: 2014-01-01 05:30:00 26 0
24: 2014-01-01 05:45:00 25 0
25: 2014-01-01 06:00:00 24 0
What I wish to do is two things:
(1) From where we observe a TroughPrice, flag the first instance where the price has risen by 10 or more dollars. That is, find the first instance where deltaPrice >= 10 since the trough price.
As an example: from the trough price of 22 (row 2), in the next interval price is increased to 23 which is a change of 1 dollar, so no flag. From the trough price of 22 (again row 2, since always with reference to the trough price in question), two intervals later the price is 24 dollars, so the price has increased by 2 dollars since the trough, so again no flag. However, from the trough price of 22, 5 intervals later the price has increased to 33 dollars, which is an increase of 11 dollars and is the first time the price has increased above 10 dollars. Thus the flag is 1.
(2) Determine the number of 15 minute periods which have passed between the trough price and the first instance the price has risen by 10 or more dollars.
The resulting dataframe should look like this:
DATETIME Price TroughPriceFlag FirstOver10CentsFlag CountPeriods
1 2014-01-01 00:00:00 23 0 0 NA
2 2014-01-01 00:15:00 22 1 0 5
3 2014-01-01 00:30:00 23 0 0 NA
4 2014-01-01 00:45:00 24 0 0 NA
5 2014-01-01 01:00:00 27 0 0 NA
6 2014-01-01 01:15:00 31 0 0 NA
7 2014-01-01 01:30:00 33 0 1 NA
8 2014-01-01 01:45:00 34 0 0 NA
9 2014-01-01 02:00:00 31 0 0 NA
10 2014-01-01 02:15:00 26 0 0 NA
11 2014-01-01 02:30:00 24 0 0 NA
12 2014-01-01 02:45:00 23 0 0 NA
13 2014-01-01 03:00:00 19 0 0 NA
14 2014-01-01 03:15:00 18 1 0 8
15 2014-01-01 03:30:00 19 0 0 NA
16 2014-01-01 03:45:00 19 0 0 NA
17 2014-01-01 04:00:00 23 0 0 NA
18 2014-01-01 04:15:00 25 0 0 NA
19 2014-01-01 04:30:00 26 0 0 NA
20 2014-01-01 04:45:00 26 0 0 NA
21 2014-01-01 05:00:00 27 0 0 NA
22 2014-01-01 05:15:00 30 0 1 NA
23 2014-01-01 05:30:00 26 0 0 NA
24 2014-01-01 05:45:00 25 0 0 NA
25 2014-01-01 06:00:00 24 0 0 NA
I'm not really sure where to start, since the time gaps can be quite large and I've only used indexing in the context of a few steps forward/backward. Please help!
Thanks in advance
You can chain operation with data.table package, the idea would be to group by cumsum of the ThroughPriceFlag:
library(data.table)
df[, col1:=pmatch(Price-Price[1]>10,T, nomatch=0), cumsum(TroughPriceFlag)][
, count:=which(col1==1)-1,cumsum(TroughPriceFlag)][
TroughPriceFlag==0, count:=NA]
#> df
# DATETIME Price TroughPriceFlag col1 count
# 1: 2014-01-01 00:00:00 23 0 0 NA
# 2: 2014-01-01 00:15:00 22 1 0 5
# 3: 2014-01-01 00:30:00 23 0 0 NA
# 4: 2014-01-01 00:45:00 24 0 0 NA
# 5: 2014-01-01 01:00:00 27 0 0 NA
# 6: 2014-01-01 01:15:00 31 0 0 NA
# 7: 2014-01-01 01:30:00 33 0 1 NA
# 8: 2014-01-01 01:45:00 34 0 0 NA
# 9: 2014-01-01 02:00:00 31 0 0 NA
#10: 2014-01-01 02:15:00 26 0 0 NA
#11: 2014-01-01 02:30:00 24 0 0 NA
#12: 2014-01-01 02:45:00 23 0 0 NA
#13: 2014-01-01 03:00:00 19 0 0 NA
#14: 2014-01-01 03:15:00 18 1 0 8
#15: 2014-01-01 03:30:00 19 0 0 NA
#16: 2014-01-01 03:45:00 19 0 0 NA
#17: 2014-01-01 04:00:00 23 0 0 NA
#18: 2014-01-01 04:15:00 25 0 0 NA
#19: 2014-01-01 04:30:00 26 0 0 NA
#20: 2014-01-01 04:45:00 26 0 0 NA
#21: 2014-01-01 05:00:00 27 0 0 NA
#22: 2014-01-01 05:15:00 30 0 1 NA
#23: 2014-01-01 05:30:00 26 0 0 NA
#24: 2014-01-01 05:45:00 25 0 0 NA
#25: 2014-01-01 06:00:00 24 0 0 NA
I'm just trying to figure out how to do a conditional join on two data.tables.
I've written a sqldf conditional join to give me the circuits whose start or finish times are within the other's start/finish times.
sqldf("select dt2.start, dt2.finish, dt2.counts, dt1.id, dt1.circuit
from dt2
left join dt1 on (
(dt2.start >= dt1.start and dt2.start < dt1.finish) or
(dt2.finish >= dt1.start and dt2.finish < dt1.finish)
)")
This gives me the correct result, but it's too slow for my large-ish data set.
What's the data.table way to do this without a vector scan?
Here's my data:
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))
Using non-equi joins:
ans = dt1[dt2, on=.(start <= end, end > start),
.(i.start, i.end, counts, id, circuit, cndn = i.start < x.start & i.end >= x.end),
allow.cartesian=TRUE
][!cndn %in% TRUE]
The condition start <= end, end >= start (note the >= on both cases) would check if two intervals overlap by any means. The open interval on one side is accomplished by end > start part (> instead of >=). But still it also picks up the intervals of type:
dt1: start=================end
dt2: start--------------------------------end ## start < start, end > end
and
dt1: start=================end
dt2: start----------end ## end == end
The cndn column is to check and remove these cases. Hopefully, those cases aren't a lot so that we don't materialise unwanted rows unnecessarily.
PS: the solution in this case is not as straightforward as I'd like to still, and that's because the solution requires an OR operation. It is possible to do two conditional joins, and then bind them together though.
Perhaps at some point, we'll have to think about the feasibility of extending joins to these kinds of operations in a more straightforward manner.
No idea if this performs faster, but here's a shot at a data table method. I reshape dt1 and use findInterval to identify where the times in dt2 line up with times in dt1.
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))
# > dt1
# circuit start end id
# 1: b 2014-02-28 16:00:00 2014-02-28 17:30:00 1001
# 2: a 2014-02-28 17:52:00 2014-02-28 18:51:59 1002
# 3: b 2014-02-28 18:00:00 2014-02-28 21:00:00 1003
# 4: a 2014-02-28 18:52:00 2014-02-28 19:00:00 1004
# 5: b 2014-03-01 00:05:00 2014-03-01 02:55:00 1005
# 6: c 2014-02-28 23:00:00 2014-03-01 06:30:00 1006
# 7: a 2014-03-01 01:40:00 2014-03-01 04:59:59 1007
# 8: a 2014-03-01 05:00:00 2014-03-01 06:00:00 1008
# 9: b 2014-03-01 04:30:00 2014-03-01 07:30:00 1009
# > dt2
# start end seconds counts
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1
## reshapes dt1 from wide to long
## puts start and end times into one column and sorts by time
## this is so that you can use findInterval later
dt3 <- dt1[,list(time = c(start,end)), by = "circuit,id"][order(time)]
dt3[,ntvl := seq_len(nrow(dt3))]
# circuit id time ntvl
# 1: b 1001 2014-02-28 16:00:00 1
# 2: b 1001 2014-02-28 17:30:00 2
# 3: a 1002 2014-02-28 17:52:00 3
# 4: b 1003 2014-02-28 18:00:00 4
# 5: a 1002 2014-02-28 18:51:59 5
# 6: a 1004 2014-02-28 18:52:00 6
# 7: a 1004 2014-02-28 19:00:00 7
# 8: b 1003 2014-02-28 21:00:00 8
# 9: c 1006 2014-02-28 23:00:00 9
# 10: b 1005 2014-03-01 00:05:00 10
# 11: a 1007 2014-03-01 01:40:00 11
# 12: b 1005 2014-03-01 02:55:00 12
# 13: b 1009 2014-03-01 04:30:00 13
# 14: a 1007 2014-03-01 04:59:59 14
# 15: a 1008 2014-03-01 05:00:00 15
# 16: a 1008 2014-03-01 06:00:00 16
# 17: c 1006 2014-03-01 06:30:00 17
# 18: b 1009 2014-03-01 07:30:00 18
## map interval to id
dt4 <- dt3[,list(ntvl = seq(from = min(ntvl), to = max(ntvl)-1), by = 1),by = "circuit,id"]
setkey(dt4, ntvl)
# circuit id ntvl
# 1: b 1001 1
# 2: a 1002 3
# 3: a 1002 4
# 4: b 1003 4
# 5: b 1003 5
# 6: b 1003 6
# 7: a 1004 6
# 8: b 1003 7
# 9: c 1006 9
# 10: c 1006 10
# 11: b 1005 10
# 12: c 1006 11
# 13: b 1005 11
# 14: a 1007 11
# 15: c 1006 12
# 16: a 1007 12
# 17: c 1006 13
# 18: a 1007 13
# 19: b 1009 13
# 20: c 1006 14
# 21: b 1009 14
# 22: c 1006 15
# 23: b 1009 15
# 24: a 1008 15
# 25: c 1006 16
# 26: b 1009 16
# 27: b 1009 17
# circuit id ntvl
## finds intervals in dt2
dt2[,`:=`(ntvl_start = findInterval(start, dt3[["time"]], rightmost.closed = FALSE),
ntvl_end = findInterval(end, dt3[["time"]], rightmost.closed = FALSE))]
# start end seconds counts ntvl_start ntvl_end
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1 1 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1 1 2
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0 2 2
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1 2 3
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2 3 7
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1 7 8
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0 8 8
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1 8 9
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2 9 10
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3 10 12
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2 12 12
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3 12 16
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2 16 17
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1 17 18
## joins, by start time, then by end time
## the commented out lines may be a better alternative
## if there are many NA values
setkey(dt2, ntvl_start)
dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_start_na <- dt2[!dt4]
setkey(dt2, ntvl_end)
dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_end_na <- dt2[!dt4]
## bring them all together and remove duplicates
dt_ans <- unique(rbind(dt_ans_start, dt_ans_end), by = c("start", "id"))
dt_ans <- dt_ans[!(is.na(id) & counts > 0)]
dt_ans[,ntvl := NULL]
setkey(dt_ans,start)
# start end counts id circuit
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 1 1001 b
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1 1001 b
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 0 NA NA
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 1 1002 a
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1002 a
# 6: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1003 b
# 7: 2014-02-28 19:00:00 2014-02-28 21:00:00 1 1003 b
# 8: 2014-02-28 21:00:00 2014-02-28 22:59:59 0 NA NA
# 9: 2014-02-28 22:59:59 2014-03-01 00:04:59 1 1006 c
# 10: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1006 c
# 11: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1005 b
# 12: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1006 c
# 13: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1005 b
# 14: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1007 a
# 15: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1006 c
# 16: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1007 a
# 17: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1006 c
# 18: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1007 a
# 19: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1009 b
# 20: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1006 c
# 21: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1009 b
# 22: 2014-03-01 06:30:00 2014-03-01 07:30:00 1 1009 b
# start end counts id circuit