I want to write an R function with some optional params. It should subset some data by two core params and then I want to have the option to pass additional constraints. Eg.
filter_func <- function(start_datetime, end_datetime, user=*, type=*){
as.data.frame(subset(df, format(df$datetime,"%Y-%m-%d %H:%M:%S") > start_datetime &
format(df$datetime,"%Y-%m-%d %H:%M:%S") < end_datetime) &
df$user == user &
df$type == type)
So... if i pass a param it constrains it to that column on either user or type, but if i don't it uses the wildcard and gets everything in the column?
I've seen examples here that use %in% or grepl() but those seem more aimed at where you have part of a string and then want the rest... like new_york gets both new_york_city and new_york_state... i don't want to get any values that don't exactly match the param!
edit: now with examples
So... ideally go from something like this...
start | end | user | type |
-----------------|------------------|------|------|
2017-01-01 11:00 | 2017-01-01 20:00 | usr1 | typ1 |
2017-01-01 12:00 | 2017-01-01 19:00 | usr2 | typ2 |
2017-01-01 02:00 | 2017-01-01 03:00 | usr2 | typ1 |
2017-03-01 01:00 | 2017-03-01 09:00 | usr1 | typ2 |
2017-04-01 05:00 | 2017-04-01 07:00 | usr3 | typ4 |
2017-05-01 01:00 | 2017-05-01 08:00 | usr2 | typ5 |
use my function filter_func(2017-01-01 00:00, 2017-01-01 23:59)
gets me:
start | end | user | type |
-----------------|------------------|------|------|
2017-01-01 11:00 | 2017-01-01 20:00 | usr1 | typ1 |
2017-01-01 12:00 | 2017-01-01 19:00 | usr2 | typ2 |
2017-01-01 02:00 | 2017-01-01 03:00 | usr2 | typ1 |
but if i add a param filter_func(2017-01-01 00:00, 2017-01-01 23:59, usr2)
start | end | user | type |
-----------------|------------------|------|------|
2017-01-01 12:00 | 2017-01-01 19:00 | usr2 | typ2 |
2017-01-01 02:00 | 2017-01-01 03:00 | usr2 | typ1 |
or even filter_func(2017-01-01 00:00, 2017-01-01 23:59, usr2, typ2)
start | end | user | type |
-----------------|------------------|------|------|
2017-01-01 12:00 | 2017-01-01 19:00 | usr2 | typ2 |
Firstly,
[ is safer for programmatic use than subset.
You don't need format, which turns datetime objects into string; you need as.POSIXct or the like, which parse strings into datetimes. You could do this in the function, but you should do this before the function, as you'll always want your datetimes parsed, and there's no point in doing it repeatedly.
You can update a version of the data.frame internal to the function in several steps, which allows you to use control flow like if. You'll still need to check if the variable exists. Two options:
Use missing, which is built for checking whether function parameters exist.
Supply a default value of NULL and use is.null.
You'll need to pass in quoted strings or parsed datetimes (< operators will try to coerce objects that don't match to the same class).
I added a parameter to pass in the data.frame first, which gives the function broader use, but is not necessary.
Altogether, then,
df <- data.frame(start = c("2017-01-01 11:00", "2017-01-01 12:00", "2017-01-01 02:00",
"2017-03-01 01:00", "2017-04-01 05:00", "2017-05-01 01:00"),
end = c("2017-01-01 20:00", "2017-01-01 19:00", "2017-01-01 03:00",
"2017-03-01 09:00", "2017-04-01 07:00", "2017-05-01 08:00"),
user = c("usr1", "usr2", "usr2", "usr1", "usr3", "usr2"),
type = c( "typ1", "typ2", "typ1", "typ2", "typ4", "typ5"))
# parse in two steps if you like, e.g. df$start <- as.POSIXct(df$start)
df[1:2] <- lapply(df[1:2], as.POSIXct)
filter_func <- function(x, start_time, end_time, usr, typ = NULL){
x <- x[x$start > start_time & x$end < end_time, ]
if (!missing(usr)) {
x <- x[x$user %in% usr, ]
}
if (!is.null(typ)) {
x <- x[x$type %in% typ, ]
}
x
}
and test it:
str(df)
#> 'data.frame': 6 obs. of 4 variables:
#> $ start: POSIXct, format: "2017-01-01 11:00:00" "2017-01-01 12:00:00" ...
#> $ end : POSIXct, format: "2017-01-01 20:00:00" "2017-01-01 19:00:00" ...
#> $ user : Factor w/ 3 levels "usr1","usr2",..: 1 2 2 1 3 2
#> $ type : Factor w/ 4 levels "typ1","typ2",..: 1 2 1 2 3 4
filter_func(df, as.POSIXct('2017-01-01 00:00'), as.POSIXct('2017-01-01 23:59'))
#> start end user type
#> 1 2017-01-01 11:00:00 2017-01-01 20:00:00 usr1 typ1
#> 2 2017-01-01 12:00:00 2017-01-01 19:00:00 usr2 typ2
#> 3 2017-01-01 02:00:00 2017-01-01 03:00:00 usr2 typ1
filter_func(df, '2017-01-01 00:00', '2017-01-01 23:59')
#> start end user type
#> 1 2017-01-01 11:00:00 2017-01-01 20:00:00 usr1 typ1
#> 2 2017-01-01 12:00:00 2017-01-01 19:00:00 usr2 typ2
#> 3 2017-01-01 02:00:00 2017-01-01 03:00:00 usr2 typ1
filter_func(df, '2017-01-01 00:00', '2017-01-01 23:59', 'usr2')
#> start end user type
#> 2 2017-01-01 12:00:00 2017-01-01 19:00:00 usr2 typ2
#> 3 2017-01-01 02:00:00 2017-01-01 03:00:00 usr2 typ1
filter_func(df, '2017-01-01 00:00', '2017-01-01 23:59', 'usr2', 'typ2')
#> start end user type
#> 2 2017-01-01 12:00:00 2017-01-01 19:00:00 usr2 typ2
You need to use grepl() for pattern matching.
filter_func <- function(start_datetime, end_datetime, user_='*', type_='*'){
subset(df, as.POSIXlt(df$start) > as.POSIXlt(start_datetime) &
as.POSIXlt(df$end) < as.POSIXlt(end_datetime) &
grepl(user_, df$user) &
grepl(type_, df$type))
}
filter_func(start='2017-01-01 00:00', end='2017-01-01 23:59')
# start end user type
#1 2017-01-01 11:00 2017-01-01 20:00 usr1 typ1
#2 2017-01-01 12:00 2017-01-01 19:00 usr2 typ2
#3 2017-01-01 02:00 2017-01-01 03:00 usr2 typ1
filter_func(start='2017-01-01 00:00', end='2017-01-01 23:59', user='usr2')
# start end user type
#2 2017-01-01 12:00 2017-01-01 19:00 usr2 typ2
#3 2017-01-01 02:00 2017-01-01 03:00 usr2 typ1
filter_func(start='2017-01-01 00:00', end='2017-01-01 23:59', user='usr2', type='typ2')
# start end user type
#2 2017-01-01 12:00 2017-01-01 19:00 usr2 typ2
Related
I am trying to count the number of events that are occurring within each interval and for each of my factor (mystations).
Below is a MWE:
library(lubridate)
myintervals <- c(dmy_hms(
"01/01/2000 08:00:00",
"25/02/2000 09:00:00",
"01/03/2000 10:00:00",
"30/04/2000 11:00:00",
"01/05/2000 12:00:00",
"30/06/2000 13:00:00",
"01/07/2000 14:00:00",
"30/08/2000 15:00:00",
"01/09/2000 16:00:00",
"30/10/2000 17:00:00"))
mystations <- c("A","B","C","A","B","C","A","B","C","D")
mydata <- data.frame(myintervals,mystations)
myintervals mystations
|1 2000-01-01 08:00:00 A
|2 2000-02-25 09:00:00 B
|3 2000-03-01 10:00:00 C
|4 2000-04-30 11:00:00 A
|5 2000-05-01 12:00:00 B
|6 2000-06-30 13:00:00 C
|7 2000-07-01 14:00:00 A
|8 2000-08-30 15:00:00 B
|9 2000-09-01 16:00:00 C
|10 2000-10-30 17:00:00 D
Here I am creating the detections
date.time <- c(dmy_hms(
"31/12/1999 08:00:00",
"24/02/2000 09:00:00",
"25/02/2000 08:00:00",
"26/02/2000 10:00:00",
"27/02/2000 11:00:00",
"01/03/2000 10:00:00",
"10/03/2000 22:00:00",
"20/03/2000 23:00:00",
"01/04/2000 10:00:00",
"20/04/2000 20:00:00",
"25/04/2000 08:00:00",
"30/04/2000 10:00:00",
"01/05/2000 12:00:00",
"10/05/2000 20:00:00",
"20/05/2000 08:00:00",
"30/06/2000 13:00:00",
"10/07/2000 10:00:00",
"20/07/2000 20:00:00",
"30/08/2000 15:00:00",
"01/09/2000 16:00:00"))
mydetections <- data.frame(date.time=date.time,mystations=mystations)
date.time mystations
|1 1999-12-31 08:00:00 A
|2 2000-02-24 09:00:00 B
|3 2000-02-25 08:00:00 C
|4 2000-02-26 10:00:00 A
|5 2000-02-27 11:00:00 B
|6 2000-03-01 10:00:00 C
|7 2000-03-10 22:00:00 A
|8 2000-03-20 23:00:00 B
|9 2000-04-01 10:00:00 C
|10 2000-04-20 20:00:00 D
|11 2000-04-25 08:00:00 A
|12 2000-04-30 10:00:00 B
|13 2000-05-01 12:00:00 C
|14 2000-05-10 20:00:00 A
|15 2000-05-20 08:00:00 B
|16 2000-06-30 13:00:00 C
|17 2000-07-10 10:00:00 A
|18 2000-07-20 20:00:00 B
|19 2000-08-30 15:00:00 C
|20 2000-09-01 16:00:00 D
The origins for each interval are here:
myorigins <- data.frame(myintervals=c(
dmy_hms("01/01/1970 00:00:00","01/04/1970 00:00:00","01/08/1970 00:00:00","01/12/1970 00:00:00")),mystations=c(unique(mydata$mystations)))
The expected output is this:
myintervals mystation value
1 1970-01-01 00:00:00 UTC--2000-01-01 08:00:00 UTC A 1
2 2000-01-01 08:00:00 UTC--2000-04-30 11:00:00 UTC A 3
3 2000-04-30 11:00:00 UTC--2000-07-01 14:00:00 UTC A 1
4 1970-04-01 00:00:00 UTC--2000-02-25 09:00:00 UTC B 1
5 2000-02-25 09:00:00 UTC--2000-05-01 12:00:00 UTC B 3
6 2000-05-01 12:00:00 UTC--2000-08-30 15:00:00 UTC B 2
7 1970-08-01 00:00:00 UTC--2000-03-01 10:00:00 UTC C 2
8 2000-03-01 10:00:00 UTC--2000-06-30 13:00:00 UTC C 3
9 2000-06-30 13:00:00 UTC--2000-09-01 16:00:00 UTC C 1
10 1970-12-01 00:00:00 UTC--2000-10-30 17:00:00 UTC D 1
What I was able to achieve so far is this:
#line by line
mydata <- add_row(mydata,myorigins)
mydata <- arrange(mydata,mystations,myintervals)
DF <- group_split(mydata,mystations)
Y <- lapply(seq_along(DF), function(x) as.data.frame(DF[[x]]))
names(Y) <- c(unique(mydata$mystations))
list2env(Y, envir = .GlobalEnv)
#splitting the detections
DFD <- group_split(mydetections,mystations)
Z <- lapply(seq_along(DFD), function(x) as.data.frame(DFD[[x]]))
names(Z) <- c(paste(unique(mydata$mystations),"det",sep=""))
list2env(Z, envir = .GlobalEnv)
I believe now is time to "only" construct the intervals for each dataframe like this:
Aint <- int_diff(A$myintervals)
and "checking" which detection falls in which interval with this:
myresA <- Adet$date.time%within%Aint
Clearly, I would like to avoid to "manually" construct the intervals for each df As.
As always, I would greatly appreciate any helps or tips for getting the desired output. I apologize for the initial confusion in the post.
Here are a couple of options to consider - hope this may be helpful.
Using tidyverse you can add your myorigins to mydata, then after sorting with arrange make time intervals (start-end).
You can use fuzzy_left_join to add the events table, matching on mystations and where the date.time falls between the interval start and end.
Then, after grouping, you can count the number of rows. You will get something close to your result, depending on how you want to handle edge cases.
library(tidyverse)
library(fuzzyjoin)
library(lubridate)
bind_rows(mydata, myorigins) %>%
arrange(myintervals) %>%
group_by(mystations) %>%
transmute(start = myintervals, end = lead(myintervals)) %>%
filter(!is.na(end)) %>%
fuzzy_left_join(
mydetections,
by = c("mystations", "start" = "date.time", "end" = "date.time"),
match_fun = c(`==`, `<`, `>=`)
) %>%
group_by(start, end, mystations.x) %>%
summarise(count = n()) %>%
arrange(mystations.x)
Output
start end mystations.x count
<dttm> <dttm> <chr> <int>
1 1970-01-01 00:00:00 2000-01-01 08:00:00 A 1
2 2000-01-01 08:00:00 2000-04-30 11:00:00 A 3
3 2000-04-30 11:00:00 2000-07-01 14:00:00 A 1
4 1970-04-01 00:00:00 2000-02-25 09:00:00 B 1
5 2000-02-25 09:00:00 2000-05-01 12:00:00 B 3
6 2000-05-01 12:00:00 2000-08-30 15:00:00 B 2
7 1970-08-01 00:00:00 2000-03-01 10:00:00 C 2
8 2000-03-01 10:00:00 2000-06-30 13:00:00 C 3
9 2000-06-30 13:00:00 2000-09-01 16:00:00 C 1
10 1970-12-01 00:00:00 2000-10-30 17:00:00 D 2
An alternative to consider is using data.table which would be faster. One function that may be helpful here is foverlaps to find overlap between the event dates and date ranges.
library(data.table)
dt <- rbind(myorigins, mydata)
setDT(dt)
dt[, c("start", "end") := list(myintervals, lead(myintervals)), by = mystations]
dt <- na.omit(dt, "end")
setDT(mydetections)
mydetections[,date.time.copy := copy(date.time)]
setkey(mydetections, mystations, date.time, date.time.copy)
dt_ovlp <- foverlaps(dt,
mydetections,
by.x = c("mystations", "start", "end"),
by.y = c("mystations", "date.time", "date.time.copy"))
dt_ovlp[ , .(value = .N), by = c("mystations", "start", "end")][order(mystations, start)]
The data table looks like the following:
ID DATE
1 2020-12-31 10:10:00
2 2020-12-31 20:30:00
3 2020-12-31 20:50:00
4 2021-01-02 17:10:00
5 2021-01-02 17:20:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
And I would like to query only the last entry per hour per day, and to have the resulte like:
ID DATE
1 2020-12-31 10:10:00
3 2020-12-31 20:50:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
I tried to look for hourly query and found the following
strftime('%H', " + DATE + ", '+1 hours')
However, not sure how to use it properly (e.g. with GROUP BY ? then how to ensure it takes the lastest entry of the hour), therefore, would be great to have some help here!
You can do it with ROW_NUMBER() window function:
SELECT ID, DATE
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY strftime('%Y%m%d%H', DATE) ORDER BY DATE DESC) rn
FROM tablename
)
WHERE rn = 1
ORDER BY ID
Instead of strftime('%Y%m%d%H', DATE) you could also use substr(DATE, 1, 13).
For versions of SQLite previous to 3.25.0 which do not support window functions you can do it with NOT EXISTS:
SELECT t1.*
FROM tablename t1
WHERE NOT EXISTS (
SELECT 1
FROM tablename t2
WHERE strftime('%Y%m%d%H', t2.DATE) = strftime('%Y%m%d%H', t1.DATE)
AND t2.DATE > t1.DATE
)
See the demo.
Results:
> ID | DATE
> -: | :------------------
> 1 | 2020-12-31 10:10:00
> 3 | 2020-12-31 20:50:00
> 6 | 2021-01-02 17:30:00
> 7 | 2021-01-03 23:10:00
I would like a help ... the clinic has several doctors and each one has a specific time of care. Example: 07:00 to 12:00, 12:00 to 17:00, 09:00 to 15:00 ... What is the SQL statement to display only records within the specified time range in the start_time and end_time ?
fields:
start_time | end_time
07:00:00 | 12:30:00
09:00:00 | 15:00:00
12:30:00 | 17:00:00
07:00:00 | 17:00:00
That is, in the morning, display only the records that are part of 07:00:00 to 12:30:00 from the current time. If it's afternoon show only record that are part of 12:30:00 until 17:00:00.
Thankful.
My question is about time series data.
Suppose I have one file, named as P1 with column Time.Stamp and Value. Data table is given below:
Time.Stamp
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:09
01/01/2017 19:09
Value
12
24
45
56
78
76
34
65
87
I have another separated file, Named as P2 which has two columns , “Transaction from” and “transaction to” . This has the following columns:
Transaction from
01/01/2017 19:00
01/01/2017 19:15
02/01/2017 08:45
02/01/2017 09:00
02/01/2017 09:15
02/01/2017 09:30
03/01/2017 18:00
03/01/2017 18:15
03/01/2017 23:45
04/01/2017 00:15
04/01/2017 01:45
transaction to
01/01/2017 19:15
01/01/2017 19:30
02/01/2017 09:00
02/01/2017 09:15
02/01/2017 09:30
02/01/2017 09:45
03/01/2017 18:15
03/01/2017 18:30
04/01/2017 00:00
04/01/2017 00:30
04/01/2017 02:00
Now I want to search in R, which “Time.Stamp” from file P1 are belongs to the duration of “Transaction from” to “transaction to” of file P2. If any “Time.Stamp” is in the range of mentioned two columns of P2 then the associated value with Time.stamp will be aggregated. The length of columns of file P1 and file P2 is not equal. Length of P1 is much more long than length of P2.
It will be very helpful, if any one can find a solution in R.
This is a possible duplication of How to perform join over date ranges using data.table? Assuming that P1 & P2 are data frames and dates are POSIXct at the beginning, here is the livesaver join provided by data.table:
library(data.table)
setDT(P1)
setDT(P2)
P1[ , dummy := Time.Stamp]
setkey(P2, Transaction.from, transaction.to)
dt <- foverlaps(
P1,
P2,
by.x = c("Time.Stamp", "dummy"),
# mult = "first"/mult = "first" will only choose first/last match
nomatch = 0L
)[ , dummy := NULL]
# you can run ?data.table::foverlaps for the documentation
Please refer to this great blog post for a step-by-step explanation and other possible answers.
After this point you can simply:
library(dplyr)
dt %>%
group_by(Transaction.from) %>%
mutate(total = sum(value))
Please note that this solution may seem long for the simple aggregation you asked. However, it will come very handy if you need to merge the data frames and conduct more complex analysis.
First, convert all date to as.POSIXct(x,format = "%d/%m/%Y %H:%M"). Then look if each elements of p1$Time.Stamp is in any period of p2[,1] to p2[,2] by following function , then aggregate:
isitthere<- function(x,from=p2$`Transaction from`,to=p2$`transaction to`){
any(x >=from & x<= to)
}
Apply the function to all p1$Time.Stamp:
index<-sapply(p1$Time.Stamp, isitthere,from=p2$`Transaction from`,to=p2$`transaction to`)
index
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now aggregate:
sum(p1$Value[index])
[1] 477
I am not clear about what is to be aggregated by what but assuming that DF1 and DF2 are as defined in the Note at the end then this will, for each row in DF2, look up zero or more rows in DF1 and then sum all Value for those rows having the same Transaction.from and Transaction.to.
library(sqldf)
sqldf("select [Transaction.from], [Transaction.to], sum(Value) as Value
from DF2
left join DF1 on [Time.Stamp] between [Transaction.from] and [Transaction.to]
group by [Transaction.from], [Transaction.to]")
giving:
Transaction.from Transaction.to Value
1 2017-01-01 19:00:00 2017-01-01 19:15:00 477
2 2017-01-01 19:15:00 2017-01-01 19:30:00 NA
3 2017-02-01 08:45:00 2017-02-01 09:00:00 NA
4 2017-02-01 09:00:00 2017-02-01 09:15:00 NA
5 2017-02-01 09:15:00 2017-02-01 09:30:00 NA
6 2017-02-01 09:30:00 2017-02-01 09:45:00 NA
7 2017-03-01 18:00:00 2017-03-01 18:15:00 NA
8 2017-03-01 18:15:00 2017-03-01 18:30:00 NA
9 2017-03-01 23:45:00 2017-04-01 00:00:00 NA
10 2017-04-01 00:15:00 2017-04-01 00:30:00 NA
11 2017-04-01 01:45:00 2017-04-01 02:00:00 NA
Note
Lines1 <- "
Time.Stamp,Value
01/01/2017 19:08,12
01/01/2017 19:08,24
01/01/2017 19:08,45
01/01/2017 19:08,56
01/01/2017 19:08,78
01/01/2017 19:08,76
01/01/2017 19:08,34
01/01/2017 19:09,65
01/01/2017 19:09,87
"
DF1 <- read.csv(text = Lines1)
fmt <- "%m/%d/%Y %H:%M"
DF1 <- transform(DF1, Time.Stamp = as.POSIXct(Time.Stamp, format = fmt))
Lines2 <- "
Transaction.from,Transaction.to
01/01/2017 19:00,01/01/2017 19:15
01/01/2017 19:15,01/01/2017 19:30
02/01/2017 08:45,02/01/2017 09:00
02/01/2017 09:00,02/01/2017 09:15
02/01/2017 09:15,02/01/2017 09:30
02/01/2017 09:30,02/01/2017 09:45
03/01/2017 18:00,03/01/2017 18:15
03/01/2017 18:15,03/01/2017 18:30
03/01/2017 23:45,04/01/2017 00:00
04/01/2017 00:15,04/01/2017 00:30
04/01/2017 01:45,04/01/2017 02:00
"
DF2 <- read.csv(text = Lines2)
DF2 <- transform(DF2, Transaction.from = as.POSIXct(Transaction.from, format = fmt),
Transaction.to = as.POSIXct(Transaction.to, format = fmt))
I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:
table1_schedule
Channel Program program_Date start_time
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
table2_watch
Person Channel program_Date start_time end_time
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,
The wanted output
Person Channel program_Date start_time end_time Prog1 Prog2 Prog3
Name A NatGeo 1/1/2018 11:00 12:00 Doc A Doc B Doc C
Name B NatGeo 1/1/2018 12:30 14:00 Doc C Doc D -NA-
Name B HBO 1/1/2018 21:30 22:00 Mov A -NA- -NA-
Name B HBO 1/1/2018 22:30 23:30 Mov A Mov B -NA-
Is there a way to do this in the simplest and most efficient way such as using dplyr or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks
Here is a data.table solution where we can make use foverlap.
I'm showing every step with a short comment, to hopefully help with understanding.
library(data.table)
# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]
# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]
# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]
# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
# Channel Person start end Prog1 Prog2 Prog3
#1: NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2: NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D NA
#3: HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A NA NA
#4: HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B NA
Sample data
df1 <- read.table(text =
"Channel Program program_Date start_time
HBO 'Mov A' 1/1/2018 21:00
HBO 'Mov B' 1/1/2018 23:00
HBO 'Mov C' 1/1/2018 23:59
NatGeo 'Doc A' 1/1/2018 11:00
NatGeo 'Doc B' 1/1/2018 11:30
NatGeo 'Doc C' 1/1/2018 12:00
NatGeo 'Doc D' 1/1/2018 14:00", header = T)
df2 <- read.table(text =
"Person Channel program_Date start_time end_time
'Name A' NatGeo 1/1/2018 11:00 12:00
'Name B' NatGeo 1/1/2018 12:30 14:00
'Name B' HBO 1/1/2018 21:30 22:00
'Name B' HBO 1/1/2018 22:30 23:30", header = T)
Here is how I would go about doing this. Note that I renamed some of your stuff.
> cat schedule
Channel Program Date StartTime
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
> cat watch
Person Channel Date StartTime EndTime
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
Now, make sure we read these correctly using readr. In other words, specify the correct formats for the dates and the times.
library(dplyr)
library(readr)
library(lubridate)
schedule <- read_table("schedule",
col_types=cols_only(Channel=col_character(),
Program=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M")))
watch <- read_table("watch",
col_types=cols_only(Person=col_character(),
Channel=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M"),
EndTime=col_time("%H:%M")))
Next, we convert all dates and times to datetimes and add an ending datetime to the schedule.
schedule <- schedule %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
group_by(Channel) %>%
mutate(EndDateTime=lead(StartDateTime, default=as_datetime(Inf))) %>%
ungroup() %>%
select(Channel, Program, StartDateTime, EndDateTime)
watch <- watch %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
mutate(EndDateTime=ymd_hms(paste(Date, EndTime))) %>%
select(Person, Channel, StartDateTime, EndDateTime)
We can perform a join and check if the watch and schedule intervals overlap (you can modify this to accommodate to your 10 minute comment I believe, although I did not fully understand what you meant).
watch %>%
inner_join(schedule,
by=c("Channel" = "Channel"),
suffix=c(".Watch", ".Schedule")) %>%
filter(int_overlaps(interval(StartDateTime.Watch, EndDateTime.Watch),
interval(StartDateTime.Schedule, EndDateTime.Schedule))) %>%
select(Person, Channel, Program, StartDateTime.Watch, EndDateTime.Watch) %>%
rename_at(.vars=vars(ends_with(".Watch")),
.funs=funs(sub("\\.Watch$", "", .)))
# A tibble: 8 x 5
Person Channel Program StartDateTime EndDateTime
<chr> <chr> <chr> <dttm> <dttm>
1 Name A NatGeo Doc A 2018-01-01 11:00:00 2018-01-01 12:00:00
2 Name A NatGeo Doc B 2018-01-01 11:00:00 2018-01-01 12:00:00
3 Name A NatGeo Doc C 2018-01-01 11:00:00 2018-01-01 12:00:00
4 Name B NatGeo Doc C 2018-01-01 12:30:00 2018-01-01 14:00:00
5 Name B NatGeo Doc D 2018-01-01 12:30:00 2018-01-01 14:00:00
6 Name B HBO Mov A 2018-01-01 21:30:00 2018-01-01 22:00:00
7 Name B HBO Mov A 2018-01-01 22:30:00 2018-01-01 23:30:00
8 Name B HBO Mov B 2018-01-01 22:30:00 2018-01-01 23:30:00
To get the desired output, you would have to group by everything except Program and "explode" the resulting groups into multiple columns. However, I am not sure if that is a good idea so I did not do it.