I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:
table1_schedule
Channel Program program_Date start_time
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
table2_watch
Person Channel program_Date start_time end_time
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,
The wanted output
Person Channel program_Date start_time end_time Prog1 Prog2 Prog3
Name A NatGeo 1/1/2018 11:00 12:00 Doc A Doc B Doc C
Name B NatGeo 1/1/2018 12:30 14:00 Doc C Doc D -NA-
Name B HBO 1/1/2018 21:30 22:00 Mov A -NA- -NA-
Name B HBO 1/1/2018 22:30 23:30 Mov A Mov B -NA-
Is there a way to do this in the simplest and most efficient way such as using dplyr or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks
Here is a data.table solution where we can make use foverlap.
I'm showing every step with a short comment, to hopefully help with understanding.
library(data.table)
# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]
# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]
# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]
# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
# Channel Person start end Prog1 Prog2 Prog3
#1: NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2: NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D NA
#3: HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A NA NA
#4: HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B NA
Sample data
df1 <- read.table(text =
"Channel Program program_Date start_time
HBO 'Mov A' 1/1/2018 21:00
HBO 'Mov B' 1/1/2018 23:00
HBO 'Mov C' 1/1/2018 23:59
NatGeo 'Doc A' 1/1/2018 11:00
NatGeo 'Doc B' 1/1/2018 11:30
NatGeo 'Doc C' 1/1/2018 12:00
NatGeo 'Doc D' 1/1/2018 14:00", header = T)
df2 <- read.table(text =
"Person Channel program_Date start_time end_time
'Name A' NatGeo 1/1/2018 11:00 12:00
'Name B' NatGeo 1/1/2018 12:30 14:00
'Name B' HBO 1/1/2018 21:30 22:00
'Name B' HBO 1/1/2018 22:30 23:30", header = T)
Here is how I would go about doing this. Note that I renamed some of your stuff.
> cat schedule
Channel Program Date StartTime
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
> cat watch
Person Channel Date StartTime EndTime
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
Now, make sure we read these correctly using readr. In other words, specify the correct formats for the dates and the times.
library(dplyr)
library(readr)
library(lubridate)
schedule <- read_table("schedule",
col_types=cols_only(Channel=col_character(),
Program=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M")))
watch <- read_table("watch",
col_types=cols_only(Person=col_character(),
Channel=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M"),
EndTime=col_time("%H:%M")))
Next, we convert all dates and times to datetimes and add an ending datetime to the schedule.
schedule <- schedule %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
group_by(Channel) %>%
mutate(EndDateTime=lead(StartDateTime, default=as_datetime(Inf))) %>%
ungroup() %>%
select(Channel, Program, StartDateTime, EndDateTime)
watch <- watch %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
mutate(EndDateTime=ymd_hms(paste(Date, EndTime))) %>%
select(Person, Channel, StartDateTime, EndDateTime)
We can perform a join and check if the watch and schedule intervals overlap (you can modify this to accommodate to your 10 minute comment I believe, although I did not fully understand what you meant).
watch %>%
inner_join(schedule,
by=c("Channel" = "Channel"),
suffix=c(".Watch", ".Schedule")) %>%
filter(int_overlaps(interval(StartDateTime.Watch, EndDateTime.Watch),
interval(StartDateTime.Schedule, EndDateTime.Schedule))) %>%
select(Person, Channel, Program, StartDateTime.Watch, EndDateTime.Watch) %>%
rename_at(.vars=vars(ends_with(".Watch")),
.funs=funs(sub("\\.Watch$", "", .)))
# A tibble: 8 x 5
Person Channel Program StartDateTime EndDateTime
<chr> <chr> <chr> <dttm> <dttm>
1 Name A NatGeo Doc A 2018-01-01 11:00:00 2018-01-01 12:00:00
2 Name A NatGeo Doc B 2018-01-01 11:00:00 2018-01-01 12:00:00
3 Name A NatGeo Doc C 2018-01-01 11:00:00 2018-01-01 12:00:00
4 Name B NatGeo Doc C 2018-01-01 12:30:00 2018-01-01 14:00:00
5 Name B NatGeo Doc D 2018-01-01 12:30:00 2018-01-01 14:00:00
6 Name B HBO Mov A 2018-01-01 21:30:00 2018-01-01 22:00:00
7 Name B HBO Mov A 2018-01-01 22:30:00 2018-01-01 23:30:00
8 Name B HBO Mov B 2018-01-01 22:30:00 2018-01-01 23:30:00
To get the desired output, you would have to group by everything except Program and "explode" the resulting groups into multiple columns. However, I am not sure if that is a good idea so I did not do it.
Related
I have two datasets.
Dataset X looks as follows. It contains 30-min intervals of the trading day of some stock index, which opens 9:30AM and closes at 15:00PM for DJ, but 16:00PM for DX. So the closing time may vary by Ticker.
Date Ticker end_time start_time
1997-10-06 DJ 10:00 09:30
1997-10-06 DJ 10:30 10:00
1997-10-06 DJ 11:00 10:30
1997-10-06 DJ 11:30 11:00
1997-10-08 DJ 09:30 15:00
1997-10-08 DJ 10:00 09:30
1997-10-06 DX 10:00 09:30
1997-10-06 DX 10:30 10:00
1997-10-06 DX 11:00 10:30
1997-10-06 DX 11:30 11:00
1997-10-07 DX 14:30 14:00
1997-10-07 DX 15:00 14:30
1997-10-07 DX 15:30 15:00
1997-10-07 DX 16:00 15:30
1997-10-08 DX 09:30 16:00
1997-10-08 DX 10:00 09:30
Dataset Y looks as follows:
Date Time Event
1997-10-06 09:30 Event1
1997-10-06 10:30 Event2
1997-10-07 22:00 Event3
1997-10-08 09:00 Event4
1997-10-08 09:30 Event5
1997-10-08 09:30 Event6
My aim is to link events in Y to X based on whether the event date-time occurs within the start/end time interval. My expected output is something (data-set Z):
Date Ticker end_time start_time Event
1997-10-06 DJ 10:00 09:30 Event1
1997-10-06 DJ 10:30 10:00 NA
1997-10-06 DJ 11:00 10:30 Event2
1997-10-06 DJ 11:30 11:00 NA
1997-10-08 DJ 09:30 15:00 Event3,Event4
1997-10-08 DJ 10:00 09:30 Event5,Event6
1997-10-06 DX 10:00 09:30 Event1
1997-10-06 DX 10:30 10:00 NA
1997-10-06 DX 11:00 10:30 Event2
1997-10-06 DX 11:30 11:00 NA
1997-10-07 DX 14:30 14:00 NA
1997-10-07 DX 15:00 14:30 NA
1997-10-07 DX 15:30 15:00 NA
1997-10-07 DX 16:00 15:30 NA
1997-10-08 DX 09:30 16:00 Event3, Event4
1997-10-08 DX 10:00 09:30 Event5,Event6
It is thus possible to multiple events happen between an interval. Is it possible to store those in column "Event". It is also possible that Event occurs after a market closes, which should be stored in the first interval that occurs after the event. How can I obtain this expected output? I have been thinking for a while now, but I have no clue where to start.
Edit: X contains 400k 30-min intervals. Y contains 40k events.
There are lots of ways of approaching the problem, here's just one suggestion.
I'm using these datasets:
x <- read.table(text = "Date,Ticker,end_time,start_time
06/10/1997,DJ,10:00,09:30
06/10/1997,DJ,10:30,10:00
06/10/1997,DJ,11:00,10:30
06/10/1997,DJ,11:30,11:00
08/10/1997,DJ,09:30,15:00
08/10/1997,DJ,10:00,09:30
06/10/1997,DX,10:00,09:30
06/10/1997,DX,10:30,10:00
06/10/1997,DX,11:00,10:30
06/10/1997,DX,11:30,11:00
07/10/1997,DX,14:30,14:00
07/10/1997,DX,15:00,14:30
07/10/1997,DX,15:30,15:00
07/10/1997,DX,16:00,15:30
08/10/1997,DX,09:30,16:00
08/10/1997,DX,10:00,09:30
08/10/1997,DX,10:00,09:30", sep =",", header = TRUE, stringsAsFactors =
FALSE)
y <- read.table(text = "Date,Time,Event
06/10/1997,09:30,Event1
06/10/1997,10:30,Event2
07/10/1997,22:00,Event3
08/10/1997,09:00,Event4
08/10/1997,09:30,Event5
08/10/1997,09:30,Event6
", sep =",", header = TRUE, stringsAsFactors = FALSE)
I would start by concatenating and formatting the dates and times so they can be used in functions to check whether an event occurred in that window. Assuming you have two data frames called x and y in the structure described above:
y$date_time <- strptime(paste(y$Time,y$Date),format="%H:%M %d/%m/%Y")
x$start_time_date <- strptime(paste(x$start_time,x$Date),format="%H:%M %d/%m/%Y")
x$end_time_date <- strptime(paste(x$end_time,x$Date),format="%H:%M %d/%m/%Y")
If you have control over the dataset as it is compiled then it might be easier for the start and end dates to be recorded in this way, as for the periods that cross over a date doing it this way will produce start date-times that are after the end date-times. We can edit those by just instead using the date from the previous entry in the data frame, assuming it will always be in chronological order and there won't be missing periods. This is a bit of a hack!:
#check which entries cross over a date
overnight_idx <- which(x$end_time_date < x$start_time_date)
#replace start date with that of preceding entry in the data frame
x[overnight_idx, 'start_time_date'] <-
as.POSIXct(paste(x[overnight_idx, 'start_time'],
x[overnight_idx - 1, 'Date']),format="%H:%M %d/%m/%Y",
origin = "1970-01-01")
Now we can write a function that for a given row in the data frame x will extract any events that have occurred listed in y, and then do a little bit of formatting to get it in the format you described.
checkEvent <- function(x_row){
y2 <- y[y$date_time>=x_row['start_time_date'] &
y$date_time<x_row['end_time_date'], 'Event']
if(length(y2)==0){
y2 <- NA
} else if(length(y2)>1){
y2 <- paste(y2,collapse = ' ')
}
return(y2)
}
Then we can just apply that to x
x$Event <- apply(x,1,checkEvent)
which will produce the following (ignoring the columns we created above so it fits on the screen):
> x[,c('Date','Ticker','end_time','start_time','Event')]
Date Ticker end_time start_time Event
1 06/10/1997 DJ 10:00 09:30 Event1
2 06/10/1997 DJ 10:30 10:00 <NA>
3 06/10/1997 DJ 11:00 10:30 Event2
4 06/10/1997 DJ 11:30 11:00 <NA>
5 08/10/1997 DJ 09:30 15:00 Event3 Event4
6 08/10/1997 DJ 10:00 09:30 Event5 Event6
7 06/10/1997 DX 10:00 09:30 Event1
8 06/10/1997 DX 10:30 10:00 <NA>
9 06/10/1997 DX 11:00 10:30 Event2
10 06/10/1997 DX 11:30 11:00 <NA>
11 07/10/1997 DX 14:30 14:00 <NA>
12 07/10/1997 DX 15:00 14:30 <NA>
13 07/10/1997 DX 15:30 15:00 <NA>
14 07/10/1997 DX 16:00 15:30 <NA>
15 08/10/1997 DX 09:30 16:00 Event3 Event4
16 08/10/1997 DX 10:00 09:30 Event5 Event6
17 08/10/1997 DX 10:00 09:30 Event5 Event6
I have two tables, First table with columns - ID, Start_Date, End_Date Second table with columns - Day_of_Week, Start_Time, End_Time
ID Start_Date_Time End_Date_Time
1 ABC123 2019-01-05 16:00:00 2019-01-07 20:00:00
2 XYZ123 2019-01-06 05:00:00 2019-01-13 05:00:00
3 XYZ456 2019-01-08 19:00:00 2019-01-13 12:00:00
And
ID Day StartTime EndTime
1 ABC123 Saturday 13:00 18:00
2 XYZ123 Sunday 0:00 6:00
3 XYZ456 Tuesday 0:00 12:00
I need a Resultant column in the first table which captures the number of hours within the Start_Date and End_Date based on the condition in the second table. In this case the result should be
ID Start_Date End_Date Timeline_Hours
ABC123 01/05/2019 16:00 01/07/2019 20:00 2
XYZ123 01/06/2019 5:00 01/13/2019 5:00 6
XYZ456 01/08/2019 19:00 01/13/2019 12:00 0
For the first record: ABC123 - Number of hours withing the Start_Date and End_date based on the condition is 2 Hours.
Reason - Date starts from Staurday 16:00 (4PM) and ends on Monday 20:00 (8PM), Condition in the second table says Saturday 13:00 to 18:00 so overlap is 2 Hours ( from 16:00 to 18:00)
Similarly second one has duration of more than a week and overlap for the first week is 1 Hour (from 5:00 to 6:00) and for the second week it is 5 Hours (from 0:00 to 5:00)
For third one no overlap so 0 Hour.
Can this be done in R?
Thanks
Nagaraj
df1 <- structure(list(ID = c("ABC123", "XYZ123", "XYZ456"), Start_Date_Time = structure(c(1546675200,
1546722000, 1546945200), class = c("POSIXct", "POSIXt"), tzone = ""),
End_Date_Time = structure(c(1546862400, 1547326800, 1547352000
), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA,
-3L), class = "data.frame")
df2 <- structure(list(ID = c("ABC123", "XYZ123", "XYZ456"), Day = c("Saturday",
"Sunday", "Tuesday"), StartTime = c("13:00", "0:00", "0:00"),
EndTime = c("18:00", "6:00", "12:00")), row.names = c(NA,
-3L), class = "data.frame")
An option using data.table:
library(data.table)
setDT(df1)
setDT(df2)
fmt <- "%Y-%m-%d %H:%M"
#generate all hours with df2
DT <- df1[, {
x <- seq(min(as.IDate(Start_Date_Time)), max(as.IDate(End_Date_Time)), by="1 day")
.(Date=x, Day=weekdays(x))
}][
df2, on=.(Day), nomatch=0L]
hoursDT <- DT[, .(ID, END_HR=seq.POSIXt(as.POSIXct(paste(Date, StartTime), format=fmt) + 60*60,
as.POSIXct(paste(Date, EndTime), format=fmt),
by="1 hour")),
seq_len(nrow(DT))]
#count number of overlapping hours by joining the prev data.table with df1
df1[, Timeline_Hours :=
hoursDT[.SD, on=.(ID, END_HR>Start_Date_Time, END_HR<=End_Date_Time), by=.EACHI, .N]$N
]
output for df1:
ID Start_Date_Time End_Date_Time Timeline_Hours
1: ABC123 2019-01-05 16:00:00 2019-01-07 20:00:00 2
2: XYZ123 2019-01-06 05:00:00 2019-01-13 05:00:00 6
3: XYZ456 2019-01-08 19:00:00 2019-01-13 12:00:00 0
I have a file which has messages between customers and agents but these message are not grouped by conversations i.e. there is unique conversation id. Luckily the original message is included in each following reply to that message. The message is in the 'text' column. This can be easily explained by below example
actionDateTime text response postTime
2019-01-01 12:00 Hi N/A 2019-01-01 12:00
2019-01-01 12:01 Hi Hello! 2019-01-01 12:00
2019-01-01 12:02 Hi How can I help? 2019-01-01 12:00
.
.
.
2019-01-02 12:00 Hi there N/A 2019-01-01 12:00
2019-01-02 12:01 Hi there Morning 2019-01-01 12:00
2019-01-02 12:02 Hi there How can I help? 2019-01-01 12:00
So I tried the code below to group but this isn't working.
df %>%
group_by(text, postTime) %>%
mutate(convID = row_number()) %>%
ungroup()
This does output a file with convID but not the way I want. In fact, I don't understand how's it numbering. I believe that's because I'm using two variables in group_by. However, using only one will not work as two different people can message at the same time or two different messages can look similar (e.g. a lot of people can start with just 'Hi').
When I tried only group 'text' it still gives me numbers within a conversation rather than a unique ID. Again, explained below
What I get
text response postTime convID
Hi N/A 2019-01-01 12:00 1
Hi Hello! 2019-01-01 12:00 2
Hi How can I help? 2019-01-01 12:00 3
.
.
.
Hi there N/A 2019-01-01 12:00 1
Hi there Morning 2019-01-01 12:00 2
Hi there How can I help? 2019-01-01 12:00 3
What I want:
text response postTime convID
Hi N/A 2019-01-01 12:00 1
Hi Hello! 2019-01-01 12:00 1
Hi How can I help? 2019-01-01 12:00 1
.
.
.
Hi there N/A 2019-01-01 12:00 2
Hi there Morning 2019-01-01 12:00 2
Hi there How can I help? 2019-01-01 12:00 2
Any help?
We may need group_indices
library(dplyr)
df %>%
mutate(convID = group_indices(., text, postTime))
# actionDateTime text response postTime convID
#1 2019-01-01 12:00 Hi N/A 2019-01-01 12:00 1
#2 2019-01-01 12:01 Hi Hello! 2019-01-01 12:00 1
#3 2019-01-01 12:02 Hi How can I help? 2019-01-01 12:00 1
#4 2019-01-02 12:00 Hi there N/A 2019-01-01 12:00 2
#5 2019-01-02 12:01 Hi there Morning 2019-01-01 12:00 2
#6 2019-01-02 12:02 Hi there How can I help? 2019-01-01 12:00 2
data
df <- structure(list(actionDateTime = c("2019-01-01 12:00", "2019-01-01 12:01",
"2019-01-01 12:02", "2019-01-02 12:00", "2019-01-02 12:01", "2019-01-02 12:02"
), text = c("Hi", "Hi", "Hi", "Hi there", "Hi there", "Hi there"
), response = c("N/A", "Hello!", "How can I help?", "N/A", "Morning",
"How can I help?"), postTime = c("2019-01-01 12:00", "2019-01-01 12:00",
"2019-01-01 12:00", "2019-01-01 12:00", "2019-01-01 12:00", "2019-01-01 12:00"
)), class = "data.frame", row.names = c(NA, -6L))
I have a table that looks like this;
user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC
I have another table that has every week in 2018.
week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30
... ... ...
Assume the week_start is a Monday and week_end is a Sunday.
I'd like to do two things. I'd first like to join the week_id to the first table and then I'd like to assign a day to each of the timestamps. My output would look like this:
user_id timestamp week_id day_of_week
aa 2018-01-01 12:01 UTC 1 Monday
ab 2018-01-02 05:01 UTC 1 Tuesday
bb 2018-01-13 09:01 UTC 2 Friday
bc 2018-01-28 23:01 UTC 4 Friday
cc 2018-01-06 11:01 UTC 1 Saturday
In Excel I could easily do this with a vlookup. My main interest is to learn how to join tables in cases like this. For that reason, I won't accept answers that use the weekday function.
Here are both of the tables in a more accessible format.
user_id <- c("aa", "ab", "bb", "bc", "cc")
timestamp <- c("2018-01-01 12:01", "2018-01-01 05:01", "2018-06-01 09:01", "2018-03-03 23:01", "2018-01-02 11:01")
week_id <- seq(1,52)
week_start <- seq(as.Date("2018-01-01"), as.Date("2018-12-31"), 7)
week_end <- week_start + 6
week_start <- week_start[1:52]
week_end <- week_end[1:52]
table1 <- data.frame(user_id, timestamp)
table2 <- data.frame(week_id, week_start, week_end)
Using SQL one can join two tables on a range like this. This seems the most elegant solution expressing our intent directly but we also provide some alternatives further below.
library(sqldf)
DF1$date <- as.Date(DF1$timestamp)
sqldf("select *
from DF1 a
left join DF2 b on date between week_start and week_end")
giving:
user_id timestamp date week_id week_start week_end
1 aa 2018-01-01 12:01:00 2018-01-01 1 2018-01-01 2018-01-07
2 ab 2018-01-01 05:01:00 2018-01-01 1 2018-01-01 2018-01-07
3 bb 2018-06-01 09:01:00 2018-06-01 NA <NA> <NA>
4 bc 2018-03-03 23:01:00 2018-03-04 NA <NA> <NA>
5 cc 2018-01-02 11:01:00 2018-01-02 1 2018-01-01 2018-01-07
dplyr
In a comment the poster asked for whether it could be done in dplyr. It can't be done directly since dplyr does not support complex joins but a workaound would be to do a full cross join of the two data frames which gives rise to an nrow(DF1) * nrow(DF2) intermediate result and then filter this down. dplyr does not directly support cross joins but we can simulate one by doing a full join on an identical dummy constant column that is appended to both data frames in the full join. Since we actually need a right join here to add back the unmatched rows, we do a final right join with the original DF1 data frame. Obviously this is entirely impractical for sufficiently large inputs but for the small input here we can do it. If it were known that there is a match in DF2 to every row in DF1 then the right_join at the end could be omitted.
DF1 %>%
mutate(date = as.Date(timestamp), dummy = 1) %>%
full_join(DF2 %>% mutate(dummy = 1)) %>%
filter(date >= week_start & date <= week_end) %>%
select(-dummy) %>%
right_join(DF1)
R Base
findix finds the index in DF2 corresponding to a date d. We then sapply it over the dates corresponding to rows of DF1 and put DF1 and the corresponding DF2 row together.
findix <- function(d) c(which(d >= DF2$week_start & d <= DF2$week_end), NA)[1]
cbind(DF1, DF2[sapply(as.Date(DF1$timestamp), findix), ])
Note
The input data in reproducible form used is:
Lines1 <- "user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC"
DF1 <- read.csv(text = gsub(" +", ",", Lines1), strip.white = TRUE)
DF1$timestamp <- as.POSIXct(DF1$timestamp)
Lines2 <- "week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30"
DF2 <- read.table(text = Lines2, header = TRUE)
DF2$week_start <- as.Date(DF2$week_start)
DF2$week_end <- as.Date(DF2$week_end)
This is a case for the fuzzyjoin-package. With the match_fun- argument we can specify conditions for each column. In this case table1$date >= table2$week_start and table1$date <= table2$week_end.
library(fuzzyjoin)
library(lubridate)
table1$date <- as.Date(table1$timestamp)
fuzzy_left_join(table1, table2,
by = c("date" = "week_start", "date" = "week_end"),
match_fun = list(`>=`, `<=`)) %>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
select(user_id, timestamp, week_id, day_of_week)
user_id timestamp week_id day_of_week
1 aa 2018-01-01 12:01 1 Mo
2 ab 2018-01-01 05:01 1 Mo
3 bb 2018-06-01 09:01 22 Fr
4 bc 2018-03-03 23:01 9 Sa
5 cc 2018-01-02 11:01 1 Di
I'm also a smartass because I didn't use the weekday-function but wday from the lubridate-package.
I have a list of Lectures for a university course stored in a data-frame. This is a large complex table with over 1000 rows. I have used simple time in the example, but this is actually date time in the format %d %b %Y %H:%M. I think I should be able to extrapolate to the more complex usage.
essentially:
ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist
I am trying to create an output data frame which determines which modules clash in the timetable and at which times. For example:
ModuleCode1 StartTime EndTime ModuleCode2 StartTime EndTime
11A 10:30 12:30 15B 09:40 10:40
11A 10:30 12:30 13C 11:45 12:45
13A 10:30 12:30 17B 14:00 15:00
There are a multitude of questions on date time overlaps, but the ones that I can find seem to either work with 2 dataframes, or I can't understand them. I have come across the lubridate and IRanges packages, but cannot work out this specific implementation with date time in a single data frame. It seems as though something which would be generally useful, and most likely would have a simple implementation I am missing. Grateful for any help.
Here is an sqldf solution. The intervals do NOT overlap iff a.StartTime > b.EndTime or a.EndTime < b.StartTime so they do overlap exactly when the negation of this statement is true, hence:
library(sqldf)
sqldf("select a.ModuleCode1, a.StartTime, a.EndTime, b.ModuleCode1, b.StartTime, b.EndTime
from DF a join DF b on a.ModuleCode1 < b.ModuleCode1 and
a.StartTime <= b.EndTime and
a.EndTime >= b.StartTime")
giving:
ModuleCode1 StartTime EndTime ModuleCode1 StartTime EndTime
1 11A 10:30 12:30 13C 11:45 12:45
2 11A 10:30 12:30 15B 09:40 10:40
3 13A 14:30 15:30 17B 14:00 15:00
Note: The input in reproducible form is:
Lines <- "ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist"
DF <- read.table(text = Lines, header = TRUE)