Listing pairwise overlaps of Date time elements in R - r

I have a list of Lectures for a university course stored in a data-frame. This is a large complex table with over 1000 rows. I have used simple time in the example, but this is actually date time in the format %d %b %Y %H:%M. I think I should be able to extrapolate to the more complex usage.
essentially:
ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist
I am trying to create an output data frame which determines which modules clash in the timetable and at which times. For example:
ModuleCode1 StartTime EndTime ModuleCode2 StartTime EndTime
11A 10:30 12:30 15B 09:40 10:40
11A 10:30 12:30 13C 11:45 12:45
13A 10:30 12:30 17B 14:00 15:00
There are a multitude of questions on date time overlaps, but the ones that I can find seem to either work with 2 dataframes, or I can't understand them. I have come across the lubridate and IRanges packages, but cannot work out this specific implementation with date time in a single data frame. It seems as though something which would be generally useful, and most likely would have a simple implementation I am missing. Grateful for any help.

Here is an sqldf solution. The intervals do NOT overlap iff a.StartTime > b.EndTime or a.EndTime < b.StartTime so they do overlap exactly when the negation of this statement is true, hence:
library(sqldf)
sqldf("select a.ModuleCode1, a.StartTime, a.EndTime, b.ModuleCode1, b.StartTime, b.EndTime
from DF a join DF b on a.ModuleCode1 < b.ModuleCode1 and
a.StartTime <= b.EndTime and
a.EndTime >= b.StartTime")
giving:
ModuleCode1 StartTime EndTime ModuleCode1 StartTime EndTime
1 11A 10:30 12:30 13C 11:45 12:45
2 11A 10:30 12:30 15B 09:40 10:40
3 13A 14:30 15:30 17B 14:00 15:00
Note: The input in reproducible form is:
Lines <- "ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist"
DF <- read.table(text = Lines, header = TRUE)

Related

Merge two datasets based on time interval in R

I have two datasets.
Dataset X looks as follows. It contains 30-min intervals of the trading day of some stock index, which opens 9:30AM and closes at 15:00PM for DJ, but 16:00PM for DX. So the closing time may vary by Ticker.
Date Ticker end_time start_time
1997-10-06 DJ 10:00 09:30
1997-10-06 DJ 10:30 10:00
1997-10-06 DJ 11:00 10:30
1997-10-06 DJ 11:30 11:00
1997-10-08 DJ 09:30 15:00
1997-10-08 DJ 10:00 09:30
1997-10-06 DX 10:00 09:30
1997-10-06 DX 10:30 10:00
1997-10-06 DX 11:00 10:30
1997-10-06 DX 11:30 11:00
1997-10-07 DX 14:30 14:00
1997-10-07 DX 15:00 14:30
1997-10-07 DX 15:30 15:00
1997-10-07 DX 16:00 15:30
1997-10-08 DX 09:30 16:00
1997-10-08 DX 10:00 09:30
Dataset Y looks as follows:
Date Time Event
1997-10-06 09:30 Event1
1997-10-06 10:30 Event2
1997-10-07 22:00 Event3
1997-10-08 09:00 Event4
1997-10-08 09:30 Event5
1997-10-08 09:30 Event6
My aim is to link events in Y to X based on whether the event date-time occurs within the start/end time interval. My expected output is something (data-set Z):
Date Ticker end_time start_time Event
1997-10-06 DJ 10:00 09:30 Event1
1997-10-06 DJ 10:30 10:00 NA
1997-10-06 DJ 11:00 10:30 Event2
1997-10-06 DJ 11:30 11:00 NA
1997-10-08 DJ 09:30 15:00 Event3,Event4
1997-10-08 DJ 10:00 09:30 Event5,Event6
1997-10-06 DX 10:00 09:30 Event1
1997-10-06 DX 10:30 10:00 NA
1997-10-06 DX 11:00 10:30 Event2
1997-10-06 DX 11:30 11:00 NA
1997-10-07 DX 14:30 14:00 NA
1997-10-07 DX 15:00 14:30 NA
1997-10-07 DX 15:30 15:00 NA
1997-10-07 DX 16:00 15:30 NA
1997-10-08 DX 09:30 16:00 Event3, Event4
1997-10-08 DX 10:00 09:30 Event5,Event6
It is thus possible to multiple events happen between an interval. Is it possible to store those in column "Event". It is also possible that Event occurs after a market closes, which should be stored in the first interval that occurs after the event. How can I obtain this expected output? I have been thinking for a while now, but I have no clue where to start.
Edit: X contains 400k 30-min intervals. Y contains 40k events.
There are lots of ways of approaching the problem, here's just one suggestion.
I'm using these datasets:
x <- read.table(text = "Date,Ticker,end_time,start_time
06/10/1997,DJ,10:00,09:30
06/10/1997,DJ,10:30,10:00
06/10/1997,DJ,11:00,10:30
06/10/1997,DJ,11:30,11:00
08/10/1997,DJ,09:30,15:00
08/10/1997,DJ,10:00,09:30
06/10/1997,DX,10:00,09:30
06/10/1997,DX,10:30,10:00
06/10/1997,DX,11:00,10:30
06/10/1997,DX,11:30,11:00
07/10/1997,DX,14:30,14:00
07/10/1997,DX,15:00,14:30
07/10/1997,DX,15:30,15:00
07/10/1997,DX,16:00,15:30
08/10/1997,DX,09:30,16:00
08/10/1997,DX,10:00,09:30
08/10/1997,DX,10:00,09:30", sep =",", header = TRUE, stringsAsFactors =
FALSE)
y <- read.table(text = "Date,Time,Event
06/10/1997,09:30,Event1
06/10/1997,10:30,Event2
07/10/1997,22:00,Event3
08/10/1997,09:00,Event4
08/10/1997,09:30,Event5
08/10/1997,09:30,Event6
", sep =",", header = TRUE, stringsAsFactors = FALSE)
I would start by concatenating and formatting the dates and times so they can be used in functions to check whether an event occurred in that window. Assuming you have two data frames called x and y in the structure described above:
y$date_time <- strptime(paste(y$Time,y$Date),format="%H:%M %d/%m/%Y")
x$start_time_date <- strptime(paste(x$start_time,x$Date),format="%H:%M %d/%m/%Y")
x$end_time_date <- strptime(paste(x$end_time,x$Date),format="%H:%M %d/%m/%Y")
If you have control over the dataset as it is compiled then it might be easier for the start and end dates to be recorded in this way, as for the periods that cross over a date doing it this way will produce start date-times that are after the end date-times. We can edit those by just instead using the date from the previous entry in the data frame, assuming it will always be in chronological order and there won't be missing periods. This is a bit of a hack!:
#check which entries cross over a date
overnight_idx <- which(x$end_time_date < x$start_time_date)
#replace start date with that of preceding entry in the data frame
x[overnight_idx, 'start_time_date'] <-
as.POSIXct(paste(x[overnight_idx, 'start_time'],
x[overnight_idx - 1, 'Date']),format="%H:%M %d/%m/%Y",
origin = "1970-01-01")
Now we can write a function that for a given row in the data frame x will extract any events that have occurred listed in y, and then do a little bit of formatting to get it in the format you described.
checkEvent <- function(x_row){
y2 <- y[y$date_time>=x_row['start_time_date'] &
y$date_time<x_row['end_time_date'], 'Event']
if(length(y2)==0){
y2 <- NA
} else if(length(y2)>1){
y2 <- paste(y2,collapse = ' ')
}
return(y2)
}
Then we can just apply that to x
x$Event <- apply(x,1,checkEvent)
which will produce the following (ignoring the columns we created above so it fits on the screen):
> x[,c('Date','Ticker','end_time','start_time','Event')]
Date Ticker end_time start_time Event
1 06/10/1997 DJ 10:00 09:30 Event1
2 06/10/1997 DJ 10:30 10:00 <NA>
3 06/10/1997 DJ 11:00 10:30 Event2
4 06/10/1997 DJ 11:30 11:00 <NA>
5 08/10/1997 DJ 09:30 15:00 Event3 Event4
6 08/10/1997 DJ 10:00 09:30 Event5 Event6
7 06/10/1997 DX 10:00 09:30 Event1
8 06/10/1997 DX 10:30 10:00 <NA>
9 06/10/1997 DX 11:00 10:30 Event2
10 06/10/1997 DX 11:30 11:00 <NA>
11 07/10/1997 DX 14:30 14:00 <NA>
12 07/10/1997 DX 15:00 14:30 <NA>
13 07/10/1997 DX 15:30 15:00 <NA>
14 07/10/1997 DX 16:00 15:30 <NA>
15 08/10/1997 DX 09:30 16:00 Event3 Event4
16 08/10/1997 DX 10:00 09:30 Event5 Event6
17 08/10/1997 DX 10:00 09:30 Event5 Event6

Unpredictable results using cut() function in R to convert dates to 15 minute intervals

OK, this is making me crazy.
I have several datasets with time values that need to be rolled up into 15 minute intervals.
I found a solution here that works beautifully on one dataset. But on the next one I try to do I'm getting weird results. I have a column with character data representing dates:
BeginTime
-------------------------------
1 1/3/19 1:50 PM
2 1/3/19 1:30 PM
3 1/3/19 4:56 PM
4 1/4/19 11:23 AM
5 1/6/19 7:45 PM
6 1/7/19 10:15 PM
7 1/8/19 12:02 PM
8 1/9/19 10:43 PM
And I'm using the following code (which is exactly what I used on the other dataset except for the names)
df$by15 = cut(mdy_hm(df$BeginTime), breaks="15 min")
but what I get is:
BeginTime by15
-------------------------------------------------------
1 1/3/19 1:50 PM 2019-01-03 13:36:00
2 1/3/19 1:30 PM 2019-01-03 13:21:00
3 1/3/19 4:56 PM 2019-01-03 16:51:00
4 1/4/19 11:23 AM 2019-01-04 11:21:00
5 1/6/19 7:45 PM 2019-01-06 19:36:00
6 1/7/19 10:15 PM 2019-01-07 22:06:00
7 1/8/19 12:02 PM 2019-01-08 11:51:00
8 1/9/19 10:43 PM 2019-01-09 22:36:00
9 1/10/19 11:25 AM 2019-01-10 11:21:00
Any suggestions on why I'm getting such random times instead of the 15-minute intervals I'm looking for? Like I said, this worked fine on the other data set.
You can use lubridate::round_date() function which will roll-up your datetime data as follows;
library(lubridate) # To handle datetime data
library(dplyr) # For data manipulation
# Creating dataframe
df <-
data.frame(
BeginTime = c("1/3/19 1:50 PM", "1/3/19 1:30 PM", "1/3/19 4:56 PM",
"1/4/19 11:23 AM", "1/6/19 7:45 PM", "1/7/19 10:15 PM",
"1/8/19 12:02 PM", "1/9/19 10:43 PM")
)
df %>%
# First we parse the data in order to convert it from string format to datetime
mutate(by15 = parse_date_time(BeginTime, '%d/%m/%y %I:%M %p'),
# We roll up the data/round it to 15 minutes interval
by15 = round_date(by15, "15 mins"))
#
# BeginTime by15
# 1/3/19 1:50 PM 2019-03-01 13:45:00
# 1/3/19 1:30 PM 2019-03-01 13:30:00
# 1/3/19 4:56 PM 2019-03-01 17:00:00
# 1/4/19 11:23 AM 2019-04-01 11:30:00
# 1/6/19 7:45 PM 2019-06-01 19:45:00
# 1/7/19 10:15 PM 2019-07-01 22:15:00
# 1/8/19 12:02 PM 2019-08-01 12:00:00
# 1/9/19 10:43 PM 2019-09-01 22:45:00

How to parse year from a date in r [duplicate]

This question already has answers here:
Extract year from date
(7 answers)
Closed 5 years ago.
I have 53000 Date data-set and I want to extract only "year" from the date variable.
Do you guys know how can I do this?
My data are as follows:
OPN_DT_TM
18/07/2003 10:55
12/06/2004 6:00
9/06/2007 12:20
29/06/2001 16:00
6/06/2000 7:55
27/11/2006 10:15
17/11/2001 17:00
12/05/2004 22:00
16/04/2005 22:00
18/03/2005 8:40
13/06/2006 11:10
30/07/2006 12:00
16/07/2002 6:10
16/07/2002 7:15
3/09/2004 6:00
9/11/2004 15:20
25/08/2005 14:15
24/11/2001 19:10
15/04/2002 6:30
20/06/2002 6:30
17/03/2003 7:00
15/01/2005 13:00
23/03/2007 1:00
21/01/2001 10:30
,,,
This can be achieved by converting the entries into Date format and extracting the year, for instance like this:
> format(as.Date("15/01/2005 13:00", format="%d/%m/%Y %H:%M"),"%Y")
[1] "2005"
To get in-depth knowledge about dates and times in R, please see this.

standard deviation of specific row numbers and put the value in another row & column in R

I have following data:
Date Value Std.Dev
11/30/2015 10:00 0
11/30/2015 10:30 -0.002400962
11/30/2015 11:00 -0.004819286
11/30/2015 11:30 -0.000805477
11/30/2015 12:00 -0.001612904
11/30/2015 12:30 -0.003233633
11/30/2015 13:00 0.000809389
11/30/2015 13:30 0.005647453
11/30/2015 14:00 -0.002416433
11/30/2015 14:30 -0.006472515
11/30/2015 15:00 -0.002438035
11/30/2015 15:30 0
11/30/2015 16:30 -0.000814001
12/1/2015 9:00 0.006493529 0.002931114
12/1/2015 9:30 -0.001619434 0.003657839
12/1/2015 10:00 -0.003246756 0.00363798
12/1/2015 10:30 -0.002442004 0.003519869
12/1/2015 11:00 0.000814664 0.003551266
12/1/2015 11:30 -0.001629992 0.00357286
12/1/2015 12:00 0.000815328 0.003504601
12/1/2015 12:30 -1.11022E-16 0.003504796
12/1/2015 13:00 -0.000815328 0.002981979
Std.Dev should start calculation from row number 14. Because I am calculating first std.dev on previous days values. And standard deviation for row 14 will be calculated on row=1 of Value to row=13 of Value. And thus it should go on. So Std.Dev_at_row_number_15 = STDEV(Value2:Value14).
Std.Dev_at_row_number_16 = STDEV(Value3:Value15). And so on....
Can you please suggest any function for this kind of calculation in R. In excel it is very easy. But if you can suggest similar in R, it will be very helpful.
Thanks.
Pardon me for bad English if any. Please let me know in comments if you want more details or example.
Definitely not the most efficient way, but maybe sufficient for you (with x denoting your data frame):
for(counter in 14:nrow(x)){
x[counter,3] <- sd(x[(counter-13):(counter-1),2])
}
But again, that's definitely not the most efficient way.
For a data.frame, df, you can get this as follows with sapply:
df$st.dev <- c(rep(NA, 13), sapply(13:(nrow(df)-1), function(i) sd(df$Value[(i-12):i])))
sapply will run through the selected rows and the function that follows will repeatedly calculate the standard deviations for the selected rows. I prepend NAs to this output so that it can be added to the data.frame.
data
I cheated a little in reading in the data, but it doesn't affect the result.
df <- read.table(header=T, text="Date Time Value
11/30/2015 10:00 0
11/30/2015 10:30 -0.002400962
11/30/2015 11:00 -0.004819286
11/30/2015 11:30 -0.000805477
11/30/2015 12:00 -0.001612904
11/30/2015 12:30 -0.003233633
11/30/2015 13:00 0.000809389
11/30/2015 13:30 0.005647453
11/30/2015 14:00 -0.002416433
11/30/2015 14:30 -0.006472515
11/30/2015 15:00 -0.002438035
11/30/2015 15:30 0
11/30/2015 16:30 -0.000814001
12/1/2015 9:00 0.006493529
12/1/2015 9:30 -0.001619434
12/1/2015 10:00 -0.003246756
12/1/2015 10:30 -0.002442004
12/1/2015 11:00 0.000814664
12/1/2015 11:30 -0.001629992
12/1/2015 12:00 0.000815328
12/1/2015 12:30 -1.11022E-16
12/1/2015 13:00 -0.000815328", as.is=TRUE, row)

R Programming: How to arrange the ticks values in datetime plot in ggplot [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Plot dates on the x axis and time on the y axis with ggplot2
I have these data,
Arrival Date
7:50 Apr-19
7:45 Apr-20
7:30 Apr-23
7:30 Apr-24
7:55 Apr-25
7:20 Apr-26
7:30 Apr-27
7:50 Apr-28
8:00 Apr-30
7:45 May-2
8:30 May-3
8:06 May-4
8:25 May-7
7:35 May-8
7:45 May-9
8:02 May-10
7:53 May-11
8:39 May-14
8:14 May-15
8:08 May-16
8:27 May-17
8:20 May-18
12:00 Apr-19
12:00 Apr-20
12:00 Apr-23
12:00 Apr-24
12:00 Apr-25
12:00 Apr-26
12:00 Apr-27
12:00 Apr-28
11:50 Apr-30
12:00 May-2
11:45 May-3
11:50 May-4
12:00 May-7
11:50 May-8
11:55 May-9
12:10 May-10
11:53 May-11
11:54 May-14
11:40 May-15
11:54 May-16
11:45 May-17
12:00 May-18
And I want to plot it using ggplot,
This is what I did,
OJT <- read.csv(file = "Data.csv", header = TRUE)
qplot(Date,Arrival, data = OJT, xlab = expression(bold("Date")), ylab = expression(bold("Time"))) + theme_bw() + opts(axis.text.x=theme_text(angle=90)) +geom_point(size = 2, colour = "black", fill = "red", pch = 21)
And here is the output
As you can see, the time and date is not arrange. I want the time to start from 7:00 am to 12:20 pm, and the date from April 19 to May 18. I tried using
as.Date(strptime(OJT$Date,"%m-%dT"))
But still I don't get the right plot.
And I can't find similar problems through the internet.
Any idea to help me solve this.
Thanks
I will try a different approach with some wrangling in lubridate. Target plot:
The code, including your data:
library("ggplot2")
library("lubridate")
df <- read.table(text = "Arrival Date
7:50 Apr-19
7:45 Apr-20
7:30 Apr-23
7:30 Apr-24
7:55 Apr-25
7:20 Apr-26
7:30 Apr-27
7:50 Apr-28
8:00 Apr-30
7:45 May-2
8:30 May-3
8:06 May-4
8:25 May-7
7:35 May-8
7:45 May-9
8:02 May-10
7:53 May-11
8:39 May-14
8:14 May-15
8:08 May-16
8:27 May-17
8:20 May-18
12:00 Apr-19
12:00 Apr-20
12:00 Apr-23
12:00 Apr-24
12:00 Apr-25
12:00 Apr-26
12:00 Apr-27
12:00 Apr-28
11:50 Apr-30
12:00 May-2
11:45 May-3
11:50 May-4
12:00 May-7
11:50 May-8
11:55 May-9
12:10 May-10
11:53 May-11
11:54 May-14
11:40 May-15
11:54 May-16
11:45 May-17
12:00 May-18", header=TRUE)
df$Date <- paste('2012-',df$Date, sep='')
df$Full <- paste(df$Date, df$Arrival, sep=' ')
df$Full <- ymd_hm(df$Full)
df$decimal.hour <- hour(df$Full) + minute(df$Full)/60
p <- ggplot(df, aes(x=Full, y=decimal.hour)) +
geom_point()
p
#make some data in your kind of format:
tS <- dummySeries()
a<-rownames(tS)
x<-c(a,a)
y<-1:24
dat<-as.data.frame(cbind(x,y))
#get it in the format for the plot
v<-paste(dat$x,dat$y, sep=" ")
v2<-as.POSIXct(strptime(v, "%Y-%m-%d %H",tz="GMT"))
v3<-sort(v2)
hrs<-strftime(v2,"%H")
days<-strftime(v2,"%Y-%m-%d")
final<-data.frame(cbind(days,hrs))
qplot(days,hrs,data=final) + geom_point()
#ooooff... I bet this can be done much cleaner...i know little about
#time series data.

Resources