Merge two datasets based on time interval in R - r

I have two datasets.
Dataset X looks as follows. It contains 30-min intervals of the trading day of some stock index, which opens 9:30AM and closes at 15:00PM for DJ, but 16:00PM for DX. So the closing time may vary by Ticker.
Date Ticker end_time start_time
1997-10-06 DJ 10:00 09:30
1997-10-06 DJ 10:30 10:00
1997-10-06 DJ 11:00 10:30
1997-10-06 DJ 11:30 11:00
1997-10-08 DJ 09:30 15:00
1997-10-08 DJ 10:00 09:30
1997-10-06 DX 10:00 09:30
1997-10-06 DX 10:30 10:00
1997-10-06 DX 11:00 10:30
1997-10-06 DX 11:30 11:00
1997-10-07 DX 14:30 14:00
1997-10-07 DX 15:00 14:30
1997-10-07 DX 15:30 15:00
1997-10-07 DX 16:00 15:30
1997-10-08 DX 09:30 16:00
1997-10-08 DX 10:00 09:30
Dataset Y looks as follows:
Date Time Event
1997-10-06 09:30 Event1
1997-10-06 10:30 Event2
1997-10-07 22:00 Event3
1997-10-08 09:00 Event4
1997-10-08 09:30 Event5
1997-10-08 09:30 Event6
My aim is to link events in Y to X based on whether the event date-time occurs within the start/end time interval. My expected output is something (data-set Z):
Date Ticker end_time start_time Event
1997-10-06 DJ 10:00 09:30 Event1
1997-10-06 DJ 10:30 10:00 NA
1997-10-06 DJ 11:00 10:30 Event2
1997-10-06 DJ 11:30 11:00 NA
1997-10-08 DJ 09:30 15:00 Event3,Event4
1997-10-08 DJ 10:00 09:30 Event5,Event6
1997-10-06 DX 10:00 09:30 Event1
1997-10-06 DX 10:30 10:00 NA
1997-10-06 DX 11:00 10:30 Event2
1997-10-06 DX 11:30 11:00 NA
1997-10-07 DX 14:30 14:00 NA
1997-10-07 DX 15:00 14:30 NA
1997-10-07 DX 15:30 15:00 NA
1997-10-07 DX 16:00 15:30 NA
1997-10-08 DX 09:30 16:00 Event3, Event4
1997-10-08 DX 10:00 09:30 Event5,Event6
It is thus possible to multiple events happen between an interval. Is it possible to store those in column "Event". It is also possible that Event occurs after a market closes, which should be stored in the first interval that occurs after the event. How can I obtain this expected output? I have been thinking for a while now, but I have no clue where to start.
Edit: X contains 400k 30-min intervals. Y contains 40k events.

There are lots of ways of approaching the problem, here's just one suggestion.
I'm using these datasets:
x <- read.table(text = "Date,Ticker,end_time,start_time
06/10/1997,DJ,10:00,09:30
06/10/1997,DJ,10:30,10:00
06/10/1997,DJ,11:00,10:30
06/10/1997,DJ,11:30,11:00
08/10/1997,DJ,09:30,15:00
08/10/1997,DJ,10:00,09:30
06/10/1997,DX,10:00,09:30
06/10/1997,DX,10:30,10:00
06/10/1997,DX,11:00,10:30
06/10/1997,DX,11:30,11:00
07/10/1997,DX,14:30,14:00
07/10/1997,DX,15:00,14:30
07/10/1997,DX,15:30,15:00
07/10/1997,DX,16:00,15:30
08/10/1997,DX,09:30,16:00
08/10/1997,DX,10:00,09:30
08/10/1997,DX,10:00,09:30", sep =",", header = TRUE, stringsAsFactors =
FALSE)
y <- read.table(text = "Date,Time,Event
06/10/1997,09:30,Event1
06/10/1997,10:30,Event2
07/10/1997,22:00,Event3
08/10/1997,09:00,Event4
08/10/1997,09:30,Event5
08/10/1997,09:30,Event6
", sep =",", header = TRUE, stringsAsFactors = FALSE)
I would start by concatenating and formatting the dates and times so they can be used in functions to check whether an event occurred in that window. Assuming you have two data frames called x and y in the structure described above:
y$date_time <- strptime(paste(y$Time,y$Date),format="%H:%M %d/%m/%Y")
x$start_time_date <- strptime(paste(x$start_time,x$Date),format="%H:%M %d/%m/%Y")
x$end_time_date <- strptime(paste(x$end_time,x$Date),format="%H:%M %d/%m/%Y")
If you have control over the dataset as it is compiled then it might be easier for the start and end dates to be recorded in this way, as for the periods that cross over a date doing it this way will produce start date-times that are after the end date-times. We can edit those by just instead using the date from the previous entry in the data frame, assuming it will always be in chronological order and there won't be missing periods. This is a bit of a hack!:
#check which entries cross over a date
overnight_idx <- which(x$end_time_date < x$start_time_date)
#replace start date with that of preceding entry in the data frame
x[overnight_idx, 'start_time_date'] <-
as.POSIXct(paste(x[overnight_idx, 'start_time'],
x[overnight_idx - 1, 'Date']),format="%H:%M %d/%m/%Y",
origin = "1970-01-01")
Now we can write a function that for a given row in the data frame x will extract any events that have occurred listed in y, and then do a little bit of formatting to get it in the format you described.
checkEvent <- function(x_row){
y2 <- y[y$date_time>=x_row['start_time_date'] &
y$date_time<x_row['end_time_date'], 'Event']
if(length(y2)==0){
y2 <- NA
} else if(length(y2)>1){
y2 <- paste(y2,collapse = ' ')
}
return(y2)
}
Then we can just apply that to x
x$Event <- apply(x,1,checkEvent)
which will produce the following (ignoring the columns we created above so it fits on the screen):
> x[,c('Date','Ticker','end_time','start_time','Event')]
Date Ticker end_time start_time Event
1 06/10/1997 DJ 10:00 09:30 Event1
2 06/10/1997 DJ 10:30 10:00 <NA>
3 06/10/1997 DJ 11:00 10:30 Event2
4 06/10/1997 DJ 11:30 11:00 <NA>
5 08/10/1997 DJ 09:30 15:00 Event3 Event4
6 08/10/1997 DJ 10:00 09:30 Event5 Event6
7 06/10/1997 DX 10:00 09:30 Event1
8 06/10/1997 DX 10:30 10:00 <NA>
9 06/10/1997 DX 11:00 10:30 Event2
10 06/10/1997 DX 11:30 11:00 <NA>
11 07/10/1997 DX 14:30 14:00 <NA>
12 07/10/1997 DX 15:00 14:30 <NA>
13 07/10/1997 DX 15:30 15:00 <NA>
14 07/10/1997 DX 16:00 15:30 <NA>
15 08/10/1997 DX 09:30 16:00 Event3 Event4
16 08/10/1997 DX 10:00 09:30 Event5 Event6
17 08/10/1997 DX 10:00 09:30 Event5 Event6

Related

How to grepl search for the max and min timings in a string?

I have a dataset with a column containing the opening and closing times of various stores.
The timings are in string format Opening time - Closing time,
eg: 17:00 - 21:00 | 11:30 - 14:30 | 11:30 - 14:30
I want to extract the minimum opening time within the above string, i.e. 11:30 and the max closing time i.e. 21:00.How do I do that using R?
DPUT:
structure(list(head.timings_remapping.Opening.And.Closing.Time..40. = c("15:30 - 21:30",
"12:00 - 00:00", "11:00 - 15:00 | 16:30 - 20:45", "12:00 - 22:30",
"17:00 - 21:30", "17:00 - 21:30", "16:30 - 00:00", "16:00 - 21:15",
"16:30 - 20:30", "17:00 - 20:00", "16:00 - 23:30", "16:30 - 21:30",
"17:00 - 22:00", "17:00 - 22:00", "17:00 - 21:30", "17:00 - 21:30",
"16:00 - 00:00", "16:30 - 23:59", "11:30 - 22:30", "11:30 - 23:59",
"17:00 - 20:30", "07:30 - 12:50", "16:15 - 23:00", "09:00 - 21:00",
"10:00 - 21:00", "11:00 - 22:00", "07:00 - 12:00 | 07:00 - 13:30 | 12:00 - 13:30",
"07:00 - 13:00 | 10:00 - 15:00", "10:00 - 02:00", "00:00 - 23:59",
"00:00 - 23:59", "11:00 - 20:00", "11:00 - 20:00", NA, "12:00 - 03:30 | 11:00 - 00:00",
"05:30 - 15:00", "07:00 - 16:00", "08:30 - 13:30", "17:00 - 21:00 | 11:30 - 14:30 | 11:30 - 14:30",
"12:00 - 01:00")), class = "data.frame", row.names = c(NA, -40L
))
The final output will have two columns "Opening time" and "Closing time"
Does this work:
library(dplyr)
library(tidyr)
df %>%
separate(col = head.timings_remapping.Opening.And.Closing.Time..40., into = c('Open_Close','A'), sep = '\\|') %>%
separate(col = Open_Close, into = c('Opening Time','Closing Time'), sep = ' - ') %>%
mutate(`Opening Time` = trimws(`Opening Time`), `Closing Time` = trimws(`Closing Time`)) %>% select(-A)
Opening Time Closing Time
1 15:30 21:30
2 12:00 00:00
3 11:00 15:00
4 12:00 22:30
5 17:00 21:30
6 17:00 21:30
7 16:30 00:00
8 16:00 21:15
9 16:30 20:30
10 17:00 20:00
11 16:00 23:30
12 16:30 21:30
13 17:00 22:00
14 17:00 22:00
15 17:00 21:30
16 17:00 21:30
17 16:00 00:00
18 16:30 23:59
19 11:30 22:30
20 11:30 23:59
21 17:00 20:30
22 07:30 12:50
23 16:15 23:00
24 09:00 21:00
25 10:00 21:00
26 11:00 22:00
27 07:00 12:00
28 07:00 13:00
29 10:00 02:00
30 00:00 23:59
31 00:00 23:59
32 11:00 20:00
33 11:00 20:00
34 <NA> <NA>
35 12:00 03:30
36 05:30 15:00
37 07:00 16:00
38 08:30 13:30
39 17:00 21:00
40 12:00 01:00
Using dplyr and tidyr library you can do :
library(dplyr)
library(tidyr)
#Rename the long column name to something smaller
names(df)[1] <- 'Time'
df %>%
#Create a row index
mutate(row = row_number()) %>%
#Split the data in different rows on '|'
separate_rows(Time, sep = '\\s*\\|\\s*') %>%
#split the data on '-'
separate(Time, c("Opening_Time", "Closing_time"), sep = '\\s*-\\s*') %>%
#Change the time to POSIXct format
mutate(across(c(Opening_Time, Closing_time), as.POSIXct, format = '%H:%M')) %>%
#For each row
group_by(row) %>%
#Get minimum opening time and maximum closing time
#and change into required format
summarise(Opening_Time = format(min(Opening_Time), "%H:%M"),
Closing_time = format(max(Closing_time), "%H:%M")) %>%
#Drop row column
select(-row)
This returns
# Opening_Time Closing_time
# <chr> <chr>
# 1 15:30 21:30
# 2 12:00 00:00
# 3 11:00 20:45
# 4 12:00 22:30
# 5 17:00 21:30
# 6 17:00 21:30
# 7 16:30 00:00
# 8 16:00 21:15
# 9 16:30 20:30
#10 17:00 20:00
# … with 30 more rows

Transforming data into xts format

I have some data, and the Date column includes the time too. I am trying to get this data into xts format. I have tried below, but I get an error. Can anyone see anything wrong with this code? TIA
Date Open High Low Close
1 2017.01.30 07:00 1.25735 1.25761 1.25680 1.25698
2 2017.01.30 08:00 1.25697 1.25702 1.25615 1.25619
3 2017.01.30 09:00 1.25618 1.25669 1.25512 1.25533
4 2017.01.30 10:00 1.25536 1.25571 1.25093 1.25105
5 2017.01.30 11:00 1.25104 1.25301 1.25093 1.25262
6 2017.01.30 12:00 1.25260 1.25479 1.25229 1.25361
7 2017.01.30 13:00 1.25362 1.25417 1.25096 1.25177
8 2017.01.30 14:00 1.25177 1.25219 1.24900 1.25071
9 2017.01.30 15:00 1.25070 1.25307 1.24991 1.25238
10 2017.01.30 16:00 1.25238 1.25358 1.25075 1.25159
df = read.table(file = "GBPUSD60.csv", sep="," , header = TRUE)
dates = as.character(df$Date)
df$Date = NULL
Sept17 = xts(df, as.POSIXct(dates, format="%Y-%m-%d %H:%M"))

Listing pairwise overlaps of Date time elements in R

I have a list of Lectures for a university course stored in a data-frame. This is a large complex table with over 1000 rows. I have used simple time in the example, but this is actually date time in the format %d %b %Y %H:%M. I think I should be able to extrapolate to the more complex usage.
essentially:
ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist
I am trying to create an output data frame which determines which modules clash in the timetable and at which times. For example:
ModuleCode1 StartTime EndTime ModuleCode2 StartTime EndTime
11A 10:30 12:30 15B 09:40 10:40
11A 10:30 12:30 13C 11:45 12:45
13A 10:30 12:30 17B 14:00 15:00
There are a multitude of questions on date time overlaps, but the ones that I can find seem to either work with 2 dataframes, or I can't understand them. I have come across the lubridate and IRanges packages, but cannot work out this specific implementation with date time in a single data frame. It seems as though something which would be generally useful, and most likely would have a simple implementation I am missing. Grateful for any help.
Here is an sqldf solution. The intervals do NOT overlap iff a.StartTime > b.EndTime or a.EndTime < b.StartTime so they do overlap exactly when the negation of this statement is true, hence:
library(sqldf)
sqldf("select a.ModuleCode1, a.StartTime, a.EndTime, b.ModuleCode1, b.StartTime, b.EndTime
from DF a join DF b on a.ModuleCode1 < b.ModuleCode1 and
a.StartTime <= b.EndTime and
a.EndTime >= b.StartTime")
giving:
ModuleCode1 StartTime EndTime ModuleCode1 StartTime EndTime
1 11A 10:30 12:30 13C 11:45 12:45
2 11A 10:30 12:30 15B 09:40 10:40
3 13A 14:30 15:30 17B 14:00 15:00
Note: The input in reproducible form is:
Lines <- "ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist"
DF <- read.table(text = Lines, header = TRUE)

standard deviation of specific row numbers and put the value in another row & column in R

I have following data:
Date Value Std.Dev
11/30/2015 10:00 0
11/30/2015 10:30 -0.002400962
11/30/2015 11:00 -0.004819286
11/30/2015 11:30 -0.000805477
11/30/2015 12:00 -0.001612904
11/30/2015 12:30 -0.003233633
11/30/2015 13:00 0.000809389
11/30/2015 13:30 0.005647453
11/30/2015 14:00 -0.002416433
11/30/2015 14:30 -0.006472515
11/30/2015 15:00 -0.002438035
11/30/2015 15:30 0
11/30/2015 16:30 -0.000814001
12/1/2015 9:00 0.006493529 0.002931114
12/1/2015 9:30 -0.001619434 0.003657839
12/1/2015 10:00 -0.003246756 0.00363798
12/1/2015 10:30 -0.002442004 0.003519869
12/1/2015 11:00 0.000814664 0.003551266
12/1/2015 11:30 -0.001629992 0.00357286
12/1/2015 12:00 0.000815328 0.003504601
12/1/2015 12:30 -1.11022E-16 0.003504796
12/1/2015 13:00 -0.000815328 0.002981979
Std.Dev should start calculation from row number 14. Because I am calculating first std.dev on previous days values. And standard deviation for row 14 will be calculated on row=1 of Value to row=13 of Value. And thus it should go on. So Std.Dev_at_row_number_15 = STDEV(Value2:Value14).
Std.Dev_at_row_number_16 = STDEV(Value3:Value15). And so on....
Can you please suggest any function for this kind of calculation in R. In excel it is very easy. But if you can suggest similar in R, it will be very helpful.
Thanks.
Pardon me for bad English if any. Please let me know in comments if you want more details or example.
Definitely not the most efficient way, but maybe sufficient for you (with x denoting your data frame):
for(counter in 14:nrow(x)){
x[counter,3] <- sd(x[(counter-13):(counter-1),2])
}
But again, that's definitely not the most efficient way.
For a data.frame, df, you can get this as follows with sapply:
df$st.dev <- c(rep(NA, 13), sapply(13:(nrow(df)-1), function(i) sd(df$Value[(i-12):i])))
sapply will run through the selected rows and the function that follows will repeatedly calculate the standard deviations for the selected rows. I prepend NAs to this output so that it can be added to the data.frame.
data
I cheated a little in reading in the data, but it doesn't affect the result.
df <- read.table(header=T, text="Date Time Value
11/30/2015 10:00 0
11/30/2015 10:30 -0.002400962
11/30/2015 11:00 -0.004819286
11/30/2015 11:30 -0.000805477
11/30/2015 12:00 -0.001612904
11/30/2015 12:30 -0.003233633
11/30/2015 13:00 0.000809389
11/30/2015 13:30 0.005647453
11/30/2015 14:00 -0.002416433
11/30/2015 14:30 -0.006472515
11/30/2015 15:00 -0.002438035
11/30/2015 15:30 0
11/30/2015 16:30 -0.000814001
12/1/2015 9:00 0.006493529
12/1/2015 9:30 -0.001619434
12/1/2015 10:00 -0.003246756
12/1/2015 10:30 -0.002442004
12/1/2015 11:00 0.000814664
12/1/2015 11:30 -0.001629992
12/1/2015 12:00 0.000815328
12/1/2015 12:30 -1.11022E-16
12/1/2015 13:00 -0.000815328", as.is=TRUE, row)

data handling outlier with conditional in R

I have 2 data frame (data by hour and data by Day).
I want the point outlier by hour will be mark with conditional (PH by Hour in day belong to (standard1 - standard2) is OK and else is Outlier)
Example
PH in 11-09-13 10:00 (Hour) = 49.14068
compare with 11-09-13 20-40
and 49.14068>40 => Outlier
I want run, compare it automatic in R
I was search for this question but no result for this.
So, help me please !
Data by Hour
DateTime PH
11-09-13 10:00 49.14068
11-09-13 11:00 52.53494167
11-09-13 12:00 24.8525
11-09-13 13:00 8.56055
11-09-13 14:00 23.77944167
11-09-13 15:00 25.13243333
11-09-13 16:00 35.2913
11-09-13 17:00 20.58211667
11-09-13 18:00 18.605975
11-09-13 19:00 59.16179167
11-09-13 20:00 72.06908333
11-09-13 21:00 43.47536667
11-09-13 22:00 44.73696667
11-09-13 23:00 38.7266
12-09-13 0:00 41.12040833
12-09-13 1:00 33.67845833
12-09-13 2:00 38.49083333
12-09-13 3:00 46.20168333
12-09-13 4:00 40.03630833
12-09-13 5:00 41.10841667
12-09-13 6:00 43.753475
12-09-13 7:00 45.077675
12-09-13 8:00 57.53141667
12-09-13 9:00 45.17694167
12-09-13 10:00 41.106525
12-09-13 11:00 30.08048333
12-09-13 12:00 24.70255833
12-09-13 13:00 15.60813333
12-09-13 14:00 14.09729167
........ n day(24h/day)
Data by Day aggregate from Data by Hour
DateTime standard1 standard2
11-09-13 20 40
12-09-13 12 50
13-09-13 16 30
....... n day

Resources