Merge two datasets based on time interval in R - r
I have two datasets.
Dataset X looks as follows. It contains 30-min intervals of the trading day of some stock index, which opens 9:30AM and closes at 15:00PM for DJ, but 16:00PM for DX. So the closing time may vary by Ticker.
Date Ticker end_time start_time
1997-10-06 DJ 10:00 09:30
1997-10-06 DJ 10:30 10:00
1997-10-06 DJ 11:00 10:30
1997-10-06 DJ 11:30 11:00
1997-10-08 DJ 09:30 15:00
1997-10-08 DJ 10:00 09:30
1997-10-06 DX 10:00 09:30
1997-10-06 DX 10:30 10:00
1997-10-06 DX 11:00 10:30
1997-10-06 DX 11:30 11:00
1997-10-07 DX 14:30 14:00
1997-10-07 DX 15:00 14:30
1997-10-07 DX 15:30 15:00
1997-10-07 DX 16:00 15:30
1997-10-08 DX 09:30 16:00
1997-10-08 DX 10:00 09:30
Dataset Y looks as follows:
Date Time Event
1997-10-06 09:30 Event1
1997-10-06 10:30 Event2
1997-10-07 22:00 Event3
1997-10-08 09:00 Event4
1997-10-08 09:30 Event5
1997-10-08 09:30 Event6
My aim is to link events in Y to X based on whether the event date-time occurs within the start/end time interval. My expected output is something (data-set Z):
Date Ticker end_time start_time Event
1997-10-06 DJ 10:00 09:30 Event1
1997-10-06 DJ 10:30 10:00 NA
1997-10-06 DJ 11:00 10:30 Event2
1997-10-06 DJ 11:30 11:00 NA
1997-10-08 DJ 09:30 15:00 Event3,Event4
1997-10-08 DJ 10:00 09:30 Event5,Event6
1997-10-06 DX 10:00 09:30 Event1
1997-10-06 DX 10:30 10:00 NA
1997-10-06 DX 11:00 10:30 Event2
1997-10-06 DX 11:30 11:00 NA
1997-10-07 DX 14:30 14:00 NA
1997-10-07 DX 15:00 14:30 NA
1997-10-07 DX 15:30 15:00 NA
1997-10-07 DX 16:00 15:30 NA
1997-10-08 DX 09:30 16:00 Event3, Event4
1997-10-08 DX 10:00 09:30 Event5,Event6
It is thus possible to multiple events happen between an interval. Is it possible to store those in column "Event". It is also possible that Event occurs after a market closes, which should be stored in the first interval that occurs after the event. How can I obtain this expected output? I have been thinking for a while now, but I have no clue where to start.
Edit: X contains 400k 30-min intervals. Y contains 40k events.
There are lots of ways of approaching the problem, here's just one suggestion.
I'm using these datasets:
x <- read.table(text = "Date,Ticker,end_time,start_time
06/10/1997,DJ,10:00,09:30
06/10/1997,DJ,10:30,10:00
06/10/1997,DJ,11:00,10:30
06/10/1997,DJ,11:30,11:00
08/10/1997,DJ,09:30,15:00
08/10/1997,DJ,10:00,09:30
06/10/1997,DX,10:00,09:30
06/10/1997,DX,10:30,10:00
06/10/1997,DX,11:00,10:30
06/10/1997,DX,11:30,11:00
07/10/1997,DX,14:30,14:00
07/10/1997,DX,15:00,14:30
07/10/1997,DX,15:30,15:00
07/10/1997,DX,16:00,15:30
08/10/1997,DX,09:30,16:00
08/10/1997,DX,10:00,09:30
08/10/1997,DX,10:00,09:30", sep =",", header = TRUE, stringsAsFactors =
FALSE)
y <- read.table(text = "Date,Time,Event
06/10/1997,09:30,Event1
06/10/1997,10:30,Event2
07/10/1997,22:00,Event3
08/10/1997,09:00,Event4
08/10/1997,09:30,Event5
08/10/1997,09:30,Event6
", sep =",", header = TRUE, stringsAsFactors = FALSE)
I would start by concatenating and formatting the dates and times so they can be used in functions to check whether an event occurred in that window. Assuming you have two data frames called x and y in the structure described above:
y$date_time <- strptime(paste(y$Time,y$Date),format="%H:%M %d/%m/%Y")
x$start_time_date <- strptime(paste(x$start_time,x$Date),format="%H:%M %d/%m/%Y")
x$end_time_date <- strptime(paste(x$end_time,x$Date),format="%H:%M %d/%m/%Y")
If you have control over the dataset as it is compiled then it might be easier for the start and end dates to be recorded in this way, as for the periods that cross over a date doing it this way will produce start date-times that are after the end date-times. We can edit those by just instead using the date from the previous entry in the data frame, assuming it will always be in chronological order and there won't be missing periods. This is a bit of a hack!:
#check which entries cross over a date
overnight_idx <- which(x$end_time_date < x$start_time_date)
#replace start date with that of preceding entry in the data frame
x[overnight_idx, 'start_time_date'] <-
as.POSIXct(paste(x[overnight_idx, 'start_time'],
x[overnight_idx - 1, 'Date']),format="%H:%M %d/%m/%Y",
origin = "1970-01-01")
Now we can write a function that for a given row in the data frame x will extract any events that have occurred listed in y, and then do a little bit of formatting to get it in the format you described.
checkEvent <- function(x_row){
y2 <- y[y$date_time>=x_row['start_time_date'] &
y$date_time<x_row['end_time_date'], 'Event']
if(length(y2)==0){
y2 <- NA
} else if(length(y2)>1){
y2 <- paste(y2,collapse = ' ')
}
return(y2)
}
Then we can just apply that to x
x$Event <- apply(x,1,checkEvent)
which will produce the following (ignoring the columns we created above so it fits on the screen):
> x[,c('Date','Ticker','end_time','start_time','Event')]
Date Ticker end_time start_time Event
1 06/10/1997 DJ 10:00 09:30 Event1
2 06/10/1997 DJ 10:30 10:00 <NA>
3 06/10/1997 DJ 11:00 10:30 Event2
4 06/10/1997 DJ 11:30 11:00 <NA>
5 08/10/1997 DJ 09:30 15:00 Event3 Event4
6 08/10/1997 DJ 10:00 09:30 Event5 Event6
7 06/10/1997 DX 10:00 09:30 Event1
8 06/10/1997 DX 10:30 10:00 <NA>
9 06/10/1997 DX 11:00 10:30 Event2
10 06/10/1997 DX 11:30 11:00 <NA>
11 07/10/1997 DX 14:30 14:00 <NA>
12 07/10/1997 DX 15:00 14:30 <NA>
13 07/10/1997 DX 15:30 15:00 <NA>
14 07/10/1997 DX 16:00 15:30 <NA>
15 08/10/1997 DX 09:30 16:00 Event3 Event4
16 08/10/1997 DX 10:00 09:30 Event5 Event6
17 08/10/1997 DX 10:00 09:30 Event5 Event6
Related
How to grepl search for the max and min timings in a string?
I have a dataset with a column containing the opening and closing times of various stores. The timings are in string format Opening time - Closing time, eg: 17:00 - 21:00 | 11:30 - 14:30 | 11:30 - 14:30 I want to extract the minimum opening time within the above string, i.e. 11:30 and the max closing time i.e. 21:00.How do I do that using R? DPUT: structure(list(head.timings_remapping.Opening.And.Closing.Time..40. = c("15:30 - 21:30", "12:00 - 00:00", "11:00 - 15:00 | 16:30 - 20:45", "12:00 - 22:30", "17:00 - 21:30", "17:00 - 21:30", "16:30 - 00:00", "16:00 - 21:15", "16:30 - 20:30", "17:00 - 20:00", "16:00 - 23:30", "16:30 - 21:30", "17:00 - 22:00", "17:00 - 22:00", "17:00 - 21:30", "17:00 - 21:30", "16:00 - 00:00", "16:30 - 23:59", "11:30 - 22:30", "11:30 - 23:59", "17:00 - 20:30", "07:30 - 12:50", "16:15 - 23:00", "09:00 - 21:00", "10:00 - 21:00", "11:00 - 22:00", "07:00 - 12:00 | 07:00 - 13:30 | 12:00 - 13:30", "07:00 - 13:00 | 10:00 - 15:00", "10:00 - 02:00", "00:00 - 23:59", "00:00 - 23:59", "11:00 - 20:00", "11:00 - 20:00", NA, "12:00 - 03:30 | 11:00 - 00:00", "05:30 - 15:00", "07:00 - 16:00", "08:30 - 13:30", "17:00 - 21:00 | 11:30 - 14:30 | 11:30 - 14:30", "12:00 - 01:00")), class = "data.frame", row.names = c(NA, -40L )) The final output will have two columns "Opening time" and "Closing time"
Does this work: library(dplyr) library(tidyr) df %>% separate(col = head.timings_remapping.Opening.And.Closing.Time..40., into = c('Open_Close','A'), sep = '\\|') %>% separate(col = Open_Close, into = c('Opening Time','Closing Time'), sep = ' - ') %>% mutate(`Opening Time` = trimws(`Opening Time`), `Closing Time` = trimws(`Closing Time`)) %>% select(-A) Opening Time Closing Time 1 15:30 21:30 2 12:00 00:00 3 11:00 15:00 4 12:00 22:30 5 17:00 21:30 6 17:00 21:30 7 16:30 00:00 8 16:00 21:15 9 16:30 20:30 10 17:00 20:00 11 16:00 23:30 12 16:30 21:30 13 17:00 22:00 14 17:00 22:00 15 17:00 21:30 16 17:00 21:30 17 16:00 00:00 18 16:30 23:59 19 11:30 22:30 20 11:30 23:59 21 17:00 20:30 22 07:30 12:50 23 16:15 23:00 24 09:00 21:00 25 10:00 21:00 26 11:00 22:00 27 07:00 12:00 28 07:00 13:00 29 10:00 02:00 30 00:00 23:59 31 00:00 23:59 32 11:00 20:00 33 11:00 20:00 34 <NA> <NA> 35 12:00 03:30 36 05:30 15:00 37 07:00 16:00 38 08:30 13:30 39 17:00 21:00 40 12:00 01:00
Using dplyr and tidyr library you can do : library(dplyr) library(tidyr) #Rename the long column name to something smaller names(df)[1] <- 'Time' df %>% #Create a row index mutate(row = row_number()) %>% #Split the data in different rows on '|' separate_rows(Time, sep = '\\s*\\|\\s*') %>% #split the data on '-' separate(Time, c("Opening_Time", "Closing_time"), sep = '\\s*-\\s*') %>% #Change the time to POSIXct format mutate(across(c(Opening_Time, Closing_time), as.POSIXct, format = '%H:%M')) %>% #For each row group_by(row) %>% #Get minimum opening time and maximum closing time #and change into required format summarise(Opening_Time = format(min(Opening_Time), "%H:%M"), Closing_time = format(max(Closing_time), "%H:%M")) %>% #Drop row column select(-row) This returns # Opening_Time Closing_time # <chr> <chr> # 1 15:30 21:30 # 2 12:00 00:00 # 3 11:00 20:45 # 4 12:00 22:30 # 5 17:00 21:30 # 6 17:00 21:30 # 7 16:30 00:00 # 8 16:00 21:15 # 9 16:30 20:30 #10 17:00 20:00 # … with 30 more rows
Transforming data into xts format
I have some data, and the Date column includes the time too. I am trying to get this data into xts format. I have tried below, but I get an error. Can anyone see anything wrong with this code? TIA Date Open High Low Close 1 2017.01.30 07:00 1.25735 1.25761 1.25680 1.25698 2 2017.01.30 08:00 1.25697 1.25702 1.25615 1.25619 3 2017.01.30 09:00 1.25618 1.25669 1.25512 1.25533 4 2017.01.30 10:00 1.25536 1.25571 1.25093 1.25105 5 2017.01.30 11:00 1.25104 1.25301 1.25093 1.25262 6 2017.01.30 12:00 1.25260 1.25479 1.25229 1.25361 7 2017.01.30 13:00 1.25362 1.25417 1.25096 1.25177 8 2017.01.30 14:00 1.25177 1.25219 1.24900 1.25071 9 2017.01.30 15:00 1.25070 1.25307 1.24991 1.25238 10 2017.01.30 16:00 1.25238 1.25358 1.25075 1.25159 df = read.table(file = "GBPUSD60.csv", sep="," , header = TRUE) dates = as.character(df$Date) df$Date = NULL Sept17 = xts(df, as.POSIXct(dates, format="%Y-%m-%d %H:%M"))
Listing pairwise overlaps of Date time elements in R
I have a list of Lectures for a university course stored in a data-frame. This is a large complex table with over 1000 rows. I have used simple time in the example, but this is actually date time in the format %d %b %Y %H:%M. I think I should be able to extrapolate to the more complex usage. essentially: ModuleCode1 ModuleName Lecturer StartTime EndTime Course 11A Hist1 Bob 10:30 12:30 Hist 13A Hist2 Bob 14:30 15:30 Hist 13C Hist3 Steve 11:45 12:45 Hist 15B Hist4 Bob 09:40 10:40 Hist 17B Hist5 Bob 14:00 15:00 Hist I am trying to create an output data frame which determines which modules clash in the timetable and at which times. For example: ModuleCode1 StartTime EndTime ModuleCode2 StartTime EndTime 11A 10:30 12:30 15B 09:40 10:40 11A 10:30 12:30 13C 11:45 12:45 13A 10:30 12:30 17B 14:00 15:00 There are a multitude of questions on date time overlaps, but the ones that I can find seem to either work with 2 dataframes, or I can't understand them. I have come across the lubridate and IRanges packages, but cannot work out this specific implementation with date time in a single data frame. It seems as though something which would be generally useful, and most likely would have a simple implementation I am missing. Grateful for any help.
Here is an sqldf solution. The intervals do NOT overlap iff a.StartTime > b.EndTime or a.EndTime < b.StartTime so they do overlap exactly when the negation of this statement is true, hence: library(sqldf) sqldf("select a.ModuleCode1, a.StartTime, a.EndTime, b.ModuleCode1, b.StartTime, b.EndTime from DF a join DF b on a.ModuleCode1 < b.ModuleCode1 and a.StartTime <= b.EndTime and a.EndTime >= b.StartTime") giving: ModuleCode1 StartTime EndTime ModuleCode1 StartTime EndTime 1 11A 10:30 12:30 13C 11:45 12:45 2 11A 10:30 12:30 15B 09:40 10:40 3 13A 14:30 15:30 17B 14:00 15:00 Note: The input in reproducible form is: Lines <- "ModuleCode1 ModuleName Lecturer StartTime EndTime Course 11A Hist1 Bob 10:30 12:30 Hist 13A Hist2 Bob 14:30 15:30 Hist 13C Hist3 Steve 11:45 12:45 Hist 15B Hist4 Bob 09:40 10:40 Hist 17B Hist5 Bob 14:00 15:00 Hist" DF <- read.table(text = Lines, header = TRUE)
standard deviation of specific row numbers and put the value in another row & column in R
I have following data: Date Value Std.Dev 11/30/2015 10:00 0 11/30/2015 10:30 -0.002400962 11/30/2015 11:00 -0.004819286 11/30/2015 11:30 -0.000805477 11/30/2015 12:00 -0.001612904 11/30/2015 12:30 -0.003233633 11/30/2015 13:00 0.000809389 11/30/2015 13:30 0.005647453 11/30/2015 14:00 -0.002416433 11/30/2015 14:30 -0.006472515 11/30/2015 15:00 -0.002438035 11/30/2015 15:30 0 11/30/2015 16:30 -0.000814001 12/1/2015 9:00 0.006493529 0.002931114 12/1/2015 9:30 -0.001619434 0.003657839 12/1/2015 10:00 -0.003246756 0.00363798 12/1/2015 10:30 -0.002442004 0.003519869 12/1/2015 11:00 0.000814664 0.003551266 12/1/2015 11:30 -0.001629992 0.00357286 12/1/2015 12:00 0.000815328 0.003504601 12/1/2015 12:30 -1.11022E-16 0.003504796 12/1/2015 13:00 -0.000815328 0.002981979 Std.Dev should start calculation from row number 14. Because I am calculating first std.dev on previous days values. And standard deviation for row 14 will be calculated on row=1 of Value to row=13 of Value. And thus it should go on. So Std.Dev_at_row_number_15 = STDEV(Value2:Value14). Std.Dev_at_row_number_16 = STDEV(Value3:Value15). And so on.... Can you please suggest any function for this kind of calculation in R. In excel it is very easy. But if you can suggest similar in R, it will be very helpful. Thanks. Pardon me for bad English if any. Please let me know in comments if you want more details or example.
Definitely not the most efficient way, but maybe sufficient for you (with x denoting your data frame): for(counter in 14:nrow(x)){ x[counter,3] <- sd(x[(counter-13):(counter-1),2]) } But again, that's definitely not the most efficient way.
For a data.frame, df, you can get this as follows with sapply: df$st.dev <- c(rep(NA, 13), sapply(13:(nrow(df)-1), function(i) sd(df$Value[(i-12):i]))) sapply will run through the selected rows and the function that follows will repeatedly calculate the standard deviations for the selected rows. I prepend NAs to this output so that it can be added to the data.frame. data I cheated a little in reading in the data, but it doesn't affect the result. df <- read.table(header=T, text="Date Time Value 11/30/2015 10:00 0 11/30/2015 10:30 -0.002400962 11/30/2015 11:00 -0.004819286 11/30/2015 11:30 -0.000805477 11/30/2015 12:00 -0.001612904 11/30/2015 12:30 -0.003233633 11/30/2015 13:00 0.000809389 11/30/2015 13:30 0.005647453 11/30/2015 14:00 -0.002416433 11/30/2015 14:30 -0.006472515 11/30/2015 15:00 -0.002438035 11/30/2015 15:30 0 11/30/2015 16:30 -0.000814001 12/1/2015 9:00 0.006493529 12/1/2015 9:30 -0.001619434 12/1/2015 10:00 -0.003246756 12/1/2015 10:30 -0.002442004 12/1/2015 11:00 0.000814664 12/1/2015 11:30 -0.001629992 12/1/2015 12:00 0.000815328 12/1/2015 12:30 -1.11022E-16 12/1/2015 13:00 -0.000815328", as.is=TRUE, row)
data handling outlier with conditional in R
I have 2 data frame (data by hour and data by Day). I want the point outlier by hour will be mark with conditional (PH by Hour in day belong to (standard1 - standard2) is OK and else is Outlier) Example PH in 11-09-13 10:00 (Hour) = 49.14068 compare with 11-09-13 20-40 and 49.14068>40 => Outlier I want run, compare it automatic in R I was search for this question but no result for this. So, help me please ! Data by Hour DateTime PH 11-09-13 10:00 49.14068 11-09-13 11:00 52.53494167 11-09-13 12:00 24.8525 11-09-13 13:00 8.56055 11-09-13 14:00 23.77944167 11-09-13 15:00 25.13243333 11-09-13 16:00 35.2913 11-09-13 17:00 20.58211667 11-09-13 18:00 18.605975 11-09-13 19:00 59.16179167 11-09-13 20:00 72.06908333 11-09-13 21:00 43.47536667 11-09-13 22:00 44.73696667 11-09-13 23:00 38.7266 12-09-13 0:00 41.12040833 12-09-13 1:00 33.67845833 12-09-13 2:00 38.49083333 12-09-13 3:00 46.20168333 12-09-13 4:00 40.03630833 12-09-13 5:00 41.10841667 12-09-13 6:00 43.753475 12-09-13 7:00 45.077675 12-09-13 8:00 57.53141667 12-09-13 9:00 45.17694167 12-09-13 10:00 41.106525 12-09-13 11:00 30.08048333 12-09-13 12:00 24.70255833 12-09-13 13:00 15.60813333 12-09-13 14:00 14.09729167 ........ n day(24h/day) Data by Day aggregate from Data by Hour DateTime standard1 standard2 11-09-13 20 40 12-09-13 12 50 13-09-13 16 30 ....... n day