Obtaining or subsetting the first 5 minutes of each day of data from an xts - r

I would like to subset out the first 5 minutes of time series data for each day from minutely data, however the first 5 minutes do not occur at the same time each day thus using something like xtsobj["T09:00/T09:05"] would not work since the beginning of the first 5 minutes changes. i.e. sometimes it starts at 9:20am or some other random time in the morning instead of 9am.
So far, I have been able to subset out the first minute for each day using a function like:
k <- diff(index(xtsobj))> 10000
xtsobj[c(1, which(k)+1)]
i.e. finding gaps in the data that are larger than 10000 seconds, but going from that to finding the first 5 minutes of each day is proving more difficult as the data is not always evenly spaced out. I.e. between first minute and 5th minute there could be from 2 row to 5 rows and thus using something like:
xtsobj[c(1, which(k)+6)]
and then binding the results together
is not always accurate. I was hoping that a function like 'first' could be used, but wasn't sure how to do this for multiple days, perhaps this might be the optimal solution. Is there a better way of obtaining this information?
Many thanks for the stackoverflow community in advance.

split(xtsobj, "days") will create a list with an xts object for each day.
Then you can apply head to the each day
lapply(split(xtsobj, "days"), head, 5)
or more generally
lapply(split(xtsobj, "days"), function(x) {
x[1:5, ]
})
Finally, you can rbind the days back together if you want.
do.call(rbind, lapply(split(xtsobj, "days"), function(x) x[1:5, ]))

What about you use the package lubridate, first find out the starting point each day that according to you changes sort of randomly, and then use the function minutes
So it would be something like:
five_minutes_after = starting_point_each_day + minutes(5)
Then you can use the usual subset of xts doing something like:
5_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
xtsobj[5_min_period]
Edit:
#Joshua
I think this works, look at this example:
library(lubridate)
x <- xts(cumsum(rnorm(20, 0, 0.1)), Sys.time() - seq(60,1200,60))
starting_point_each_day= index(x[1])
five_minutes_after = index(x[1]) + minutes(5)
five_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
x[five_min_period]
In my previous example I made a mistake, I put the five_min_period between quotes.
Was that what you were pointing out Joshua? Also maybe the starting point is not necessary, just:
until5min=paste('/',five_minutes_after,sep="")
x[until5min]

Related

Data frames and datetimes [duplicate]

This question already has answers here:
Extracting time from POSIXct
(7 answers)
Closed 8 months ago.
I have a dataset that I’m working with and I’m trying to change the format of my time column. The current format reads like this, example: “2022-05-23 23:06:58”, I’m trying to change this to only show me the hour times and erase the dates.
Other info: I want to make this change within my data frame, not just random times. I want to change over 100,000 rows so I need a function or solution that will do so. Tidyverse, Lubridate, Format, etc. Thank you guys.
Edit: There was one thing I may not have articulated fully, I wanted to keep the exact time and nothing else. so ‘23:48:07 would’ be how I’m looking for it not just the our. I need it so I can eventually subtract the time passed between two columns. You get me?
Try this
for the first question here is the code to convert to the hour of the day
your_time<-format(as.POSIXct(your_time), format = "%H:%M:%S")
#which gives "23" hours of the day
Since you want to apply on a large dataset we use this below
large_df%>%
mutate(Hour = format(as.POSIXct(Datetime), format ="%H:%M:%S"))
where the large_df is your large dataset worth over 100,000 records
The mutate will open another column for the result which is named the Hour column
and the Datetime is the DateTime column in your large_df dataset
Is the time as a string ok? Cause then you can use substr to extract the hour and minutes like so:
time <- c("2022-05-23 23:02:58", "2022-05-23 13:52:58", "2022-05-23 03:31:58", "2022-05-23 09:09:58")
n <- nchar(time)
hour <- substr(time, n - 7, n - 3)
Just time with your 100.000 row time column
library(data.table)
hour("2022-05-23 23:06:58") # 23

Cutting a time elapsed variable into manageable things

flight_time
11:42:00
19:37:06
18:11:17
I am having trouble working with the time played variable in the dataset. I can't seem to figure out how to get R to treat this value as a numeric.
Apologies if this has been asked before.
EDIT:
Okay well given the stuff posted below I've realised there's a few things I didn't know/check before.
First of all this is a factor variable. I read through the lubridate package documentation, and since I want to perform arithmetic operations (if this is the right terminology) I believe the duration function is the correct one.
However looking at the examples - I am not entirely sure what the syntax is for applying this to a whole column in a large(ish) data from. Since I have 4.5k observations, I'm not sure exactly how to appply this. I don't need an excessive amount of granularity - ideally even hours and minutes are fine.
So I'm thinking I would want my code to look like:
conversion from factor variable to character string > conversion from character string to duration/as.numeric.
Try this code:
#dummy data with factors
df <- data.frame(flight_time=c("11:42:00","19:37:06","18:11:17"))
#add Seconds column
df$Seconds <-
sapply(as.character(df$flight_time), function(i)
sum(as.numeric(unlist(strsplit(i,":"))) * c(60^2,60,1)))
#result
df
# flight_time Seconds
# 1 11:42:00 42120
# 2 19:37:06 70626
# 3 18:11:17 65477

R - For loop over large list of elements

I have split my large data set by date like so to create a large list of several elements:
days <- split(df, df$Date)
My data has columns including time of sunrise, sunset etc. for each day. I now want to use a for loop to do further work on each day separately like this:
for(i in 1:length(days){
sunrisetime <- as.character(df$Sunrise[1])
# Further similar work (using time of sunrise & sunset for each date to split
into daytime hours and nighttime hours)
}
My question is about the df$Sunrise on the second line - I don't think this is the right code to use when trying to access the sunrise time of each day on the days list. I have tried all sorts of variations but am an R newbie so must just be hitting the wrong terms.
Thanks in advance.
sunrisetime<-rep(NA,length(days))
for(i in 1:length(days){
sunrisetime[i] <- as.character(df$Sunrise[i])
}

Creating a specific sequence of date/times in R

I want to create a single column with a sequence of date/time increasing every hour for one year or one month (for example). I was using a code like this to generate this sequence:
start.date<-"2012-01-15"
start.time<-"00:00:00"
interval<-60 # 60 minutes
increment.mins<-interval*60
x<-paste(start.date,start.time)
for(i in 1:365){
print(strptime(x, "%Y-%m-%d %H:%M:%S")+i*increment.mins)
}
However, I am not sure how to specify the range of the sequence of dates and hours. Also, I have been having problems dealing with the first hour "00:00:00"? Not sure what is the best way to specify the length of the date/time sequence for a month, year, etc? Any suggestion will be appreciated.
I would strongly recommend you to use the POSIXct datatype. This way you can use seq without any problems and use those data however you want.
start <- as.POSIXct("2012-01-15")
interval <- 60
end <- start + as.difftime(1, units="days")
seq(from=start, by=interval*60, to=end)
Now you can do whatever you want with your vector of timestamps.
Try this. mondate is very clever about advancing by a month. For example, it will advance the last day of Jan to last day of Feb whereas other date/time classes tend to overshoot into Mar. chron does not use time zones so you can't get the time zone bugs that code as you can using POSIXct. Here x is from the question.
library(chron)
library(mondate)
start.time.num <- as.numeric(as.chron(x))
# +1 means one month. Use +12 if you want one year.
end.time.num <- as.numeric(as.chron(paste(mondate(x)+1, start.time)))
# 1/24 means one hour. Change as needed.
hours <- as.chron(seq(start.time.num, end.time.num, 1/24))

XTS: split FX intraday bar data by trading days

I want to apply a function to 20 trading days worth of hourly FX data (as one example amongst many).
I started off with rollapply(data,width=20*24,FUN=FUN,by=24). That seemed to be working well, I could even assert I always got 480 bars passed in... until I realized that wasn't what I wanted. The start and end time of those 480 bars was drifting over the years, due to changes in daylight savings, and market holidays.
So, what I want is a function that treats a day as from 22:00 to 22:00 of each day we have data for. (21:00 to 21:00 in N.Y. summertime - my data timezone is UTC, and daystart is defined at 5pm ET)
So, I made my own rollapply function with this at its core:
ep=endpoints(data,on=on,k=k)
sp=ep[1:(length(ep)-width)]+1
ep=ep[(width+1):length(ep)]
xx <- lapply(1:length(ep), function(ix) FUN(.subset_xts(data,sp[ix]:ep[ix]),...) )
I then called this with on="days", k=1 and width=20.
This has two problems:
Days is in days, not trading days! So, instead of typically 4 weeks of data, I get just under 3 weeks of data.
The cutoff is midnight UTC. I cannot work out how to change it to use the 22:00 (or 21:00) cutoff.
UPDATE: Problem 1 above is wrong! The XTS endpoints function does work in trading days, not calendar days. The reason I thought otherwise is the timezone issue made it look like a 6-day trading
week: Sun to Fri. Once the timezone problem was fixed (see my
self-answer), using width=20 and on="days" does indeed give me 4
weeks of data.
(The typically there is important: when there is a trading holiday during those 4 weeks I expect to receive 4 weeks 1 day's worth of data, i.e. always exactly 20 trading days.)
I started working on a function to cut the data into weeks, thinking I could then cut them into five 24hr chunks, but this feels like the wrong approach, and surely someone has invented this wheel before me?
Here is how to get the daybreak right:
x2=x
index(x2)=index(x2)+(7*3600)
indexTZ(x2)='America/New_York'
I.e. just setting the timezone puts the daybreak at 17:00; we want it to be at 24:00, so add 7 hours on first.
With help from:
time zones in POSIXct and xts, converting from GMT in R
Here is the full function:
rollapply_chunks.FX.xts=function(data,width,FUN,...,on="days",k=1){
data <- try.xts(data)
x2 <- data
index(x2) <- index(x2)+(7*3600)
indexTZ(x2) <- 'America/New_York'
ep <- endpoints(x2,on=on,k=k) #The end point of each calendar day (when on="days").
#Each entry points to the final bar of the day. ep[1]==0.
if(length(ep)<2){
stop("Cannot divide data up")
}else if(length(ep)==2){ #Can only fit one chunk in.
sp <- 1;ep <- ep[-1]
}else{
sp <- ep[1:(length(ep)-width)]+1
ep <- ep[(width+1):length(ep)]
}
xx <- lapply(1:length(ep), function(ix) FUN(.subset_xts(data,sp[ix]:ep[ix]),...) )
xx <- do.call(rbind,xx) #Join them up as one big matrix/data.frame.
tt <- index(data)[ep] #Implicit align="right". Use sp for align="left"
res <- xts(xx, tt)
return (res)
}
You can see we use the modified index to split up the original data. (If R uses copy-on-write under the covers, then the only extra memory requirement should be for a copy of the index, not of the data.)
(Legal bit: please consider it licensed under MIT, but explicit permission given to use in the GPL-2 XTS package if that is desired.)

Resources