I have a stock dataset in xts format that has data for one year at a minute by minute level.
I need to calculate the returns, but only for the periods on the same date.
what would be the most efficient way to not calculate the returns while avoiding the returns of todays first observation and yesterdays last observation.
In the same vein, if one wants to calculate - 15 minute returns, then how to avoid calculating the periods that are not in the same day?
I am including a toy dataset with hourly periods (since original dataset is at minute level)
time_index <- seq(from = as.POSIXct("2021-01-01 07:00"), to = as.POSIXct("2021-02-28 18:00"), by = "hour")
set.seed(1)
value <- 100 + rnorm(n = length(time_index))
eventdata <- xts(value, order.by = time_index)
So how to calculate three hourly returns for intraday periods.
It is probably easier to compute all the returns and then drop the overnight-ones. For getting 15-minute returns you can create a time-sequence with these intervals and then run the computations on the reduced data.
library(xts)
library(lubridate)
library(TTR)
time_index <-
seq(
from = as.POSIXct("2021-01-01 07:00"),
to = as.POSIXct("2021-02-28 18:00"),
by = "hour"
)
set.seed(1)
value <- 100 + rnorm(n = length(time_index))
eventdata <- xts(value, order.by = time_index)
# compute hourly returns and drop first observation per day
res <- TTR::ROC(eventdata, type = "discrete")
res2 <- res[-xts:::startof(res, by = "days")]
> head(res2)
[,1]
2021-01-01 08:00:00 0.00815204
2021-01-01 09:00:00 -0.01017404
2021-01-01 10:00:00 0.02451394
2021-01-01 11:00:00 -0.01245897
2021-01-01 12:00:00 -0.01146199
2021-01-01 13:00:00 0.01318717
# compute 15-minute returns
time_index2 <- seq(
from = as.POSIXct("2021-01-01 07:00"),
to = as.POSIXct("2021-01-01 18:00"),
by = "min"
)
length(time_index2)
time_index3 <- seq(ymd_hms(from = '2021-01-01 07:00:00'),
by = '15 min', length.out=(661))
testasset <- xts(rnorm(661, sd = 0.03), order.by = time_index2)
res3 <- TTR::ROC(testasset[time_index3], type = "discrete")
head(res3)
> head(res3)
[,1]
2021-01-01 08:00:00 NA
2021-01-01 08:15:00 -0.6516079
2021-01-01 08:30:00 -5.7101543
2021-01-01 08:45:00 -0.5411609
2021-01-01 09:00:00 -2.2945892
2021-01-01 09:15:00 -2.3038205
Related
I want to generate the same period during serval days, e.g. from 09:30:00 to 16:00:00 every day, and I know that
dates<- seq(as.POSIXct("2000-01-01 9:00",tz='UTC'), as.POSIXct("2000-04-9 16:00",tz='UTC'), by=300)
can help me obtain the time series observed every 5 minutes during 24 hours in 100 days. But what I want is the 09:30:00 to 16:00:00 over 100 days.
Thanks in advance
Here is one way. We can create a date sequence for every day, and then create sub-list with each day for the five minute interval. Finally, we can combine this list. final_seq is the final output.
date_seq <- seq(as.Date("2000-01-01"), as.Date("2000-04-09"), by = 1)
hour_seq <- lapply(date_seq, function(x){
temp_date <- as.character(x)
temp_seq <- seq(as.POSIXct(paste(temp_date, "09:30"), tz = "UTC"),
as.POSIXct(paste(temp_date, "16:00"), tz = "UTC"),
by = 300)
})
final_seq <- do.call("c", hour_seq)
An option using tidyr::crossing() (which I love) and the lubridate package:
crossing(c1 = paste(dmy("01/01/2000") + seq(1:100), "09:30"),
c2 = seq(0, 390, 5)) %>%
mutate(time_series = ymd_hm(c1) + minutes(c2)) %>%
pull(time_series)
Made up a data frame. How to calculate the squared difference/error in hourly TMP and DW for 1/1 to 1/9 against 1/10? Need the sum of squared difference between hour1 to hour 24 of each day from 1/1 to 1/9 against 1/10
The output should look like
Date SETmp SEDW
2012/1/1 X1 Y1
......
2012/1/9 X9 Y9
Data:
set.seed(1)
dataset <- data.frame(Date = seq(from = as.POSIXct("2012-1-1 0:00", tz = "UTC"),
to = as.POSIXct("2012-1-10 23:00", tz = "UTC"),
by="hour"),
TMP = rnorm(240),
DW = rnorm(240))
If I understand your question correctly, we can get there using the by and merge functions:
# add day and hour columns (for subsetting and merge)
dataset$day <- lubridate::day(dataset$Date)
dataset$hour <- lubridate::hour(dataset$Date)
# split data apart
data_ten <- subset(dataset, day == 10)
data_one_to_nine <- subset(dataset, day != 10)
# for each date, merge to data_ten using hours
# then calculate sum of squared differences
do.call('rbind.data.frame',
by(data_one_to_nine, data_one_to_nine$day, function(d){
xm <- merge(d, data_ten, by = 'hour')
data.frame(
'Date' = unique(as.Date(d$Date)),
'SE_TMP' = sum((xm$TMP.x - xm$TMP.y)^2),
'SE_DW' = sum((xm$DW.x - xm$DW.y)^2),
stringsAsFactors = FALSE
)
})
)
Date SE_TMP SE_DW
1 2012-01-01 59.33207 63.41261
2 2012-01-02 42.04597 58.90700
3 2012-01-03 66.15492 51.81897
4 2012-01-04 31.83438 40.68851
5 2012-01-05 30.26666 59.30694
6 2012-01-06 45.05186 55.39751
7 2012-01-07 61.93305 39.76287
8 2012-01-08 37.08246 47.81958
9 2012-01-09 58.54562 64.79331
I have a dataframe of datetimes, like so:
library(lubridate)
date_seq <- seq.POSIXt(ymd_hm('2016-04-01 0:00'), ymd_hm('2016-04-30 23:30'), by = '30 mins')
datetimes <- data.frame(datetime = date_seq)
I've also got a dataframe containing opening times that specify a range of days over which the opening times apply and an hour range over which the store is open for the days in the date range, like so:
opening_times <- data.frame(from_date = c('2016-03-01', '2016-04-15'),
till_date = c('2016-04-15', '2016-05-20'),
from_time = c('11:00', '10:30'),
till_time = c('22:00', '23:00'))
What I would like is to mark in datetimes those rows which are inside the opening hours. That is, I want a column that is TRUE whenever the datetime in the row is within both from_date and till_date and within from_time and till_time.
If the dataset isn't too big, I'd recommend creating a new dataset from opening_times -
opening_times$from_date = as.Date(opening_times$from_date, '%Y-%m-%d')
opening_times$till_date = as.Date(opening_times$till_date, '%Y-%m-%d')
opening_times2 = do.call(
rbind,
lapply(
seq(nrow(opening_times)),
function (rownumber) {
data.frame(
seq.Date(
from = opening_times[rownumber,'from_date'],
to = opening_times[rownumber,'till_date'],
by = 1
),
from_time = opening_times[rownumber,'from_time'],
till_time = opening_times[rownumber,'till_time']
)
}
)
)
and then merging it with datetimes by date and checking for whether time falls between the two values.
lubridate has a %within% function for checking whether a time is within a lubridate::interval which can make this easy once you create a vector of intervals:
# make a sequence of days in each set from opening_times
open_intervals <- apply(opening_times, 1, function(x){
dates <- seq.Date(ymd(x[1]), ymd(x[2]), by = 'day')
})
# turn each date into a lubridate::interval object with the appropriate times
open_intervals <- mapply(function(dates, from, to){
interval(ymd_hm(paste(dates, from)), ymd_hm(paste(dates, to)))
}, open_intervals, opening_times$from_time, opening_times$till_time)
# combine list items into one vector of intervals
open_intervals <- do.call(c, open_intervals)
# use lubridate::%within% to check if each datetime is in any open interval
datetimes$open <- sapply(datetimes$datetime, function(x){
any(x %within% open_intervals)
})
datetimes[20:26,]
# datetime open
# 20 2016-04-01 09:30:00 FALSE
# 21 2016-04-01 10:00:00 FALSE
# 22 2016-04-01 10:30:00 FALSE
# 23 2016-04-01 11:00:00 TRUE
# 24 2016-04-01 11:30:00 TRUE
# 25 2016-04-01 12:00:00 TRUE
# 26 2016-04-01 12:30:00 TRUE
Edit
If you have exactly two sets of hours, you can condense the whole thing into a (somewhat huge) ifelse:
datetimes$open <- ifelse(as.Date(datetimes$datetime) %within%
interval(opening_times$from_date[1],
opening_times$till_date[1]),
hm(format(datetimes$datetime, '%H:%M')) >= hm(opening_times$from_time)[1] &
hm(format(datetimes$datetime, '%H:%M')) <= hm(opening_times$till_time)[1],
hm(format(datetimes$datetime, '%H:%M')) >= hm(opening_times$from_time)[2] &
hm(format(datetimes$datetime, '%H:%M')) <= hm(opening_times$till_time)[2])
or
datetimes$open <- ifelse(as.Date(datetimes$datetime) %within%
interval(opening_times$from_date[1],
opening_times$till_date[1]),
datetimes$datetime %within%
interval(ymd_hm(paste(as.Date(datetimes$datetime), opening_times$from_time[1])),
ymd_hm(paste(as.Date(datetimes$datetime), opening_times$till_time[1]))),
datetimes$datetime %within%
interval(ymd_hm(paste(as.Date(datetimes$datetime), opening_times$from_time[2])),
ymd_hm(paste(as.Date(datetimes$datetime), opening_times$till_time[2]))))
Here is an example of a subset data in .csv files. There are three columns with no header. The first column represents the date/time and the second column is load [kw] and the third column is 1= weekday, 0 = weekends/ holiday.
9/9/2010 3:00 153.94 1
9/9/2010 3:15 148.46 1
I would like to program in R, so that it selects the first and second column within time ranges from 10:00 to 20:00 for all weekdays (when the third column is 1) within a month of September and do not know what's the best and most efficient way to code.
code dt <- read.csv("file", header = F, sep=",")
#Select a column with weekday designation = 1, weekend or holiday = 0
y <- data.frame(dt[,3])
#Select a column with timestamps and loads
x <- data.frame(dt[,1:2])
t <- data.frame(dt[,1])
#convert timestamps into readable format
s <- strptime("9/1/2010 0:00", format="%m/%d/%Y %H:%M")
e <- strptime("9/30/2010 23:45", format="%m/%d/%Y %H:%M")
range <- seq(s,e, by = "min")
df <- data.frame(range)
OP ask for "best and efficient way to code" this without showing "inefficient code", so #Justin is right.
It's seems that the OP is new to R (and it's officially the summer of love) so I give it a try and I have a solution (not sure about efficiency..)
index <- c("9/9/2010 19:00", "9/9/2010 21:15", "10/9/2010 11:00", "3/10/2010 10:30")
index <- as.POSIXct(index, format = "%d/%m/%Y %H:%M")
set.seed(1)
Data <- data.frame(Date = index, load = rnorm(4, mean = 120, sd = 10), weeks = c(0, 1, 1, 1))
## Data
## Date load weeks
## 1 2010-09-09 19:00:00 113.74 0
## 2 2010-09-09 21:15:00 121.84 1
## 3 2010-09-10 11:00:00 111.64 1
## 4 2010-10-03 10:30:00 135.95 1
cond <- expression(format(Date, "%H:%M") < "20:00" &
format(Date, "%H:%M") > "10:00" &
weeks == 1 &
format(Date, "%m") == "09")
subset(Data, eval(cond))
## Date load weeks
## 3 2010-09-10 11:00:00 111.64 1
I have a dataset filled with the average windspeed per hour for multiple years. I would like to create an 'average year', in which for each hour the average windspeed for that hour over multiple years is calculated. How can I do this without looping endlessly through the dataset?
Ideally, I would like to just loop through the data once, extracting for each row the right month, day, and hour, and adding the windspeed from that row to the right row in a dataframe where the aggregates for each month, day, and hour are gathered. Is it possible to do this without extracting the month, day, and hour, and then looping over the complete average-year data.frame to find the right row?
Some example data:
data.multipleyears <- data.frame(
DATETIME = c("2001-01-01 01:00:00", "2001-05-03 09:00:00", "2007-01-01 01:00:00", "2008-02-29 12:00:00"),
Windspeed = c(10, 5, 8, 3)
)
Which I would like to aggregate in a dataframe like this:
average.year <- data.frame(
DATETIME = c("01-01 00:00:00", "01-01 01:00:00", ..., "12-31 23:00:00")
Aggregate.Windspeed = (100, 80, ...)
)
From there, I can go on calculating the averages, etc. I have probably overlooked some command, but what would be the right syntax for something like this (in pseudocode):
for(i in 1:nrow(data.multipleyears) {
average.year$Aggregate.Windspeed[
where average.year$DATETIME(month, day, hour) == data.multipleyears$DATETIME[i](month, day, hour)] <- average.year$Aggregate.Windspeed + data.multipleyears$Windspeed[i]
}
Or something like that. Help is appreciated!
I predict that ddply and the plyr package are going to be your best friend :). I created a 30 year dataset with hourly random windspeeds between 1 and 10 ms:
begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
# 30 year dataset
dat = data.frame(dt = begin_date + (0:(24*30*365)) * (3600))
dat = within(dat, {
speed = runif(length(dt), 1, 10)
unique_day = strftime(dt, "%d-%m")
})
> head(dat)
dt unique_day speed
1 1990-01-01 00:00:00 01-01 7.054124
2 1990-01-01 01:00:00 01-01 2.202591
3 1990-01-01 02:00:00 01-01 4.111633
4 1990-01-01 03:00:00 01-01 2.687808
5 1990-01-01 04:00:00 01-01 8.643168
6 1990-01-01 05:00:00 01-01 5.499421
To calculate the daily normalen (30 year average, this term is much used in meteorology) over this 30 year period:
library(plyr)
res = ddply(dat, .(unique_day),
summarise, mean_speed = mean(speed), .progress = "text")
> head(res)
unique_day mean_speed
1 01-01 5.314061
2 01-02 5.677753
3 01-03 5.395054
4 01-04 5.236488
5 01-05 5.436896
6 01-06 5.544966
This takes just a few seconds on my humble two core AMD, so I suspect just going once through the data is not needed. Multiple of these ddply calls for different aggregations (month, season etc) can be done separately.
You can use substr to extract the part of the date you want,
and then use tapply or ddply to aggregate the data.
tapply(
data.multipleyears$Windspeed,
substr( data.multipleyears$DATETIME, 6, 19),
mean
)
# 01-01 01:00:00 02-29 12:00:00 05-03 09:00:00
# 9 3 5
library(plyr)
ddply(
data.multipleyears,
.(when=substr(DATETIME, 6, 19)),
summarize,
Windspeed=mean(Windspeed)
)
# when Windspeed
# 1 01-01 01:00:00 9
# 2 02-29 12:00:00 3
# 3 05-03 09:00:00 5
It is pretty old post, but I wanted to add. I guess timeAverage in Openair can also be used. In the manual, there are more options for timeAverage function.