Group Time Series OHLC Data by chosen period in R - r

There exist several functions in R in libraries xts and zoo, which try to aggregate financial OHLC(V) data from lower to higher granularities, as well as the newcomer tibbletime::to_period, which performs the same task for a tibble. All of them, however, suffer from the same inefficiency: When aggregating by, let us say, one hour, they take the round times as the start and end points of the intervals, i.e. boundaries would be 8 AM, 9 AM, 10 AM,... If I have data with 15 min candles, how can I aggregate OHLC(V), so that it is aggregated by 1 H intervals, not by the round times?
Time <- seq(from = as.POSIXct("2018-12-28 12:00:00"), to = as.POSIXct("2019-01-02 13:30:00"), by = 900)
Price_Data <- tibble::tibble(Time = Time,
Open = 100 + rnorm(n = length(Time)),
High = 100 + rnorm(n = length(Time)),
Low = 100 + rnorm(n = length(Time)),
Close = 100 + rnorm(n = length(Time)),
Volume = rpois(n = length(Time), lambda = 5000))
tail(Price_Data)
1 2019-01-02 12:15:00 99.7 5074
2 2019-01-02 12:30:00 99.9 4925
3 2019-01-02 12:45:00 101. 5070
4 2019-01-02 13:00:00 98.6 4919
5 2019-01-02 13:15:00 98.6 4925
6 2019-01-02 13:30:00 99.5 5046
How can I aggragate the above tibble to 30M, 1H, 2H and 4H, so that the groups will of the desired length? For example, the last group in aggregating by 1H would take the 4 candles from 12:45:00 to 13:30:00, 2H from 11:45:00, ...
I have tried
purrr::map(c("30 M","1 H","2 H","4 H")), function(Period) Price_Data %>%
na.omit() %>% tibbletime::tbl_time(., index = Time) %>%
tibbletime::collapse_by(Period, side = "end", clean = T) %>%
dplyr::group_by(Time) %>%
dplyr::mutate(Open = dplyr::first(Open),
High = max(High),
Low = min(Low),
Close = dplyr::last(Close),
Volume = sum(Volume)) %>%
dplyr::slice(n = n()) %>% dplyr::ungroup())
with various combinations of parameters, but nothing produces the desired result. Also, grouping by the number of candles in specific interval does not help, as real world data has gaps.

Related

Efficient Group Variable to Note When Values Fall Between Two Times

I have a dataset that contains start and end time stamps, as well as a performance percentage. I'd like to calculate group statistics over hourly blocks, e.g. "the average performance for the midnight hour was x%."
My question is if there is a more efficient way to do this than a series of ifelse() statements.
# some sample data
pre.starting <- data.frame(starting = format(seq.POSIXt(from =
as.POSIXct(Sys.Date()), to = as.POSIXct(Sys.Date()+1), by = "5 min"),
"%H:%M", tz="GMT"))
pre.ending <- data.frame(ending = pre.starting[seq(1, nrow(pre.starting),
2), ])
ending2 <- pre.ending[-c(1), ]
starting2 <- data.frame(pre.starting = pre.starting[!(pre.starting$starting
%in% pre.ending$ending),])
dataset <- data.frame(starting = starting2
, ending = ending2
, perct = rnorm(nrow(starting2), 0.5, 0.2))
For example, I could create hour blocks with code along the lines of the following:
dataset2 <- dataset %>%
mutate(hour = ifelse(starting >= 00:00 & ending < 01:00, 12
, ifelse(starting >= 01:00 & ending < 02:00, 1
, ifelse(starting >= 02:00 & ending < 03:00, 13)))
) %>%
group_by(hour) %>%
summarise(mean.perct = mean(perct, na.rm=T))
Is there a way to make this code more efficient, or improve beyond ifelse()?
We can use cut ending hour based on hourly interval after converting timestamps into POSIXct and then take mean for each hour.
library(dplyr)
dataset %>%
mutate_at(vars(pre.starting, ending), as.POSIXct, format = "%H:%M") %>%
group_by(ending_hour = cut(ending, breaks = "1 hour")) %>%
summarise(mean.perct = mean(perct, na.rm = TRUE))
# ending_hour mean.perct
# <fct> <dbl>
# 1 2019-09-30 00:00:00 0.540
# 2 2019-09-30 01:00:00 0.450
# 3 2019-09-30 02:00:00 0.612
# 4 2019-09-30 03:00:00 0.470
# 5 2019-09-30 04:00:00 0.564
# 6 2019-09-30 05:00:00 0.437
# 7 2019-09-30 06:00:00 0.413
# 8 2019-09-30 07:00:00 0.397
# 9 2019-09-30 08:00:00 0.492
#10 2019-09-30 09:00:00 0.613
# … with 14 more rows

How to filter a data set and calculate a new variable faster in R?

I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
> head(DF)
datetime pw cu year m d hr min
1 2017-08-18 14:56:00 0.0630341 1.94065 2017 8 18 14 53
2 2017-08-18 14:57:00 0.0604653 1.86771 2017 8 18 14 57
3 2017-08-18 14:58:00 0.0601318 1.86596 2017 8 18 14 58
4 2017-08-18 14:59:00 0.0599276 1.83761 2017 8 18 14 59
5 2017-08-18 15:00:00 0.0598998 1.84177 2017 8 18 15 0
I had to use a for loop to reduce my table, I wrote the following to do it:
datetime <- c()
eg_bf <-c ()
for(i in 1:8760){
hour= start + 3600
DF= DF %>%
filter(datetime >= start & datetime < hour) %>%
summarise(eg= mean(pw))
datetime= append(datetime, start)
eg_bf= append(eg_bf, DF$eg)
start= hour
}
new_DF= data.frame(datetime, eg_bf)
So. I was able to get my new data set with the mean value for every hour of the year.
datetime eg_bf
1 2018-01-01 00:00:00 0.025
2 2018-01-01 01:00:00 0.003
3 2018-01-01 02:00:00 0.002
4 2018-01-01 03:00:00 0.010
5 2018-01-01 04:00:00 0.015
The problem I'm facing is that It takes a lot of time to do it. The idea is to add this calculation to a shiny UI, so every time I make a change it must make the changes faster. Any idea how to improve this calculation?
you can try this. use make_date from the lubridate package to make a new date_time column using the year , month, day and hour columns of your dataset. Then group and summarise on the new column
library(dplyr)
library(lubridate)
df %>%
mutate(date_time = make_datetime(year, m, d, hr)) %>%
group_by(date_time) %>%
summarise(eg_bf = mean(pw))
#Adam Gruer's answer provides a nice solution for the date variable that should solve your question. The calculation of the mean per hour does work with just dplyr, though:
df %>%
group_by(year, m, d, hr) %>%
summarise(test = mean(pw))
# A tibble: 2 x 5
# Groups: year, m, d [?]
year m d hr test
<int> <int> <int> <int> <dbl>
1 2017 8 18 14 0.0609
2 2017 8 18 15 0.0599
You said in your question:
When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
What did you do differently?
Even if you've found your answer, I believe this is worth mentioning:
If you're working with a lot of data and speed is an issue, then you might want ot see if you can use data.table instead of dplyr
You can see with a simple benchmarking how much faster data.table is:
library(dplyr)
library(lubridate)
library(data.table)
library(microbenchmark)
set.seed(123)
# dummy data, one year, one entry per minute
# first as data frame
DF <- data.frame(datetime = seq(as.POSIXct("2018-01-01 00:00:00"),
as.POSIXct("2019-01-02 00:00:00"), 60),
pw = runif(527041)) %>%
mutate(year = year(datetime), m=month(datetime),
d=day(datetime), hour = hour(datetime))
# save it as a data.table
dt <- as.data.table(DF)
# transformation with dplyr
f_dplyr <- function(){
DF %>%
group_by(year, m, d, hour) %>%
summarize(eg_bf = mean(pw))
}
# transformation with data.table
f_datatable <- function() {
dt[, mean(pw), by=.(year, m, d, hour)]
}
# benchmarking
microbenchmark(f_dplyr(), f_datatable())
#
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f_dplyr() 41.240235 44.075019 46.85497 45.64998 47.95968 76.73714 100 b
# f_datatable() 9.081295 9.712694 12.53998 10.55697 11.33933 41.85217 100 a
check out this post it tells a lot data.table vs dplyr: can one do something well the other can't or does poorly?
As I understood you have a data frame of 365 * 24 * 60 rows. The code below returns the result instantly. The outcome is mean(pw) grouped by every hour of the year.
remove(list = ls())
library(dplyr)
library(lubridate)
library(purrr)
library(tibble)
date_time <- seq.POSIXt(
as.POSIXct("2018-01-01"),
as.POSIXct("2019-01-01"),
by = "1 min"
)
n <- length(date_time)
data <- tibble(
date_time = date_time,
pw = runif(n),
cu = runif(n),
ye = year(date_time),
mo = month(date_time),
da = day(date_time),
hr = hour(date_time)
)
grouped <- data %>%
group_by(
ye, mo, da, hr
) %>%
summarise(
mean_pw = mean(pw)
)

Rolling Max/Min/Sum for time series over last x Mins interval

I have a financial time series data.frame with microsecond precision:
timestamp price volume
2017-08-29 08:00:00.345678 99.1 10
2017-08-29 08:00:00.674566 98.2 5
....
2017-08-29 16:00:00.111234 97.0 3
2017-08-29 16:00:01.445678 96.5 5
In total: around 100k records per day.
I saw a couple of functions where I can specify the width of the rolling windows, e.g. k = 10. But the k is expressed as a number of observations and not minutes.
I need to calculate runing/rolling Max, Min of Price series and a runing/rolling sum of Volume series like that:
starting with a timestamp exactly 5 minutes after the begin of the time series
for every following timestamp: look back for 5 minutes interval and
calculate the rolling statistics.
How to calculate this effectivly?
Your data
I wasn't able to capture milliseconds (but the solution should still work)
library(lubridate)
df <- data.frame(timestamp = ymd_hms("2017-08-29 08:00:00.345678", "2017-08-29 08:00:00.674566", "2017-08-29 16:00:00.111234", "2017-08-29 16:00:01.445678"),
price=c(99.1, 98.2, 97.0, 96.5),
volume=c(10,5,3,5))
purrr and dplyr solution
library(purrr)
library(dplyr)
timeinterval <- 5*60 # 5 minute
Filter df for observations within time interval, save as list
mdf <- map(1:nrow(df), ~df[df$timestamp >= df[.x,]$timestamp & df$timestamp < df[.x,]$timestamp+timeinterval,])
Summarise for each data.frame in list
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp = head(timestamp,1),
max.price = max(price),
max.volume = max(volume),
sum.price = sum(price),
sum.volume = sum(volume),
min.price = min(price),
min.volume = min(volume)))
Output
timestamp max.price max.volume sum.price sum.volume
1 2017-08-29 08:00:00 99.1 10 197.3 15
2 2017-08-29 08:00:00 98.2 5 98.2 5
3 2017-08-29 16:00:00 97.0 5 193.5 8
4 2017-08-29 16:00:01 96.5 5 96.5 5
min.price min.volume
1 98.2 5
2 98.2 5
3 96.5 3
4 96.5 5
As I was looking for a backward calculation (start with a timestamp and look 5 minutes back) I slightly modified the great solution by #CPak as follows:
mdf <- map(1:nrow(df), ~df[df$timestamp <= df[.x,]$timestamp & df$timestamp > df[.x,]$timestamp - timeinterval,])
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp_to = tail(timestamp,1),
timestamp_from = head(timestamp,1),
max.price = max(price),
min.price = min(price),
sum.volume = sum(volume),
records = n()))
In addition, I added records = n() to see how many records have been used in the intervals.
One caveat: the code takes 10 mins on mdf and another 6 mins for statdf on a dataset with 100K+ records.
Any ideas how to optimize it? Thank you!

Rearrangment of time series data

I am not good at "R" and not sure how to rearrange and subset time series data. Sorry, if this question sounds stupid.
I have a time series data of sea tide with four values per day (with missing values as well). Two values for high tide and two values for low tide. The time and date are given in the same column but in different rows. Now, I want to subset the data only for daytime (from 7:00 AM to 7:00 PM) not for night. Then I want to have data arranged in three columns only i) Date, ii) Time and iii) Tide. For Tide, I only need minimum and maximum values. Here is an example of the data and the desired arrangement of data. For each date, data is arranged in three rows similar to the example.
1/1/2011 Low High Low NA
Time 2:58 AM 9:38 AM 5:19 PM NA
Tide 1.2 m 2.2 m 0.6 m NA
1/2/2011 High Low High Low
Time 2:07 AM 4:22 AM 10:19 AM 6:07 PM
Tide 1.4 m 1.3 m 2.3 m 0.4 m
Date Time Tide
1/1/2011 17:19 0.6
1/1/2011 9:38 2.2
1/2/2011 2:07 1.4
1/2/2011 18:07 0.4
The input, DF is assumed to be as in the Note below.
g, the grouping vector, has one element per row of DF and is equal to c(1, 1, 1, 2, 2, 2, ...). Alternate ways to compute g would be n <- nrow(DF); g <- gl(n, 3, n) or n <- nrow(DF); g <- rep(1:3, n, n).
We then use by to split DF into groups and apply the indicated anonymous function to each group as defined by g.
The anonymous function combines the date and the times in the current group to create the date/times dt making use of the fact that the common date is x[1,1] and the times prior to being cleaned up are in x[2,-1].
Using dt and the tides in x[2, -1] (prior to being cleaned up) it computes each of the three columns arranging them into a data frame. Then there is a commented out line which removes NA values. If you want this uncomment it. Subset the data frame obtained so far to the 7am to 7pm time period and further take the two rows consisting of the min and max tide. We sort that by time.
Finally do.call("rbind", ...) puts the groups together into one overall data frame.
No packages are used.
g <- cumsum(grepl("\\d", DF$V1))
Long <- do.call("rbind", by(DF, g, function(x) {
dt <- as.POSIXct(paste(x[1,1], as.matrix(x[2, -1])), format = "%m/%d/%Y %I:%M %p")
X <- data.frame(Date = as.Date(dt),
Time = format(dt, "%H:%M"),
Tide = as.numeric(sub("m", "", as.matrix(x[3, -1]))),
stringsAsFactors = FALSE)
# X <- na.omit(X)
X <- subset(X, Time >= "07:00" & Time <= "19:00")
X <- X[c(which.min(X$Tide), which.max(X$Tide)), ]
X[order(X$Time), ]
}))
giving the following -- note that the third row in the question's output is not between 7am and 7pm so the output here necessarily differs.
> Long
Date Time Tide
1.2 2011-01-01 09:38 2.2
1.3 2011-01-01 17:19 0.6
2.3 2011-01-02 10:19 2.3
2.4 2011-01-02 18:07 0.4
Note: The input DF is assumed to be as follows in reproducible form:
Lines <- "1/1/2011,Low,High,Low,NA
Time,2:58 AM,9:38 AM,5:19 PM,NA
Tide,1.2 m,2.2 m,0.6 m,NA
1/2/2011,High,Low,High,Low
Time,2:07 AM,4:22 AM,10:19 AM,6:07 PM
Tide,1.4 m,1.3 m,2.3 m,0.4 m"
DF <- read.table(text = Lines, sep = ",", as.is = TRUE)
If the list is not too long, this endeavour would be simpler to do in a spreadsheet simply by mapping cells and filtering. But one way to do it in R with zoo and tidyverse is the following:
Assuming that the original dataframes have their columns named as C1:C5
C1 C2 C3 C4 C5
<chr> <chr> <chr> <chr> <chr>
1 1/1/2010 Low High Low <NA>
2 Time 2:58 AM 9:38 AM 5:19 PM <NA>
3 Tide 1.2 2.2 0.6 <NA>
4 1/2/2011 High Low High Low
5 Time 2:07 AM 4:22 AM 10:19 AM 6:07 PM
6 Tide 1.4 1.3 2.3 0.4
DF <- DF %>%
mutate(Date = as.Date(gsub("Tide|Time","", C1), format = "%d/%m/%Y"))
DF <- DF %>%
mutate(Date = na.locf(DF$Date, na.rm = TRUE),
C1 = gsub("[[:digit:]]|\\/", "", C1),
Type = if_else(nchar(C1) == 0, "TideType", C1)) %>%
select(Date, Type, C2:C5) %>%
gather(oColumn, Value, -c(Date, Type)) %>%
spread(key = Type, value = Value) %>%
select(Date, Time, Tide) %>%
filter(complete.cases(.))
DF <- DF %>%
mutate(Time = ymd_hm(paste(DF$Date, DF$Time, sep = " ")),
Tide = as.numeric(Tide))
DF <- DF %>%
mutate(DayNight = (DF$Time) %within%
interval(as.POSIXlt(DF$Date) + (7*60*60), as.POSIXlt(DF$Date) + (19*60*60))) %>%
filter(DayNight == TRUE) %>%
select(-DayNight) ) %>%
group_by(Date) %>%
filter(Tide == max(Tide) | min(Tide))
DF
Source: local data frame [4 x 3]
Groups: Date [2]
Date Time Tide
<date> <dttm> <dbl>
1 2010-01-01 2010-01-01 09:38:00 2.2
2 2010-01-01 2010-01-01 17:19:00 0.6
3 2011-02-01 2011-02-01 10:19:00 2.3
4 2011-02-01 2011-02-01 18:07:00 0.4
Note that "Date" is a Date type of Object and "Time" is a Posixct type of Date-Time Object. You might want to convert "Time" into a vector of minutes.

R get value at minimum and maximum time of day

I have some data that I need to analyse easily. I want to create a graph of the average usage per day of a week. The data is in a data.table with the following structure:
time value
2014-10-22 23:59:54 7433033.0
2014-10-23 00:00:12 7433034.0
2014-10-23 00:00:31 7433035.0
2014-10-23 00:00:49 7433036.0
...
2014-10-23 23:59:21 7443032.0
2014-10-23 23:59:40 7443033.0
2014-10-23 23:59:59 7443034.0
2014-10-24 00:00:19 7443035.0
Since the value is cumulative, I would need the maximum value of a day, minus the minimum value of that day, and then average all the values with the same days.
I already know how to get the day of the week (using as.POSIXlt and $wday). So how can I get the daily difference? Once I have the data in a structure like:
dayOfWeek value
0 10
1 20
2 50
I should be able to find the mean myself using some functions.
Here is a sample:
library(data.table)
data <- fread("http://pastebin.com/raw.php?i=GXGiCAiu", header=T)
#get the difference per day
#create average per day of week
There are many ways to do this with R. You can use ave from base R or data.table or dplyr packages. These solutions all add the summaries as columns of your data.
data
df <- data.frame(dayOfWeek = c(0L, 0L, 1L, 1L, 2L),
value = c(10L, 5L, 20L, 60L, 50L))
base r
df$min <- ave(df$value, df$dayOfWeek, FUN = min)
df$max <- ave(df$value, df$dayOfWeek, FUN = max)
data.table
require(data.table)
setDT(df)[, ":="(min = min(value), max = max(value)), by = dayOfWeek][]
dplyr
require(dplyr)
df %>% group_by(dayOfWeek) %>% mutate(min = min(value), max = max(value))
If you just want the summaries, you can also use the following:
# base
aggregate(value~dayOfWeek, df, FUN = min)
aggregate(value~dayOfWeek, df, FUN = max)
# data.table
setDT(df)[, list(min = min(value), max = max(value)), by = dayOfWeek]
# dplyr
df %>% group_by(dayOfWeek) %>% summarise(min(value), max(value))
This is actually a trickier problem than it seemed at first glance. I think you need two separate aggregations, one to aggregate the cumulative usage values within each calendar day by taking the difference of the range, and then a second to aggregate the per-calendar-day usage values by weekday. You can extract the weekday with weekdays(), calculate the daily difference with diff() on the range(), calculate the mean with mean(), and aggregate with aggregate():
set.seed(1);
N <- as.integer(60*60*24/19*14);
df <- data.frame(time=seq(as.POSIXct('2014-10-23 00:00:12',tz='UTC'),by=19,length.out=N)+rnorm(N,0,0.5), value=seq(7433034,by=1,length.out=N)+rnorm(N,0,0.5) );
head(df);
## time value
## 1 2014-10-23 00:00:11 7433034
## 2 2014-10-23 00:00:31 7433035
## 3 2014-10-23 00:00:49 7433036
## 4 2014-10-23 00:01:09 7433037
## 5 2014-10-23 00:01:28 7433039
## 6 2014-10-23 00:01:46 7433039
tail(df);
## time value
## 63658 2014-11-05 23:58:14 7496691
## 63659 2014-11-05 23:58:33 7496692
## 63660 2014-11-05 23:58:51 7496693
## 63661 2014-11-05 23:59:11 7496694
## 63662 2014-11-05 23:59:31 7496695
## 63663 2014-11-05 23:59:49 7496697
df2 <- aggregate(value~date,cbind(df,date=as.Date(df$time)),function(x) diff(range(x)));
df2;
## date value
## 1 2014-10-23 4547.581
## 2 2014-10-24 4546.679
## 3 2014-10-25 4546.410
## 4 2014-10-26 4545.726
## 5 2014-10-27 4546.602
## 6 2014-10-28 4545.194
## 7 2014-10-29 4546.136
## 8 2014-10-30 4546.454
## 9 2014-10-31 4545.712
## 10 2014-11-01 4546.901
## 11 2014-11-02 4544.684
## 12 2014-11-03 4546.378
## 13 2014-11-04 4547.061
## 14 2014-11-05 4547.082
df3 <- aggregate(value~dayOfWeek,cbind(df2,dayOfWeek=weekdays(df2$date)),mean);
df3;
## dayOfWeek value
## 1 Friday 4546.196
## 2 Monday 4546.490
## 3 Saturday 4546.656
## 4 Sunday 4545.205
## 5 Thursday 4547.018
## 6 Tuesday 4546.128
## 7 Wednesday 4546.609
Came across this looking for something else. I think you were looking for the difference and mean per Monday, Tuesday, etc. Sticking with data.table allows a quick all in one call to get the mean per day of week and the difference per day of the week. This gives an output of 7 rows and three columns.
library(data.table)
data <- fread("http://pastebin.com/raw.php?i=GXGiCAiu", header=T)
data_summary <- data[,list(mean = mean(value),
diff = max(value)-min(value)),
by = list(date = format(as.POSIXct(time), format = "%A"))]
This gives an output of 7 rows and three columns.
date mean diff
1: Thursday 7470107 166966
2: Friday 7445945 6119
3: Saturday 7550000 100000
4: Sunday 7550000 100000
5: Monday 7550000 100000
6: Tuesday 7550000 100000
7: Wednesday 7550000 100000

Resources