Fast data.table joins with time series

Fast data.table joins with time series - r

I have two data sets with currency data prices (High, Low, Open and Close) on two different timeframes (1 hour and 5 minute).
I have price targets for each row of the 1 Hour timeframe data, and a target direction for the trade (either 1 or -1).
For trades with a direction of '1', I'm trying to find the first point on the 5 minute chart where the "low" is <= the target level. Conversely, for trades with a direction of '-1', I want to find the first point on the 5-minute chart where the "high" is >= the target trade level.
I've put the code below to demonstrate what I'm after.
The problem I'm having in practice is that I'm looking at time periods spanning 10 - 20 years, which makes the joins very slow.
I've set the example below to be from 2016 to 2022 so it's not too slow on my computer, but if you extend it out to more than 10 years it starts to really become a pain.
This may be something that I have to live with, but I'm looking for guidance on two things:
Is there a faster / more efficient way of achieving what I've outlined below?
You can see that I've separated the buy and sell trades into two joins at the end. Is there a way to combine that into one single join? It's not really that important as it doesn't take up too much 'space' in my code, but I'm interested for educational purposes.
Thanks heaps in advance,
Phil
# Load required packages
library(data.table)
library(dplyr)
# Define timeframes for the data
Start <- as.POSIXct("2016-01-01 00:00:00")
End <- as.POSIXct("2022-01-01 23:55:00")
Hours <- floor(as.numeric(difftime(End,Start,units = "hours"))) + 1
Minutes <- floor(as.numeric(difftime(End,Start,units = "mins")) / 5) + 1
# Create the Hourly data
set.seed(123)
hourly_prices <- data.table(
datetime = seq(Start, End, by = "hour"),
open = rnorm(Hours, mean = 100, sd = 1),
high = rnorm(Hours, mean = 101, sd = 1),
low = rnorm(Hours, mean = 99, sd = 1),
close = rnorm(Hours, mean = 100, sd = 1),
Direction = sample(c(1,-1),Hours,replace = T)) %>%
.[,Target_price := ifelse(Direction == -1,rnorm(.N, mean = 104, sd = 1),rnorm(.N,mean = 97,sd = 1))]
# Create the 5-minute data
set.seed(456)
minute_prices <- data.table(
datetime = seq(Start, End, by = "5 min"),
open = rnorm(Minutes, mean = 100, sd = 1),
high = rnorm(Minutes, mean = 101, sd = 1),
low = rnorm(Minutes, mean = 99, sd = 1),
close = rnorm(Minutes, mean = 100, sd = 1),
Position = seq_len(Minutes))
# Join the two data.tables to find the first point at which price passes the target levels
hourly_prices[(Direction == 1),Location := minute_prices[.SD, on = .(datetime > datetime, low <= Target_price),mult = "first",x.Position]]
hourly_prices[(Direction == -1),Location := minute_prices[.SD, on = .(datetime > datetime, high >= Target_price),mult = "first",x.Position]]

One simple ~3x speed-up would be to subset the minute_prices data.table to include only those rows that are the cumulative min/max for the current hour (minus 1 second, since the join is datetime > datetime):
dtM <- copy(minute_prices)
dtH <- copy(hourly_prices)
system.time({
dtH[(Direction == 1),Location := dtM[dtM[, low == cummin(low), as.integer(datetime - 1)%/%3600L][[2]]][.SD, on = .(datetime > datetime, low <= Target_price),mult = "first",x.Position]]
dtH[(Direction == -1),Location := dtM[dtM[, high == cummax(high), as.integer(datetime - 1)%/%3600L][[2]]][.SD, on = .(datetime > datetime, high >= Target_price),mult = "first",x.Position]]
})
#> user system elapsed
#> 6.77 0.01 6.81
system.time({
hourly_prices[(Direction == 1),Location := minute_prices[.SD, on = .(datetime > datetime, low <= Target_price),mult = "first",x.Position]]
hourly_prices[(Direction == -1),Location := minute_prices[.SD, on = .(datetime > datetime, high >= Target_price),mult = "first",x.Position]]
})
#> user system elapsed
#> 22.97 0.00 23.04
identical(dtH, hourly_prices)
#> [1] TRUE

Related

R - deSolve package (ode function): change a matrix of parameters in SIR model according to time

I am trying to simulate the transmission of viruses in a population using the function ode from the deSolve package. The basic of my model is a SIR model and I posted a much simpler demo of my model here, which consists of only three states S(susceptible), I(infectious) and R(recovered). Each state is represented by a m*n matrix in my code, since I have m age groups and n subpopulations in my population.
The problem is: during the simulation period, there will be several vaccination activities that transfer people in state S to state I. Each vaccination activity is characterized by a begin date, an end date, its coverage rate and duration. What I want to do is once the time t falls into the interval of begin date and end date of one vaccination activity, the code calculates the effective vaccination rate (also a m*n matrix, based on coverage rate and duration) and times it with S (m*n matrix), to get a matrix of people transited to state I. Right now, I am using if() to decide if time t is between a begin date and a end date:
#initialize the matrix of effective vaccination rate
irrate_matrix = matrix(data = rep(0, m*n), nrow = m, ncol = n)
for (i in 1:length(tbegin)){
if (t>=tbegin[i] & t<=tend[i]){
for (j in 1:n){
irrate_matrix[, j] = -log(1-covir[(j-1)*length(tbegin)+i])/duration[i]
}
}
}
Here, irrate_matrix is the m*n effective vaccination rate matrix, m = 2 is the number of age groups, n = 2 is the number of subpopulations, tbegin = c(5, 20, 35) is the begin date of 3 vaccination activities, tend = c(8, 23, 38) is the end date of 3 vaccination activities, covir = c(0.35, 0.25, 0.25, 0.225, 0.18, 0.13) is the coverage rate of each vaccination for each subpopulation (e.g., covir[1] = 0.35 is the coverage rate of the first vaccination for subpopulation1, while covir[4] = 0.225 is the coverage rate of the first vaccination for subpopulation2) and duration = c(4, 4, 4) is the duration of each vaccination (in days).
After calculating irrate_matrix, I take it into derivatives and therefore I have:
dS = as.matrix(b*N) - as.matrix(irrate_matrix*S) - as.matrix(mu*S)
dI = as.matrix(irrate_matrix*S) - as.matrix(gammaS*I) - as.matrix(mu*I)
dR = as.matrix(gammaS*I) - as.matrix(mu*R)
I want to do a simulation from day 0 to day 50, by 1-day step, thus:
times = seq(0, 50, 1)
The current issue with my code is: every time the time t comes to a time point close to a tbegin[i] or tend[i], the simulation becomes much slower since it iterates at this time point for much more rounds than at any other time point. For example, once the time t comes to tbegin[1] = 5, the model iterates at time point 5 for many rounds. I attached screenshots from printing out those iterations (screenshot1 and screenshot2). I find this is why my bigger model takes a very long running time now.
I have tried using the "events" function of deSolve mentioned by tpetzoldt in this question stackoverflow: change the value of a parameter as a function of time. However, I found it's inconvenient for me to change a matrix of parameters and change it every time there is a vaccination activity.
I am looking for solutions regarding:
How to change my irrate_matrix to non-zero matrix when there is a vaccination activity and let it be zero matrix when there is no vaccination? (it has to be calculated for each vaccination)
At the same time, how to make the code run faster by avoiding iterating at any tbegin[i] or tend[i] for many rounds? (I think I should not use if() but I do not know what I should do with my case)
If I need to use "forcing" or "events" function, could you please also tell me how to have multiple "forcing"/"events" in the model? Right now, I have had an "events" used in my bigger model to introduce a virus to the population, as:
virusevents = data.frame(var = "I1", time = 2, value = 1, method = "add")
Any good idea is welcome and directly providing some codes is much appreciated! Thank you in advance!
For reference, I post the whole demo here:
library(deSolve)
##################################
###(1) define the sir function####
##################################
sir_basic <- function (t, x, params)
{ # retrieve initial states
S = matrix(data = x[(0*m*n+1):(1*m*n)], nrow = m, ncol = n)
I = matrix(data = x[(1*m*n+1):(2*m*n)], nrow = m, ncol = n)
R = matrix(data = x[(2*m*n+1):(3*m*n)], nrow = m, ncol = n)
with(as.list(params), {
N = as.matrix(S + I + R)
# print out current iteration
print(paste0("Total population at time ", t, " is ", sum(N)))
# calculate irrate_matrix by checking time t
irrate_matrix = matrix(data = rep(0, m*n), nrow = m, ncol = n)
for (i in 1:length(tbegin)){
if (t>=tbegin[i] & t<=tend[i]){
for (j in 1:n){
irrate_matrix[, j] = -log(1-covir[(j-1)*length(tbegin)+i])/duration[i]
}
}
}
# derivatives
dS = as.matrix(b*N) - as.matrix(irrate_matrix*S) - as.matrix(mu*S)
dI = as.matrix(irrate_matrix*S) - as.matrix(gammaS*I) - as.matrix(mu*I)
dR = as.matrix(gammaS*I) - as.matrix(mu*R)
derivatives <- c(dS, dI, dR)
list(derivatives)
})
}
##################################
###(2) characterize parameters####
##################################
m = 2 # the number of age groups
n = 2 # the number of sub-populations
tbegin = c(5, 20, 35) # begin dates
tend = c(8, 23, 38) # end dates
duration = c(4, 4, 4) # duration
covir = c(0.35, 0.25, 0.25, 0.225, 0.18, 0.13) # coverage rates
b = 0.0006 # daily birth rate
mu = 0.0006 # daily death rate
gammaS = 0.05 # transition rate from I to R
parameters = c(m = m, n = n,
tbegin = tbegin, tend = tend, duration = duration, covir = covir,
b = b, mu = mu, gammaS = gammaS)
##################################
#######(3) initial states ########
##################################
inits = c(
S = c(20000, 40000, 10000, 20000),
I = rep(0, m*n),
R = rep(0, m*n)
)
##################################
#######(4) run simulations########
##################################
times = seq(0, 50, 1)
traj <- ode(func = sir_basic,
y = inits,
parms = parameters,
times = times)
plot(traj)

Element wise operations are the same for matrices and vectors, so the as.matrix conversions are redundant, as no true matrix multiplication is used. Same with the rep: the zero is recycled anyway.
In effect, CPU time reduces already to 50%. In contrast, use of an external forcing with approxTime instead of the inner if and for made the model slower (not shown).
Simplified code
sir_basic2 <- function (t, x, params)
{ # retrieve initial states
S = x[(0*m*n+1):(1*m*n)]
I = x[(1*m*n+1):(2*m*n)]
R = x[(2*m*n+1):(3*m*n)]
with(as.list(params), {
N = S + I + R
# print out current iteration
#print(paste0("Total population at time ", t, " is ", sum(N)))
# calculate irrate_matrix by checking time t
irrate_matrix = matrix(data = 0, nrow = m, ncol = n)
for (i in 1:length(tbegin)){
if (t >= tbegin[i] & t <= tend[i]){
for (j in 1:n){
irrate_matrix[, j] = -log(1-covir[(j-1) * length(tbegin)+i])/duration[i]
}
}
}
# derivatives
dS = b*N - irrate_matrix*S - mu*S
dI = irrate_matrix*S - gammaS*I - mu*I
dR = gammaS*I - mu*R
list(c(dS, dI, dR))
})
}
Benchmark
Each model version is run 10 times. Model sir_basic is the original implementation, where print line was disabled for a fair comparison.
system.time(
for(i in 1:10)
traj <- ode(func = sir_basic,
y = inits,
parms = parameters,
times = times)
)
system.time(
for(i in 1:10)
traj2 <- ode(func = sir_basic2,
y = inits,
parms = parameters,
times = times)
)
plot(traj, traj2)
summary(traj - traj2)
I observed another considerable speedup, when I use method="adams" instead of the default lsoda solver, but this may differ for your full model.

Correcting errors in variables, by taking the median by group for outliers

Example data:
library(data.table)
set.seed(1)
DT <- data.table(panelID = sample(10,10),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
wt = 15*round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Height= 150*round(rnorm(10,0.75,0.3),2))
# ERRORS
DT [1:5,11] <- 0
I have some errors in the Height registration of my data. Since the data is panel data, I should be able to deduce the actual Height from the other observations. To automise this process I thought of replacing a value by the median if the value is more than 50cm away from the median:
setDT(DT)[abs(median(Height, na.rm = TRUE) - Height) > 50 , newheight := median(Height, na.rm = TRUE), by=panelID]
Somehow the by argument is however not working, leading to the following result:
> table(DT$newheight)
0 27 165
4 10 10
Could anyone tell me what I am doing wrong?

Your problem is in the order of operations. First, the filter (i.e. the bit before the comma) is being applied to the whole table, so the median calculated here is the median of the whole Height column. Then, the bit after the comma is being applied within each group, to the filtered data. So you're actually using the median of the outliers, rather than the median of the whole group (which I assume is what you intended).
One fix could be to do it in stages:
setDT(DT)
DT[, median.height:= median(Height, na.rm = TRUE), by='panelID']
DT[abs(Height - median.height) > 50, newheight:= median.height]
Or, it's possible you intended replacing the last line with this instead:
DT[, newheight:= fifelse(abs(Height - median.height) > 50, median.height, Height)]

Try:
setDT(DT)
DT[ , newheight := ifelse(abs(median(Height, na.rm = TRUE) - Height) > 50, median(Height, na.rm = TRUE), Height), by=.(panelID)]

How to transform time series code to automated code to use for multiple time series?

I want to transform my time series code for one time series to an automated code which can be used for multiple time series data (my data contains a monthly time series).
My general approach for one time series was to remove the seasonal component and take first differences to achieve stationarity. Then I use auto.arima to get the ARIMA parameters. I use these parameters to build my ARIMA model with my original time series data. Then I forecast and compare to the actual data of 4 months (which I have cut out before) and calculate the RMSE.
As I cannot use my actual data, I just generate a random time series and test set as an example - of course the outcome does not make much sense.
library('forecast')
set.seed(123)
# create random time series and 4 months testing data
ts <- ts(runif(26, min = 50, max = 3000), start = c(2017,01), end = c(2019,02), frequency = 12)
test.data <- runif(4, min = 50, max = 3000)
# Decomompose
comp.ts = decompose(ts)
# subtrect seasonal trend
ts2 <- ts - comp.ts$seasonal
ts2 <- diff(ts2, differences=1)
auto.arima(ts2, trace = T, seasonal = TRUE,ic = 'aicc', max.p = 10,max.q = 10,max.P = 10,max.Q = 10,max.d = 10, stepwise = F)
# Use auto.arima outcome as input
my.arima <- Arima(ts2, order=c(0,0,0),seasonal = list(order = c(0,1,0), period = 12),method="ML", include.drift = F)
# Forecast and calculate RMSE
data.forecast <- forecast(my.arima, h=4, level=c(99.5))
my.difference <- test.data - data.forecast$mean
my.rmse <- (sum(sqrt(my.difference^2)))/length(my.difference)
As my actual data set contains over 500 time series, I need to automate the whole process. Unfortunately, I have not used R for time series so far, so I have problems coming up with an automated process.
Lets assume 4 random time series with 4 random test sets. How could I generate an automated process for these time series (which I can also use for my actual 500+ time series) which does the exact same thing as above?
ts1 <- ts(runif(26, min = 50, max = 3000), start = c(2017,01), end = c(2019,02), frequency = 12)
ts2 <- ts(runif(26, min = 50, max = 3000), start = c(2017,01), end = c(2019,02), frequency = 12)
ts3 <- ts(runif(26, min = 50, max = 3000), start = c(2017,01), end = c(2019,02), frequency = 12)
ts4 <- ts(runif(26, min = 50, max = 3000), start = c(2017,01), end = c(2019,02), frequency = 12)
test.data1 <- runif(4, min = 50, max = 3000)
test.data2 <- runif(4, min = 50, max = 3000)
test.data3 <- runif(4, min = 50, max = 3000)
test.data4 <- runif(4, min = 50, max = 3000)
Thanks for the help!

Just put your workflow into a function.
serialArima <- function(ts, test.data) {
library(forecast)
# Decomompose
comp.ts=decompose(ts)
# subtrect seasonal trend
ts2 <- ts - comp.ts$seasonal
ts2 <- diff(ts2, differences=1)
auto.arima(ts2, trace=T, seasonal=TRUE, ic='aicc', max.p=0, max.q=0, max.P=0,
max.Q=0, max.d=0, stepwise=F)
# Use auto.arima outcome as input
my.arima <- Arima(ts2, order=c(0, 0, 0),
seasonal=list(order=c(0, 1, 0), period=2),
method="ML", include.drift=F)
# Forecast and calculate RMSE
data.forecast <- forecast(my.arima, h=4, level=c(99.5))
my.difference <- test.data - data.forecast$mean
my.rmse <- (sum(sqrt(my.difference^2)))/length(my.difference)
return(list(data.forecast=data.forecast, my.difference=my.difference, my.rmse=my.rmse))
}
Singular application
serialArima(ts, test.data)
# ARIMA(0,0,0) with zero mean : 82.45803
# ARIMA(0,0,0) with non-zero mean : 88.13593
#
#
#
# Best model: ARIMA(0,0,0) with zero mean
#
# $data.forecast
# Point Forecast Lo 99.5 Hi 99.5
# 2020.00 -349.1424 -2595.762 1897.477
# 2020.50 772.6014 -1474.018 3019.221
# 2021.00 -349.1424 -3526.342 2828.057
# 2021.50 772.6014 -2404.598 3949.801
#
# $my.difference
# Time Series:
# Start = c(2020, 1)
# End = c(2021, 2)
# Frequency = 2
# [1] 1497.2446 840.4139 2979.4553 993.5614
#
# $my.rmse
# [1] 1577.669
Multiple application
Map(serialArima, list(ts1, ts2, ts3, ts4),
list(test.data1, test.data2, test.data3, test.data4))

Sub-setting to find the anomaly

In a data frame, I am trying to look for data points that are more than (threshold * s.d.) away from mean. The dim of the data frame is as follows:
[1] 4032 4
To find the data points for the above condition, I did:
df$mean = rollapply(df$value, width = 2, FUN = mean, align = "right", fill = "extend")
df$sd = rollapply(df$value, width = 2, FUN = sd, align = "right", fill = "extend")
After the above the head(df) looks like:
timestamp value mean sd
2007-03-14 1393577520 37.718 38.088 0.5232590
2007-03-15 1393577220 38.458 38.088 0.5232590
2007-03-16 1393576920 37.912 38.185 0.3860803
2007-03-17 1393576620 40.352 39.132 1.7253405
2007-03-18 1393576320 38.474 39.413 1.3279465
2007-03-19 1393576020 39.878 39.176 0.9927779
To find the datapoints:
anomaly = df[df$value > abs((threshold*df$sd + df$mean) |
(df$mean - threshold*df$sd)),]
Is above the correct way to find data points that are more than (threshold * s.d.) away from mean. The reason I am suspicious is that dim of anomaly is same as that of df.

This will do it
# creating some dummy data
m <- matrix(runif(16128,-1,1), ncol = 4)
tresh <- .004+1
m[which(abs(m-mean(m)) > tresh*sd(m), arr.ind = T)]
Where m denotes your matrix (or your column value depending on whichever you take the mean/sd) and tresh your treshold.
Update Here are the first couple of entries of my result:
dat <- df$value[which(abs(df$value-mean(df$value)) > tresh*sd(df$value))]
head(dat)
[1] 51.846 48.568 44.986 49.108 53.404 46.314

Conditional formatting with R

I am trying to formulate a conditional expression a little bit complicated using two tables ("CS8_2007_2009_M&F.csv" and "CourbeDeCroissance_M&F.csv").
The first table contains around 60 000 individuals (nocs8) that have a value for weight (“weight”) and age (“agegestationnel”)
In the second table, I have for ages (“GA”) the corresponding 3rd, 5th and 10th percentiles of weight (respectively “3%”, “5%” and “10%”).
I’d like to code individuals (nocs8) depending on the age (“GA”) and the corresponding percentiles of weight (“3%”, ” 5%” and “10%”).
Here is my script:
mydata=fread("CS8_2007_2009_M&F.csv",
colClasses = c(rep("character", 5),
rep("numeric", 5 ),
"character",
rep("numeric", 7 ),
rep("character", 9), "numeric"))
setkey(mydata,nocs8)
weight=fread("CourbeDeCroissance_M&F.csv")
setkey(poids, GA)
normal weight
mydata[,quant:=0]
weight < 10%
mydata[, quant:=if(weight[GA==agegestationnel,`10%`]>mydata[[weight]]) 1, by = 1:nrow(mydata)]
weight < 5%
mydata[, quant:=if(weight[GA==agegestationnel,`5%`]>mydata[[weight]]) 1, by = 1:nrow(mydata)]
weight < 3%
mydata[, quant:=if(weight[GA==agegestationnel,`3%`]>mydata[[weight]]) 1, by = 1:nrow(mydata)]
And I got this message error:
« Error in weight["GA" == agegestationnel, "10%"] :
incorrect number of dimensions
»
I'm wondering if it is due to the large size of my sample (nocs8=60 000) or the large number of conditions I’m asking (23GA X 3 percentiles = 46)? If so, What may I do?

I finally found the answer:
mydata <- read.csv("file1.csv", sep=";")
weight <- read.csv("file2.csv", sep=";")
data_merge <- merge(mydata, weight, by.x=14, by.y=1, all.x=TRUE)
data_merge$categ = NA
data_merge[!is.na(data_merge$weight) & !is.na(data_merge$X10.) & (data_merge$weight > data_merge$X10.), "categ"] = "Normal"
data_merge[!is.na(data_merge$weight) & !is.na(data_merge$X10.) & (data_merge$weight < data_merge$X10.), "categ"] = "low"
data_merge[!is.na(data_merge$poids) & !is.na(data_merge$X5.) & (data_merge$weight < data_merge$X5.), "categ"] = "very low"
data_merge[!is.na(data_merge$poids) & !is.na(data_merge$X3.) & (data_merge$weight < data_merge$X3.), "categ"] = "Extremely low"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fast data.table joins with time series - r

Related

R - deSolve package (ode function): change a matrix of parameters in SIR model according to time

Correcting errors in variables, by taking the median by group for outliers

How to transform time series code to automated code to use for multiple time series?

Sub-setting to find the anomaly

Conditional formatting with R

Categories

Resources