Calculating the difference of a column compared to a specific reference row - r

I have a data frame with data for every minute and every weekday during the year and want to calculate the difference based on a specific reference line each day (which is 08:30:00 in this example and Data1 is the column I want to compare the difference for). Usually I would use diff and lag but there I can only check the difference to n-previous rows not one specific reference row.
As the entire data has about 1 Mio entries I think using lag and diff in a recursive function (where I could use the condition-check for the starting line and then walking forward) would be too time consuming. Another idea I had is doing a second data frame with only the reference line for each day (which only had line 3 in this sample) and then joining with the original data frame as a new column containing the starting value. Then I could easily calc the difference between two columns.
Date Time Data1 Diff
1 2022-01-03 08:28:00 4778.14 0
2 2022-01-03 08:29:00 4784.23 0
3 2022-01-03 08:30:00 4785.15 0
4 2022-01-03 08:31:00 4785.01 -0.14
5 2022-01-03 08:32:00 4787.83 2.68
6 2022-01-03 08:33:00 4788.80 3.65

You can subset Data1 to rows where Time is "08:30:00" as follows. This assumes Time is character.
dat$diff <- dat$Data1 - dat$Data1[[match("08:30:00", dat$Time)]]
dat
Date Time Data1 Diff diff
1 2022-01-03 08:28:00 4778.14 0.00 -7.01
2 2022-01-03 08:29:00 4784.23 0.00 -0.92
3 2022-01-03 08:30:00 4785.15 0.00 0.00
4 2022-01-03 08:31:00 4785.01 -0.14 -0.14
5 2022-01-03 08:32:00 4787.83 2.68 2.68
6 2022-01-03 08:33:00 4788.80 3.65 3.65
For data with multiple dates, you can do the same operation for each day using dplyr::group_by():
library(dplyr)
dat %>%
group_by(Date) %>%
mutate(diff = Data1 - Data1[[match("08:30:00", Time)]]) %>%
ungroup()

Related

How to combine 2 rows in 1

I'm trying to reshape a data frame but I'm totally lost in how to proceed:
> test
# Time Entry Order Size Price S / L T / P Profit Balance
1 1 2017-01-11 00:00:00 buy 1 0.16 1.05403 1.0449 1.07838 NA NA
2 3 2017-01-24 16:00:00 s/l 1 0.16 1.04490 1.0449 1.07838 -97.28 9902.72
As you can see, we have 2 (or more) registers for one order ID. What I want to do is combine those 2 rows into one by adding several new columns: Exit (that's where the "s/l" entry of the second observation should go), Exit Price (there should go the data for the Price column on the second entry) and replace the NA from the first entry with the data of the second one on the Profit and Balance columns.
By the way, the original name of the Entry column is "Type" but I already changed that, so that's why it doesn't make that much sense of having the exit reason of the trade on a column called "Entry". So far I've only thought of extracting the data on several vectors and then just do a mutate on the first entry and dropping the second one, but I'm quite sure there's a better way of doing that. Also, that stone-age approach would be useless when applied to the whole data frame.
If possible, I'd like to stick to the tidyverse library to do this just for ease of replication. Thank you in advance for your suggestions!
I ended up sorting it out! My solution was to split the data frame in 2, reshape each half as needed, and then full joining them. Here's the initial data frame:
> head(backtest_table, n = 10)
# Time Type Order Size Price S / L T / P Profit Balance
1 1 2017.01.11 00:00 buy 1 0.16 1.05403 1.04490 1.07838 NA NA
2 2 2017.01.19 00:00 buy 2 0.16 1.05376 1.04480 1.07764 NA NA
3 3 2017.01.24 16:00 s/l 1 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
4 4 2017.01.24 16:00 s/l 2 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
5 5 2017.02.09 00:00 buy 3 0.15 1.05218 1.04265 1.07758 NA NA
6 6 2017.03.03 16:00 t/p 3 0.15 1.07758 1.04265 1.07758 251.75 10058.99
7 7 2017.03.29 00:00 buy 4 0.15 1.08826 1.07859 1.11405 NA NA
8 8 2017.04.04 00:00 close 4 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
9 9 2017.04.04 00:00 sell 5 0.15 1.08416 1.09421 1.05737 NA NA
10 10 2017.04.07 00:00 sell 6 0.15 1.08250 1.09199 1.05718 NA NA
Here's the code I used to modify everything:
# Re-format data
library(lubridate)
# Separate entries and exits
entries <- backtest_table %>% filter(Type %in% c("buy", "sell"))
exits <- backtest_table %>% filter(!Type %in% c("buy", "sell"))
# Reshape entries and exits
# Entries
entries <- entries[-c(1, 9, 10)]
colnames(entries) <- c("Entry time", "Entry type", "Order", "Entry volume",
"Entry price", "Entry SL", "Entry TP")
entries$`Entry time` <- entries$`Entry time` %>% ymd_hm()
entries$`Entry type` <- as.factor(entries$`Entry type`)
# Exits
exits <- exits[-1]
colnames(exits) <- c("Exit time", "Exit type", "Order", "Exit volume",
"Exit price", "Exit SL", "Exit TP", "Profit", "Balance")
exits$`Exit time` <- exits$`Exit time` %>% ymd_hm()
exits$`Exit type` <- as.factor(exits$`Exit type`)
# Join re-shaped data
test <- full_join(entries, exits, by = c("Order"))
And here's the output of that:
> head(test, n = 10)
Entry time Entry type Order Entry volume Entry price Entry SL Entry TP Exit time
1 2017-01-11 buy 1 0.16 1.05403 1.04490 1.07838 2017-01-24 16:00:00
2 2017-01-19 buy 2 0.16 1.05376 1.04480 1.07764 2017-01-24 16:00:00
3 2017-02-09 buy 3 0.15 1.05218 1.04265 1.07758 2017-03-03 16:00:00
4 2017-03-29 buy 4 0.15 1.08826 1.07859 1.11405 2017-04-04 00:00:00
5 2017-04-04 sell 5 0.15 1.08416 1.09421 1.05737 2017-05-26 10:00:00
6 2017-04-07 sell 6 0.15 1.08250 1.09199 1.05718 2017-05-01 09:20:00
7 2017-04-19 sell 7 0.15 1.07334 1.08309 1.04733 2017-04-25 10:00:00
8 2017-05-05 sell 8 0.14 1.07769 1.08773 1.05093 2017-05-29 14:00:00
9 2017-05-24 sell 9 0.14 1.06673 1.07749 1.03803 2017-06-22 18:00:00
10 2017-06-14 sell 10 0.14 1.04362 1.05439 1.01489 2017-06-15 06:40:00
Exit type Exit volume Exit price Exit SL Exit TP Profit Balance
1 s/l 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
2 s/l 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
3 t/p 0.15 1.07758 1.04265 1.07758 251.75 10058.99
4 close 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
5 t/p 0.15 1.05737 1.09421 1.05737 265.58 10091.18
6 s/l 0.15 1.09199 1.09199 1.05718 -94.79 9825.60
7 s/l 0.15 1.08309 1.08309 1.04733 -97.36 9920.39
8 t/p 0.14 1.05093 1.08773 1.05093 247.61 10338.79
9 t/p 0.14 1.03803 1.07749 1.03803 265.59 10504.05
10 s/l 0.14 1.05439 1.05439 1.01489 -100.33 10238.46
And that combined the observations where a trade was added that showed NAs on the last columns with the observations where that trade was closed, populating the last columns with the actual result and new account balance!
If someone has suggestions on how to improve the system please let me know!

Calculating an hour average between 12 and 1AM with per min dataframe: dplyr

I have a per minute timeseries for a number of years.
I need to compute a the following value for each minute data point:
q <- (Fn-Fd)/Fn
Whereby Fn is the average F value at night time between 12-1 AM and Fd is just the minute data point.
Now obviously the Fn changes each day so one approach would be to calculate Fn perhaps using a dplyr function and i would need to create a loop of some kind or re-organise my data frame...
dummy data:
#string of dates for a one month
datetime <- seq(
from=as.POSIXct("2012-1-1 0:00:00", tz="UTC"),
to=as.POSIXct("2012-2-1 0:00:00", tz="UTC"),
by="min"
)
#variable F
F <- runif(44641, min = 0, max =2)
#dataframe
df <- as.data.frame(cbind(datetime,F))
library(lubridate)
#make sure its in "POSIXct" "POSIXt" format
df$datetime <- as_datetime(df$datetime)
Or a less elegant way might be to get Fn on its own, between the times using dplyr first - i think it will be something like this:
Fn <- df %>%
filter(between(as.numeric(format(datetime, "%H")), 0, 1)) %>%
group_by(hour=format(datetime, "%Y-%m-%d %H:")) %>%
summarise(value=mean(df$F))
But I am not sure my syntax is correct there? Am I calculating the mean F between 12 and 1 AM per day?
Then i could just print the average Fn value for each min per day to my dataframe and do the simple calculation to get Q.
Thanks in advance for advice here.
Maybe something like this ?
library(dplyr)
library(lubridate)
df %>%
group_by(Date = as.Date(datetime)) %>%
mutate(F_mean = mean(F[hour(datetime) == 0]),
value = (F_mean - F)/F_mean) %>%
ungroup() %>%
select(-F_mean, -Date)
# datetime F value
# <dttm> <dbl> <dbl>
# 1 2012-01-01 00:00:00 1.97 -0.902
# 2 2012-01-01 00:01:00 0.194 0.813
# 3 2012-01-01 00:02:00 1.52 -0.467
# 4 2012-01-01 00:03:00 1.66 -0.599
# 5 2012-01-01 00:04:00 0.765 0.262
# 6 2012-01-01 00:05:00 1.31 -0.267
# 7 2012-01-01 00:06:00 1.62 -0.565
# 8 2012-01-01 00:07:00 0.642 0.380
# 9 2012-01-01 00:08:00 1.62 -0.560
#10 2012-01-01 00:09:00 1.68 -0.621
# ... with 44,631 more rows
We first group_by every date get the mean value for 0th hour (values between 00:00 to 00:59) each day and calculate value using the formula given.

Remove incomplete days / retain complete days

I have a data from field instruments where values for 7 different parameters are measured and recorded every 15 minutes. The data set extends for many years. Sometimes the instruments fail or are taken off-line for preventive maintenance giving incomplete days in the record. In post-processing the data, I would like to remove those incomplete days (or, stated alternatively, retain only the complete days).
An abbreviated example of what the data might look like:
Date Temp
2012-02-01 00:01:00 18.5
2012-02-01 00:16:00 18.4
2012-02-01 00:31:00 18.6
.
.
.
2012-02-01 23:31:00 19.0
2012-02-01 23:46:00 18.9
2012-02-02 00:01:00 19.0
2012-02-02 00:16:00 19.0
2012-02-03 00:01:00 17.0
2012-02-03 00:16:00 17.1
2012-02-03 00:31:00 17.0
.
.
.
2012-02-03 23:31:00 18.0
2012-02-03 23:46:00 18.2
So 2012-02-01 and 2012-02-03 are complete days and I'd like to remove 2012-02-02 as it is an incomplete day.
Convert dates to days
Count the number of observations per day
Retain only those days with the maximum number of observations
The code
library(dplyr)
library(lubridate)
dataset %>%
mutate(Day = floor_date(Date, unit = "day")) %>%
group_by(Day) %>%
mutate(nObservation = n()) %>%
filter(nObservation == max(nObservation)
Date.rle = rle(df$Date)
Date.good = Date.rle$val[Date.rle$len==96]
df = df[df$Date %in% Date.good,]
Here is one base R method that should work:
# create a day variable
df$day <- as.Date(df$Date, format="%Y-%m-%d")
# calculate the number of observations per day
df$obsCnt <- ave(df$Temp, df$day, FUN=length)
# subset data: more than 90 observations
dfNew <- df[df$obsCnt > 96,]
I put the threshold at 96 observations a day, but it is easily adjusted.

Summarize R data frame based on a date range in a second data frame

I have two data frames, one that includes data by day, and one that includes data by irregular time multi-day intervals. For example:
A data frame precip_range with precipitation data by irregular time intervals:
start_date<-as.Date(c("2010-11-01", "2010-11-04", "2010-11-10"))
end_date<-as.Date(c("2010-11-03", "2010-11-09", "2010-11-12"))
precipitation<-(c(12, 8, 14))
precip_range<-data.frame(start_date, end_date, precipitation)
And a data frame precip_daily with daily precipitation data:
day<-as.Date(c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-05",
"2010-11-06", "2010-11-07", "2010-11-08", "2010-11-09", "2010-11-10",
"2010-11-11", "2010-11-12"))
precip<-(c(3, 1, 2, 1, 0.25, 1, 3, 0.33, 0.75, 0.5, 1, 2))
precip_daily<-data.frame(day, precip)
In this example, precip_daily represents daily precipitation estimated by a model and precip_range represents measured cumulative precipitation for specific date ranges. I am trying to compare modeled to measured data, which requires synchronizing the time periods.
So, I want to summarize the precip column in data frame precip_daily (count of observations and sum of precip) by the date date ranges between start_date and end_date in the data frame precip_range. Any thoughts on the best way to do this?
You can use the start_dates from precip_range as breaks to cut() to group your daily values. For example
rng <- cut(precip_daily$day,
breaks=c(precip_range$start_date, max(precip_range$end_date)),
include.lowest=T)
Here we cut the values in daily using the start dates in the range data.frame. We're sure to include the lowest value and stop at the largest end value. If we merge that with the daily values we see
cbind(precip_daily, rng)
# day precip rng
# 1 2010-11-01 3.00 2010-11-01
# 2 2010-11-02 1.00 2010-11-01
# 3 2010-11-03 2.00 2010-11-01
# 4 2010-11-04 1.00 2010-11-04
# 5 2010-11-05 0.25 2010-11-04
# 6 2010-11-06 1.00 2010-11-04
# 7 2010-11-07 3.00 2010-11-04
# 8 2010-11-08 0.33 2010-11-04
# 9 2010-11-09 0.75 2010-11-04
# 10 2010-11-10 0.50 2010-11-10
# 11 2010-11-11 1.00 2010-11-10
# 12 2010-11-12 2.00 2010-11-10
which shows that the values have been grouped. Then we can do
aggregate(cbind(count=1, sum=precip_daily$precip)~rng, FUN=sum)
# rng count sum
# 1 2010-11-01 3 6.00
# 2 2010-11-04 6 6.33
# 3 2010-11-10 3 3.50
To get the total for each of those ranges (ranges as labeled with the start date)
Or
library(zoo)
library(data.table)
temp <- merge(precip_daily, precip_range, by.x = "day", by.y = "start_date", all.x = T)
temp$end_date <- na.locf(temp$end_date)
setDT(temp)[, list(Sum = sum(precip), Count = .N), by = end_date]
## end_date Sum Count
## 1: 2010-11-03 6.00 3
## 2: 2010-11-09 6.33 6
## 3: 2010-11-12 3.50 3

Make Quarterly Time Series in Julia

In Julia, we can create a Time Array with the following code:
d = [date(1980,1,1):date(2015,1,1)];
t = TimeArray(d,rand(length(d)),["test"])
This would give us daily data. What about getting quarterly or yearly time series?
Merely use the optional step capability of Base.range in combination with the type Datetime.Period
julia> [Date(1980,1,1):Month(3):Date(2015,1,1)]
141-element Array{Date{ISOCalendar},1}:
1980-01-01
1980-04-01
1980-07-01
1980-10-01
1981-01-01
1981-04-01
...
And change the step as necessary
julia> [Date(1980,1,1):Year(1):Date(2015,1,1)]
36-element Array{Date{ISOCalendar},1}:
1980-01-01
1981-01-01
1982-01-01
...
0.3.x vs 0.4.x
In version 0.3.x Dates is available in the package Dates which provides the module Dates, but in version 0.4.x, module Dates is built in. Also (currently) an additional subtle difference is Year and Month must be accessed as Dates.Year and Dates.Month in version 0.4.x.
I know this question is a bit old, but it's worth adding that there is another time series package called Temporal* that has this functionality available.
Here's some example usage:
using Temporal, Base.Dates
date_array = collect(today()-Day(365):Day(1):today())
random_walk = cumsum(randn(length(date_array))) + 100.0
Construct the time series object (type TS). Last argument is for column names, but if not given will autogenerate default column names.
ts_data = TS(random_walk, date_array, :RandomWalk)
# Index RandomWalk
# 2016-08-24 99.8769
# 2016-08-25 99.1643
# 2016-08-26 98.8918
# 2016-08-27 97.7265
# 2016-08-28 97.9675
# 2016-08-29 97.7151
# 2016-08-30 97.0279
# ⋮
# 2017-08-17 81.2998
# 2017-08-18 82.0658
# 2017-08-19 82.1941
# 2017-08-20 81.9021
# 2017-08-21 81.8163
# 2017-08-22 81.5406
# 2017-08-23 81.2229
# 2017-08-24 79.2867
Get the last observation of every quarter (similar logic exists for weeks, months, and years using eow, eom, and eoy respectively):
eoq(ts_data) # get the last observation at every quarter
# 4x1 Temporal.TS{Float64,Date}: 2016-09-30 to 2017-06-30
# Index RandomWalk
# 2016-09-30 88.5629
# 2016-12-31 82.1014
# 2017-03-31 84.9065
# 2017-06-30 92.1997
Can also use functions to aggregate data by the same kinds of periods as given above.
collapse(ts_data, eoq, fun=mean) # get the average value every quarter
# 4x1 Temporal.TS{Float64,Date}: 2016-09-30 to 2017-06-30
# Index RandomWalk
# 2016-09-30 92.5282
# 2016-12-31 86.8291
# 2017-03-31 89.1391
# 2017-06-30 90.3982
* (Disclaimer: I'm the package author.)
Quarterly isn't supported yet, but other time periods such as week, month and year are supported. There is a method called collapse that is used to convert a TimeArray to a larger time frame.
d = [Date(1980,1,1):Date(2015,1,1)];
t = TimeArray(d,rand(length(d)),["test"])
c = collapse(t, last, period=year)
Returns the following
36x1 TimeArray{Float64,1} 1980-12-31 to 2015-01-01
test
1980-12-31 | 0.94
1981-12-31 | 0.37
1982-12-31 | 0.12
1983-12-31 | 0.64
⋮
2012-12-31 | 0.43
2013-12-31 | 0.81
2014-12-31 | 0.88
2015-01-01 | 0.55
Also, note that date has been deprecated in favor of Date as a new updated package now runs the date/time functions underneath.

Resources