Identify Min & Max Numeric Value within Date/Datetime range repeatedly - r

I am completely new to R so this is proving too complex to handle for me right now, so any help is much appreciated.
I am analysing price action data for BTC. I have 1 minute candles from 2019-09-08 19:13:00 to 2022-03-15 00:22:00 with the variables of open, high, low, close price as well as volume in BTC & USD and trade count for each of those minutes. Data source is https://www.cryptodatadownload.com/data/binance/ for anyone interested.
I cleaned up & correctly formatted the data and now want to analyse when BTC price made a low & high for various date & time ranges, for example:
What time of day in 30 minute increments did BTC made a low for the week?
Here is what I believe I need to do:
I need to tell R that 30 minutes is a range and identify the lowest & highest value for the "Low" and "High" variables within in as well as that a day is a range and within that the lowest & highest value for the "Low" and "High" variables as well as define a week as a range and within that the lowest & highest value for the "Low" and "High" variables.
Then I'd need to mark these values, the best method I can think of would be creating a new variable and have it as a TRUE/FALSE column like so:
btcusdt_binance_fut_1min$pa.low.of.week.30min
btcusdt_binance_fut_1min$pa.high.of.week.30min
Every minute row that is within that 30min low and high will be marked TRUE and every other minute within that week will be marked FALSE.
I looked at lubridate's interval() function but as far as I know the problem is I'd need to define each year, month, week, day, 30mins interval individually with start and end time, which is obviously not feasible. I believe I run into the same problem with the subset() function.
Another option seems to be the seq() and seq.POSIXt() functions as well as the range() function, but I haven't found a way for it.
Here is all my code and I am using this data set: https://www.cryptodatadownload.com/cdd/BTCUSDT_Binance_futures_data_minute.csv
library(readr)
library(lubridate)
library(tidyverse)
library(plyr)
library(dplyr)
# IMPORT CSV FILE AS DATA SET
# Name data set & choose import file
# Skip = 1 for skipping first row of CSV
btcusdt_binance_fut_1min <-
read.csv(
file.choose(),
skip = 1,
header = T,
sep = ","
)
# CLEAN UP & REORGANISE DATA
# Remove unix & symbol column
btcusdt_binance_fut_1min$unix = NULL
btcusdt_binance_fut_1min$symbol = NULL
# Rename date column to datetime
colnames(btcusdt_binance_fut_1min)[colnames(btcusdt_binance_fut_1min) == "date"] <-
"datetime"
# Convert datetime column to POSIXct format
btcusdt_binance_fut_1min$datetime <-
as_datetime(btcusdt_binance_fut_1min$datetime, tz = "UTC")
# Create variable column for each time element
btcusdt_binance_fut_1min$year <-
year(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$month <-
month(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$week <-
isoweek(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$weekday <-
wday(btcusdt_binance_fut_1min$datetime,
label = TRUE,
abbr = FALSE)
btcusdt_binance_fut_1min$hour <-
hour(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$minute <-
minute(btcusdt_binance_fut_1min$datetime)
# Reorder columns
btcusdt_binance_fut_1min <-
btcusdt_binance_fut_1min[, c(1, 9, 10, 11, 12, 13, 14, 4, 3, 2, 5, 6, 7, 8)]

Using data.table we can do the following:
btcusdt_binance_fut_1min <- data.table(datetime = seq.POSIXt(as.POSIXct("2022-01-01 0:00"), as.POSIXct("2022-01-01 2:59"), by = "1 min"))
btcusdt_binance_fut_1min[, group := format(as.POSIXct(cut(datetime, breaks = "30 min")), "%H:%M")]
the cut function will "floor" each datetime to it's nearest, smaller, half an hour. The format and as.POSIXct are just there to remove the date part to allow for easy comparing between dates for the same half hours, but if you prefer to keep it a datetime you can remove these functions.
After this the next steps are pretty straightforward:
btcusdt_binance_fut_1min[, .(High = max(High), Low = min(Low)), by=.(group)]

Related

Moving average varying window

I have an unbalanced panel, in which I have certain observations (variable x) per ID and month. I am trying to calculate a 6-month-rolling average of x, but only every March. I know that with zoo, I can calculate the average every single time, but I think that is computationally expensive. I have a very large panel, so it would be better to define an index first and pass it to the function. Also, my panel is imbalanced, so sometimes I have all 6 past values at a given March, and sometimes I do not. If there is a minimum of 3 values available, I would still like to compute the average.
Here is some sample code and my solution so far:
library(data.table)
set.seed(1)
time=rep(seq(as.Date("2010-02-01"), length=42, by="1 month") - 1,2)
IDs=rep(letters[1:2],each=length(time))
DT <- data.table(time=time,
ID=IDs,
ind=rep(1:(2*length(time))),
row=1:(2*length(time)),
x=sample(2*length(time)))
DT
DT <- DT[!ind %in% c(11,12,26)]
DT
library(zoo)
DT[,movavg := if(length(x) >= 3){ rollapply(x, 6, sum, na.rm = FALSE,align = "right",fill = NA)}else{
rep(NA,length(x))
},by=ID]
DT
The target is to simply show for each March the corresponding moving average, which contains the past 6 observations. I don't mind if the original panel is kept, that is, only in March the results are shown, or if only the March values are extracted and nothing else is shown.
My code works, but it does the calculation every row/month. What I want it to do is to work only at a defined index. The issue is, as the panel is unbalanced, the distance between the Marches is not equally long. For example, it can be 12 months from one to another year, but it could be 10 months from the next to the following year when 2 observations are unfortunately missing. Can roll apply still be used? Any hints for data table or dplyr are highly appreciated.
If this code from the question gives what you want
DT[,movavg := if(length(x) >= 3){ rollapply(x, 6, sum, na.rm = FALSE,align = "right",fill = NA)}else{
rep(NA,length(x))
},by=ID]
then the first of these ran 2.8x faster and gave the same result and the second one using frollsum from data.table ran 4.8x faster.
DT[, movavg := rollsumr(x, 6, fill = NA), by = ID]
DT[, movavg := frollsum(x, 6), by = ID]

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Filter specificaly with hour/minutes/seconds - R

I have this df:
> Time
[1] "02:15:00" "02:30:00" "02:45:00" "03:00:00" "03:15:00" "03:30:00"
I wanna delete all the time values before 3:00:00. However, I need to do it in a format of hour = 3, minutes = 0, seconds = 0. Like:
df <- df[df$Time < a_function(hour=3, minutes=0, seconds=0) ,]
I want to know how can I do this with time values, as I can do it with year, month, and day.
Why do you want to do it that way? Just asking for context. Also have you looked at lubridate::hms()? It takes a time object and converts to periods of time.

How to group time series data by arbitrary dates in R?

I have a data.frame like the following:
df <- data.frame(
DateTime = seq(ISOdate(2015, 1, 1, 0), by = 15 * 60, length.out = 35040),
kWh = abs(rnorm(35040, mean = 550, sd = 50))
)
and a vector such as:
dates <- as.Date(c("2015-01-15", "2015-02-17", "2015-03-14", "2015-04-16",
"2015-05-16", "2015-06-18", "2015-07-15", "2015-08-15",
"2015-09-16", "2015-10-13", "2015-11-17", "2015-12-17"))
What I want to do is add a column to df that indicates what accounting period each entry is attributed to. For example every entry from the beginning of the data through the last entry on 2015-01-14 would be given a value of 201501 because they are attributed to the January 2015 accounting period. Again, every value from from 2015-01-15 to the last value on 2015-02-16 would be given a value of 201502.
I was hoping that there would be a solution using lubridate as I'd rather not convert to an xts or zoo based object. Performance is also somewhat important as I will have to do this for a couple hundred such data sets.
I figured out the answer, I didn't realize cut also works with POSIXct objects.
df$Period <- cut(df$DateTime, breaks = as.POSIXct(dates),
labels = 201502:201512)
It's important to convert the dates into POSIXct object because otherwise cut throws an error saying that they breaks are not formatted correctly.

Replace values in an xts object according to some events on specific dates in R

I have two signal series and a data series as below.
BuyDates<-seq(as.Date("2013/1/1"), as.Date("2013/3/1"), by = "5 days")
SellDates<-seq(as.Date("2013/1/1"), as.Date("2013/3/1"), by = "7 days")
data<- xts(c(rnorm(32,100,3)),seq(as.Date("2013/1/1"), as.Date("2013/2/1"), by = "days"))
What i want is,the dates on which data gets buy signal from BuyDates,the value of data should be replaced by 1 and for SellDates it should be -1.And,on the remaining days in the sequence,1 or -1 should be carried forward till it gets the opposite signal,and for the days till the 1st signal,value should be replaced with NA.
kindly help
You can subset the data as usual:
data<- xts(rep(NA, 32),seq(as.Date("2013/1/1"), as.Date("2013/2/1"), by = "days"))
data[BuyDates] <- 1
data[SellDates] <- -1
Then you can carry forward the non-NA values using na.locf.
na.locf(data)

Resources