How to group time series data by arbitrary dates in R? - r

I have a data.frame like the following:
df <- data.frame(
DateTime = seq(ISOdate(2015, 1, 1, 0), by = 15 * 60, length.out = 35040),
kWh = abs(rnorm(35040, mean = 550, sd = 50))
)
and a vector such as:
dates <- as.Date(c("2015-01-15", "2015-02-17", "2015-03-14", "2015-04-16",
"2015-05-16", "2015-06-18", "2015-07-15", "2015-08-15",
"2015-09-16", "2015-10-13", "2015-11-17", "2015-12-17"))
What I want to do is add a column to df that indicates what accounting period each entry is attributed to. For example every entry from the beginning of the data through the last entry on 2015-01-14 would be given a value of 201501 because they are attributed to the January 2015 accounting period. Again, every value from from 2015-01-15 to the last value on 2015-02-16 would be given a value of 201502.
I was hoping that there would be a solution using lubridate as I'd rather not convert to an xts or zoo based object. Performance is also somewhat important as I will have to do this for a couple hundred such data sets.

I figured out the answer, I didn't realize cut also works with POSIXct objects.
df$Period <- cut(df$DateTime, breaks = as.POSIXct(dates),
labels = 201502:201512)
It's important to convert the dates into POSIXct object because otherwise cut throws an error saying that they breaks are not formatted correctly.

Related

Date Formatting in Time Series Codes

I have a .csv file that looks like this:
Date
Time
Demand
01-Jan-05
6:30
6
01-Jan-05
6:45
3
...
23-Jan-05
21:45
0
23-Jan-05
22:00
1
The days are broken into 15 minute increments from 6:30 - 22:00.
Now, I am trying to do a time series on this, but I am a little lost on the notation of this.
I have the following so far:
library(tidyverse)
library(forecast)
library(zoo)
tp <- read.csv(".csv")
tp.ts <- ts(tp$DEMAND, start = c(), end = c(), frequency = 63)
The frequency I am after is an entire day, which I believe makes the number 63.***
However, I am unsure as to how to notate the dates in c().
***Edit
If the frequency is meant to be observations per a unit of time, and I am trying to observe just (Demand) by the 15 minute time slots (Time) in each day (Date), maybe my Frequency is 1?
***Edit 2
So I think I am struggling with doing the time series because I have a Date column (which is characters) and a Time column.
Since I need the data for Demand at the given hours on the dates, maybe I need to convert the dates to be used in ts() and combine the Date and Time date into a new column?
If I do this, I am assuming this should give me the times I need (6:30 to 22:00) but with the addition of having the date?
However, the data is to be used to predict the Demand for the rest of the month. So maybe the Date is an important variable if the day of the week impacts Demand?
We assume you are starting with tp shown reproducibly in the Note at the end. A complete cycle of 24 * 4 = 96 points should be represented by one unit of time internally. The chron class does that so read it in as a zoo series z with chron time index and then convert that to ts giving ts_ser or possibly leave it as a zoo series depending on what you are going to do next.
library(zoo)
library(chron)
to_chron <- function(date, time) as.chron(paste(date, time), "%d-%b-%y %H:%M")
z <- read.zoo(tp, index = 1:2, FUN = to_chron, frequency = 4 * 24)
ts_ser <- as.ts(z)
Note
tp <- structure(list(Date = c("01-Jan-05", "01-Jan-05"), Time = c("6:30",
"6:45"), Demand = c(6L, 3L)), row.names = 1:2, class = "data.frame")

Identify Min & Max Numeric Value within Date/Datetime range repeatedly

I am completely new to R so this is proving too complex to handle for me right now, so any help is much appreciated.
I am analysing price action data for BTC. I have 1 minute candles from 2019-09-08 19:13:00 to 2022-03-15 00:22:00 with the variables of open, high, low, close price as well as volume in BTC & USD and trade count for each of those minutes. Data source is https://www.cryptodatadownload.com/data/binance/ for anyone interested.
I cleaned up & correctly formatted the data and now want to analyse when BTC price made a low & high for various date & time ranges, for example:
What time of day in 30 minute increments did BTC made a low for the week?
Here is what I believe I need to do:
I need to tell R that 30 minutes is a range and identify the lowest & highest value for the "Low" and "High" variables within in as well as that a day is a range and within that the lowest & highest value for the "Low" and "High" variables as well as define a week as a range and within that the lowest & highest value for the "Low" and "High" variables.
Then I'd need to mark these values, the best method I can think of would be creating a new variable and have it as a TRUE/FALSE column like so:
btcusdt_binance_fut_1min$pa.low.of.week.30min
btcusdt_binance_fut_1min$pa.high.of.week.30min
Every minute row that is within that 30min low and high will be marked TRUE and every other minute within that week will be marked FALSE.
I looked at lubridate's interval() function but as far as I know the problem is I'd need to define each year, month, week, day, 30mins interval individually with start and end time, which is obviously not feasible. I believe I run into the same problem with the subset() function.
Another option seems to be the seq() and seq.POSIXt() functions as well as the range() function, but I haven't found a way for it.
Here is all my code and I am using this data set: https://www.cryptodatadownload.com/cdd/BTCUSDT_Binance_futures_data_minute.csv
library(readr)
library(lubridate)
library(tidyverse)
library(plyr)
library(dplyr)
# IMPORT CSV FILE AS DATA SET
# Name data set & choose import file
# Skip = 1 for skipping first row of CSV
btcusdt_binance_fut_1min <-
read.csv(
file.choose(),
skip = 1,
header = T,
sep = ","
)
# CLEAN UP & REORGANISE DATA
# Remove unix & symbol column
btcusdt_binance_fut_1min$unix = NULL
btcusdt_binance_fut_1min$symbol = NULL
# Rename date column to datetime
colnames(btcusdt_binance_fut_1min)[colnames(btcusdt_binance_fut_1min) == "date"] <-
"datetime"
# Convert datetime column to POSIXct format
btcusdt_binance_fut_1min$datetime <-
as_datetime(btcusdt_binance_fut_1min$datetime, tz = "UTC")
# Create variable column for each time element
btcusdt_binance_fut_1min$year <-
year(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$month <-
month(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$week <-
isoweek(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$weekday <-
wday(btcusdt_binance_fut_1min$datetime,
label = TRUE,
abbr = FALSE)
btcusdt_binance_fut_1min$hour <-
hour(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$minute <-
minute(btcusdt_binance_fut_1min$datetime)
# Reorder columns
btcusdt_binance_fut_1min <-
btcusdt_binance_fut_1min[, c(1, 9, 10, 11, 12, 13, 14, 4, 3, 2, 5, 6, 7, 8)]
Using data.table we can do the following:
btcusdt_binance_fut_1min <- data.table(datetime = seq.POSIXt(as.POSIXct("2022-01-01 0:00"), as.POSIXct("2022-01-01 2:59"), by = "1 min"))
btcusdt_binance_fut_1min[, group := format(as.POSIXct(cut(datetime, breaks = "30 min")), "%H:%M")]
the cut function will "floor" each datetime to it's nearest, smaller, half an hour. The format and as.POSIXct are just there to remove the date part to allow for easy comparing between dates for the same half hours, but if you prefer to keep it a datetime you can remove these functions.
After this the next steps are pretty straightforward:
btcusdt_binance_fut_1min[, .(High = max(High), Low = min(Low)), by=.(group)]

How to create intervals of 1 hour

How to create for every date hourly timestamps?
So for example from 00:00 til 23:59. The result of the function could be 10:00. I read on the internet that loop could work but we couldn't make it fit.
Data sample:
df = data.frame( id = c(1, 2, 3, 4), Date = c(2021-04-18, 2021-04-19, 2021-04-21
07:07:08.000, 2021-04-22))
A few points:
The input shown in the question is not valid R syntax so we assume what we have is the data frame shown reproducibly in the Note at the end.
the question did not describe the specific output desired so we will assume that what is wanted is a POSIXct vector of hourly values which in (1) below we assume is from the first hour of the minimum date to the last hour of the maximum date in the current time zone or in (2) below we assume that we only want hourly sequences for the dates in df also in the current time zone.
we assume that any times in the input should be dropped.
we assume that the id column of the input should be ignored.
No packages are used.
1) This calculates hour 0 of the first date and hour 0 of the day after the last date giving rng. The as.Date takes the Date part, range extracts out the smallest and largest dates into a vector of two components, adding 0:1 adds 0 to the first date leaving it as is and 1 to the second date converting it to the date after the last date. The format ensures that the Dates are converted to POSIXct in the current time zone rather than UTC. Then it creates an hourly sequence from those and uses head to drop the last value since it would be the day after the input's last date.
rng <- as.POSIXct(format(range(as.Date(df$Date)) + 0:1))
head(seq(rng[1], rng[2], "hour"), -1)
2) Another possibility is to paste together each date with each hour from 0 to 23 and then convert that to POSIXct. This will give the same result if the input dates are sequential; otherwise, it will give the hours only for those dates provided.
with(expand.grid(Date = as.Date(df$Date), hour = paste0(0:23, ":00:00")),
sort(as.POSIXct(paste(Date, hour))))
Note
df <- data.frame( id = c(1, 2, 3, 4),
Date = c("2021-04-18", "2021-04-19", "2021-04-21 07:07:08.000", "2021-04-22"))

Define different timeseries for different columns

I have a dataframe where some of the columns are starting later than the other. Please find a reproducible example.
set.seed(354)
df <- data.frame(Product_Id = rep(1:100, each = 50),
Date = seq(from = as.Date("2014/1/1"),
to = as.Date("2018/2/1"),
by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312, 2551:2562, 2651:2662, 2751:2762), ]
library(zoo)
z <- read.zoo(df, index = "Date", split = "Product_Id", FUN = as.yearmon)
tt <- as.ts(z)
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
There, AFAIK, no way to do this in R in a time series matrix. And if each column started at a different date, then (since each column has the same number of entries), each column would also need to end at a different date. Is this really what you need? A collection of time series that all happen to be of the same length (so they can fit into a matrix), but that start and end with offsets? I struggle to understand where something like this would be useful, outside a kind of forecasting competition.
If you really need this, then I would recommend you put your time series into a list structure. Then each one can start and end at any date, and they can be the same or different lengths. Take inspiration from Mcomp::M3.
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Since your tt is already a time series object, the simplest way would be simply to iterate over its columns:
fcst <- matrix(nrow=10,ncol=ncol(tt))
for ( ii in 1:ncol(tt) ) fcst <- forecast(ets(tt[,ii]),10)$mean
Note that most modeling functions in forecast will throw a warning and do something reasonable on encountering NA values. Here, e.g.:
1: In ets(tt[, ii]) :
Missing values encountered. Using longest contiguous portion of time series
Of course, you could do something yourself inside the loop, e.g., search for the last NA and start the time series for modeling right after that (but make sure you fail gracefully if the last entry is NA).

Create date index and add to data frame in R

Currently transitioning from Python to R. In Python, you can create a date range with pandas and add it to a data frame like so;
data = pd.read_csv('Data')
dates = pd.date_range('2006-01-01 00:00', periods=2920, freq='3H')
df = pd.DataFrame({'data' : data}, index = dates)
How can I do this in R?
Further, if I want to compare 2 datasets with different lengths but same time span, you can resample the dataset with lower frequency so it can be the same length as the higher frequency by placing 'NaNs' in the holes like so:
df2 = pd.read_csv('data2') #3 hour resolution = 2920 points of data
data2 = df2.resample('30Min').asfreq() #30 Min resolution = 17520 points
I guess I'm basically looking for a Pandas package equivalent for R. How can I code these in R?
The following is a way of getting your time-series data from a given time interval (3 hours)to another (30 minutes):
Get the data:
starter_df <- data.frame(dates=seq(from=(as.POSIXct(strftime("2006-01-01 00:00"))),
length.out = 2920,
by="3 hours"),
data = rnorm(2920))
Get the full sequence in 30 minute intervals and replace the NA's with the values from the starter_df data.frame:
full_data <- data.frame(dates=seq(from=min(starter_df$dates),
to=max(starter_df$dates), by="30 min"),
data=rep(NA,NROW(seq(from=min(starter_df$dates),
to=max(starter_df$dates), by="30 min"))))
full_data[full_data$dates %in% starter_df$dates,] <- starter_df[starter_df$dates %in% full_data$dates,]
I hope it helps.

Resources