I've been working on R for last one week or so, this website has helped a lot in understanding the basics.
I am doing an minute wise forecast for my company,
data is something like this:
REFEE ENTRY_DATE
1.00 01-01-2011 00:00:00
2.00 01-01-2011 00:01:00
3.00 01-01-2011 00:02:00
4.00 01-01-2011 00:03:00
5.00 01-01-2011 00:04:00
6.00 01-01-2011 00:05:00
7.00 01-01-2011 00:06:00
8.00 01-01-2011 00:07:00
9.00 01-01-2011 00:08:00
10.00 01-01-2011 00:09:00
......so on for four years till 2014
thats roughly more than 133921*12 samples. I have tried all the codes for forecasting, HoltWinters(), forecast() and all other forcasting methods....
The problem is, the application hangs everytime I try these functions; doesn't R support so many data for forecasting?
Is there any other package that can help me get the forecast for such enormous amount of data?
This actually is quite a lot of data, at least for R. You could look at ets() in the forecast package. I like recommending this free online forecasting textbook from the same authors.
You could of course think about your data. Do you actually expect dynamics that can only be seen on this level, e.g., sub-hourly patterns? Do you actually need your forecasts on a minute-by-minute basis, e.g., for operational decisions? (From what I know, even short-term electricity forecasting is done in 15 minute buckets - and if you are actually into high frequency trading, you'd likely have shorter time periods.)
If yes, you should probably look into specific methods that can actually model multiple types of seasonality. Electricity load forecasting may be a good point to start, since these people do deal with multiple overlaid seasonal patterns.
If no, you could think about aggregating your data, say to days, then forecasting aggregates and diaggregating, e.g., using historical proportions of minutes within days. This would at least make forecasting less of a data problem.
For large data sets I would recommend using predict() from the R base as opposed to forecast(). While forecast() provides more information (predict() only provides the forecast and standard errors), using rbenchmark for the two functions suggests predict() is much faster.
Additionally forecast() drops the century in its dates for their forecasted ts object, which is annoying...
As Stephan Kosla stated having such granular data may be an issue. A speed up could be found by taking a daily/weekly/monthly average of your data before performing the forecast. You can do this using the one of the apply functions, lubridate and a bit of ingenuity. I've shown and example below of how I would do this:
library(lubridate)
# Create dataframe for AirPassengers dataset (frome base)
df <- data.frame(data=as.vector(AirPassengers),
date=as.Date((time(AirPassengers))),
year=year(as.Date((time(AirPassengers)))))
# Split by year, then take average
average.by.year <- unsplit(lapply(split(df$data,df$year),mean), #lapply takes the mean
df$year)
Related
I am working with financial asset pricing data and I am interested in finding the time difference between the dates of two consecutive operations on the market.
A simple example could be:
first operation -> client X buys stock A on 2020-01-02 09:00:00
second operation -> client X buys stock B on 2020-01-03 09:00:00
Here is my problem:
I am looking for a function that computes the FINANCIAL time difference between this two datetime objects.
Hence, I am not interested in a simple calendar time difference (that can be computed in R using the well-known difftime() function), but in a time difference that considers the financial or trading day, that is (roughly speaking) a day that starts at 9.00 am and ends at 18.00 pm.
Therefore, this function should give a result of 9 hours (if the unit reference is hour) to the simple example above, instead of 24 hours as the usual calendar time difference would suggest.
A more complex version of this function would also take into account the market holidays and exclude weekends from computation.
Following an example using R
d1 <- as.POSIXct("2020-01-02 09:00:00", tz = "UTC")
d2 <- as.POSIXct("2020-01-03 09:00:00", tz = "UTC")
difftime(d2, d1, units = "hours")
This produce a time difference of 24 hours.
However, the expected result would be just 9 hours, since the financial (or market trading) day ends on 2020-01-02 at 18.00 pm and starts again the day after at 9.00 am, hence there should be only 9 hours of trading between the two.
I mainly work in R, so I would appreciate if someone can give me some advice on that language, but if anyone knows something similar in other languages would also be very useful.
Thank you a lot for your help and time.
I want to forecast the number of customers entering a shop during service hours. I have hourly data for
Monday to Friday
8:00 to 18:00
Thus, I assume my time series is in fact regular, but atypical in a sense, since I have 10 hours a day and 5 days a week.
I am able to do modeling with this regular 24/7 time series by setting non-service hours to zero, but I find this inefficient and also incorrect, because the times are not missing. Rather, they do not exist.
Using the old ts-framework I was able to explicitly specify
myTS <- ts(x, frequency = 10)
However, within the new tsibble/fable-framework this is not possible. It detects hourly data and expects 24 hours per day and not 10. Every subsequent function reminds me of implicit gaps in time. Manually overriding the interval-Attribute works:
> attr(ts, "interval") <- new_interval(hour = 10)
> has_gaps(ts)
# A tibble: 1 x 1
.gaps
<lgl>
1 FALSE
But has no effect on modeling:
model(ts,
snaive = SNAIVE(customers ~ lag("week")))
I still get the same error message:
1 error encountered for snaive [1] .data contains implicit gaps in
time. You should check your data and convert implicit gaps into
explicit missing values using tsibble::fill_gaps() if required.
Any help would be appreciated.
This question actually corresponds to this gh issue. As far as I know, there's no R packages that allow users to construct custom schedule, for example to specify certain intra-days and days. A couple of packages provide some specific calendars (like business dates), but none gives a solution to setting up intra days. Tsibble will gain a calendar argument for custom calendars to respect structural missings, when such a package is made available. But currently no support for that.
As you stated, it's hourly data. Hence the data interval should be 1 hour, not 10 hours. However, ts() frequency is seasonal periods, 10 hours per day, for modelling.
times booked_res
11:00 23
13:00 26
15:00 27
17:00 25
19:00 28
21:00 30
So I need to use the ts() function in R to convert this frame into a time series. The column on the right are the number of people reserved in each time. How should I approach this? I'm not sure about the arguments and I don't know if the frequency should be set to 24 (hours in a day) or 10 (11:00 to 21:00) as shown above. Any help appreciated.
First, find the frequency by noting that you are taking a sample every two minutes. The frequency is the inverse of the time between samples, which is 1/2 samples per minute or 30 samples per hour. The data you're interested in is on the right, so you can just use that data vector rather than the entire data frame. The code to convert that into a time series is simply:
booked_res <- c(23,26,27,25,28,30)
ts(booked_res,frequency = 30)
A simple plot with your data might be coded like this:
plot(ts(booked_res,frequency = 30),ylab='Number of people reserved',xlab='Time (in hours) since start of sampling',main='Time series chart of people reservations')
UPDATE:
A time series model can only be created in R when the times series is stationary. Using a varying sample rate would make the time series non-stationary, and so you wouldn't be able to create a time-series object in R.
This page on Analytics Vidhya provides a nice definition of stationary and non-stationary time series, while this page on R bloggers gives some resources that relate to analyzing a non-stationary time series.
What is the correct way to deal with datetimes in ggplot ?
I have data at several different dates and I would like to facet each date by the same time of day, e.g. between 1:30PM and 1:35PM, and plot the points between this time frame, how can I achieve this?
My data looks like:
datetime col1
2015-01-02 00:00:01 20
... ...
2015-01-02 11:59:59 34
2015-02-19 00:00:03 12
... ...
2015-02-19 11:59:58 27
I find myself often wanting to ggplot time series using datetime objects as the x-axis but I don't know how to use times only when dates aren't of interest.
The lubridate package will do the trick. There are commands you could use, specifically floor_date or ceiling_date to transform your datetime array.
I always use the chron package for times. It completely disregards dates and stores your time numerically (e.g. 1:30PM is stored as 13.5 because it's 13.5 hours into the day). That allows you to perform math on times, which is great for a lot of reasons, including calculating average time, the time between two points, etc.
For specific help with your plot you'll need to share a sample data frame in an easily copy-able format, and show the code you've tried so far.
This is a question I'd asked previously regarding the chron package, and it also gives an idea of how to share your data/ask a question that's easier for folks to reproduce and therefore answer:
Clear labeling of times class data on horizontal barplot/geom_segment
I am trying to resample a dataset with a given temporal resolution of 5 min (source). In order to get a 30 min resampled temporal resolution I've tried:
#Date and Time together
SRI_2010$Date_Time = paste(SRI_2010$Date, SRI_2010$Time, sep=" ")
SRI_2010$Date_Time=as.character(SRI_2010$Date_Time)
SRI_2010$Date_Time=as.POSIXct(SRI_2010$Date_Time,format="%d/%m/%Y %H:%M")
#Creating the zoo object
SRI_2010.zoo <- zoo(SRI_2010,as.POSIXct(SRI_2010$Date_Time))
#Criteria for the resampling
ends2010 <- endpoints(SRI_2010.zoo,'minutes', 30)
SRI_30m_2010 <-period.apply(SRI_2010.zoo$SRI..W.m2.,ends2010,mean)
At the very beginning, I was quite satisfied because the code worked out, but after a double-check, I've realised it calculates the mean values at min 25 and 55, instead of at min 00 and 30 that I am interested in.
Example:
> SRI_30m_2010
2010-07-28 04:55:00 2010-07-28 05:25:00
3.80000000 12.06666667
2010-07-28 05:55:00 2010-07-28 06:25:00
19.73333333 28.46666667
2010-07-28 06:55:00 2010-07-28 07:25:00
40.30000000 61.60000000
This small issue is super annoying when I aim to combine different datasets with different temporal resolutions into a communal one. Does anyone know how could I sort this issue out?
The "issue" is that endpoints is doing what it was designed to do. It's returning the last timestamp of each period. I recommend you use align.time to move the index timestamp forward to the minutes you're interested in.
s <- align.time(as.xts(SRI_30m_2010), 60*30)
It's also not much of an issue if you're trying to combine multiple series with different resolutions into a single xts object. You could just merge them all, use na.locf or similar to fill in missing values, then extract the resolution you're interested in. I believe the xts FAQ shows you how to do this, and I know I've demonstrated it more than a couple times in my other answers on stackoverflow.