On the issue of automatic time series fitting using R - r

we have to fit about 2000 or odd time series every month,
they have very idiosyncratic behavior in particular, some are arma/arima, some are ewma, some are arch/garch with or without seasonality and/or trend (only thing in common is the time series aspect).
one can in theory build ensemble model with aic or bic criterion to choose the best fit model but is the community aware of any library which attempts to solve this problem?
Google made me aware of the below one by Rob J Hyndman
link
but are they any other alternatives?

There are two automatic methods in the forecast package: auto.arima() which will handle automatic modelling using ARIMA models, and ets() which will automatically select the best model from the exponential smoothing family (including trend and seasonality where appropriate). The AIC is used in both cases for model selection. Neither handles ARCH/GARCH models though. The package is described in some detail in this JSS article: http://www.jstatsoft.org/v27/i03
Further to your question:
When will it be possible to use
forecast package functions, especially
ets function, with high dimensional
data(weekly data, for example)?
Probably early next year. The paper is written (see robjhyndman.com/working-papers/complex-seasonality) and we are working on the code now.

Thanks useRs, I have tried the forecast package, that too as a composite of arima and ets, but not to much acclaim from aic or bic(sbc), so i am now tempted to treat each of the time series to its own svm(support vector machine) because of its better genralization adaptability and also being able to add other variables apart from lags and non linear kernel functions
Any premonitions?

Related

Ensemble machine learning model with NNETAR and BRNN

I used the forecast package to forecast the daily time-series of variable Y using its lag values and a time series of an external parameter X. I found nnetar model (a NARX model) was the best in terms of overall performance. However, I was not able to get the prediction of peaks of the time series well despite my various attempts with parameter tuning.
I then extracted the peak values (above a threshold) of Y (and of course this is not a regular time series anymore) and corresponding X values and tried to fit a regression model (note: not an autoregression model) using various models in carat package. I found out the prediction of peak values using brnn(Bidirectional recurrent neural networks) model just using X values is better than that of nnetar which uses both lag values and X values.
Now my question is how do I go from here to create ensamples of these two models (i.e whenever the prediction using brnn regression model ( or any other regression model) is better I want to replace the prediction using nnetar and move forward - I am mostly concerned about the peaks)? Is this a commonly used approach?
Instead of trying to pick one model that would be the superior at anytime, it's typically better to do an average of the models, in order to include as many individual views as possible.
In the experiments I've been involved in, where we tried to pick one model that would outperform, based on historical performance, it's typically shown that a simple average was as good or better. Which is in line with the typical results on this problem: https://otexts.com/fpp2/combinations.html
So, before you try to go more advanced at it by using trying to pick a specific model based on previous performance, or by using an weighted average, consider doing a simple average of the two models.
If you want to continue with a sort of selection/weighted averaging, try to have a look at the FFORMA package in R: https://github.com/pmontman/fforma
I've not tried the specific package (yet), but have seen promising results in my test using the original m4metalearning package.

R - Forecast multiple time-series (15K Products)

Hi Stack Overflow community.
I have 5 years of weekly price data for more than 15K Products (5*15K**52 records). Each product is a univariate time series. The objective is to forecast the price of each product.
I am familiar with the univariate time series analysis in which we can visualize each ts series, plot its ACF, PACF, and forecast the series. But, Univariate time series analysis is not possible in this case when I have 15K different time-series, can not visualize each time series, its ACF, PACF, and forecast separately of each product, and make a tweak/decision on it.
I am looking for some recommendations and directions to solve this multi-series forecasting problem using R (preferable). Any help and support will be appreciated.
Thanks in advance.
I would suggest you use auto.arima from the forecast package.
This way you don't have to search for the right ARIMA model.
auto.arima: Returns best ARIMA model according to either AIC, AICc or BIC value. The function conducts a search over possible models within the order constraints provided.
fit <- auto.arima(WWWusage)
plot(forecast(fit,h=20))
Instead of WWWusage you could put one of your time series, to fit an ARIMA model.
With forecast you then perform the forecast - in this case 20 time steps ahead (h=20).
auto.arima basically chooses the ARIMA parameters for you (according to AIC - Akaike information criterion).
You would have to try, if it is too computational expensive for you. But in general it is not that uncommon to forecast that many time series.
Another thing to keep in mind could be, that it might after all not be that unlikely, that there is some cross-correlation in the time series. So from a forecasting precision standpoint it could make sense to not treat this as a univariate forecasting problem.
The setting it sounds quite similar to the m5 forecasting competition that was recently held on Kaggle. Goal was to point forecasts the unit sales of various products sold in the USA by Walmart.
So a lot of time series of sales data to forecast. In this case the winner did not do a univariate forecast. Here a link to a description of the winning solution. Since the setting seems so similar to yours, it probably makes sense to read a little bit in the kaggle forum of this challenge - there might be even useful notebooks (code examples) available.

Rolling forecast origin cross-validation in R?

Probably a dumb post but here goes:
So as someone who has done some econometricks and ML like random forests and XGBoosts I always make sure to use either a k-fold cross validation or/and a train/test set approach (using caret), but I have a question about implementing rolling forecast origin in CV in forecasting models using the ets() function (and arima).
In my textbook it used the tsCV function when showing some basic forecasts like seasonal naive, but when moving to ets models using ets(), he just used the function with model parameters such as "AAdM", then used the forecast() to make a forecast for 8 periods. I did not see any splitting of the data.
Does the ets() function do this automatically, or has the example probably been constructed to just show the elements of the ets model?
If it's the latter, then I should split the data using rsample. But then the question is, how do I for instance implement rolling CV when forecasting say the next 12 periods? A simple train/test split is easy, but I am not familiar with rolling forecasting origins and the book only briefly mentioned anything about cross validation.

Interpreting ACF and PACF plots for SARIMA model

I'm new to time series and used the monthly ozone concentration data from Rob Hyndman's website to do some forecasting.
After doing a log transformation and differencing by lags 1 and 12 to get rid of the trend and seasonality respectively, I plotted the ACF and PACF shown [in this image][2]. Am I on the right track and how would I interpret this as a SARIMA?
There seems to be a pattern every 11 lags in the PACF plot, which makes me think I should do more differencing (at 11 lags), but doing so gives me a worse plot.
I'd really appreciate any of your help!
EDIT:
I got rid of the differencing at lag 1 and just used lag 12 instead, and this is what I got for the ACF and PACF.
From there, I deduced that: SARIMA(1,0,1)x(1,1,1) (AIC: 520.098)
or SARIMA(1,0,1)x(2,1,1) (AIC: 521.250)
would be a good fit, but auto.arima gave me (3,1,1)x(2,0,0) (AIC: 560.7) normally and (1,1,1)x(2,0,0) (AIC: 558.09) without stepwise and approximation.
I am confused on which model to use, but based on the lowest AIC, SAR(1,0,1)x(1,1,1) would be the best? Also, the thing that concerns me is that none of the models pass the Ljung-Box test. Is there any way I can fix this?
It is quite difficult to manually select a model order that will perform well at forecasting a dataset. This is why Rob has built the 'auto.arima' function in his R forecast package, to figure out the model that may perform best based on certain metrics.
When you see a pacf plot with significantly negative lags that usually means you have over differenced your data. Try removing the 1st order difference and keeping the 12 order difference. Then carry on making your best guess.
I'd recommend trying his auto.arima function and passing it a time series object with frequency = 12. He has a good writeup of seasonal arima models here:
https://www.otexts.org/fpp/8/9
If you would like more insight into manually selecting a SARIMA model order, this is a good read:
https://onlinecourses.science.psu.edu/stat510/node/67
In response to your Edit:
I think it would be beneficial to this post if you clarify your objective. Which of the following are you trying to achieve?
Find a model where residuals satisfy Ljung Box Test
Produce the most accurate out of sample forecast
Manually select lag orders such that ACF and PACF plots show no significant lags remaining.
In my opinion, #2 is the most sought after objective so I'll assume that is your goal. From my experience, #3 produces poor results out of sample. In regards to #1, I am usually not concerned about correlations remaining in the residuals. We know we do not have the true model for this time-series, so I do not feel there's any reason to expect an approximate model that performs well out of sample to not have left something behind in the residuals that is more complex perhaps, or nonlinear etc.
To provide you another SARIMA result, I ran this data through some code I've developed and found the following equation produced the minimal error on a cross-validation period.
Final model is:
SARIMA [0,1,1] [1,1,1]12 with a constant using the log normal of the time-series.
The errors in the cross validation period are:
MAPE = 16%
MAE = 0.46
RSQR = 74%
Here is the Partial Autocorrelation plot of the residuals for your information.
This is roughly similar in methodology to selecting an equation based on AICc to my understanding, but is ultimately a different approach. Regardless, if your objective is out of sample accuracy, I'd recommend evaluating equations in terms of their out of sample accuracy versus in-sample fit, tests, or plots.

fourier() vs fourierf() function in R

I'm using the fourier() and fourierf() functions in Ron Hyndman's excellent forecast package in R. Looking to verify whether the same terms are selected and used in fourier() and fourierf(), I plotted a few of the output terms.
Below is the original data using ts.plot(data). There's a frequency of 364 in the time series, FYI.
Below is the plot of the terms using fourier(data,3). Basically, it looks like mirror images of the existing data.
Looking at just the sin1 term of the output, again, we get some variation that shows similar 364-day seasonality in line with the data above.
However, when I plot the results of the Fourier forecast using fourierf(data,3, 410) I see the below data. It appears far more smooth than the terms provided by the original fourier function.
So, I wonder how the results of fourier() and fourierf() are related. Is it possible to just see one consolidated Fourier result, so that you can see the sin or cosine result moving through existing data and then through the forecasting period? If not, how can I confirm that the terms created by fourierf() fit the in-sample data?
I want to use it in an auto.arima or glm function with other external regressors like this:
trainFourier<-fourier(data,3)
trainFourier<-as.data.frame(trainFourier)
trainFourier$exogenous<-exogenousData
arima.object<-auto.arima(data, xreg=trainFourier)
futureFourier<-fourierf(data,3, 410)
fourierForecast<-forecast(arima.object, xreg=futureFourier, h=410)
and want to be completely sure that the auto.arima has the proper fitting (using the terms from fourier()) to what I'll put in under xreg for forecast (which has terms from a different function, i.e. ffourier()).
Figured out the problem. I was using both the fda and forecast packages. fda, which is for functional data analysis and regression, has its own fourier() function. If I detach fda, my S1 term from fourier(data,3) looks like this:
which lines up nicely with the Fourier forecast if I use ts.plot(c(trainFourier$S1,futureFourier$S1))
Moral of the story -- watch what your packages supress, folks!

Resources