How do I deal with monthly time series data of 3900+ regions at once - r

I am working on a time series model and I am new to this. I have just started learning time series analysis and forecasting.
I know how to deal with monthly data.
But I have a bigger and huge data that I need to solve.
It has monthly time series data for 3900+ regions.
I want to predict the values for next 12 months using R.
My data looks something like this : https://drive.google.com/file/d/10QvtS55NQ1kIXxeccWxXl0SqqyqYXyoh/view?usp=sharing
I know how to do this for 1 region using ARIMA model but don't know how to handle this big data.
Thanks in advance!

as you are new to the topic, I recommend you to take a look at the approach of using global models like xgboost or glmnet instead.
You will fail to produce scalable results with the "forecast" package or similar local time series approaches using ARIMA, ETS, Prophet and so on.
When your models are complex enough to produce accurate forecasts, they will take a lot of time to compute. For example a model prediction with fully tuned local models took about 5 hours for 100 time series (5 years of train, one year of test) to complete. With global models it is a matter of just 3 minutes.
As I am using it myself, I may recommend the modeltime framework which makes use of the tidymodels stack.

Related

R - Forecast multiple time-series (15K Products)

Hi Stack Overflow community.
I have 5 years of weekly price data for more than 15K Products (5*15K**52 records). Each product is a univariate time series. The objective is to forecast the price of each product.
I am familiar with the univariate time series analysis in which we can visualize each ts series, plot its ACF, PACF, and forecast the series. But, Univariate time series analysis is not possible in this case when I have 15K different time-series, can not visualize each time series, its ACF, PACF, and forecast separately of each product, and make a tweak/decision on it.
I am looking for some recommendations and directions to solve this multi-series forecasting problem using R (preferable). Any help and support will be appreciated.
Thanks in advance.
I would suggest you use auto.arima from the forecast package.
This way you don't have to search for the right ARIMA model.
auto.arima: Returns best ARIMA model according to either AIC, AICc or BIC value. The function conducts a search over possible models within the order constraints provided.
fit <- auto.arima(WWWusage)
plot(forecast(fit,h=20))
Instead of WWWusage you could put one of your time series, to fit an ARIMA model.
With forecast you then perform the forecast - in this case 20 time steps ahead (h=20).
auto.arima basically chooses the ARIMA parameters for you (according to AIC - Akaike information criterion).
You would have to try, if it is too computational expensive for you. But in general it is not that uncommon to forecast that many time series.
Another thing to keep in mind could be, that it might after all not be that unlikely, that there is some cross-correlation in the time series. So from a forecasting precision standpoint it could make sense to not treat this as a univariate forecasting problem.
The setting it sounds quite similar to the m5 forecasting competition that was recently held on Kaggle. Goal was to point forecasts the unit sales of various products sold in the USA by Walmart.
So a lot of time series of sales data to forecast. In this case the winner did not do a univariate forecast. Here a link to a description of the winning solution. Since the setting seems so similar to yours, it probably makes sense to read a little bit in the kaggle forum of this challenge - there might be even useful notebooks (code examples) available.

How to deal with time series data with many 0's?

I have time series data ranging from 0 to 30 million. Its basically web traffic weekly data. I am working on building a forecasting model with this data. I want to understand how can I deal with this range of data. I tried box cox transformation with prophet model. I am not sure about what metrics could I use to evaluate the performance of the model. The data has a lot of 0's. I can't remove them from the dataset. Is there a better way to deal with the 0's other than the Box Cox transformation? I had issues with the inverse transformation but I added a small value (0.1) to the data to avoid negative values.
If your series have lot of periodic zero data,Croston method is a one way.It is a basically forecast strategy for products with intermittent demand.Also you can try exponential smoothing and traditional ARIMA,SARIMA models and clip the negative values in the forecast(this is according to your use case).
you can find croston method in forecast package.
also refer these links as well.
https://stats.stackexchange.com/questions/8779/analysis-of-time-series-with-many-zero-values/8782
https://stats.stackexchange.com/questions/373689/forecasting-intermittent-demand-with-zeroes-in-times-series
https://robjhyndman.com/papers/foresight.pdf

Forecasting Hospital Bed Demand Using Daily Observations

Basically, my task for the next 3 months is to forecast bed demand and a couple of other variables in a hospital's emergency department. The data is 5 years worth of daily observations of these variables. The data is complete with no missing values.
The goal is to improve the prediction accuracy of the current tool, which is an Excel workbook.
I have not taken any time series or optimization courses in college thus far- so imagine my horror when I realised I had no clue on how to approach this project and that I would be working entirely alone. I was told no one in the department has any experience and no one would be able to help me.
I'm using RStudio, but I'm not very proficient since it was self-taught.
From trying out the questions asked on here as well as YouTube tutorials to learn the appropriate syntax and functions, what I have managed to find out is:
1) My data is a time series and I should apply forecasting models to predict future values based on the historical data I have.
2) Daily observations of a long time series has weekly and annual seasonality, so I should define the data as a multi-seasonal time series.
I first tried defining my data as ts(), then msts(). One of the answers here mentioned zoo() would be more appropriate for daily obervations, so I tried that too. The forecasting models I've tried are snaive, ets, auto.arima and TBATS.
I would like to present the plots of the values/forecasts based on day-of-the-week other than all 365 days of the year, which is the only output I could plot. I tried using frequency = 365 and 7, and start = c(2014, 1) and end= c(2018, 365), but I haven't had any luck.
I would really appreciate any advice and help I could get from anyone. Thank you!
Without looking at your data, have you tried to get started with some basic ARIMA modeling and seeing what results you get from that? It’s a fairly friendly way to get started with time series forecasting, depending on your data. I was forecasting by the hour, but the frequency can be adjusted to whatever you need to forecast in. As you have mentioned, you are looking ot change the frequency. Sometimes it’s easier to see a pattern at larger time intervals, and can aggregate your data at larger time intervals.
For example, this converts daily observations to monthly.
library(xts)
dates <- seq(as.Date('2012-01-01'),as.Date('2019-03-31'),by='days')
beds$date.formatted <- dates
beds.xts <- xts(x=beds$neds.count,as.POSIXct(paste(beds$date.formatted)))
end.month <- endpoints(beds.xts,'months')
beds.month <- period.apply(beds.xts,end.month,sum)
beds.monthly.df <- data.frame(date=index(beds.month),coredata(beds.month))
colnames(beds.monthly.df) <- c('Date','Sessions')
beds.monthly <- ts(sessions.monthly.df$Sessions,start=c(2012,1),end=c(2019,3),frequency=12)
plot(beds.monthly)
I’m not sure if that would answer your question, but as you mentioned you are self-taught and stating out, I can share a script with you to help you go get started with an example, and maybe this would help you? It goes through the whole process of checking you have read your data in as a time series, what is time series data, how to check for non-stationary data and seasonality trends, plots that are useful for this, modeling, prediction, plotting actual vs predicted, accuracy, and further issues with the data that could be hindering your model. The video tutorial series are scripted in Python, but you can follow the end-to-end process of forecasting in ARIMA using the equivalent R script for this tutorial: https://code.datasciencedojo.com/rebeccam/tutorials/blob/master/Time%20Series/r_time_series_example.R
https://tutorials.datasciencedojo.com/time-series-python-reading-data/

How to construct dataframe for time series data using ensemble learning methods

I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard that it is necessary to re-arrange the data frame in some way, because ensemble methods draw repeated RANDOM samples from the training data, in which case the sequence of the Bitcoin time series will be ruined. So, is there a way to re-arrange the data frame in some way such that the time series will still be in chronological order every time repeated samples are drawn from the training data?
I was provided with an explanation of how to construct the data frame here and possibly here, too, but unfortunately, I didn't really understand these explanations, because I didn't see a visual example of the to-be-constructed data frame and because I wasn't able to identify the relevant line of code. So, if someone could, show me how to re-arrange the data frame using an example data frame, I would be very thankful. As example data frame, you might consider using the airquality in-built data frame in r (I think it contains time series data), the data I provided above, or any other data frame you think is best.
Many thanks!
There is no problem with resampling for ML algorithms. To capture (auto)correlation just add columns with lagged values of time series. E.g. in case of univarate time-series x[t], where t is time in minutes, you add x[t - 1], x[t - 2], ..., x[t - n] columns with lagged values. More lags you add more history will be accounted at model training.
Some very basic working example you can find here: Prediction using neural networks
More advanced staff with Keras is here: Time series prediction using RNN
However, just for your information, special message by Mr Chollet and Mr Allaire from the above-mentioned article ,):
NOTE: Markets and machine learning
Some readers are bound to want to take the techniques we’ve introduced
here and try them on the problem of forecasting the future price of
securities on the stock market (or currency exchange rates, and so
on). Markets have very different statistical characteristics than
natural phenomena such as weather patterns. Trying to use machine
learning to beat markets, when you only have access to publicly
available data, is a difficult endeavor, and you’re likely to waste
your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not
a good predictor of future returns – looking in the rear-view mirror
is a bad way to drive. Machine learning, on the other hand, is
applicable to datasets where the past is a good predictor of the
future.

"auto.arima" in SAS?

I used to run arima model in R using "auto.arima" to identify the best arima model that fits the data. Even without it, it's easy in R to write a function to perform similar task. However, I have googled for the past few days, and I can't find a similar procedure in SAS. Does anyone know if there is a "auto.arima" in SAS? Or do I have to write one by myself? Thank you!
Edit:
After days of searching online, the closest one that I can find is Automatic Model Selection in time series forecasting package. However, that function is the one using GUI, and still one has to manually select all the different models to test. Does anyone know a command line procedure or package to do this? Thank you.
SAS has proc arima which is part of the SAS/ETS module (licensed seperately). You can use either the Enterprise Guide proc arima node for a GUI interface to it, or you can use Solutions->Analysis->Time Series Analysis for a base SAS interface. The base sas interface is what I usually use, it has the advantage of comparing many models other than just arima for a fit.
To check to see if you have the correct license run the following code:
proc setinit;
run;
You should see something like this in the results if you have it licensed:
---SAS/ETS (01 JAN 2020)
SAS HpF for high performance forecasting is the best in market for time series forecasting nothing can beat its accuracy when u are trying to generate forecast for multiple products ...
Proc hpfdiagnose followed by proc hpfengine you will hate auto.arima after using this
You might want to give PROC FORECAST a try.
I'm working on a similar problem where I have about 6,000 separate time series to forecast so modeling each one individually is out of the question. You can specify a BY variable in PROC FORECAST that lets you forecast many series at once pretty quickly (it ran my moderately large dataset in less than 3 seconds). And if you choose the STEPAR method, it will fit the best autoregressive model it can find to your data.
Here's a good overview of the FORECAST procedure: http://www.okstate.edu/sas/v8/saspdf/ets/chap12.pdf
Still not as awesome as auto.arima in R, but gets the job done.
Good luck!
SAS has high performance forecasting procedures (PROC HPFDIAGNOSE+PROC HPFENGINE), which not only selects the best ARIMA model, but can also select the best among ARIMA, ESM, UCM, IDM, combination models, and external models, etc. You can either let it automatically picks the best based on default selection criterion, or customize the selection criterion. There is a procedure family to customize everything: PROC HPFDIAGNOSE, PROC HPFENGINE, PROC ARIMASPEC, etc. If you want to do more flexible time series analysis plus coding, you can also use PROC TIMEDATA with all the built-in time series packages, which allows you to program whatever you want and also do all the automatic modeling.
Like being mentioned above, it is the best in market for time series forecasting, and nothing can beat its accuracy when you are trying to generate forecasts for multiple series. However, it usually licensed with SAS Forecast Server or SAS Forecast Studio, which are enterprised forecasting solutions with GUI. It's understandable since other forecasting solutions built on R and Python which can handle automatic
parallelization and automatic forecasting also charge money.
For the cloud computing version, there is also PROC TSMODEL and Visual Forecasting version, which has both forecast accuracy and computation performance advantages. However, it is also for enterprise use and pricey. Afterall, it is targeted to markets that require forecasting for thousands or millions of time series.
For free versions, maybe the closest one would be PROC FORECAST.

Resources