Does Driverless AI from H2O.ai support mutivariate time series analysis? - driverless-ai

Does Driverless AI support mutivariate time series analysis?
I'm trying to do a Time Series Analysis Anomaly forecasting where I need to forecast the spikes in Incident Management ticket Count based upon the Geography (location) and the Type of ticket.

Yes, it does. When modeling with multivariate time series time series of interest is a target while the other time series will be used to make predictions. Data format looks exactly like in this example (not specific to Driverless AI) or see H2O.ai docs for concrete time series example where target time series is Weekly_Sales and other time series variables are Temperature and Fuel_Price.
There is a couple of settings relevant to multivariate time series setup:
Unavailable at Prediction Time: specify the other time series columns (besides target column) that will only have lag-related features created with it.
Probability to Create Non-Target Lag Features: specify a probability value for creating non-target lag features (any value between 0 and 1). With multivariate time series this value may go as high as 0.9 or even 1 if no target lag should be used for predictions.
UPDATE
In the spirit of the question Multivariate vs Multiple time series adding more information on modeling time series with Driverless AI. It also supports multiple time series (vs. multivariate time series above) using time groups columns (TGC). In fact, any time series dataset is automatically parsed to identify such groups (alternatively, TGC are specified explicitly by user). Treating TGC as categorical Driverless AI constructs multiple time series - one for each unique combination of values in TGC.
The following settings let user refine how time series model utilizes TGC:
Consider TGC as Standalone Features
Which TGC Feature Types to Consider as Standalone Features
Always Group by All Time Groups Columns for Creating Lag Features
TGC feature operates in combination with multivariate time series, so for each group Driverless AI maintains multivariate times series functionality as described above.

Related

How to determine the most significant predictors - multivariate forecasting

I would like to create a forecasting model with time series in R. I have a target time series 'Sales' that I would like to forecast. I also have several time series that represent, for example, GDP or advertising spend. Unfortunately I have a lot of independent time series and I don't know how to figure out the most significant ones. It would be best to find out the most important ones already before building the model.
I have already worked with classification problems, here I have always used the Pearson correlation value. This is not possible with time series, right? How can I determine the correlation for time series and use the correlation to find suitable time series that describe my target time series?
I tried to use the corr.test() function in R, but I think thats not right.

Detecting seasonality for time series and apply cross correlation function

I have a question using R's ccf() function. I have two time series that represents snow water equivalent on the surface and groundwater head under the ground. I want to find out the "propagation" time from the surface to the ground, so I think that using cross correlation between two time series can help me to detect what's the "lag" time between them.
It seems that ccf() function is a proper way to determine the lag of two time series. But according to the mathematical concepts of cross correlation, it seems that it requires stationarity of the input data, and both of my time series are seasonal, because intuitively we know that snow occurs in winter. Data with seasonality is considered as non-stationary, so I think I might need to decompose the data so that it's stationary. Then I used both stl() function and decompose() function to detect whether there is a seasonality pattern, but both of them gave me this error message:
Error in decompose(swefoothill):
time series has no or less than 2 periods
which is pretty self explanatory, both time series don't have a clear seasonality. But that doesn't mean that my data are not seasonal. So I want to ask under this circumstance, is it okay for me to perform a ccf() directly for both time series? I did a sample analysis and the correlation factor figure looks like this:
And I'm observing a cycle pattern here, am I doing it wrong? Thanks a lot for your help!

Time series benchmarking/reconciliation and revisions - are there methods that minimise revisions?

I am using the tempdisagg R package for benchmarking quarterly time series to annual time series from different (more trusted) sources (by temporally disaggragating the annual data using the quarterly data as indicator series).
The time series are sub series and sum series, and these identities should hold after benchmarking, too. I.e. if
S = A + B - C,
then
predict(td(S,...)) = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...)).
I have tried the Denton-Cholette and the Chow-Lin-maxlog methods.
This is to be carried out regularly, so ideally I would like a disaggregation method that minimises revisions. I have tried removing up to ten years' worth a data from various time series to see if any method outperforms the other in terms of minimising revisions, but it seems that it depends on a combination of time series volatility and method and I can't reach a conclusion.
It would be possible to use a combination of different methods on the sub series, I guess.
Is there any comprehensive knowledge on benchhmarking and revisions?
I have attached some graphs in an attempt to illustrate the problem. Ideally, we would like to see one line that just changes colour according to the various years of data, as in the first two graphs until about 2015. The black lines in the graphs are the raw data.
Your question seems to consist of two independent parts.
You mention that the identity S = A + B - C can be achieved with our tempdisagg R-library by predict(td(S,...)) = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...)).
This is usually not the case. You will have to apply td() to three of the four series and compute the fourth series implicitly (e.g. S = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...))).
To answer your question about the revisions, a reproducible example would be handy. You could create such an example with the example time series in our tempdisagg library, which are accessible by data(tempdisagg).
Since the Chow-Lin method is based on a regression (in your case of the involved annual time series), the regression parameters are gonna change with every new or revised annual value. As a consequence, all values of the resulting quarterly series are gonna be revised. When applying the Denton method, no parameters have to be estimated. Thus only the most recent years of the resulting quarterly series are prone to revision. If your focus is on the whole resulting quarterly time series, a lot less quarters are prone to revisions when using the Denton method compared to the Chow-Lin method. If your focus is on the revisions of the most recent quarters/years, it's a different story and I doubt that there is a clear cut answer.
I hope this helps. Our paper Temporal Disaggregation of Time Series contains an overview of the different temporal disaggregation methods implemented in the tempdisagg library.

How to construct dataframe for time series data using ensemble learning methods

I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard that it is necessary to re-arrange the data frame in some way, because ensemble methods draw repeated RANDOM samples from the training data, in which case the sequence of the Bitcoin time series will be ruined. So, is there a way to re-arrange the data frame in some way such that the time series will still be in chronological order every time repeated samples are drawn from the training data?
I was provided with an explanation of how to construct the data frame here and possibly here, too, but unfortunately, I didn't really understand these explanations, because I didn't see a visual example of the to-be-constructed data frame and because I wasn't able to identify the relevant line of code. So, if someone could, show me how to re-arrange the data frame using an example data frame, I would be very thankful. As example data frame, you might consider using the airquality in-built data frame in r (I think it contains time series data), the data I provided above, or any other data frame you think is best.
Many thanks!
There is no problem with resampling for ML algorithms. To capture (auto)correlation just add columns with lagged values of time series. E.g. in case of univarate time-series x[t], where t is time in minutes, you add x[t - 1], x[t - 2], ..., x[t - n] columns with lagged values. More lags you add more history will be accounted at model training.
Some very basic working example you can find here: Prediction using neural networks
More advanced staff with Keras is here: Time series prediction using RNN
However, just for your information, special message by Mr Chollet and Mr Allaire from the above-mentioned article ,):
NOTE: Markets and machine learning
Some readers are bound to want to take the techniques we’ve introduced
here and try them on the problem of forecasting the future price of
securities on the stock market (or currency exchange rates, and so
on). Markets have very different statistical characteristics than
natural phenomena such as weather patterns. Trying to use machine
learning to beat markets, when you only have access to publicly
available data, is a difficult endeavor, and you’re likely to waste
your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not
a good predictor of future returns – looking in the rear-view mirror
is a bad way to drive. Machine learning, on the other hand, is
applicable to datasets where the past is a good predictor of the
future.

survival package, right censored data

I account for right censored data in the analysis of my dataset. I am using the survival package - given cancer treatment tactics and when the patient last checked in with my clients clinic.
Is there a suggested method or manipulation to the standard survival package to account for right-censored data?
Our rows are unique individual patients...
Here are our columns that are filled out:
List item
our treatment type (constant)
days since original diagnosis
'censored' which is the number of patients who were last heard on this day. Hence, We are now uncertain if they are still alive or dead seen as they stopped attending the clinic. They should be removed from the probability estimate at all points in future.
# of patients who died on that day (from original diagnosis)
So do you recommend a manipulation of the standard survival package? Or using another package? I have seen survSNP, survPRESMOOTH and survBIVAR that may perhaps help. I want to avoid recalculations of the individual columns/fields and creating new objects of the R algorithm seeing as this is a small part of a very large dataset.

Resources