Fast Fourier Transform and Clustering of Time Series - r

I'm making a project connected with identifying dynamic of sales. That's how the piece of my database looks like http://imagizer.imageshack.us/a/img854/1958/zlco.jpg. There are three columns:
Product - present the group of product
Week - time since launch the product (week), first 26 weeks
Sales_gain - how the sales of product change by week
In the database there is 3302 observations = 127 time series
My aim is to cluster time series in groups which are going to show me different dynamic of sales. Before clustering I want to use Fast Fourier Transform to change time series on vectors and take into consideration amplitude etc and then use a distance algorithm and group products.
It's my first time I deal with FFT and clustering, so I would be grateful if anybody would point steps, which I have to do before/after using FFT to group dynamics of sales. I want to do all steps in R, so it would be wonderful if somebody type which procedures should I use to do all steps.
That's how my time series look like now http://imageshack.com/a/img703/6726/sru7.jpg
Please note that I am relatively new to time series analysis (that's why I cannot put here my code) so any clarity you could provide in R or any package you could recommend that would accomplish this task efficiently would be appreciated.
P.S. Instead of FFT I found the code for DWT here -> www.rdatamining.com/examples/time-series-clustering-classification but cannot use it on my data base and time series (suggest R to analyze new time series after 26 weeks). Can sb explain it to me?

You may have too little data for FFT/DWT to make sense. DTW may be better, but I also don't think it makes sense for sales data - why would there be a x-week temporal offset from one location to another? It's not as if the data were captured at unknown starting weeks.
FFT and DWT are good when your data will have interesting repetitive patterns, and you have A) a good temporal resolution (for audio data, e.g. 16000 Hz - I am talking about thousands of data points!) and B) you have no idea of what frequencies to expect. If you know e.g. you will have weekly patterns (e.g. no sales on sundays) then you should filter them with other algorithms instead.
DTW (dynamic time-warping) is good when you don't know when the event starts and how they align. Say you are capturing heart measurements. You cannot expect to have the hearts of two subjects to beat in synchronization. DTW will try to align this data, and may (or may not) succeed in matching e.g. an anomaly in the heart beat of two subjects. In theory...
Maybe you don't need specialized time methods here at all.
A) your data has too low temporal resolution
B) your data is already perfectly aligned
Maybe all you need is spend more time in preprocessing your data, in particular normalization, to be able to capture similarity.

Related

Time series benchmarking/reconciliation and revisions - are there methods that minimise revisions?

I am using the tempdisagg R package for benchmarking quarterly time series to annual time series from different (more trusted) sources (by temporally disaggragating the annual data using the quarterly data as indicator series).
The time series are sub series and sum series, and these identities should hold after benchmarking, too. I.e. if
S = A + B - C,
then
predict(td(S,...)) = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...)).
I have tried the Denton-Cholette and the Chow-Lin-maxlog methods.
This is to be carried out regularly, so ideally I would like a disaggregation method that minimises revisions. I have tried removing up to ten years' worth a data from various time series to see if any method outperforms the other in terms of minimising revisions, but it seems that it depends on a combination of time series volatility and method and I can't reach a conclusion.
It would be possible to use a combination of different methods on the sub series, I guess.
Is there any comprehensive knowledge on benchhmarking and revisions?
I have attached some graphs in an attempt to illustrate the problem. Ideally, we would like to see one line that just changes colour according to the various years of data, as in the first two graphs until about 2015. The black lines in the graphs are the raw data.
Your question seems to consist of two independent parts.
You mention that the identity S = A + B - C can be achieved with our tempdisagg R-library by predict(td(S,...)) = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...)).
This is usually not the case. You will have to apply td() to three of the four series and compute the fourth series implicitly (e.g. S = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...))).
To answer your question about the revisions, a reproducible example would be handy. You could create such an example with the example time series in our tempdisagg library, which are accessible by data(tempdisagg).
Since the Chow-Lin method is based on a regression (in your case of the involved annual time series), the regression parameters are gonna change with every new or revised annual value. As a consequence, all values of the resulting quarterly series are gonna be revised. When applying the Denton method, no parameters have to be estimated. Thus only the most recent years of the resulting quarterly series are prone to revision. If your focus is on the whole resulting quarterly time series, a lot less quarters are prone to revisions when using the Denton method compared to the Chow-Lin method. If your focus is on the revisions of the most recent quarters/years, it's a different story and I doubt that there is a clear cut answer.
I hope this helps. Our paper Temporal Disaggregation of Time Series contains an overview of the different temporal disaggregation methods implemented in the tempdisagg library.

How can Keras predict sequences of sales (individually) of 11106 distinct customers, each a series of varying length (anyway from 1 to 15 periods)

I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's

How to sort many time series' by how trending each series is

Hi I am recording data for around 150k items in influx. I have tried grouping by item id and using some of the functions from the docs but they don't seem to show "trend".
As there are a lot of series' to group by. I am currently performing a query on each series to calculate a value, storing it and sorting by that.
I have tried to use Linear Regression (the average angle of the line) but it's not quite meant for this as the X axis are timestamps, which do not correlate to the Y axis values, so end up with a near vertical line. Maybe i can calculate the X values to be something else?
The other issue i have is some series' are much higher values than others, so one series jumping up by 1000 might be huge (very trending) and not a big deal for other series that are always much higher.
Is there a way i can generate a single value from a series that represents how trending the series is, eg its just jumped up quite a lot compared to normal.
Here is an example of one series that is not trending and one that was trending a couple days ago. So the latter would have a higher trend value than the first:
Thanks!
I think similar problems arise naturally in the stock market and in general when detecting outliers.
So there are different way to move. Probably 1 is good enough.
It looks like you have a moving average in the graphs. You could just take the difference to the moving average and see the distribution to evaluate the the appropriate thresholds for you to pay attention. It looks like in the first graph you have an event perhaps relevant. You could just place a threshold like two standard deviations of the average of the difference between the real series and the moving average.
De-trend each series. Even 1) could be good enough (I mean just substraction of real value for the series minus the average for the last X days), you could de-trend using more sophisticated ideas. But that could need more attention for each case, for instance you should be careful with seasonality and so on. Perhaps something line Hodrick Prescott or inline with this: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/.
Perhaps the idea from 1) is more formally described as Bollinger Bands. That help you to know where the time series should be with some probability.
There are more sophisticated ways to identify outliers in time series (as in here: https://towardsdatascience.com/effective-approaches-for-time-series-anomaly-detection-9485b40077f1) or here for a literature review: https://arxiv.org/pdf/2002.04236.pdf

Interpolate a high frequency time series

I have a physical time series in a range of 2 year sample data with a frequency of 30 minutes, but there are multiple and wide lost data intervals as you can see there:
I tried with the function na.interp from forecast package with a bad result (shown above):
sapply(dataframeTS[2:10], na.interp)
Im looking for a more useful method.
UPDATE:
Here is more info about the pattern I want to capture, concretely the row data. This subsample belongs to May.
You might want to try the **imputeTS** package. It's an R package dedicated to time series missing value imputation.
The na_seadec(), na_seasplit(), na_kalman() methods might be interesting here
There are many more algorithm options - you can find a list in this Paper about the package.
In this specific case I would try:
na_seasplit(yourData)
or
na_kalman(yourData)
or
na_seadec(yourData)
Be aware, that it might be you need to give the seasonality information correctly with the time series. (you have to create a time series (ts object) and set the frequency parameter)
Still might be that it won't work out at all, you will have to try.
(if you can provide the data I'll also give it a try)

How to construct dataframe for time series data using ensemble learning methods

I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard that it is necessary to re-arrange the data frame in some way, because ensemble methods draw repeated RANDOM samples from the training data, in which case the sequence of the Bitcoin time series will be ruined. So, is there a way to re-arrange the data frame in some way such that the time series will still be in chronological order every time repeated samples are drawn from the training data?
I was provided with an explanation of how to construct the data frame here and possibly here, too, but unfortunately, I didn't really understand these explanations, because I didn't see a visual example of the to-be-constructed data frame and because I wasn't able to identify the relevant line of code. So, if someone could, show me how to re-arrange the data frame using an example data frame, I would be very thankful. As example data frame, you might consider using the airquality in-built data frame in r (I think it contains time series data), the data I provided above, or any other data frame you think is best.
Many thanks!
There is no problem with resampling for ML algorithms. To capture (auto)correlation just add columns with lagged values of time series. E.g. in case of univarate time-series x[t], where t is time in minutes, you add x[t - 1], x[t - 2], ..., x[t - n] columns with lagged values. More lags you add more history will be accounted at model training.
Some very basic working example you can find here: Prediction using neural networks
More advanced staff with Keras is here: Time series prediction using RNN
However, just for your information, special message by Mr Chollet and Mr Allaire from the above-mentioned article ,):
NOTE: Markets and machine learning
Some readers are bound to want to take the techniques we’ve introduced
here and try them on the problem of forecasting the future price of
securities on the stock market (or currency exchange rates, and so
on). Markets have very different statistical characteristics than
natural phenomena such as weather patterns. Trying to use machine
learning to beat markets, when you only have access to publicly
available data, is a difficult endeavor, and you’re likely to waste
your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not
a good predictor of future returns – looking in the rear-view mirror
is a bad way to drive. Machine learning, on the other hand, is
applicable to datasets where the past is a good predictor of the
future.

Resources