I have a question on using R studio to take Arima test on more than one time series.
For example, I have three clients with three time series in 4 periods.
Client 1 client 2 client 3
1 3 7
2 5 3
4 3 1
5 8 9
Now I want to predict the next period after 5/8/9. I know how to use Arima to predict time series one by one, but in practice I have lots of clients and it will take too much time. Could u plz teach me how to do a loop or use lapply or so on to make things easier?
Also, when picking the order of Arima, I only know to use Ident to generate figures of ACF and PACF to tell the orders of MA and AR, which will not work on a large amount of time series - I feel unwise to draw hundreds of figures. Do you have any good advice to tell the order of Arima? Thank you!
apply would help you apply the same function to every column (if you have each column as one time-series). Please learn how to use it, there are a lot of examples on internet.
If you have a lot of time-series, and you want to avoid manual work, auto.arima should come in handy. In case you're not satisfied with its results:
Try finding general rules first. That is, if you know that each of the time-series would be seasonal(with same season-length), you know that you need seasonal-differencing for all. Same thing can be said about long-term trend. These inferences could also be made algorithmically.
Irrespective of whether 1 is applicable or not, in order to decide best value for parameters, you have to code up the logic which you use to decide AR and MA parameters manually. The simpler your manual logic, the easier its to code. One simple rule of thumb is to keep AR + MA = 1, chose AR if ACF falls gradually, MA if it falls rapidly. What is rapid and what is gradual would have to be decided by your code only.
Related
I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's
I tried searching for an answer for this question of mine, however I could not find anything.
I want to build a model that predicts barley prices for that i came up with 11 variables that may have an impact on the prices. What I tried doing was building a loop that chooses every time one extra variable from my pool of variables and tries different combinations of them and the output would be for every (extra/combination) variable a new VAR-model, so in a sense, it is a combinatorics exercise. After that, i want to implement an in/out of sample testing for each of the models that I came up with to decide which one is the most appropriate. Unfortunately, i am not very familiar with loops and i have been told not to use them on R... As I am a beginner on R, my tryouts won't help you out at all, but if you really require them I am happy to provide them to you.
Many thanks in advance!
I have historical purchase data for some 10k customers for 3 months, I want to use that data for making predictions about their purchase in next 3 months. I am using Customer ID as input variable, as I want xgboost to learn for individual spendings among different categories. Is there a way to tweak, so that emphasis is to learn more based on the each Individual purchase? Or better way of addressing this problem?
You can use weight vector which you can pass in weight argument in xgboost; a vector of size equal to nrow(trainingData). However This is generally used to penalize mistake in classification mistake (think of sparse data with items which just sale say once in month or so; you want to learn the sales then you need to give more weight to sales instance or else all prediction will be zero). Apparently you are trying to tweak weight of independent variable which I am not able to understand well.
Learning the behavior of dependent variable (sales in your case) is what machine learning model do, you should let it do its job. You should not tweak it to force learn from some feature only. For learning purchase behavior clustering type of unsupervised techniques will be more useful.
To include user specific behavior first take will be to do clustering and identify under-indexed and over-indexed categories for each user. Then you can create some categorical feature using these flags.
PS: Some data to explain your problem can help others to help you better.
It's arrived with XGBoost 1.3.0 as of the date of 10 December 2020, with the name of feature_weights : https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit , I'll edit here when I can work/see a tutorial with it.
We have hourly time series data having 2 columns, one is the timestamp and other is the error rate. We used H2O deep-learning model to learn and predict future error-rate but looks like it requires at least 2 features (except timestamp) for creating the model.
Is there any way h2o can learn this type of data (time, value) having only one feature and predict the value given future time?
Not in the current release of H2O, but ARIMA models are in development. You can follow the progress here.
Interesting question,
I read about to declare other variables which represent previous values of the time series, similar to the methodology of regression in ARIMA models. But I'm not sure if this is a possible way to do it, so please correct me if I am wrong.
Consequently you could try to extend your dataset to something like this:
t value(t) value(t-1) value(t-2) value(t-3) ...
1 10 NA NA NA ...
2 14 10 NA NA ...
3 27 14 10 NA ...
...
After this, value(t) is your response (output neuron) and the others are your predictor variables, each refering to an input neuron.
I have tried to use many of the default methods inside H2O with time series data. If you treat the system as a state machine where the state variables are a series of lagged prior states, it's possible, but not entirely effective as the prior states don't maintain their causal order. One way to alleviate this is to assign weights to each lagged state set based on time past, similar to how an EMA gives precedence to more recent data.
If you are looking to see how easy or effective the DL/ML can be for a non-linear time series model, I would start with an easy problem to validate the DL approach gives any improvement over a simple 1 period ARIMA/GARCH type process.
I have used this technique, with varying success. What I have had success with is taking well known non linear time series models and improving their predictive qualities with additional factors using the the handcrafted non linear model as an input into the DL method. It seems that certain qualities that I haven't manually worked out about the entire parameter space are able to supplement a decent foundation.
The real question at that point is there is now an introduction of immense complexity that isn't entirely understood. Is that complexity warranted in the compiled landscape when the nonlinear model encapsulates about 95% of the information between the two stages?
if I have 2 lists of time intervals :
List1 :
1. 2010-06-06 to 2010-12-12
2. 2010-05-04 to 2010-11-02
3. 2010-02-04 to 2010-10-08
4. 2010-04-01 to 2010-08-02
5. 2010-01-03 to 2010-02-02
and
List2 :
1. 2010-06-08 to 2010-12-14
2. 2010-04-04 to 2010-10-10
3. 2010-02-02 to 2010-12-16
What would be the best way to calculate some sort of correlation or similarity factor between the two lists?
Thanks!
Is that the extent of the data or just a sample to give an idea of the structure you have?
Just a few ideas about how to look at this... My apologies if it is redundant to your current state in looking at this set.
Two basic ideas come to mind for comparing interval like this: absolute or relative. A relative comparison would ignore absolute time for the interval data and look for repeating structures or signature that occur in both groups but not necessarily at the same time. The absolute version would consider simultaneous events to be relevant and and it doesn't matter if something happens every week if they are separated by a year... You can maybe make this distinction by knowing something about the origin of the data.
If it is the grand total of data available for your decision about associations it will come down to some assumptions about what constitutes "correlation". For instance, if you have a specific model for what is going on - e.g. a time to start, time to stop (failure) model you could evaluate the likelihood of observing one sequence given the other. However, without more example data it seems unlikely you'd be able to make any firm conclusions.
The first interval in the two groups are nearly identical so they will contribute strongly to any correlation measure I can think of for the two groups. If there is a random model for this set, I would expect that many models would show these two observations and "unlikely" just because of that.
One way to asses "similarity" would be to ask what portion of the time-axis is covered (possibly generalized to multiple coverage) and compare the two groups on that basis.
Another possibility is to assign a function that adds one for each sequence that occurs during any particular day in the overall interval of these events. That way you have a continuous function with a rudimentary description of multiple events covering the same date. Calculating a correlation between the two groups might give you suggestions of structural similarity, but again you would need more groups of data to make any conclusions.
Ok that was a little rambling. Good luck with your project!
You may try with Cross-Correlation.
However, you should be aware that you have vector data (start, length), and the algorithms suppose a functional dependency between them. That depends on the semantic of your data, which is not clear from the question.
HTH!
A more useful link for your current problem here.