How to construct dataframe for time series data using ensemble learning methods - r

I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard that it is necessary to re-arrange the data frame in some way, because ensemble methods draw repeated RANDOM samples from the training data, in which case the sequence of the Bitcoin time series will be ruined. So, is there a way to re-arrange the data frame in some way such that the time series will still be in chronological order every time repeated samples are drawn from the training data?
I was provided with an explanation of how to construct the data frame here and possibly here, too, but unfortunately, I didn't really understand these explanations, because I didn't see a visual example of the to-be-constructed data frame and because I wasn't able to identify the relevant line of code. So, if someone could, show me how to re-arrange the data frame using an example data frame, I would be very thankful. As example data frame, you might consider using the airquality in-built data frame in r (I think it contains time series data), the data I provided above, or any other data frame you think is best.
Many thanks!

There is no problem with resampling for ML algorithms. To capture (auto)correlation just add columns with lagged values of time series. E.g. in case of univarate time-series x[t], where t is time in minutes, you add x[t - 1], x[t - 2], ..., x[t - n] columns with lagged values. More lags you add more history will be accounted at model training.
Some very basic working example you can find here: Prediction using neural networks
More advanced staff with Keras is here: Time series prediction using RNN
However, just for your information, special message by Mr Chollet and Mr Allaire from the above-mentioned article ,):
NOTE: Markets and machine learning
Some readers are bound to want to take the techniques we’ve introduced
here and try them on the problem of forecasting the future price of
securities on the stock market (or currency exchange rates, and so
on). Markets have very different statistical characteristics than
natural phenomena such as weather patterns. Trying to use machine
learning to beat markets, when you only have access to publicly
available data, is a difficult endeavor, and you’re likely to waste
your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not
a good predictor of future returns – looking in the rear-view mirror
is a bad way to drive. Machine learning, on the other hand, is
applicable to datasets where the past is a good predictor of the
future.

Related

How can Keras predict sequences of sales (individually) of 11106 distinct customers, each a series of varying length (anyway from 1 to 15 periods)

I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's

What are some R packages for dealing with multivariate time series for data sets with multiple observations?

I am trying to figure out how to approach a data problem that includes observations of multiple equipment units' pressure and temperature measures. The measures are available for a few years as daily or nearly daily values.
This seems like a time series problem (multivariate) and I have found some quality examples. However, because the data set consists of multiple measures taken for each equipment unit, I am a bit stumped on how to proceed. Should I fit a separate time series for each piece of equipment? This seems intuitively wrong, but I am really not sure which package or even approach I can use to work through this.
I would very much appreciate a recommendation or link to some resources.

Z score normalizing r dataframe consecutively

I would like to normalize an R data.frame by computing the z-score using the function scale().
However, I am not sure whether this approach is subject to "look-ahead bias", which is a finance term for making up features that would not have been known or available during the period being analyzed.
These are stock returns, and I want to use this data for a "backtest" (a finance term for validation). I want to make sure that each period's z-score is only using data available up to that point and not the entire series mean and std when computing the z-score.
Does anyone know how to perform the calculation for this? Or is there a different approach?
You can normalize data or create new features using normalization without worrying about "look-ahead" bias. It's very common.
You just don't use any data to do so that would not be available in the period being analyzed.
Much like with target encoding or other feature engineering techniques you simply create those features on a training subset of your historical data, then validate it on a validation split. You may also consider KFold cross-validation.
If you'd like to augment your question with a reproducible example I can show you.

Can I use RBF neural network to forecast time series in R language?

There is a time series of the number of jobs in manufacture between 1978 to 2017. I want to use Radial Basis Neural Network to forecast the number of jobs in two years. Is it possible? If it is, could you please write the code in R language? Many thanks! I wrote some codes here :
install.packages("RSNNS")
library(RSNNS)
data <- read.csv("jobs.csv",header = TRUE)
tsA01 <- ts(data$`A-01`,start = c(1978,2),end = c(2017,1),frequency = 4)
part of data as shown in the image below:
Looking at the example of the data, you have a very simple dataset: a response variable (number of jobs), and a single covariate (date). If that is truly the limit of your data, there is no need for a Neural Network approach. Neural Networks and other Supervised Machine Learning approaches are really only necessary when you have tons of features (i.e., covariates, also called "p"), typically such that p >> n (number of observations). In this specific case, I would start with a simple linear regression that perhaps takes things like month or season into account as covariates. If the regression looks good, you can then make predictions about future time points.
If you do have more complex data than you eluded to in your question, there is a great textbook for machine learning that is available online for free. It includes a number of laboratory chapters written in R to help guide you through various analyses, but I would invest the time in reading about the pros and cons of the various approaches before you decide to use Neural Networks specifically. You can find the textbook here: http://www-bcf.usc.edu/~gareth/ISL/ (and just click "Download the book PDF".

Fast Fourier Transform and Clustering of Time Series

I'm making a project connected with identifying dynamic of sales. That's how the piece of my database looks like http://imagizer.imageshack.us/a/img854/1958/zlco.jpg. There are three columns:
Product - present the group of product
Week - time since launch the product (week), first 26 weeks
Sales_gain - how the sales of product change by week
In the database there is 3302 observations = 127 time series
My aim is to cluster time series in groups which are going to show me different dynamic of sales. Before clustering I want to use Fast Fourier Transform to change time series on vectors and take into consideration amplitude etc and then use a distance algorithm and group products.
It's my first time I deal with FFT and clustering, so I would be grateful if anybody would point steps, which I have to do before/after using FFT to group dynamics of sales. I want to do all steps in R, so it would be wonderful if somebody type which procedures should I use to do all steps.
That's how my time series look like now http://imageshack.com/a/img703/6726/sru7.jpg
Please note that I am relatively new to time series analysis (that's why I cannot put here my code) so any clarity you could provide in R or any package you could recommend that would accomplish this task efficiently would be appreciated.
P.S. Instead of FFT I found the code for DWT here -> www.rdatamining.com/examples/time-series-clustering-classification but cannot use it on my data base and time series (suggest R to analyze new time series after 26 weeks). Can sb explain it to me?
You may have too little data for FFT/DWT to make sense. DTW may be better, but I also don't think it makes sense for sales data - why would there be a x-week temporal offset from one location to another? It's not as if the data were captured at unknown starting weeks.
FFT and DWT are good when your data will have interesting repetitive patterns, and you have A) a good temporal resolution (for audio data, e.g. 16000 Hz - I am talking about thousands of data points!) and B) you have no idea of what frequencies to expect. If you know e.g. you will have weekly patterns (e.g. no sales on sundays) then you should filter them with other algorithms instead.
DTW (dynamic time-warping) is good when you don't know when the event starts and how they align. Say you are capturing heart measurements. You cannot expect to have the hearts of two subjects to beat in synchronization. DTW will try to align this data, and may (or may not) succeed in matching e.g. an anomaly in the heart beat of two subjects. In theory...
Maybe you don't need specialized time methods here at all.
A) your data has too low temporal resolution
B) your data is already perfectly aligned
Maybe all you need is spend more time in preprocessing your data, in particular normalization, to be able to capture similarity.

Resources