RNN problems with multi-feature and separate label set (in Tensorflow) - recurrent-neural-network

I'm learning to use a RNN to predict market index, e.g. S&P500 (note; that's the S&P index, not 500 different companies). Aside from price change data I'm also feeding in other info such as RSI, MACD, EMA
I then have 3 labels which are 1w, 2w 3w in future which I load from a separate CSV.
Say this is my example data (completely made-up):
price change RSI MACD EMA
0.3 3.2 0.1 0.0
-0.1 3.1 0.1 0.0
-1.2 3.8 0.1 0.2
0.9 2.7 0.1 0.2
1.3 1.7 0.2 0.2
I then have a separate CSV for the labels
1w future price change % 2w future price change % 3w future price change %
1.2 1.8 -0.3
0.8 0.2 1.1
0.2 1.5 0.7
1.2 1.7 0.1
-0.2 1.8 -0.3
My trouble is I can only find examples that use single features and/or use future training data as the labels, whereas I use a separate defined set of data for the labels.
I've cobble together the code below, but I get a shape feed error on this line:
mse = loss.eval(feed_dict={X: trX, Y: trY})
I suspect the format of my data is wrong as it's still in the format I used to train a 'normal' feed-forward network. I suspect some re-shaping is needed, but to honest, I've not a clue into what format due to having multiples features. I might also have defined the model incorrectly(?).
I'd be grateful if someone could help me with this.
I've also an additional question: Previously (as you'll see in the code) I would shuffle the data, which would be fine for mini batch feed-forward NN, but how would that work with an RNN, where I assume you need to present the data in sequential order? Following on from that, lets say I adapted this for stocks (rather than market index); would I need to present data on a stock-by-stock basis to make up the moving window, rather than doing it on a day-to-day basis? Obviously on day-to-day basis each row of data would be for a different stock.
Sorry for all the question, still getting my head around RNNs!
import tensorflow as tf
import numpy as np
import pandas as pd
import datetime
from sklearn.model_selection import train_test_split
# hyperparameters
epochs = 600
batch_size = 128
num_hidden = 100
df = pd.read_csv('C:\\python\\MarketData-Inputs.csv',header=None)
ldf = pd.read_csv('C:\\python\\MarketData-Results.csv',header=None)
# 20% test, shuffle the data, and use random state for like-like comparison between runs
trX, teX, trY, teY = train_test_split(df, ldf, test_size=0.2, shuffle=True, random_state=42)
trX = trX.values.astype('float')
trY = trY.values.astype('float')
teX = teX.values.astype('float')
teY = teY.values.astype('float')
print(trX.shape)
print(trY.shape)
print(teX.shape)
print(teY.shape)
#data params
features_size = len(trX[0])
labels_size = len(trY[0])
step_size = 3
tf.reset_default_graph()
X = tf.placeholder("float", [None, step_size, features_size], name="X")
Y = tf.placeholder("float", [None, labels_size], name="Y")
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=num_hidden, activation=tf.nn.relu)
rnn_outputs, _ = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, num_hidden])
stacked_outputs = tf.layers.dense(stacked_rnn_outputs, labels_size)
outputs = tf.reshape(stacked_outputs, [-1, step_size, labels_size])
with tf.name_scope("loss"):
loss = tf.reduce_sum(tf.square(outputs - Y))
training_op = tf.train.AdamOptimizer().minimize(loss)
tf.summary.scalar("loss", loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
init.run()
for ep in range(epochs):
sess.run(training_op, feed_dict={X: trX, Y: trY})
if ep % 100 == 0:
mse = loss.eval(feed_dict={X: trX, Y: trY})
print(ep, "\tMSE:", mse)
y_pred = sess.run(stacked_outputs, feed_dict={X: teX})
print(y_pred)

Okay, I think I sussed it. In my example I'm not batching the data, but RNN expects an input tensor of at least rank 3, so I need to add an extra dimension to my input data to give the form of [batch,sequence,feature size], which I can do here:
trX = np.expand_dims( trX.values.astype('float'), axis=0)
trY = trY.values.astype('float')
teX = np.expand_dims( teX.values.astype('float'), axis=0)
teY = teY.values.astype('float')
and my placeholders look like this.
X = tf.placeholder("float", [None, sequence_size, features_size], name="X")
Y = tf.placeholder("float", [None, labels_size], name="Y")
Regarding my question about shuffling data: I think I can now also answer my own question; the sequence of data should remain because the order of the sequence is important for RNN (that's the whole point of them). Instead, data can be shuffled via batching. So a batch is made up of random subsets of our entire dataset. So if entire dataset is 1000 time steps, and I have a batch size of 100, then I could create 10 batches, and then present the batches in random/shuffled order to the RNN.
Hope this helps someone in future.

Related

Octave generating a random number with known probability

I want to generate a random number with range and with a given probability in octave but I'm not sure how to:
0.5 chance of 1 - 50
0.3 chance of 51 - 80
0.2 chance of 81 - 100
thx
Use randi to generate those integers in combination with randsample (from Statistics package) to define that bias.
pkg load statistics;
R = randsample([randi(50), randi([51 80]), randi([81 100])], 1, true, ...
[0.50, 0.3, 0.2]);

In R, how do I retrieve information from an XMeans output

I have a data frame, df, containing the x and y coordinates of a bunch of points. Here's an excerpt:
> tail(df)
x y
1495 0.627174 0.120215
1496 0.616036 0.123623
1497 0.620269 0.122713
1498 0.630231 0.110670
1499 0.611844 0.111593
1500 0.412236 0.933250
I am trying to find out the most appropriate number of clusters. Ultimately the goal is to do this with tens of thousands of these data frames, so the method of choice must be quick and can't be visual. Based on those requirements, it seems like the RWeka package is the way to go.
I managed to successfully load the RWeka package (I had to install Java SE Runtime in my computer first) and also RWeka's package XMeans, and run it:
library("RWeka") # requires Java SE Runtime
WPM("refresh-cache") # Build Weka package metadata cache
WPM("install-package", "XMeans") # Install XMeans package if not previously installed
weka_ctrl <- Weka_control( # Create a Weka control object to specify our parameters
I = 100, # max no iterations overall
M = 100, # max no iterations in the kmeans loop
L = 2, # min no clusters
H = 5, # max no clusters
D = "weka.core.EuclideanDistance", # distance metric
C = 0.4, S = 1)
x_means <- XMeans(df, control = weka_ctrl) # run algorithm on data
This produces exactly the result I want:
XMeans
======
Requested iterations : 100
Iterations performed : 1
Splits prepared : 2
Splits performed : 0
Cutoff factor : 0.4
Percentage of splits accepted
by cutoff factor : 0 %
------
Cutoff factor : 0.4
------
Cluster centers : 2 centers
Cluster 0
0.4197712002617799 0.9346986806282739
Cluster 1
0.616697959239131 0.11564350951086963
Distortion: 30.580934
BIC-Value : 2670.359509
I can assign each point in my data-frame to a cluster by running x_means$class_ids.
However, I would like to have a way of retrieving the coordinates of the cluster centres. I can see them in the output and write them down manually, but if I am to run tens of thousands of these, I need to be able to have a piece of code that saves them into a variable. I can't seem to subset x_means by using square brackets, so I don't know what else to do.
Thank you so much in advance for your help!
The centers do not seem to be directly stored in the structure that is returned. However, since the structure does tell you which cluster each point belongs to, it is easy to compute the centers. Since you do not provide your data, I will illustrate with the built-in iris data.
As you observed, printing out the result shows the centers. we can use this to check the result.
x_means <- XMeans(iris[,1:4], control = weka_ctrl)
x_means
## Output truncated to just the interesting part.
Cluster centers : 2 centers
Cluster 0
6.261999999999998 2.872000000000001 4.906000000000001 1.6760000000000006
Cluster 1
5.005999999999999 3.428000000000001 1.4620000000000002 0.2459999999999999
So here's how to compute that
colMeans(iris[x_means$class_ids==0,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.262 2.872 4.906 1.676
colMeans(iris[x_means$class_ids==1,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
The results agree.

List in Predict Function

I am learning r and trying to understand one concept in building a model:
The data:
Time Counts
1 0 126.6
2 1 101.8
3 2 71.6
etc...
The model:
Time2 <- Time^2
quadratic.model <-lm(Counts ~ Time + Time2)
The prediction:
timevalues <- seq(0, 30, 0.1)
predictedcounts <- predict(quadratic.model,list(Time=timevalues, Time2=timevalues^2))
I don't understand this part of the above function.
list(Time=timevalues, Time2=timevalues^2)
What exactly is the list doing? Is there a more intuitive way to accomplish the same thing?
The list is specifying what values of Time and Time2 should be used for prediction. If you had different time values (say from a cross validation set) called TimeValuesB, then by setting list(Time = TimeValuesB, Time2 = TimeValuesB^2) you could obtain the predicted output for these new data values.
However, if you just want to obtain the predictions from the original data you can omit the list. So in your case
predictedcounts <- predict(quadratic.model)
should work just fine.

KS test for power law

Im attempting fitting a powerlaw distribution to a data set, using the method outlined by Aaron Clauset, Cosma Rohilla Shalizi and M.E.J. Newman in their paper "Power-Law Distributions in Empirical Data".
I've found code to compare to my own, but im a bit mystified where some of it comes from, the story thus far is,
to identify a suitable xmin for the powerlaw fit, we take each possible xmin fit a powerlaw to that data and then compute the corresponding exponet (a) then the KS statistic (D) for the fit and the observed data, then find the xmin that corresponds to the minimum of D. The KS statistic if computed as follows,
cx <- c(0:(n-1))/n # n is the sample size for the data >= xmin
cf <- 1-(xmin/z)^a # the cdf for a powerlaw z = x[x>=xmin]
D <- max(abs(cf-cx))
what i dont get is where cx comes for, surely we should be comparing the distance between the empirical distributions and the calculated distribution. something along the lines of:
cx = ecdf(sort(z))
cf <- 1-(xmin/z)^a
D <- max(abs(cf-cx(z)))
I think im just missing something very basic but please do correct me!
The answer is that they are (almost) the same. The easiest way to see this is to generate some data:
z = sort(runif(5,xmin, 10*xmin))
n = length(x)
Then examine the values of the two CDFs
R> (cx1 = c(0:(n-1))/n)
[1] 0.0 0.2 0.4 0.6 0.8
R> (cx2 = ecdf(sort(z)))
[1] 0.2 0.4 0.6 0.8 1.0
Notice that they are almost the same - essentially the cx1 gives the CDF for greater than or equal to whilst cx2 is greater than.
The advantage of the top approach is that it is very efficient and quick to calculate. The disadvantage is that if your data isn't truly continuous, i.e. z=c(1,1,2), cx1 is wrong. But then you shouldn't be fitting your data to a CTN distribution if this were the case.

Statistical inefficiency (block-averages)

I have a series of data, these are obtained through a molecular dynamics simulation, and therefore are sequential in time and correlated to some extent. I can calculate the mean as the average of the data, I want to estimate the the error associated to mean calculated in this way.
According to this book I need to calculate the "statistical inefficiency", or roughly the correlation time for the data in the series. For this I have to divide the series in blocks of varying length and, for each block length (t_b), the variance of the block averages (v_b). Then, if the variance of the whole series is v_a (that is, v_b when t_b=1), I have to obtain the limit, as t_b tends to infinity, of (t_b*v_b/v_a), and that is the inefficiency s.
Then the error in the mean is sqrt(v_a*s/N), where N is the total number of points. So, this means that only one every s points is uncorrelated.
I assume this can be done with R, and maybe there's some package that does it already, but I'm new to R. Can anyone tell me how to do it? I have already found out how to read the data series and calculate the mean and variance.
A data sample, as requested:
# t(ps) dH/dl(kJ/mol)
0.0000 582.228
0.0100 564.735
0.0200 569.055
0.0300 549.917
0.0400 546.697
0.0500 548.909
0.0600 567.297
0.0700 638.917
0.0800 707.283
0.0900 703.356
0.1000 685.474
0.1100 678.07
0.1200 687.718
0.1300 656.729
0.1400 628.763
0.1500 660.771
0.1600 663.446
0.1700 637.967
0.1800 615.503
0.1900 605.887
0.2000 618.627
0.2100 587.309
0.2200 458.355
0.2300 459.002
0.2400 577.784
0.2500 545.657
0.2600 478.857
0.2700 533.303
0.2800 576.064
0.2900 558.402
0.3000 548.072
... and this goes on until 500 ps. Of course, the data I need to analyze is the second column.
Suppose x is holding the sequence of data (e.g., data from your second column).
v = var(x)
m = mean(x)
n = length(x)
si = c()
for (t in seq(2, 1000)) {
nblocks = floor(n/t)
xg = split(x[1:(nblocks*t)], factor(rep(1:nblocks, rep(t, nblocks))))
v2 = sum((sapply(xg, mean) - m)**2)/nblocks
#v rather than v1
si = c(si, t*v2/v)
}
plot(si)
Below image is what I got from some of my time series data. You have your lower limit of t_b when the curve of si becomes approximately flat (slope = 0). See http://dx.doi.org/10.1063/1.1638996 as well.
There are a couple different ways to calculate the statistical inefficiency, or integrated autocorrelation time. The easiest, in R, is with the CODA package. They have a function, effectiveSize, which gives you the effective sample size, which is the total number of samples divided by the statistical inefficiency. The asymptotic estimator for the standard deviation in the mean is sd(x)/sqrt(effectiveSize(x)).
require('coda')
n_eff = effectiveSize(x)
Well it's never too late to contribute to a question, isn't it?
As I'm doing some molecular simulation myself, I did step uppon this problem but did not see this thread already. I found out that the method actually proposed by Allen & Tildesley seems a bit out dated compared to modern error analysis methods. The rest of the book is good enought to worth the look though.
While Sunhwan Jo's answer is correct concerning block averages method,concerning error analysis you can find other methods like the jacknife and bootstrap methods (closely related to one another) here: http://www.helsinki.fi/~rummukai/lectures/montecarlo_oulu/lectures/mc_notes5.pdf
In short, with the bootstrap method, you can make series of random artificial samples from your data and calculate the value you want on your new sample. I wrote a short piece of Python code to work some data out (don't forget to import numpy or the functions I used):
def Bootstrap(data):
B = 100 # arbitraty number of artificial samplings
es = 0.
means = numpy.zeros(B)
sizeB = data.shape[0]/4 # (assuming you pass a numpy array)
# arbitrary bin-size proportional to the one of your
# sampling.
for n in range(B):
for i in range(sizeB):
# if data is multi-column array you may have to add the one you use
# specifically in randint, else it will give you a one dimension array.
# Check the doc.
means[n] = means[n] + data[numpy.random.randint(0,high=data.shape[0])] # Assuming your desired value is the mean of the values
# Any calculation is ok.
means[n] = means[n]/sizeB
es = numpy.std(means,ddof = 1)
return es
I know it can be upgraded but it's a first shot. With your data, I get the following:
Mean = 594.84368
Std = 66.48475
Statistical error = 9.99105
I hope this helps anyone stumbling across this problem in statistical analysis of data. If I'm wrong or anything else (first post and I'm no mathematician), any correction is welcomed.

Resources