How to do reinforcement learning with an LSTM in PyTorch? - recurrent-neural-network

Due to observations not revealing the entire state, I need to do reinforcement with a recurrent neural network so that the network has some sort of memory of what has happened in the past. For simplicity let's assume that we use an LSTM.
Now the in-built PyTorch LSTM requires you to feed it a an input of shape Time x MiniBatch x Input D and it outputs a tensor of shape Time x MiniBatch x Output D.
In reinforcement learning however, to know the input at time t+1, I need to know the output at time t, because I am doing actions in an environment.
So is it possible to use the in-built PyTorch LSTM to do BPTT in a reinforcement learning setting? And if it is, how could I do it?

Maybe you can feed your input sequence in a loop to your LSTM. Something, like this:
h, c = Variable(torch.zeros()), Variable(torch.zeros())
for i in range(T):
input = Variable(...)
_, (h, c) = lstm(input, (h,c))
Every timestep you can use (h,c) and input to evaluate action for instance. As long as you do not break computational graph you can backpropagate as Variables keep all the history.

Related

Calculate output of dynamic fnn without linear algebra

Working on a project which includes evolving the fnn topology during runtime, I want to outsource the nn output calculation to my gpu. Currently using unity (soon switching to just c# libs without unity or smth similar), I use hlsl like compute shaders.
Now since I am very restricted with hlsl's syntax (Arrays/Matrices etc with dynamic index ranges, dot product and matrix functions only working on preexisting types like float2 and those "vectors" having a max length of 4 (float4)), I am looking for a formula with which I can calculate the output of the fnn within a single calculation (obviously still including a loop). That means my shader has the weights, input layer, bias and structure of the nn each as seperate structured buffers. Now I need to find a formula with which I can calculate the output without using dot products or matrix vector multiplications since it's very hard and tortuous to implement those. However I've tried finding a formula for days, hopelessly...
Does anyone have a clue for this? Sadly not so many less recourses on the internet concerning hlsl and this problem.
Thanks a lot!
EDIT:
Okay I got some hints that my actual math related question is not quite clear. Since I am kind of an artist myself, I made a little illustration:
Now, there of course are multiple ways to calculate the output O. I wanna record following:
v3 = v1w1+v2w2
v4 = v1w3+v2w4
v5 = v3w5+v4w6 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4)
This is an easy way to calculate the final output O in one step:
O1 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4).
There are other quite cool ways to calculate the output in one step. Looking at the weights as matrices, e.g.
m1 = [[w1, w2], [w3, w4]] and m2 = [[w5, w6]].
This way we can calculate O as
L1 = m1 * I and O = m2 * L1
or in one step
O = m2*(m1*I)
which is the more elegant way. I would prefer it this way but I cant do it with matrix-vector multiplications or any other linear algebra since I am restricted by my programming tools, so I have to stay with this shape:
O1 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4).
Now this would be really really easy if I had a neural network with fixed topology. However, since it evolves during runtime I have to find a function/formula with which I can calculate O independent from the topology. All information I have is ONE list of ALL weights (w=[w1, w2, w3, w4, w5, w6]), a list of inputs (i=[v1, v2]) and the structure of the nn (s=[2, 2, 1] - 2 input nodes, one hidden layer with 2 nodes, and one output node).
However I cant think of an algorithm that calculates the output with the given information efficiently.

How to improve igraph single_paths calculation efficiency

I'm using the function all_simple_paths from the igraph R package: (1) to generate the list of all simple paths in networks (object List_paths_Mp); and (2) to calculate the total number of simple paths (object n_paths).
I'm using the function in the form:
pathsMp <- unlist(lapply(V(graphMp), function(x)all_simple_paths(graphMp, from = x)), recursive =FALSE)
List_paths_Mp <- lapply(1:length(pathsMp), function(x)as_ids(pathsMp[[x]]))
n_paths<-length(List_paths_Mp)
Where:
Mp is a square matrix with either 1 or 0 values, and graphMp is the igraph graph objected obtained through the function graph_from_adjacency_matrix.
The function does what I need, but with the increase in the number of variables and interactions the processing time to identify and store the different single paths in the network grows too much and it takes very long to get the results.
In particular, using a network with 11 variables and 60 interactions, there is a total of 146338 possible simple paths. And this already takes a long time to compute. Using a bigger network, with 13 variables and 91 interactions, causes the program to take even longer times to process (after 2 hours the function still didn't run its course, and when called to stop it crashed R).
Is there a way to increase the efficiency of the task (i.e. to get results in a faster way)? Has anyone ever encountered a similar problem and found a solution? And, I know, I could use a CPU with higher processing power, but the point is to have the function to run efficiently (as much as possible) in a normal personal computer.
Edit: here I do the calculations from the graph object, but if someone else has any idea of doing the same from the adjacency matrix, I would welcome it too!

doubts regarding batch size and time steps in RNN

In Tensorflow's tutorial of RNN: https://www.tensorflow.org/tutorials/recurrent
. It mentions two parameters: batch size and time steps. I am confused by the concepts. In my opinion, RNN introduces batch is because the fact that the to-train sequence can be very long such that backpropagation cannot compute that long(exploding/vanishing gradients). So we divide the long to-train sequence into shorter sequences, each of which is a mini-batch and whose size is called "batch size". Am I right here?
Regarding time steps, RNN consists of only a cell (LSTM or GRU cell, or other cell) and this cell is sequential. We can understand the sequential concept by unrolling it. But unrolling a sequential cell is a concept, not real which means we do not implement it in unroll way. Suppose the to-train sequence is a text corpus. Then we feed one word each time to the RNN cell and then update the weights. So why do we have time steps here? Combining my understanding of the above "batch size", I am even more confused. Do we feed the cell one word or multiple words (batch size)?
Batch size pertains to the amount of training samples to consider at a time for updating your network weights. So, in a feedforward network, let's say you want to update your network weights based on computing your gradients from one word at a time, your batch_size = 1.
As the gradients are computed from a single sample, this is computationally very cheap. On the other hand, it is also very erratic training.
To understand what happen during the training of such a feedforward network,
I'll refer you to this very nice visual example of single_batch versus mini_batch to single_sample training.
However, you want to understand what happens with your num_steps variable. This is not the same as your batch_size. As you might have noticed, so far I have referred to feedforward networks. In a feedforward network, the output is determined from the network inputs and the input-output relation is mapped by the learned network relations:
hidden_activations(t) = f(input(t))
output(t) = g(hidden_activations(t)) = g(f(input(t)))
After a training pass of size batch_size, the gradient of your loss function with respect to each of the network parameters is computed and your weights updated.
In a recurrent neural network (RNN), however, your network functions a tad differently:
hidden_activations(t) = f(input(t), hidden_activations(t-1))
output(t) = g(hidden_activations(t)) = g(f(input(t), hidden_activations(t-1)))
=g(f(input(t), f(input(t-1), hidden_activations(t-2)))) = g(f(inp(t), f(inp(t-1), ... , f(inp(t=0), hidden_initial_state))))
As you might have surmised from the naming sense, the network retains a memory of its previous state, and the neuron activations are now also dependent on the previous network state and by extension on all states the network ever found itself to be in. Most RNNs employ a forgetfulness factor in order to attach more importance to more recent network states, but that is besides the point of your question.
Then, as you might surmise that it is computationally very, very expensive to calculate the gradients of the loss function with respect to network parameters if you have to consider backpropagation through all states since the creation of your network, there is a neat little trick to speed up your computation: approximate your gradients with a subset of historical network states num_steps.
If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above.
I found this diagram which helped me visualize the data structure.
From the image, 'batch size' is the number of examples of a sequence you want to train your RNN with for that batch. 'Values per timestep' are your inputs.' (in my case, my RNN takes 6 inputs) and finally, your time steps are the 'length', so to speak, of the sequence you're training
I'm also learning about recurrent neural nets and how to prepare batches for one of my projects (and stumbled upon this thread trying to figure it out).
Batching for feedforward and recurrent nets are slightly different and when looking at different forums, terminology for both gets thrown around and it gets really confusing, so visualizing it is extremely helpful.
Hope this helps.
RNN's "batch size" is to speed up computation (as there're multiple lanes in parallel computation units); it's not mini-batch for backpropagation. An easy way to prove this is to play with different batch size values, an RNN cell with batch size=4 might be roughly 4 times faster than that of batch size=1 and their loss are usually very close.
As to RNN's "time steps", let's look into the following code snippets from rnn.py. static_rnn() calls the cell for each input_ at a time and BasicRNNCell::call() implements its forward part logic. In a text prediction case, say batch size=8, we can think input_ here is 8 words from different sentences of in a big text corpus, not 8 consecutive words in a sentence.
In my experience, we decide the value of time steps based on how deep we would like to model in "time" or "sequential dependency". Again, to predict next word in a text corpus with BasicRNNCell, small time steps might work. A large time step size, on the other hand, might suffer gradient exploding problem.
def static_rnn(cell,
inputs,
initial_state=None,
dtype=None,
sequence_length=None,
scope=None):
"""Creates a recurrent neural network specified by RNNCell `cell`.
The simplest form of RNN network generated is:
state = cell.zero_state(...)
outputs = []
for input_ in inputs:
output, state = cell(input_, state)
outputs.append(output)
return (outputs, state)
"""
class BasicRNNCell(_LayerRNNCell):
def call(self, inputs, state):
"""Most basic RNN: output = new_state =
act(W * input + U * state + B).
"""
gate_inputs = math_ops.matmul(
array_ops.concat([inputs, state], 1), self._kernel)
gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
output = self._activation(gate_inputs)
return output, output
To visualize how these two parameters are related to the data set and weights, Erik Hallström's post is worth reading. From this diagram and above code snippets, it's obviously that RNN's "batch size" will no affect weights (wa, wb, and b) but "time steps" does. So, one could decide RNN's "time steps" based on their problem and network model and RNN's "batch size" based on computation platform and data set.

Tensorflow Graphs : Are Tensorflow graphs DAG? What happens in assignAdd operations in tensor Variables

How is this graph acyclic? assign add op adds x to itself.
import tensorflow as tf
sess = tf.Session()
x = tf.Variable(1300,name="x")
y = tf.Variable(200, name="y")
z = tf.add(x, y,name="z")
b = x.assign_add(z)
init = tf.initialize_all_variables()
writer = tf.train.SummaryWriter("/tmp/logdir", sess.graph)
sess.run(init)
print(sess.run(b))
Clearly there is a bi-directional edge between AssignAdd and X.
Why is X depicted twice as a variable?
As Olivier points out, the graph for your program is a DAG. The graph visualizer takes some liberties when rendering the graph to make it easier to understand. In particular, there are no "bidirectional" edges in the runtime itself, but instead TensorFlow includes "reference edges" for variables, which are like passing a mutable value (like a pointer or a mutable reference) into a C/C++ function, as they allow the recipient to modify the same underlying storage used for the variable.
Note that it is legal for TensorFlow graphs to contain one or more cycles, or even nested cycles. The tf.while_loop() function provides a means of creating structured cycles to represent iterative computations, for which TensorFlow can compute gradients. However, for your use with a simple variable, you do not need a cycle.
Clearly there is a bi-directional edge between AssignAdd and X.
Each operation Assign or AssignAdd has two inputs and no output:
a tf.Variable: the variable to which we assign a value. Here you can see a bidirectional edge because the operation reads the old value from x and then writes back the new value
a tf.Tensor: the value assigned to the variable (or added to the variable)
Why is X depicted twice as a variable?
The variable x appears once in the graph in the big block named X, but is used twice:
in the operation tf.add(x, y): the graph reads the value of x as an input
in the AssignAdd operation: as said before, x is the tf.Variable input of the operation AssignAdd.
Conclusion
The graph is made acyclic because each operation wanting to update the value of x has the variable x as input, but not output. If the operation Assign had a variable as output, it would indeed lead to cycles.

Mathematical library to compare simularities in graphs of data for a high level language (eg. Javascript)?

I'm looking for something that I guess is rather sophisticated and might not exist publicly, but hopefully it does.
I basically have a database with lots of items which all have values (y) that correspond to other values (x). Eg. one of these items might look like:
x | 1 | 2 | 3 | 4 | 5
y | 12 | 14 | 16 | 8 | 6
This is just a a random example. Now, there are thousands of these items all with their own set of x and y values. The range between one x and the x after that one is not fixed and may differ for every item.
What I'm looking for is a library where I can plugin all these sets of Xs and Ys and tell it to return things like the most common item (sets of x and y that follow a compareable curve / progression), and the ability to check whether a certain set is atleast x% compareable with another set.
With compareable I mean the slope of the curve if you would draw a graph of the data. So, not actaully the static values but rather the detection of events, such as a high increase followed by a slow decrease, etc.
Due to my low amount of experience in mathematics I'm not quite sure what I'm looking for is called, and thus have trouble explaining what I need. Hopefully I gave enough pointers for someone to point me into the right direction.
I'm mostly interested in a library for javascript, but if there is no such thing any library would help, maybe I can try to port what I need.
About Markov Cluster(ing) again, of which I happen to be the author, and your application. You mention you are interested in trend similarity between objects. This is typically computed using Pearson correlation. If you use the mcl implementation from http://micans.org/mcl/, you'll also obtain the program 'mcxarray'. This can be used to compute pearson correlations between e.g. rows in a table. It might be useful to you. It is able to handle missing data - in a simplistic approach, it just computes correlations on those indices for which values are available for both. If you have further questions I am happy to answer them -- with the caveat that I usually like to cc replies to the mcl mailing list so that they are archived and available for future reference.
What you're looking for is an implementation of a Markov clustering. It is often used for finding groups of similar sequences. Porting it to Javascript, well... If you're really serious about this analysis, you drop Javascript as soon as possible and move on to R. Javascript is not meant to do this kind of calculations, and it is far too slow for it. R is a statistical package with much implemented. It is also designed specifically for very speedy matrix calculations, and most of the language is vectorized (meaning you don't need for-loops to apply a function over a vector of values, it happens automatically)
For the markov clustering, check http://www.micans.org/mcl/
An example of an implementation : http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi
Now you also need to define a "distance" between your sets. As you are interested in the events and not the values, you could give every item an extra attribute being a vector with the differences y[i] - y[i-1] (in R : diff(y) ). The distance between two items can then be calculated as the sum of squared differences between y1[i] and y2[i].
This allows you to construct a distance matrix of your items, and on that one you can call the mcl algorithm. Unless you work on linux, you'll have to port that one.
What you're wanting to do is ANOVA, or ANalysis Of VAriance. If you run the numbers through an ANOVA test, it'll give you information about the dataset that will help you compare one to another. I was unable to locate a Javascript library that would perform ANOVA, but there are plenty of programs that are capable of it. Excel can perform ANOVA from a plugin. R is a stats package that is free and can also perform ANOVA.
Hope this helps.
Something simple is (assuming all the graphs have 5 points, and x = 1,2,3,4,5 always)
Take u1 = the first point of y, ie. y1
Take u2 = y2 - y1
...
Take u5 = y5 - y4
Now consider the vector u as a point in 5-dimensional space. You can use simple clustering algorithms, like k-means.
EDIT: You should not aim for something too complicated as long as you go with javascript. If you want to go with Java, I can suggest something based on PCA (requiring the use of singular value decomposition, which is too complicated to be implemented efficiently in JS).
Basically, it goes like this: Take as previously a (possibly large) linear representation of data, perhaps differences of components of x, of y, absolute values. For instance you could take
u = (x1, x2 - x1, ..., x5 - x4, y1, y2 - y1, ..., y5 - y4)
You compute the vector u for each sample. Call ui the vector u for the ith sample. Now, form the matrix
M_{ij} = dot product of ui and uj
and compute its SVD. Now, the N most significant singular values (ie. those above some "similarity threshold") give you N clusters.
The corresponding columns of the matrix U in the SVD give you an orthonormal family B_k, k = 1..N. The squared ith component of B_k gives you the probability that the ith sample belongs to cluster K.
If it is ok to use java you really should have a look at Weka. It is possible to access all features via java code. Maybe you find a markov clustering, but if not, they hava a lot other clustering algorithem and its really easy to use.

Resources