What's the difference between a greedy decoder RNN and a beam decoder with k=1? - recurrent-neural-network

Given a state vector we can recursively decode a sequence in a greedy manner by generating each output successively, where each prediction is conditioned on the previous output. I read a paper recently that described using beam search during decoding with a beam size of 1 (k=1). If we are only retaining the best output at each step, isn't this just the same thing as greedy decoding, and offers none of the benefits typically provided by beam search?

Finally found an answer: beam size of 1 is the same as greedy search.
From "Abstractive Sentence Summarization with Attentive Recurrent Neural Networks":
"k refers to the size of the beam for generation; k = 1 implies greedy generation."

Related

doubts regarding batch size and time steps in RNN

In Tensorflow's tutorial of RNN: https://www.tensorflow.org/tutorials/recurrent
. It mentions two parameters: batch size and time steps. I am confused by the concepts. In my opinion, RNN introduces batch is because the fact that the to-train sequence can be very long such that backpropagation cannot compute that long(exploding/vanishing gradients). So we divide the long to-train sequence into shorter sequences, each of which is a mini-batch and whose size is called "batch size". Am I right here?
Regarding time steps, RNN consists of only a cell (LSTM or GRU cell, or other cell) and this cell is sequential. We can understand the sequential concept by unrolling it. But unrolling a sequential cell is a concept, not real which means we do not implement it in unroll way. Suppose the to-train sequence is a text corpus. Then we feed one word each time to the RNN cell and then update the weights. So why do we have time steps here? Combining my understanding of the above "batch size", I am even more confused. Do we feed the cell one word or multiple words (batch size)?
Batch size pertains to the amount of training samples to consider at a time for updating your network weights. So, in a feedforward network, let's say you want to update your network weights based on computing your gradients from one word at a time, your batch_size = 1.
As the gradients are computed from a single sample, this is computationally very cheap. On the other hand, it is also very erratic training.
To understand what happen during the training of such a feedforward network,
I'll refer you to this very nice visual example of single_batch versus mini_batch to single_sample training.
However, you want to understand what happens with your num_steps variable. This is not the same as your batch_size. As you might have noticed, so far I have referred to feedforward networks. In a feedforward network, the output is determined from the network inputs and the input-output relation is mapped by the learned network relations:
hidden_activations(t) = f(input(t))
output(t) = g(hidden_activations(t)) = g(f(input(t)))
After a training pass of size batch_size, the gradient of your loss function with respect to each of the network parameters is computed and your weights updated.
In a recurrent neural network (RNN), however, your network functions a tad differently:
hidden_activations(t) = f(input(t), hidden_activations(t-1))
output(t) = g(hidden_activations(t)) = g(f(input(t), hidden_activations(t-1)))
=g(f(input(t), f(input(t-1), hidden_activations(t-2)))) = g(f(inp(t), f(inp(t-1), ... , f(inp(t=0), hidden_initial_state))))
As you might have surmised from the naming sense, the network retains a memory of its previous state, and the neuron activations are now also dependent on the previous network state and by extension on all states the network ever found itself to be in. Most RNNs employ a forgetfulness factor in order to attach more importance to more recent network states, but that is besides the point of your question.
Then, as you might surmise that it is computationally very, very expensive to calculate the gradients of the loss function with respect to network parameters if you have to consider backpropagation through all states since the creation of your network, there is a neat little trick to speed up your computation: approximate your gradients with a subset of historical network states num_steps.
If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above.
I found this diagram which helped me visualize the data structure.
From the image, 'batch size' is the number of examples of a sequence you want to train your RNN with for that batch. 'Values per timestep' are your inputs.' (in my case, my RNN takes 6 inputs) and finally, your time steps are the 'length', so to speak, of the sequence you're training
I'm also learning about recurrent neural nets and how to prepare batches for one of my projects (and stumbled upon this thread trying to figure it out).
Batching for feedforward and recurrent nets are slightly different and when looking at different forums, terminology for both gets thrown around and it gets really confusing, so visualizing it is extremely helpful.
Hope this helps.
RNN's "batch size" is to speed up computation (as there're multiple lanes in parallel computation units); it's not mini-batch for backpropagation. An easy way to prove this is to play with different batch size values, an RNN cell with batch size=4 might be roughly 4 times faster than that of batch size=1 and their loss are usually very close.
As to RNN's "time steps", let's look into the following code snippets from rnn.py. static_rnn() calls the cell for each input_ at a time and BasicRNNCell::call() implements its forward part logic. In a text prediction case, say batch size=8, we can think input_ here is 8 words from different sentences of in a big text corpus, not 8 consecutive words in a sentence.
In my experience, we decide the value of time steps based on how deep we would like to model in "time" or "sequential dependency". Again, to predict next word in a text corpus with BasicRNNCell, small time steps might work. A large time step size, on the other hand, might suffer gradient exploding problem.
def static_rnn(cell,
inputs,
initial_state=None,
dtype=None,
sequence_length=None,
scope=None):
"""Creates a recurrent neural network specified by RNNCell `cell`.
The simplest form of RNN network generated is:
state = cell.zero_state(...)
outputs = []
for input_ in inputs:
output, state = cell(input_, state)
outputs.append(output)
return (outputs, state)
"""
class BasicRNNCell(_LayerRNNCell):
def call(self, inputs, state):
"""Most basic RNN: output = new_state =
act(W * input + U * state + B).
"""
gate_inputs = math_ops.matmul(
array_ops.concat([inputs, state], 1), self._kernel)
gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
output = self._activation(gate_inputs)
return output, output
To visualize how these two parameters are related to the data set and weights, Erik Hallström's post is worth reading. From this diagram and above code snippets, it's obviously that RNN's "batch size" will no affect weights (wa, wb, and b) but "time steps" does. So, one could decide RNN's "time steps" based on their problem and network model and RNN's "batch size" based on computation platform and data set.

Game Theory with prediction

To impress two (german) professors i try to improve the game theory.
AI in Computergames.
Game Theory:  Intelligence is a well educated proven Answer to an Question.
This means a thoughtfull decision is choosing an act who leads to an optimal result.
Question -> Resolution -> Answer -> Test (Check)
For Example one robot is fighting another robot.
This robot has 3 choices:
-move forward
-hold position
-move backward
The resulting Programm is pretty simple
randomseed = initvalue;
while (one_is_alive)
{
choice = randomselect(options,probability);
do_choice(roboter);
}  
We are using pseudorandomness.
The test for success is simply did he elimate the opponent.
The robots have automatically shooting weapons :
struct weapon
{
range
damage
}
struct life
{
hitpoints
}
Now for some Evolution.
We let 2 robots fight each other and remember the randomseeds.
What is the sign of a succesfull Roboter ?
struct {
ownrandomseed;
list_of_opponentrandomseed; // the array of the beaten opponents.
}
Now the question is how do we choose the right strategy against an opponent ?
 
We assume we have for every possible seed-strategy the optimal anti-strategy.  
Now the only thing we have to do is to observe the numbers from the opponent
and calculate his seed value.Then we could choose the right strategy.
For cracking the random generator we can use the manual method :
http://alumni.cs.ucr.edu/~jsun/random-number.pdf
or the brute Force :
https://jazzy.id.au/2010/09/20/cracking_random_number_generators_part_1.html
It depends on the algorithm used to generate the (pseudo) random numbers. If the pseudo random number generator algorithm is known, you can guess the seed by observing a number of states (robot moves). This is similar to brute force guessing a password, used for encryption, as, some encryption algorithms are known as stream ciphers, and are basically (sometimes exactly), a one time pad that is used to obfuscate the data. Now, lets say that you know the pseudorandom number generator used is a simple lagged fibonacci generator. Then, you know that they are generating each number by calculating x(n) = x(n - 2) + x(n - 3) % 3. Therefore, by observing 3 different robot moves, you will then be able to predict all of the future moves. The seed, is the first 3 numbers supplied that give the sequence you observe. Now, most random number generators are not this simple, some have up to 1024 bit length seeds, and would be impossible for a modern computer to cycle through all of those possibilities in a brute force manner. So basically, what you would need to do, is to find out what PRNG algorithm is used, find out all possible initial seed values, and devise an algorithm to determine the seed the opponent robot is using based upon their actions. Depending on the algorithm, there are ways of guessing the seed faster than testing each and every one. If there is a faster way of guessing such a seed, this means that the PRNG in question is not suitable from cryptographic applications, as it means passwords are easier guessed. AES256 itself has a break, but it still takes theoretically 2 ^ 111 guesses (instead of the brute force 2 ^256 guesses), which means it has been broken, technically, but 2 ^ 111 is still way too many operations for modern computers to process in a meaningful time frame.
if the PRNG was lagged fibonacci (which is never used anymore, I am just giving a simple example) and you observed that the robot did option 0, then, 1, then 2... you would then know that the next thing the robot will do is... 1, since 0 + 1 % 3 = 1. You could also backtrack, and figure out what the initial values were for this PRNG, which represents the seed.

i don't really understand FFT and sample rates

Im really confused over here. I am a ai programmer working on a game that is designed to detect beats in songs and some more. I have no previous knowledge about audio and just reading through whatever material i can find. While i got fft working and stuff I simply don't understand the way samples are transferred to different frequencies. Question 1, what does each frequency stands for. For the algorithm i got. I can transfer for example 1024 samples into 512 outcomes. So are they a description of the strength of each spectrum at the current second? it doesn't really make sense since what i remember is that there are 20,000hz in a 44.1khz audio recording. So how does 512 spectrum samples explain what is happening in that moment? Question 2, from what i read, its a number that represent the sound wave at this moment. However i read that by squaring both left channel and right channel, and add them together and you will get the current power level. Both these seems incoherent to my understanding, and i am really buff led so please explain away.
DFT output
the output is complex representation of phasor (Re,Im,Frequency) of basis function (usually sin wave). First item is DC offset so skip it. All the others are multiples of the same fundamental frequency (sampling rate/N). The output is symmetric (if the input is real only) so use just first half of results. Often power spectrum is used
Amplitude=sqrt(Re^2+Im^2)
which is the amplitude of basis function. If phase is needed then
phase=atan2(Im,Re)
beware DFT results are strongly dependent on the input signal shape,frequency and phase shift to your basis functions. That causes the output to vibrate/oscillate around the correct value and produce wide peaks instead of sharp ones for singular frequencies not to mention aliasing.
frequencies
if you got 44100Hz then the max output frequency is half of it that means the biggest frequency present in data is 22050Hz. The DFFT however does not contain this frequency so if you ignore the mirrored second half of results then:
for 4 samples DFT outputs frequencies are { -,11025 } Hz
for 8 samples frequencies are: { -,5512.5,11025,16537.5 } Hz
The output frequency is linear to its address from start so if you got N=512 samples
do DFFT on it
obtain first N/2=256 results
i-th sample represents frequency f=i*samplerate/N Hz
where i={ 1,...,(N/2)-1} ... skipping i=0
the image shows one of mine utility apps tighted together with
2-channel sound generator (top left)
2-channel oscilloscope (top right)
2-channel spectral analyzer (bottom) ... switched to linear frequency scale to make obvious what I mean in above text
zoom the image to see the settings ... I made it as close to the real devices as I could.
Here DCT and DFT comparison:
Here the DFT output dependency on input signal frequency aliasing by sampling rate
more channels
Summing power of channels is more safe. If you just add the channels then you could miss some data. For example let left channel is playing 1 Khz sin wave and the right exact opposite so if you just sum them then the result is zero but you can hear the sound .... (if you are not exactly in the middle between speakers). If you analyze each channel independently then you need to calculate DFFT for each channel but if you use power sum of channels (or abs sum) then you can obtain the frequencies for all channels at once , of coarse you need to scale the amplitudes ...
[Notes]
Bigger the N nicer the result (less aliasing artifacts and closer to the max frequency). For specific frequencies detection are FIR filter detectors more precise and faster.
Strongly recommend to read DFT and all sublinks there and also this plotting real time Data on (qwt) Oscillocope

Converting Real and Imaginary FFT output to Frequency and Amplitude

I'm designing a real time Audio Analyser to be embedded on a FPGA chip. The finished system will read in a live audio stream and output frequency and amplitude pairs for the X most prevalent frequencies.
I've managed to implement the FFT so far, but it's current output is just the real and imaginary parts for each window, and what I want to know is, how do I convert this into the frequency and amplitude pairs?
I've been doing some reading on the FFT, and I see how they can be turned into a magnitude and phase relationship but I need a format that someone without a knowledge of complex mathematics could read!
Thanks
Thanks for these quick responses!
The output from the FFT I'm getting at the moment is a continuous stream of real and imaginary pairs. I'm not sure whether to break these up into packets of the same size as my input packets (64 values), and treat them as an array, or deal with them individually.
The sample rate, I have no problem with. As I configured the FFT myself, I know that it's running off the global clock of 50MHz. As for the Array Index (if the output is an array of course...), I have no idea.
If we say that the output is a series of One-Dimensional arrays of 64 complex values:
1) How do I find the array index [i]?
2) Will each array return a single frequency part, or a number of them?
Thankyou so much for all your help! I'd be lost without it.
Well, the bad news is, there's no way around needing to understand complex numbers. The good news is, just because they're called complex numbers doesn't mean they're, y'know, complicated. So first, check out the wikipedia page, and for an audio application I'd say, read down to about section 3.2, maybe skipping the section on square roots: http://en.wikipedia.org/wiki/Complex_number
What that's telling you is that if you have a complex number, a + bi, you can picture it as living in the x,y plane at location (a,b). To get the magnitude and phase, all you have to do is find two quantities:
The distance from the origin of the plane, which is the magnitude, and
The angle from the x-axis, which is the phase.
The magnitude is simple enough: sqrt(a^2 + b^2).
The phase is equally simple: atan2(b,a).
The FFT result will give you an array of complex values. The twice the magnitude (square root of sum of the complex components squared) of each array element is an amplitude. Or do a log magnitude if you want a dB scale. The array index will give you the center of the frequency bin with that amplitude. You need to know the sample rate and length to get the frequency of each array element or bin.
f[i] = i * sampleRate / fftLength
for the first half of the array (the other half is just duplicate information in the form of complex conjugates for real audio input).
The frequency of each FFT result bin may be different from any actual spectral frequencies present in the audio signal, due to windowing or so-called spectral leakage. Look up frequency estimation methods for the details.

Explain BFS and DFS in terms of backtracking

Wikipedia about Depth First Search:
Depth-first search (DFS) is an
algorithm for traversing or searching
a tree, tree structure, or graph. One
starts at the root (selecting some
node as the root in the graph case)
and explores as far as possible along
each branch before backtracking.
So what is Breadth First Search?
"an algorithm that choose a starting
node, checks all nodes backtracks,
chooses the shortest path, chose neighbour nodes backtracks,
chose the shortest path, finally
finds the optimal path because of
traversing each path due to continuous
backtracking.
Regex find's pruning -- backtracking?
The term backtracking confuses due to its variety of use. UNIX's find pruning an SO-user explained with backtracking. Regex Buddy uses the term "catastrophic backtracking" if you do not limit the scope of your Regexes. It seems to be a too widely used umbrella-term. So:
How do you define "backtracking" specifically for Graph Theory?
What is "backtracking" in Breadth First Search and Depth First Search?
[Added]
Good definitions about backtracking and examples
The Brute-force method
Stallman's(?) invented term "dependency-directed backtracking"
Backtracking and regex example
Depth First Search definition.
The confusion comes in because backtracking is something that happens during search, but it also refers to a specific problem-solving technique where a lot of backtracking is done. Such programs are called backtrackers.
Picture driving into a neighborhood, always taking the first turn you see (let's assume there are no loops) until you hit a dead end, at which point you drive back to the intersection of the next unvisited street. This the "first" kind of backtracking, and it's roughly equivalent to colloquial usage of the word.
The more specific usage refers to a problem-solving strategy that is similar to depth-first search but backtracks when it realizes that it's not worth continuing down some subtree.
Put another way -- a naive DFS blindly visits each node until it reaches the goal. Yes, it "backtracks" on leaf nodes. But a backtracker also backtracks on useless branches. One example is searching a Boggle board for words. Each tile is surrounded by 8 others, so the tree is huge, and naive DFS can take too long. But when we see a combination like "ZZQ," we can safely stop searching from this point, since adding more letters won't make that a word.
I love these lectures by Prof. Julie Zelenski. She solves 8 queens, a sudoku puzzle, and a number substitution puzzle using backtracking, and everything is nicely animated.
Programming Abstractions, Lecture 10
Programming Abstractions, Lecture 11
A tree is a graph where any two vertices only have one path between them. This eliminates the possibility of cycles. When you're searching a graph, you will usually have some logic to eliminate cycles anyway, so the behavior is the same. Also, with a directed graph, you can't follow edges in the "wrong" direction.
From what I can tell, in the Stallman paper they developed a logic system that doesn't just say "yes" or "no" on a query but actually suggests fixes for incorrect queries by making the smallest number of changes. You can kind of see where the first definition of backtracking might come into play.

Resources