RNN/LSTM library with variable length sequences without bucketing or padding - recurrent-neural-network

The problem I try to solve is a classification problem with 4 parallel inputs batches of sequences. To do so, I need 4 RNN/LSTM in parallel that merge in a fully connected layer. The issue is that in each parallel batch, the sequences have a variable length.
I cannot use padding to the maximum sequence length because it use too much RAM. Actually, some sequences are really long.
I cannot use padding to a reduced length because the model cannot predict the output. I need the full sequence, I cannot know in advance where the interesting part of the sequence is.
I cannot use bucketing because if I split a sequence in one batch, I would have to do it the same way for each sequence with the same index in the 3 others batches. As the parallel sequences do not have the same length, the model will try to associate lots of empty sequences to either one or the other class.
In theory a RNN/LSTM should be able to learn sequences with different length without sequence manipulation. Unfortunately I do not know an implementation that enable me to do so. Does a such RNN/LSTM library exist (any language) ?

Theano can handle variable length sequences, but Tensorflow cannot. You can test with this Theano, and let us know your results.

Related

For memory, what should be done when you need to constantly grow a vector to an unknown upper limit?

Suppose that you are dealing with a potentially infinite amount of data. Suppose further that you do not have this data stored in memory, but can generate individual terms at will. Finally, suppose that you want to do some experiment on this data that will involve checking a large but unknown amount of terms in a way that necessitates keeping a great many of them in memory. Toy problems with Recamán's sequence, like "find the minimum number terms needed in that sequence for the first 25 even numbers to have appeared", are what I have in mind as typical examples.
The obvious solution to this sort of problem would be to write some code like:
list<-c(first term)
while([not found enough terms yet])
{
nextTerm<-Whatever
if(this term worked){list<-c(list,nextTerm)}
}
However, building a big vector like this by adding one new term at a time is your memory's worst nightmare. The alternative that I often see suggested is to pre-allocate a big vector in memory by making the first line of your code something like list<-numeric(10^6), but those solutions suppose that we have some rough idea of how many terms we need to check, which isn't always the case. So what can we do when we are dealing with an ever-growing list of unknown required length?
This is very popular subject in R check this answer: https://stackoverflow.com/a/45195098/5442527
Summing up:
Do not use c() to bind as providing value by index [ is much faster. I know that it might seem surprising that you could grow pre-allocated vector. Make an iter variable before while loop and increase the index inside the if statement.
Normally like in Python you do not have to care about it when using append. Even starting with empty list is not an problem as the list (reserved memory) grows expotentialy (x2x2x1.5x1.2...) when you pass some perimeter number of elements. Link Over-allocating

Why does NStepLSTM not have reset_state method?

I firstly use L.LSTM , then I found this NStepLSTM, which is uncovered part of offical tutorial document.
https://docs.chainer.org/en/stable/reference/generated/chainer.links.NStepLSTM.html?highlight=Nstep
Why does chainer.links.NStepLSTM or chainer.links.NStepBiLSTM not have reset_state? how to reset_state?
is it pass a list of sequences(each is one sequence chainer.Variable, e.g. one article contains multiple words is one Variable)? Is this class purpose is to deal with vary length sequence?
can we use truncate BPTT to save memory in chainer.links.NStepLSTM ? how
1.
NStepLSTM gets a batch of sequences and returns a batch of output sequences, though LSTM gets a batch of words. You don't need to use for-loop to use NStepLSTM. NStepLSTM uses cuDNN, that is a library NVIDIA provides, and is very fast.
NStepLSTM does not have a state. If you want to chain NStepLSTMs, use outputs of NStepLSTM. See seq2seq example: https://github.com/chainer/chainer/blob/master/examples/seq2seq/seq2seq.py
2.
Yes. It gots such as a batch of sequences of embed vectors created from sentences. You can use sequences with different lengths. See seq2seq example.
Note that L.NStepLSTM can get a sequence of sentences, but F.NStepLSTM can get transposed sequences. I mean it can get a sequence of batches of words. Actually L.NStepLSTM calls F.transpose_sequences and F.NStepLSTM in its implementation.
3.
Sorry it is difficult. As I said, NStepLSTM is a wrapper of cuDNN's RNN library.It does not support BPTT. Of course you can split sentences and call NStepLSTM twice.

Memory & Computation Efficient Creation of Array with Repeated Elements

I am trying to find an efficient way to create a new array by repeating each element of an old array a different, specified number of times. I have come up with something that works, using array comprehensions, but it is not very efficient, either in memory or in computation:
LENGTH = 1e6
A = collect(1:LENGTH) ## arbitrary values that will be repeated specified numbers of times
NumRepeats = [rand(20:100) for idx = 1:LENGTH] ## arbitrary numbers of times to repeat each value in A
B = vcat([ [A[idx] for n = 1:NumRepeats[idx]] for idx = 1:length(A) ]...)
Ideally, what I would like would be a structure akin to the sparse matrix apparatus that Julia has but that would instead store data efficiently based on the indices where repeated values occur. Barring that, I would at least like an efficient way to create a vector such as B in the example above. I looked into the repeat() function, but as far as I can tell from the documentation and my experimentation with the function, it is just for repeating slices of an array the same number of times for each slice. What is the best way to approach this?
Sounds like you're looking for run-length encoding. There's an RLEVectors.jl package here: https://github.com/phaverty/RLEVectors.jl. Not sure how usable it is. You could also make your own data type fairly easily.
Thanks for trying RLEVectors.jl. Some features and optimizations had been languishing on master without a version bump. It can definitely be mixed with other vectors for element-wise arithmetic. I'll put the linear algebra operations on the feature request list. Any additional feature suggestions would be most welcome.
RLEVectors.jl has a rep function that works like R's and RLEVectors.inverse_ree is like StatsBase.inverse_rle, but it works on run ends rather than lengths.

counting oligonucleotides and reversed complementary

I'm start to using R and i need some help if it's possible. I need to read fasta files and count for each species the frequency of each nucleotide, dinucleotides and to words with length 10 and the frequency of the reversed complementary. I'm using the package Biostrings. Can you Help me? Thank You
The Bioconductor Biostring Manual contains some pretty descriptive methods that match what you are looking for. They also have attached examples. Otherwise, you could just read in the FASTA file and keep track of how many of each base occurs (If you can't figure out the BioString program).
For the frequency, simply reading from a text file (FASTA after removing name sequences) is also sufficient. As long as you keep count of how many of each oligonucleotide appears.
I'm not exactly sure how you want to measure how much reverse complementary there is, if you kept all the possibilities of size 10 in an array that array wouldn't be too large (4^10 I think?), so if you add the data to the array in a logical way you could pretty easily compare them in an algorithmic manner.

Ordered Permutations

I am looking to generate ordered permutation for large numbers i.e. 37P10 (permutations for 37 of size 10). I am using combinat package, permn() function for the purpose but it does not work for more than 10 numbers. Also through this i cannot be able to generate permutation of different sizes as describe above in example.
Further, I am combining these permutation into a matrix using do.call(rbind,) function.Is any any other package in R-language that may be used for the purpose please?
What you've asked for simply cannot be done. You're asking to generate and store 1.22e15 (or 4.81e15 with replacement) permutations of 10 numbers. Even if each number were only one byte, you would need 10 million GB of RAM.
In my LSPM package, I use the function LSPM:::.nPri to generate a specific permutation based on its lexically ordered index. There's no way you will be able to iterate over every permutation in an reasonable amount of time, so I would suggest that you take a sample of all possible permutations.
Note that the above code will not work for nPr(37,10) due to precision issues with such a large number, but it should work as a good starting point.
It is near impossible to generate so many permutation on the normal computer.
Quick calculations shows (37 P 10) is 1264020397516800. To store this many integers itself, you would need 1264020397516800 x 64 bits. That is 8.09×10^7 Gb (gigabits) or 10^7 Gigabytes. Then to store actual permutation information you will need even more "memory" either in RAM or Harddisk.
I think best strategy would be to write permutation function, creates ordered permutation sequentially, and do your analysis iteratively without generating all possible permutations.

Resources