I know that in Java there is the class BigInteger, that permits to treat the integers in their full representation via Strings. Is there something similar in R? I use integers to represent indices in my data structure, and I need to keep that representation as exact as possible and hence not to obtain indices such as ``7.897557e+14''. Thanks in advance.
Related
I am working in R.
I have a large set of 20-nucleotide DNA sequence strings (~60 million). Currently I just keep them in a matrix of strings.
I need to be able to store them as efficiently as possible in memory, be able to match sequences and count number of times each string sequence appears, and importantly, be able to associate and store multiple strings to one or more.
I was wondering if anyone can suggest a formal object class that will be suitable for some (/all) of that functionality?
The problem I try to solve is a classification problem with 4 parallel inputs batches of sequences. To do so, I need 4 RNN/LSTM in parallel that merge in a fully connected layer. The issue is that in each parallel batch, the sequences have a variable length.
I cannot use padding to the maximum sequence length because it use too much RAM. Actually, some sequences are really long.
I cannot use padding to a reduced length because the model cannot predict the output. I need the full sequence, I cannot know in advance where the interesting part of the sequence is.
I cannot use bucketing because if I split a sequence in one batch, I would have to do it the same way for each sequence with the same index in the 3 others batches. As the parallel sequences do not have the same length, the model will try to associate lots of empty sequences to either one or the other class.
In theory a RNN/LSTM should be able to learn sequences with different length without sequence manipulation. Unfortunately I do not know an implementation that enable me to do so. Does a such RNN/LSTM library exist (any language) ?
Theano can handle variable length sequences, but Tensorflow cannot. You can test with this Theano, and let us know your results.
Trying to create an array from an xyz data file. The data file is arranged so that x,y,z of each atom is on a new line and I want the array to reflect this.
Then to use this array to find find the distance from each atom in the list with all the others.
To do this the array has been copied such that atom1 & atom2 should be identical to the input file.
length is simply the number of atoms in the list.
The write statement: WRITE(20,'(3F12.9)') atom1 actually gives the matrix wanted but when I try to find individual elements they're all wrong!
Any help would be really appreciated!
Thanks guys.
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE ::atom1,atom2'
ALLOCATE(atom1(length,3),atom2(length,3))
READ(10,*) ((atom1(i,j), i=1,length), j=1,3)
atom2=atom1
distn=0
distc=0
DO n=1,length
x1=atom1(n,1)
y1=atom1(n,2) !1st atom
z1=atom1(n,3)
DO m=1,length
x2=atom2(m,1)
y2=atom2(m,2) !2nd atom
z2=atom2(m,3)`
Your READ statement reads all the x coordinates for all atoms from however many records, then all the y coordinates, then all the z coordinates. That's inconsistent with your description of the input file. You have the nesting of the io-implied-do's in the READ statement around the wrong way - it should be ((atom1(i,j),j=1,3),i=1,length).
Similarly, as per the comment, your diagnostic write mislead you - you were outputting all x ordinates, followed by all y ordinates, etc. Array element order of a whole array reference varies the first (leftmost) dimension fastest (colloquially known as column major order).
(There are various pitfalls associated with list directed formatting that mean I wouldn't recommend it for production code (or perhaps for input specifically written with the knowledge of and defence against those pitfalls). One of those pitfalls is that the READ under list directed formatting will pull in as many records as it requires to satisfy the input list. You may have detected the problem earlier if you were using an explicit format that nominated the number of fields per record.)
I am fairly new to R. I have a datafile which has a matrix of complex numbers, each of the form 123+123i, when I try to read in the data in R, using read.table(), it returns strings, which is not what I want. Is there some way to read in a file of complex numbers?
One possible thing that I could do, since the program that generates the matrix is available to me, I can modify it to generate two real numbers instead of a single complex number, and after reading into R, I can make them into a single complex number, now would this be the canonical way to doing what I want?
See ?read.table, in particular you want to use the colClasses="complex" argument.
Given data in the following format (tag_uri image_uri image_uri image_uri ...), I need to turn them into Hadoop SequenceFile format for further processing by Mahout (e.g. clustering)
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318#N06/4019040356 http://flickr.com/photos/46857830#N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178#N07/5441742937
...
Before this I would turn the input into csv (or arff) as follows
http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...
with each row describes one tag. Then the arff file is converted into a vector file used by mahout for further processing. I am trying to skip the arff generation part, and generate a sequenceFile instead. If I am not mistaken, to represent my data as a sequenceFile, I would need to store each row of the data with $tag_uri as key, then $image_vector as value. What is the proper way of doing this (if possible, can I have the tag_url for each row to be included in the sequencefile somewhere)?
Some references that I found, but not sure if they are relevant:
Writing a SequenceFile
Formatting input matrix for svd matrix factorization (can I store my matrix in this form?)
RandomAccessSparseVector (considering I only list images that are assigned with a given tag instead of all the images in a line, is it possible to represent it using this vector?)
SequenceFile write
SequenceFile explanation
You just need a SequenceFile.Writer, which is explained in your link #4. This lets you write key-value pairs to the file. What the key and value are depends on your use case, of course. It's not at all the same for clustering versus matrix decomposition versus collaborative filtering. There's not one SequenceFile format.
Chances are that the key or value will be a Mahout Vector. The thing that knows how to write a Vector is VectorWritable. This is the class you would use to wrap a Vector and write it with SequenceFile.Writer.
You would need to look at the job that will consume it to make sure you're passing what it expects. For clustering, for example, I think the key is ignored and the value is a Vector.