Good Idea/Bad Idea: Using Qt's QSet on very large dataset? - qt

Is it a bad idea to use QSet to keep track of a very large set of fairly large strings? Each string is 54 characters (108 bytes). The set may contain thousands of entries (I'm not sure on the exact number yet). The QSet will only be used for insertion and membership query.
If it is a bad idea, I'm definitely open to suggestions. My 54 character strings are composed of only 6 different characters (e.g. "AAAAAAAAABBBBBBBBBCCCCCCCCCDDDDDDDDDEEEEEEEEEFFFFFFFFF"). This seems like a good candidate for compression, perhaps? Any other suggestions are welcome.

Realize that by using a built-in set, you're going to have some path-level compression based on the nature of your data. Of course, this depends on the container's implementation.
Look at some information on radix trees, digital search trees, red-black trees, etc. You'll see that you don't need to store each and every string, but rather the patterns. For instance, let's simplify your problem: we have only 3 characters that can appear an maximum of 2 times each, and each string is 6 characters long. Three possible strings are:
AABBCC, AABCBC, and AACBCB
With these examples, we could get away with using a maximum of 6 + 3 + 4 = 13 nodes instead of a full 18 nodes. not substantial, but I don't know what you're doing either. As with any type of compression, the more your prefix patterns are reused, the more compression you have.
Edit:
The numbers 13 and 18 come from the path-level compression. For instance, in straight C (for argument/discussion), if I am implementing my string storage class as a wrapper around an array I would probably just have an array of character pointers with each pointer referencing a spot in memory that contains a pattern. In the example I gave above, this would take 18 characters ( 6 * 3 = 18). Adding on the size of the array (let's say that sizeof(char*) is 4, our array would take 3 * 4 bytes of storage = 12 + 18 or 30 bytes total to store our patterns.
If I am instead storing the patterns in a sort of digital search tree, I make a small tradeoff. The nodes in my tree are going to be larger than 1 byte apiece (1 byte for the character in the node, 4 bytes for the "next" pointer in each node, 5 bytes apiece). The first pattern we store is AABBCC. This is 6 nodes in the tree. Next is AABCBC. We reuse the path AAB from the first tree and need only an additional 3 nodes for CBC. The last pattern is AACBCB. We reuse AA, and need 4 new nodes for CBCB. This is a total of 13 nodes * 5 bytes = 65 bytes of storage. However, if you have a lot of long, repeating patterns in the prefix of your data, then you'll see some prefix path-level compression.
If this isn't the case for you, I would look into Huffman or LZW compression. This will require you to build a dictionary of patterns that have integer numbers tied to them. When you compress, you build the dictionary and create integer id's for each pattern in your text. You then replace the patterns in your text with the integer id's. When uncompressing, you do the opposite. I don't have the time to describe these algorithms in more detail, so you'll need to look them up.
It's a tradeoff in simplicity/time. If your data will allow it, take the shorter method and just use the built-in container. If not, you will need something more tailored to your data.

I don't think you'd have any additional problems using QSet over another sort of container, such as std::set, a map, or a vector. If you are wondering about running out of memory, that probably depends on how many thousands of the strings you need to store, and if there was a way to encode them more concisely. (For example, if the characters always occur in the same order but vary in relative lengths, store the length for each character rather than all of the characters.) However, even 50,000 of these strings is only around 5 MB, and 500,000 of them is only 50 MB to store, discounting storage overhead, which is a moderate amount of memory on modern machines.

QSet does sound like a good idea. It's basically just a hash-table and it can optimize its bucket size dynamically. Perfect.
Another suggestion for compressing the key:
Treat it as a base-6 number string (think A=0, B=1, ... F=5) and convert it into binary (int).
QByteArray ba("112"); // instead of "BBC"
int num = ba.toInt(0, 6 /*base*/); // num == 44
6^3 < 2^8, so we can represent every 3 chars in your string with a 1 byte int (or char) and make a bytearray of it. That would cut down the size of the key from 54 bytes to 18 bytes.

From your earlier comment: "In my strings, there will always be 54 characters, and there will always be 9 of each character. The order is the only thing that changes."
Don't store raw strings then. You could just compress them into the 6 characters actually used, and then make a QSet of those. A trivial compression would be {a,b,c,d,e,f}, and if the character set is known beforehand (and only those 6 characters) you could even pack things into a 16-bit integer.

Related

Memory-wise, is it better to save a formula as a string or an expression (symbol), in Julia

I deal with lots of mathematical expressions in a certain Julia script and would like to know if storing such a formula as a String is ok, or whether using the Symbol data type is better. Thinking about scalability and keeping memory requirements to a minimum. Thanks!
Update: the application involves a machine learning model. Ideally, it should be applicable to big data too, hence the need for scalability.
In a string, each character is stored based on its number of codeunits, eg. 1 for ascii. The same is true for the characters of a Symbol. So that is a wash; do what fits your use best, probably Symbols since you are manipulating expressions.
An expression like :(x + y) is stored as a list of Any, with space allocated according to the sizeof each item in the expression.
In an expression like :(7 + 4 * 9) versus a string like "7 + 4 * 9" there are two conflicting issues. First, 7 is stored as 1 byte in the string, but 8 bytes in the expression since there are 64-bit Ints in play. On the other hand, whitespace takes up 1 byte each space in the string, but does not use memory in the expression. And a number like 123.123456789 takes up 14 bytes in the string and 8 in the expression (64 bit floats).
I think that, again, this is close to being even, and depends on the specific strings you are parsing. You could, as you work with the program, store both, compare memory usage of the resulting arrays, and drop one type of storage if you feel you should.

Encoding DNA strand in Binary

Hey guys I have the following question:
Suppose we are working with strands of DNA, each strand consisting of
a sequence of 10 nucleotides. Each nucleotide can be any one of four
different types: A, G, T or C. How many bits does it take to encode a
DNA strand?
Here is my approach to it and I want to know if that is correct.
We have 10 spots. Each spot can have 4 different symbols. This means we require 4^10 combinations using our binary digits.
4^10 = 1048576.
We will then find the log base 2 of that. What do you guys think of my approach?
Each nucleotide (aka base-pair) takes two bits (one of four states -> 2 bits of information). 10 base-pairs thus take 20 bits. Reasoning that way is easier than doing the log2(4^10), but gives the same answer.
It would be fewer bits of information if there were any combinations that couldn't appear. e.g. some codons (sequence of three base-pairs) that never appear. But ten independent 2-bit pieces of information sum to 20 bits.
If some sequences appear more frequently than others, and a variable-length representation is viable, then Huffman coding or other compression schemes could save bits most of the time. This might be good in a file-format, but unlikely to be good in-memory when you're working with them.
Densely packing your data into an array of 2bit fields makes it slower to access a single base-pair, but comparing the whole chunk for equality with another chunk is still efficient. (memcmp).
20 bits is unfortunately just slightly too large for a 16bit integer (which computers are good at). Storing in an array of 32bit zero-extended values wastes a lot of space. On hardware with good unaligned support, storing 24bit zero-extended values is ok (do a 32bit load and mask the high 8 bits. Storing is even less convenient though: probably a 16b store and an 8b store, or else load the old value and merge the high 8, then do a 32b store. But that's not atomic.).
This is a similar problem for storing codons (groups of three base-pairs that code for an amino acid): 6 bits of information doesn't fill a byte. Only wasting 2 of every 8 bits isn't that bad, though.
Amino-acid sequences (where you don't care about mutations between different codons that still code for the same AA) have about 20 symbols per position, which means a symbol doesn't quite fit into a 4bit nibble.
I used to work for the phylogenetics research group at Dalhousie, so I've sometimes thought about having a look at DNA-sequence software to see if I could improve on how they internally store sequence data. I never got around to it, though. The real CPU intensive work happens in finding a maximum-likelihood evolutionary tree after you've already calculated a matrix of the evolutionary distance between every pair of input sequences. So actual sequence comparison isn't the bottleneck.
do the maths:
4^10 = 2^2^10 = 2^20
Answer: 20 bits

Finding similar hashes

I'm trying to find 2 different plain text words that create very similar hashes.
I'm using the hashing method 'whirlpool', but I don't really need my question to be answered in the case or whirlpool, if you can using md5 or something easier that's ok.
The similarities i'm looking for is that they contain the same number of letters (doesnt matter how much they're jangled up)
i.e
plaintext 'test'
hash 1: abbb5 has 1 a , 3 b's , one 5
plaintext 'blahblah'
hash 2: b5bab must have the same, but doesnt matter what order.
I'm sure I can read up on how they're created and break it down and reverse it, but I am just wondering if what I'm talking about occurs.
I'm wondering because I haven't found a match of what I'm explaining (I created a PoC to run threw random words / letters till it recreated a similar match), but then again It would take forever doing it the way i was dong it. and was wondering if anyone with real knowledge of hashes / encryption would help me out.
So you can do it like this:
create an empty sorted map \
create a 64 bit counter (you don't need more than 2^63 inputs, in all probability, since you would be dead before they would be calculated - unless quantum crypto really takes off)
use the counter as input, probably easiest to encode it in 8 bytes;
use this as input for your hash function;
encode output of hash in hex (use ASCII bytes, for speed);
sort hex on number / alphabetically (same thing really)
check if sorted hex result is a key in the map
if it is, show hex result, the old counter from the map & the current counter (and stop)
if it isn't, put the sorted hex result in the map, with the counter as value
increase counter, goto 3
That's all folks. Results for SHA-1:
011122344667788899999aaaabbbcccddeeeefff for both 320324 and 429678
I don't know why you want to do this for hex, the hashes will be so large that they won't look too much alike. If your alphabet is smaller, your code will run (even) quicker. If you use whole output bytes (i.e. 00 to FF instead of 0 to F) instead of hex, it will take much more time - a quick (non-optimized) test on my machine shows it doesn't finish in minutes and then runs out of memory.

Dictionary Training For Different Language

I am working on a messaging system and got the idea of storing the messages in each inbox independently. While working on that idea I asked myself why not compress the messages. So I am looking for a good way to optain dictionaries in different languages.
Since the messages are highly related to everydays talking (social bla bla) I need a good source and way for that.
I need some text for it like a bunch of millions emails, books etc. I would like to create a Huffman tree out of it with the ability to inline and represent each message as a string within this huffman tree. So decoding would be fast enough.
The languages I want to use are over the place. Since Newspapers and alike might not be sufficient I need other ways.
[Update]
As I countinue my research, I noticed that I actually create two dictionaries out of wikipedia. First is a dictionary containing typical characters with a certain propability. Also I noticed that special characters I used for each language seams to have even distribution among latain based languages (well actually latain is just one member of the language family) and even russians tend to have the same distribution beside the quite different alphabet.
Also I noticed that in around 15 to 18% a special character (like ' ', ',', ';') follows another special character. And by than the first 8 most frequent word yield 10% where the next 16 words yield 9% and going on and on and by around 128 (160 words) you reach a yield of 80% of all words. So storing the next 256 and more words, becomes senseless in terms of analysing. This leaves me behind with three dictionaries (characters, words, special characters) per language of around 2 to 5KB (I use a special format to use prefix compression) that save me between 30% to 60% in character reduction and also when you remember that in Java each character stores 16 bits it results in an overall reduction of even more making it a 1:5 to 1:10 by also compressing (huffman tree) the characters having to insert.
Using such a system and compressing numbers as variable length integers, one produces a byte array that can be used for String matching, loads faster and checking for words to contain is faster than doing character by character comparism since one can check for complete words more faster by not needing to tokenize or recognize words in the first place.
It solved the problem of supporting string keys since I just can transform the string in each language and it results in a set of keys I can use for lookups.

What is the name for encoding/encrypting with noise padding?

I want code to render n bits with n + x bits, non-sequentially. I'd Google it but my Google-fu isn't working because I don't know the term for it.
For example, the input value in the first column (2 bits) might be encoded as any of the output values in the comma-delimited second column (4 bits) below:
0 1,2,7,9
1 3,8,12,13
2 0,4,6,11
3 5,10,14,15
My goal is to take a list of integer IDs, and transform them in a way they can still be used for persistent URLs, but that can't be iterated/enumerated sequentially, and where a client cannot determine programmatically if a URL in a search result set has been visited previously without visiting it again.
I would term this process "encoding". You'll see something similar done to permit the use of communications channels that have special symbols that are not permitted in data. Examples: uuencoding and base64 encoding.
That said, you still need to (and appear at first blush to have) ensure that there is only one correct de-code; and accept the increase in size of the output (in the case above, the output will be double the size, bit-for-bit as the input).
I think you'd be better off encrypting the number with a cheap cypher + a constant secret key stored on your server(s), adding a random character or four at the end, and a cheap checksum, and simply reject any responses that don't have a valid checksum.
<encrypt(secret)>
<integer>+<random nonsense>
</encrypt>
+
<checksum()>
<integer>+<random nonsense>
</checksum>
Then decrypt the first part (remember, cheap == fast), validate the ciphertext using the checksum, throw off the random nonsense, and use the integer you stored.
There are probably some cryptographic no-no's here, but let's face it, the cost of this algorithm being broken is a touch on the low side.

Resources