How can a file be compressed to less than its entropy? - r

I created a file that contains 100,000 numbers that were drawn uniformly (with probability 1/8) from the set {1,2,3,4,5,6,7,8}.
When a look at the size of this file on my hard-disk it is 293 KB (kilo-byte) which makes sense because one needs 3 bits to "identify" a number between 1 and 8 and 3*100,000 = 300 KB.
Next I compress the file using Win-zip and find that the file is reduced to only 57 KB ! How can this be since I expect that the random-number generator I used for my draws is - for all practical purposes - ideal. This means that the sequence should be truly random and the size of the file should therefore be given by its entropy ( which is 300 KB)?

I am afraid you are confused about certain concepts.
3 bits times 100,000 gives you 300,000 bits, and there are 8 bits to the byte, which corresponds to roughly 37.5 KB. That's a far cry from 300 KB.
(And in any case, if you were to create "a file that contains 100,000 numbers", there is no magic fairy sitting on your hard disk, who will figure out the min & max range of your numbers, and store them in the file using the smallest number of bits necessary to represent them all.)
So, it is very important to get it out of the way that 300 KB has absolutely nothing to do with the entropy of 100,000 single-digit numbers.
You told us absolutely nothing about how you created that file, so its file format is a mystery, but we can make some simple calculations and guesses: 293 KB times 1024 is 300,000, so what you have is a 300,000 byte file. Which means that you are writing 3 bytes per number. Which means that you have written these numbers as text, in a text file, either each digit followed by a comma, then followed by a space, or each digit followed by a carriage return and a linefeed, or something similar.
Text file formats are extremely wasteful in terms of storage space.
So, yes, this is a highly compressible file consisting mostly of identical bytes, and even the bytes that are not identical (the digits) all map to just 3 bits each, so it is no wonder that the entire file gets compressed so well.
No laws of nature were harmed during the making of this question.

Related

Encoding DNA strand in Binary

Hey guys I have the following question:
Suppose we are working with strands of DNA, each strand consisting of
a sequence of 10 nucleotides. Each nucleotide can be any one of four
different types: A, G, T or C. How many bits does it take to encode a
DNA strand?
Here is my approach to it and I want to know if that is correct.
We have 10 spots. Each spot can have 4 different symbols. This means we require 4^10 combinations using our binary digits.
4^10 = 1048576.
We will then find the log base 2 of that. What do you guys think of my approach?
Each nucleotide (aka base-pair) takes two bits (one of four states -> 2 bits of information). 10 base-pairs thus take 20 bits. Reasoning that way is easier than doing the log2(4^10), but gives the same answer.
It would be fewer bits of information if there were any combinations that couldn't appear. e.g. some codons (sequence of three base-pairs) that never appear. But ten independent 2-bit pieces of information sum to 20 bits.
If some sequences appear more frequently than others, and a variable-length representation is viable, then Huffman coding or other compression schemes could save bits most of the time. This might be good in a file-format, but unlikely to be good in-memory when you're working with them.
Densely packing your data into an array of 2bit fields makes it slower to access a single base-pair, but comparing the whole chunk for equality with another chunk is still efficient. (memcmp).
20 bits is unfortunately just slightly too large for a 16bit integer (which computers are good at). Storing in an array of 32bit zero-extended values wastes a lot of space. On hardware with good unaligned support, storing 24bit zero-extended values is ok (do a 32bit load and mask the high 8 bits. Storing is even less convenient though: probably a 16b store and an 8b store, or else load the old value and merge the high 8, then do a 32b store. But that's not atomic.).
This is a similar problem for storing codons (groups of three base-pairs that code for an amino acid): 6 bits of information doesn't fill a byte. Only wasting 2 of every 8 bits isn't that bad, though.
Amino-acid sequences (where you don't care about mutations between different codons that still code for the same AA) have about 20 symbols per position, which means a symbol doesn't quite fit into a 4bit nibble.
I used to work for the phylogenetics research group at Dalhousie, so I've sometimes thought about having a look at DNA-sequence software to see if I could improve on how they internally store sequence data. I never got around to it, though. The real CPU intensive work happens in finding a maximum-likelihood evolutionary tree after you've already calculated a matrix of the evolutionary distance between every pair of input sequences. So actual sequence comparison isn't the bottleneck.
do the maths:
4^10 = 2^2^10 = 2^20
Answer: 20 bits

Understanding the BPP inside DICOM images

I'm working with DICOM files since a few days, using FO-DICOM.
I'm using a set of dicom files for my tests, and I've been printing the "Photometric Interpretation" and the "Sample Per Pixel" values, to have a better understanding of what kind of images I'm working with.
The result was "MONOCHROME2" for the Photometric Interpretation, and "1" for the Sample Per Pixel.
What I understood by reading the part3 of the standard is that MONOCHROME2 represent a gray scale, starting from black for its minimum values.
But what is the Sample Per Pixel exactly? I thought this was representing the number of bytes (and not bits) per pixel (that would be logic to have 8 bits per pixel for a scale of gray right?)
But my problem here is that actually, my images seem to have 32 bpp.
I'm working with 512*512 pixels images, and I converted them into byte arrays. So I was expecting arrays of 512*512=262144 bytes.
But I get arrays of 1048630 bytes (which is a bit more than 4*262144)
Does someone have an explanation?
EDIT:
Here's are some of my datas :
PhotometricInterpretation=MONOCHROME2
SamplePerPixel=1
BitsAllocated=16
BitsStored=12
HighBit=11
PixelRepresentation=0
NumberOfFrames=0
The attribute (0028,0002) SamplesPerPixel refers to color images only and tells you the number of planes which are present in the image (e.g. 3 for RGB), so you have
PhotometricInterpretation=RGB
SamplesPerPixel=3
With 8 bits per pixel (I will revisit BPP below). As long as you have PhotometricInterpretation = MONOCHROME1 or MONOCHROME2, you can expect the SamplesPerPixel to be 1 and nothing else.
What you do have to take into consideration is the number of bits per pixel:
BitsAllocated (0028,0100)
BitsStored (0028,0101)
HighBit (0028,0102)
These tell you how many bits are used to encode a pixel value (BitsAllocated) and which of these bits really contain grayscale information (BitsStored, HighBit). HighBit is zero-based and usually but not necessarily = BitsStored-1
An example to illustrate this: For CT images, it is very common to express gray values in hounsfield units which range from -1000 to +3000. These are represented by 12 bits which are stored with a 2-byte-alignment, so
BitsAllocated (0028,0100) = 16
BitsStored (0028,0101) = 12
HighBit (0028,0102) = 11
Another degree of freedom is PixelRepresentation which tells you if the pixel data is encoded unsigned (0) or in 2s complement (1). I have seen both for CT images, however signed pixel data is rather unusual for image types other than CT.
In your example, I would assume that Bits Allocated == 32 or (not very likely) that you have a dataset containing multiple images ('frames'), so NumberOfFrames (0028,0008) > 1. If Number of Frames is absent, you can safely assume to have only one frame.
I have over-simplified a bit here, especially about color images but I think this is complicated enough ;-). Basically, DICOM offers any thinkable degree of freedom to encode pixel data and describe the encoding in the header.
I think I have recommended you to have a look at the DCMTK in a recent post. The DicomImage class features a nice interface (getInterData()) which cares about all that stuff and provides the pixel data read from a DICOM file in a normalized format.
[EDIT]: Feel free to post a DICOM dump of your dataset here, I would have a look at it and tell you how to interpret the pixel data.

repetition in encrypted data -- red flag?

I have some base-64 encoded encrypted data and noticed a fair amount of repetition. In a (approx) 200-character-long string, a certain base-64 character is repeated up to 7 times in several separate repeated runs.
Is this a red flag that there is a problem in the encryption? According to my understanding, encrypted data should never show significant repetition, even if the plaintext is entirely uniform (i.e. even if I encrypt 2 GB of nothing but the letter A, there should be no significant repetition in the encrypted version).
According to the binomial distribution, there is about a 2.5% chance that you'd see one character from a set of 64 appear seven times in a series of 200 random characters. That's a small chance, but not negligible. With more information, you might raise your confidence from 97.5% to something very close to 100% … or find that the cipher text really is uniformly distributed.
You say that the "character is repeated up to 7 times" in several separate repeated runs. That's not enough information to say whether the cipher text has a bias. Instead, tell us the total number of times the character appeared, and the total number of cipher text characters. For example, "it appeared a total of 3125 times in 1000 runs of 200 characters each."
Also, you need to be sure that you are talking about the raw output of a cipher. Cipher text is often encapsulated in an "envelope" like that defined by the Cryptographic Message Syntax. Of course, this enclosing structure will have predictable patterns.
Well I guess it depends. Repetition in general is bad thing if it represents the same data.
Considering you are encoding it have you looked at data to see if you have something that repeats in those counts?
In order to understand better you gotta know what kind of encryption does it use.
It could be just coincidence that they are repeating.
But if repetition comes from same data, then it can be a red flag because then frequency counts can be used to decode it.
What kind of encryption are you using? Home made or some industry standard?
It depends on how are you encrypting your data.
Base64 encoding a string may count as light obfuscation, but it is NOT encryption. The purpose of Base64 encoding is to allow any sort of binary data to be encoded as a safe ASCII string.

Good Idea/Bad Idea: Using Qt's QSet on very large dataset?

Is it a bad idea to use QSet to keep track of a very large set of fairly large strings? Each string is 54 characters (108 bytes). The set may contain thousands of entries (I'm not sure on the exact number yet). The QSet will only be used for insertion and membership query.
If it is a bad idea, I'm definitely open to suggestions. My 54 character strings are composed of only 6 different characters (e.g. "AAAAAAAAABBBBBBBBBCCCCCCCCCDDDDDDDDDEEEEEEEEEFFFFFFFFF"). This seems like a good candidate for compression, perhaps? Any other suggestions are welcome.
Realize that by using a built-in set, you're going to have some path-level compression based on the nature of your data. Of course, this depends on the container's implementation.
Look at some information on radix trees, digital search trees, red-black trees, etc. You'll see that you don't need to store each and every string, but rather the patterns. For instance, let's simplify your problem: we have only 3 characters that can appear an maximum of 2 times each, and each string is 6 characters long. Three possible strings are:
AABBCC, AABCBC, and AACBCB
With these examples, we could get away with using a maximum of 6 + 3 + 4 = 13 nodes instead of a full 18 nodes. not substantial, but I don't know what you're doing either. As with any type of compression, the more your prefix patterns are reused, the more compression you have.
Edit:
The numbers 13 and 18 come from the path-level compression. For instance, in straight C (for argument/discussion), if I am implementing my string storage class as a wrapper around an array I would probably just have an array of character pointers with each pointer referencing a spot in memory that contains a pattern. In the example I gave above, this would take 18 characters ( 6 * 3 = 18). Adding on the size of the array (let's say that sizeof(char*) is 4, our array would take 3 * 4 bytes of storage = 12 + 18 or 30 bytes total to store our patterns.
If I am instead storing the patterns in a sort of digital search tree, I make a small tradeoff. The nodes in my tree are going to be larger than 1 byte apiece (1 byte for the character in the node, 4 bytes for the "next" pointer in each node, 5 bytes apiece). The first pattern we store is AABBCC. This is 6 nodes in the tree. Next is AABCBC. We reuse the path AAB from the first tree and need only an additional 3 nodes for CBC. The last pattern is AACBCB. We reuse AA, and need 4 new nodes for CBCB. This is a total of 13 nodes * 5 bytes = 65 bytes of storage. However, if you have a lot of long, repeating patterns in the prefix of your data, then you'll see some prefix path-level compression.
If this isn't the case for you, I would look into Huffman or LZW compression. This will require you to build a dictionary of patterns that have integer numbers tied to them. When you compress, you build the dictionary and create integer id's for each pattern in your text. You then replace the patterns in your text with the integer id's. When uncompressing, you do the opposite. I don't have the time to describe these algorithms in more detail, so you'll need to look them up.
It's a tradeoff in simplicity/time. If your data will allow it, take the shorter method and just use the built-in container. If not, you will need something more tailored to your data.
I don't think you'd have any additional problems using QSet over another sort of container, such as std::set, a map, or a vector. If you are wondering about running out of memory, that probably depends on how many thousands of the strings you need to store, and if there was a way to encode them more concisely. (For example, if the characters always occur in the same order but vary in relative lengths, store the length for each character rather than all of the characters.) However, even 50,000 of these strings is only around 5 MB, and 500,000 of them is only 50 MB to store, discounting storage overhead, which is a moderate amount of memory on modern machines.
QSet does sound like a good idea. It's basically just a hash-table and it can optimize its bucket size dynamically. Perfect.
Another suggestion for compressing the key:
Treat it as a base-6 number string (think A=0, B=1, ... F=5) and convert it into binary (int).
QByteArray ba("112"); // instead of "BBC"
int num = ba.toInt(0, 6 /*base*/); // num == 44
6^3 < 2^8, so we can represent every 3 chars in your string with a 1 byte int (or char) and make a bytearray of it. That would cut down the size of the key from 54 bytes to 18 bytes.
From your earlier comment: "In my strings, there will always be 54 characters, and there will always be 9 of each character. The order is the only thing that changes."
Don't store raw strings then. You could just compress them into the 6 characters actually used, and then make a QSet of those. A trivial compression would be {a,b,c,d,e,f}, and if the character set is known beforehand (and only those 6 characters) you could even pack things into a 16-bit integer.

Hardware Cache Formulas (Parameter)

The image below was scanned (poorly) from Computer Systems: A Programmer's Perspective. (I apologize to the publisher). This appears on page 489.
Figure 6.26: Summary of cache parameters http://theopensourceu.com/wp-content/uploads/2009/07/Figure-6.26.jpg
I'm having a terribly difficult time understanding some of these calculations. At the current moment, what is troubling me is the calculation for M, which is supposed to be the number of unique addresses. "Maximum number of unique memory addresses." What does 2m suppose to mean? I think m is calculated as log2(M). This seems circular....
For the sake of this post, assume the following in the event you want to draw up an example: 512 sets, 8 blocks per set, 32 words per block, 8 bits per word
Update: All of the answers posted thus far have been helpful but I still think I'm missing something. cwrea's answer provides the biggest bridge for my understand. I feel like the answer is on the tip of my mental tongue. I know it is there but I can't identify it.
Why does M = 2m but then m = log2(M)?
Perhaps the detail I'm missing is that for a 32-bit machine, we'd assume M = 232. Does this single fact allow me to solve for m? m = log2(232)? But then this gets me back to 32... I have to be missing something...
m & M are related to each other, not defined in terms of each other. They call M a derived quantity however since usually the processor/controller is the limiting factor in terms of the word length it uses.
On a real system they are predefined. If you have a 8-bit processor, it generally can handle 8-bit memory addresses (m = 8). Since you can represent 256 values with 8-bits, you can have a total of 256 memory addresses (M = 2^8 = 256). As you can see we start with the little m due to the processor constraints, but you could always decide you want a memory space of size M, and use that to select a processor that can handle it based on word-size = log2(M).
Now if we take your assumptions for your example,
512 sets, 8 blocks per set, 32 words
per block, 8 bits per word
I have to assume this is an 8-bit processor given the 8-bit words. At that point your described cache is larger than your address space (256 words) & therefore pretty meaningless.
You might want to check out Computer Architecture Animations & Java applets. I don't recall if any of the cache ones go into the cache structure (usually they focus on behavior) but it is a resource I saved on the past to tutor students in architecture.
Feel free to further refine your question if it still doesn't make sense.
The two equations for M are just a relationship. They are two ways of saying the same thing. They do not indicate causality, though. I think the assumption made by the author is that the number of unique address bits is defined by the CPU designer at the start via requirements. Then the M can vary per implementation.
m is the width in bits of a memory address in your system, e.g. 32 for x86, 64 for x86-64. Block size on x86, for example, is 4K, so b=12. Block size more or less refers to the smallest chunk of data you can read from durable storage -- you read it into memory, work on that copy, then write it back at some later time. I believe tag bits are the upper t bits that are used to look up data cached locally very close to the CPU (not even in RAM). I'm not sure about the set lines part, although I can make plausible guesses that wouldn't be especially reliable.
Circular ... yes, but I think it's just stating that the two variables m and M must obey the equation. M would likely be a given or assumed quantity.
Example 1: If you wanted to use the formulas for a main memory size of M = 4GB (4,294,967,296 bytes), then m would be 32, since M = 2^32, i.e. m = log2(M). That is, it would take 32 bits to address the entire main memory.
Example 2: If your main memory size assumed were smaller, e.g. M = 16MB (16,777,216 bytes), then m would be 24, which is log2(16,777,216).
It seems you're confused by the math rather than the architectural stuff.
2^m ("2 to the m'th power") is 2 * 2... with m 2's. 2^1 = 2, 2^2 = 2 * 2 = 4, 2^3 = 2 * 2 * 2 = 8, and so on. Notably, if you have an m bit binary number, you can only represent 2^m different numbers. (is this obvious? If not, it might help to replace the 2's with 10's and think about decimal digits)
log2(x) ("logarithm base 2 of x") is the inverse function of 2^x. That is, log2(2^x) = x for all x. (This is a definition!)
You need log2(M) bits to represent M different numbers.
Note that if you start with M=2^m and take log2 of both sides, you get log2(M)=m. The table is just being very explicit.

Resources