Maximizing Stored Information (Entropy?) - information-theory

So I'm not sure if this question belongs here or maybe Math overflow. In any case, my question is about information theory.
Let's say I have a 16 bit word. There are 65,536 unique configurations of 1's and 0's in that number. What each one of those configurations represents is unimportant as depending on your notation (2's complement vs signed magnitude etc.) the same configuration can mean different things.
What I'm wondering is are there any techniques to store more information than that in a 16 bit word?
My original ideas were like odd/even parity or something but then I realized that's already determined by the configuration... i.e. there is no extra information encoded in that. I'm beginning to wonder if no such thing exists.
EDIT For example, let's say some magical computer (thinking quantum or something here) could understand 0,1,a. Then obviously we have 3^16 configurations and can now store more than the numbers [0 - 65,536]. Are there any other properties of a 16 bit word that you can mess with in order to encode extra information in your bit stream?
EDIT2 I am really struggling to put this into words. Right now when I look at a 16 bit word in the computer, the property which conveys information to me the relative ordering of individual 1's and 0's. Is there another property or way of looking at a 16 bit word which would allow more than 2^16 unique "configurations"? (Note it would no longer be a configuration, but 2^16 xxxx's where xxxx is a noun describing an instance of that property). The only thing I can really think of is something like if we looked at the number of 1 to 0 transitions or something rather than whether each bit was actually a 1 or 0? Now transitions does not yield more than 2^16 combinations because it is ultimately solely dependent on the configuration of 1's and 0's. I'm looking for properties that would derive from the configuration of 1's and 0's AND something else thus resulting in MORE than 2^16. Does anyone even know what this would be called if it did exist?
EDIT3 Ok I got it. My question boils down to this: How do we prove that the configuration of 1's and 0's in a word completely defines it? I.E. How do we prove that you need no other information besides the bitmap to show equality between two 16 bit words?
FINAL EDIT
I have an example... If instead of looking at the presence of 1's and 0's we look at transition between bits we can store 2^16 alphabet characters. If the bit to left is the same, treat it as a 1, if it transitions, treat it as a 0. Using the 16 bit word as a circularly linked list type structure where each link represent 0/1 we basically for a 16 bit word out of the transition between bits. That is an exact example of what I was looking for but that results in 2^16, nothing better. I am convinced that you cannot do better and am marking the correct answer =(

The amount of information in a particular configuration of 16 0/1s is determined by the probability of this configuration (this is called self-information). This can be bigger than 16 bits if the configuration is less likely than 1/(2^16), but that means that some other configurations are more likely than 1/(2^16) and so will contain less information than 16 bits.
To take into account all the possible configurations, you have to use the expected value of self-information (called entropy) of individual configurations. This value will reach its maximum when the probabilities of all configurations are equal (that is 1/(2^16)) and then it will be exactly 16 bits.
So the answer is no, you cannot store more than 16 bits of information in 16 0/1s.
See
http://en.wikipedia.org/wiki/Information_theory
http://en.wikipedia.org/wiki/Self-information
EDIT It is important to realize that bit does not stand for 0 or 1, but it is a unit of information, that is -log_2 P(w) where P(w) is the probability of a particular configuration.

You cannot store more than 2 states in one digit of a semiconductor device. You answered it yourself. The only way more information can be fitted into 16 digits is if each digit were to have many possible values.

Related

how to find out only one digit in a int number?

Im new so if this question was already Asked (i didnt find it scrolling through the list of results though) please send me the link.
I got a math quiz and im to lazy to go through all the possibilities so i thought i can find a program instead. I know a bit about programming but not much.
Is it possible (and in what programming language, and how) to read only one digit, e.g at the 3rd Position, in a integer?
And how is an integer actually saved, in a kind of array?
Thanks!
You can get rid of any lower valued digit (the ones and tens if you only want the hundreds) by dividing with rounding/truncation. 1234/100 is 12 in most languages if you are doing integer division.
You can get rid of any higher valued digits by using taking the modulus. 12 % 10 is 2 in many languages; just find out how the modulus is done in yours. I use "modulus" meaning "divide and keep the rest only", i.e. it is the opposite of "divide with rounding"; that which is lost by rounding is the final result of the modulus.
The alternative is however to actually NOT see the input as a number and treat it as text. That way it is often easier to ignore the last 2 characters and all leading characters.

How are the Bits arranged in a QRcode?

I've wondered how QR codes are working, so i did a research and tried to paint my own in an table in word.
On Wikipedia I found this picture
I understand the configuration, but how you actually store a letter doesnt make sense to me.
With the example letter w.
On even rows black is 0 and on odd rows 1.
So the example should give this binary number 01110011 which would be 115 but w is number 32.
So how do I get the right number
I dont know much about this topic but I found you a video where dude explains it. And from what I understood, there are those cells that are read in order of numbers depending on arrow (there are 4 options here and you posted those yourself). So you simply follow those numbers and write 1s and 0s on paper which results in 8bit number. That video has much more detail.
It is also worth pointing out that it is MSB, meaning if we follow your example (you just considering numbers, not colors since you mislabeled it), it has arrow pointing up, meaning you write right/down to up/left which leads to number : 01110011 which has most significant bit at the left which means its 115

Human readable alternative for UUIDs

I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:
They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
It should be easy to create unique pseudonyms.
They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.
My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).
An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.
Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.
Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.
So my question is: Is there any widely-used standard for human-readable unique ids?
Not aware of any widely-used standard for this. Here’s a non-widely-used one:
Proquints
https://arxiv.org/html/0901.4016
https://github.com/dsw/proquint
A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.
Here you go UUID Readable
Generate Easy to Remember, Readable UUIDs, that are Shakespearean and Grammatically Correct Sentences
This article suggests to use the first few characters from a SHA-256 hash, similarly to what git does. UUIDs are typically based on SHA-1, so this is not all that different. The tradeoff between property (2) and (3) is in the number of characters.
With d being the number of digits, you get 2 ** (4 * d) identifiers in total, but the first collision is expected to happen after 2 ** (2 * d).
The big question is really not about the kind of identifier you use, it is how you handle collisions.

Encoding DNA strand in Binary

Hey guys I have the following question:
Suppose we are working with strands of DNA, each strand consisting of
a sequence of 10 nucleotides. Each nucleotide can be any one of four
different types: A, G, T or C. How many bits does it take to encode a
DNA strand?
Here is my approach to it and I want to know if that is correct.
We have 10 spots. Each spot can have 4 different symbols. This means we require 4^10 combinations using our binary digits.
4^10 = 1048576.
We will then find the log base 2 of that. What do you guys think of my approach?
Each nucleotide (aka base-pair) takes two bits (one of four states -> 2 bits of information). 10 base-pairs thus take 20 bits. Reasoning that way is easier than doing the log2(4^10), but gives the same answer.
It would be fewer bits of information if there were any combinations that couldn't appear. e.g. some codons (sequence of three base-pairs) that never appear. But ten independent 2-bit pieces of information sum to 20 bits.
If some sequences appear more frequently than others, and a variable-length representation is viable, then Huffman coding or other compression schemes could save bits most of the time. This might be good in a file-format, but unlikely to be good in-memory when you're working with them.
Densely packing your data into an array of 2bit fields makes it slower to access a single base-pair, but comparing the whole chunk for equality with another chunk is still efficient. (memcmp).
20 bits is unfortunately just slightly too large for a 16bit integer (which computers are good at). Storing in an array of 32bit zero-extended values wastes a lot of space. On hardware with good unaligned support, storing 24bit zero-extended values is ok (do a 32bit load and mask the high 8 bits. Storing is even less convenient though: probably a 16b store and an 8b store, or else load the old value and merge the high 8, then do a 32b store. But that's not atomic.).
This is a similar problem for storing codons (groups of three base-pairs that code for an amino acid): 6 bits of information doesn't fill a byte. Only wasting 2 of every 8 bits isn't that bad, though.
Amino-acid sequences (where you don't care about mutations between different codons that still code for the same AA) have about 20 symbols per position, which means a symbol doesn't quite fit into a 4bit nibble.
I used to work for the phylogenetics research group at Dalhousie, so I've sometimes thought about having a look at DNA-sequence software to see if I could improve on how they internally store sequence data. I never got around to it, though. The real CPU intensive work happens in finding a maximum-likelihood evolutionary tree after you've already calculated a matrix of the evolutionary distance between every pair of input sequences. So actual sequence comparison isn't the bottleneck.
do the maths:
4^10 = 2^2^10 = 2^20
Answer: 20 bits

What is the cost of deleting a value from a hashtable?

Now I have this question where I was asked the cost of deleting a value from a hash table when we used linear probing while the insertion process.
What I could figure out from reading various stuff on the internet is that it has to do something with the load factor. Though I am not sure, but I read a relation between the load factor and no of probes required and it is No of probes = 1 / (1-LF).
So I believe the cost has to be dependent on the probe sequence. But then another thought ruins everything.
What if the element was inserted in p probes and now I am trying to delete this element. But before this I had already deleted few elements having the same hash code and were a part of insertion in probes less than p.
In this case I reach to a stage where I see a slot empty in the hash table but I am not sure if the element I am trying to delete is already deleted or is at some other location as a result of probing.
I also found that once I delete an element I must mark this slot with some special indicator to inform that it is available, but this doesn't solve my problem of being uncertain about the element which I am willing to delete.
Could anyone please suggest how to find the cost in such cases?
Is the approach going to vary if it is non-linear probing?
The standard approach is "lookup the element, mark as deleted". Marking obviously has O(1) cost, so the total operation cost is the same as just lookup: O(1) expected. It can be as high as O(n) in degenerate cases (e.g. all elements have the same hash). O(1) expected is all we can say theoretically.
About the load factor. The higher the load factor (ratio of number of occupied buckets to the total number), the larger is the expected factor (but this doesn't change the theoretical O cost). Note that in this case load factor includes number of both present in the table elements plus the number of buckets that got marked as deleted previously.
Other probing kinds (e.g. quadratic) don't change the theoretical cost, but may alter the expected constant factor or its variance. If you look at "fallback" sequences, in linear ordering the sequences of different buckets overlap. This means that if for some bucket the sequence is long, for adjacent buckets it will also be long. E.g.: if buckets 4 to 10 are occupied, sequence for bucket #4 is 7 bucket long (4, 5, 6, ..., 10), for #5 it's 6 and so on. For quadratic probing this is not the case.
However, linear probing has the benefit of better memory-cache behavior, since you check memory cells close to each other. In practice, though, for quadratic probing fallback sequences are rarely long enough for this to matter.
Finally, in linear probing case, it is possible to work without deleted mark, but for this you'd have to complicate deleting procedure considerably (still O(1) expected, though, but with much higher constant factor). Whether it is worth it has to be decided with actual profiling; for example, this simplifies inserting somewhat and lookup a bit. For a C++ implementation this would have the downside that erase() would invalidate iterators, though.

Resources