Any theoretical limit to compression? - information-theory

Imagine that you had all the supercomputers in the world at your disposal for the next 10 years. Your task was to compress 10 full-length movies losslessly as much as possible. Another criteria was that a normal computer should be able to decompress it on the fly and should not need to spend much of his HD to install the decompressing software.
My question is, how much more compression could you achieve than the best alternatives today? 1%, 5%, 50%? More specifically: is there a theoretical limit to compression, given a fixed dictionary size (if it is called that for video compression as well)?

The limits of compression are dictated by the randomness of the source. Welcome to the study of information theory! See data compression.

There is a theoretical limit: I suggest reading this article on Information theory and the pigeon hole principle. It seems to sum up the issue in a very easy to understand way.

If you have a fixed catalogue of all the movies you were ever going to compress, you could just send an id for the movie and have the "decompression" lookup up the data with that index. So compression could be to a fixed size of log2(N) bits, where N was the number of movies.
I suspect the practical lower bound is rather higher than this.
Do you really mean lossless? Most of today's video compression is lossy, I thought.

It is important to redefine the limits with the latest developments regarding information theory. Therefore, it is essential to report the hypotheses for which the limit is valid.
In information theory, 3 fundamental hypotheses are used which are the following:
the information is defined by the entropy function H(X).
the information that identifies the source is known both by the encoder and by the decoder.
the source and its isomorphisms are considered. It means that we can decode a symbol at a time.
First limit, the most famous, defined by Shannon in which all 3 hypotheses are true.
NH(X)
With H(X) entropy of the source X.
Second limit. we remove the second hypothesis the decoder does not know the source.
NH(X)+source information
Third limit, let's remove the third hypothesis. In this case, the Set Shaping Theory SST is used, a new method that is revolutionizing information theory. This theory studies the one-to-one functions f that transform a set of strings into a set of equal size made up of strings of greater length. With this method, we get the following limit:
N2H(Y)+ source information≈NH(X)
with f(X)=Y and N2>N.
In practice, we obtain a gain in terms of compression equivalent to the information necessary to describe the source is obtained. The information needed to describe the source represents the inefficiency of the entropy coding.
In this case, however, it is not possible to decode a symbol at a time (the code is not instantaneous) but the message must be decoded in full before obtaining the original message.
Important progress has been made in this area. It was possible to apply this theory to a concrete case of data compression "Practical applications of Set Shaping Theory in Huffman coding".
Another interesting aspect is that the authors shared the code and the function that performs the transform described in the set shaping theory. The file is shared on Matlab file exchange: https://www.mathworks.com/matlabcentral/fileexchange/115590-test-sst-huffman-coding?s_tid=FX_rc1_behav

Related

Difference between shuffle() and rebalance() in Apache Flink

I am working on my bachelor's final project, which is about the comparison between Apache Spark Streaming and Apache Flink (only streaming) and I have just arrived to "Physical partitioning" in Flink's documentation. The matter is that in this documentation it doesn't explain well how this two transformations work. Directly from the documentation:
shuffle(): Partitions elements randomly according to a uniform distribution.
rebalance(): Partitions elements round-robin, creating equal load per partition. Useful for performance optimisation in the presence of data skew.
Source: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning
Both are automatically done, so what I understand is that they both redistribute equally (shuffle() > uniform distribution & rebalance() > round-robin) and randomly the data. Then I deduce that rebalance() distributes the data in a better way ("equal load per partitions") so the tasks have to process the same amount of data, but shuffle() may create bigger and smaller partitions. Then, in which cases might you prefer to use shuffle() than rebalance()?
The only thing that comes to my mind is that probably rebalance()requires some processing time so in some cases it might use more time to do the rebalancing than the time it will improve in the future transformations.
I have been looking for this and nobody has talked about this, only in a mailing list of Flink, but they don't explain how shuffle() works.
Thanks to Sneftel who has helped me to improve my question asking me things to let me rethink about what I wanted to ask; and to Till who answered quite well my question. :D
As the documentation states, shuffle will randomly distribute the data whereas rebalance will distribute the data in a round robin fashion. The latter is more efficient since you don't have to compute a random number. Moreover, depending on the randomness, you might end up with some kind of not so uniform distribution.
On the other hand, rebalance will always start sending the first element to the first channel. Thus, if you have only few elements (fewer elements than subtasks), then only some of the subtasks will receive elements, because you always start to send the first element to the first subtask. In the streaming case this should eventually not matter because you usually have an unbounded input stream.
The actual reason why both methods exist is a historically reason. shuffle was introduced first. In order to make the batch an streaming API more similar, rebalance was then introduced.
This statement by Flink is misleading:
Useful for performance optimisation in the presence of data skew.
Since it's used to describe rebalance, but not shuffle, it suggests it's the distinguishing factor. My understanding of it was that if some items are slow to process and some fast, the partitioner will use the next free channel to send the item to. But this is not the case, compare the code for rebalance and shuffle. The rebalance just adds to next channel regardless how busy it is.
// rebalance
nextChannelToSendTo = (nextChannelToSendTo + 1) % numberOfChannels;
// shuffle
nextChannelToSendTo = random.nextInt(numberOfChannels);
The statement can be also understood differently: the "load" doesn't mean actual processing time, just the number of items. If your original partitioning has skew (vastly different number of items in partitions), the operation will assign items to partitions uniformly. However in this case it applies to both operations.
My conclusion: shuffle and rebalance do the same thing, but rebalance does it slightly more efficiently. But the difference is so small that it's unlikely that you'll notice it, java.util.Random can generate 70m random numbers in a single thread on my machine.

Which codec is best and what should be their parameters value?

I'm a beginner in the field of audio codec and finding it hard to understand; how does sampling rate, bit rate and any other parameter affect the encoding/decoding[Audio format], the quality of audio and file size.
I read constant bit rate is good than variable bit rate, but how to know what amount of bit rate would be perfect to encode the file in as small size as possible without compromising the quality. I'm specifically focusing on audio codec for present.
I had heard about the OPUS, SILK, G.722, SPEEX, but don't know which one should I use to get the better quality and less file size. Also, what parameters should I set for this codecs so they can work effectively for me.
Can anyone guide on this?
Thanks in advance
If you think of the original analog music as a sound wave then converting it to digital means approximating that wave as digital bits. The sampling rate is how many points on that wave you are taking per unit time so the higher the sampling rate the closer you are to the original sound. Lower sampling rate means higher compression but lesser audio quality.
Similarly the bit rate is effectively 'how much' information you're encoding at each point so again, lower bit rate means higher compression but lower audio quality.
Compression algorithms generally use pyschoacoustics to try to determine what information can be lost with the least amount of audible difference. In some sections of a track this may be more or less than in others so using a variable bit rate enables you to achieve higher compression without a 'big' audible drop in quality.
It's well explained here: Link
I don't know the details of those codecs but generally what you should use and what parameters you should pass depends on what you're trying to achieve and for what purpose. For portable use where audio quality might not be paramount you might want to pass lower values to achieve smaller file sizes - for audiophile speakers you probably want to pass the maximum.

I want to convert a sound from Mic to binary and match it from the database

I want to convert a sound from Mic to binary and match it from the database(a type of voice identification program but don't getting idea how to get sound from Mic directly so that i can convert it to binary?Also it is possible or not. Please guide me )
See this:
http://www.dotnetspider.com/resources/4967-How-record-voice-from-microphone.aspx
You're not going to be able to identify voices by doing a binary comparison on sound data. The binary of a particular sound will not be identical to an imitation of that sound unless it is literally the same file because of minor variations in just about everything. You'll need to do some signals processing to do a fuzzy comparison of the data. You can read about signal processing on wikipedia.
You will probably find it easier to use a third party library to process the sound for you. Something like this might be a good start.
You're looking at two very distinct problems here.
The first is pretty technical: Getting sound from the microphone into a digital waveform. How you do this exactly depends on the OS and API you're using (on Windows, you're probably looking at DirectX audio or, if available, ASIO). Typically, this is how you'd proceed:
Set up a recording buffer for the microphone, with suitable parameters (number of channels, physical input on the sound card, sample rate, bit depth, buffer size)
Start the recording. This usually involves pointing the sound library to a callback function to process the recorded buffer.
In the callback, read the buffer, convert it to a suitable format, and append it to the audio file of your choice. (You could also record to RAM only, but longer recordings may exceed available storage).
Store the recorded audio in a suitable database field (some kind of binary blob)
This is the easy part though; the harder part is matching a chunk of audio data against other chunks. A naïve approach would be to try and find exact matches, but that won't help you much, because the chance that you find one is practically zero - recording equipment, even the best, introduces a bit of random noise, and recording setups vary slightly whether you want to or not, so even if you'd have someone say something twice, perfectly identical, you'd still see differences in the recorded audio.
What you need to do, then, is find certain typical characteristics of the waveform. Things you could look for are:
Overall amplitude shape
Base frequencies
Selected harmonics (formants)
Extracting these is non-trivial and involves pretty severe math; and then you'll have to condense them into some sort of fingerprint, and find a way to compare them with some fuzziness (so that a near-match is good enough, rather than requiring exact matches). Finding the right parameters and comparison algorithms isn't easy, and it takes a lot of tweaking and testing; your best bet is to go find a library that does this for you.

Transfer files using checksums only?

Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a n byte file with n-1 bytes of data, that means you've got 256^(n-1) possible patterns of data you may have sent, but are selecting from a space of size 256^n. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.
Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file file.bin with a 1000-byte size. There are 2^(8*1000) different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,
you still have about 2^(7*1000) colliding alternatives.
By sending a single additional bit, you might be able cut those down by half... and you still have 2^6999 collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.
The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.

What is the best compression library for very small amounts of data (3-4 kib?)

I am working on a game engine which is loosely descended from Quake 2, adding some things like scripted effects (allowing the server to specify special effects in detail to a client, instead of having only a limited number of hardcoded effects which the client is capable of.) This is a tradeoff of network efficiency for flexibility.
I've hit an interesting barrier. See, the maximum packet size is 2800 bytes, and only one can go out per client per frame.
Here is the script to do a "sparks" effect (could be good for bullet impact sparks, electrical shocks, etc.)
http://pastebin.com/m7acdf519 (If you don't understand it, don't sweat it; it's a custom syntax I made and not relevant to the question I am asking.)
I have done everything possible to shrink the size of that script. I've even reduced the variable names to single letters. But the result is exactly 405 bytes. Meaning you can fit at most 6 of these per frame. I also have in mind a few server-side changes which could shave it down another 12, and a protocol change which might save another 6. Although the savings would vary depending on what script you are working with.
However, of those 387 bytes, I estimate that only 41 would be unique between multiple usages of the effect. In other words, this is a prime candidate for compression.
It just so happens that R1Q2 (a backward-compatible Quake 2 engine with an extended network protocol) has Zlib compression code. I could lift this code, or at least follow it closely as a reference.
But is Zlib necessarily the best choice here? I can think of at least one alternative, LZMA, and there could easily be more.
The requirements:
Must be very fast (must have very small performance hit if run over 100 times a second.)
Must cram as much data as possible into 2800 bytes
Small metadata footprint
GPL compatible
Zlib is looking good, but is there anything better? Keep in mind, none of this code is being merged yet, so there's plenty of room for experimentation.
Thanks,
-Max
EDIT: Thanks to those who have suggested compiling the scripts into bytecode. I should have made this clear-- yes, I am doing this. If you like you can browse the relevant source code on my website, although it's still not "prettied up."
This is the server-side code:
Lua component: http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/lua/scriptedfx.lua
C component: http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/game/g_scriptedfx.c
For the specific example script I posted, this gets a 1172 byte source down to 405 bytes-- still not small enough. (Keep in mind I want to fit as many of these as possible into 2800 bytes!)
EDIT2: There is no guarantee that any given packet will arrive. Each packet is supposed to contain "the state of the world," without relying on info communicated in previous packets. Generally, these scripts will be used to communicate "eye candy." If there's no room for one, it gets dropped from the packet and that's no big deal. But if too many get dropped, things start to look strange visually and this is undesirable.
LZO might be a good candidate for this.
FINAL UPDATE: The two libraries seem about equivalent. Zlib gives about 20% better compression, while LZO's decoding speed is about twice as fast, but the performance hit for either is very small, nearly negligible. That is my final answer. Thanks for all other answers and comments!
UPDATE: After implementing LZO compression and seeing only sightly better performance, it is clear that my own code is to blame for the performance hit (massively increased number of scripted effects possible per packet, thus my effect "interpreter" is getting exercised a lot more.) I would like to humbly apologize for scrambling to shift blame, and I hope there are no hard feelings. I will do some profiling and then maybe I will be able to get some numbers which will be more useful to someone else.
ORIGINAL POST:
OK, I finally got around to writing some code for this. I started out with Zlib, here are the first of my findings.
Zlib's compression is insanely great. It is reliably reducing packets of, say, 8.5 kib down to, say, 750 bytes or less, even when compressing with Z_BEST_SPEED (instead of Z_DEFAULT_COMPRESSION.) The compression time is also pretty good.
However, I had no idea the decompression speed of anything could even possibly be this bad. I don't have actual numbers, but it must be taking 1/8 second per packet at least! (Core2Duo T550 # 1.83 Ghz.) Totally unacceptable.
From what I've heard, LZMA is a tradeoff of worse performance vs. better compression when compared to Zlib. Since Zlib's compression is already overkill and its performance is already incredibly bad, LZMA is off the table sight unseen for now.
If LZO's decompression time is as good as it's claimed to be, then that is what I will be using. I think in the end the server will still be able to send Zlib packets in extreme cases, but clients can be configured to ignore them and that will be the default.
zlib might be a good candidate - license is very good, works fast and its authors say it has very little overhead and overhead is the thing that makes use for small amounts of data problematic.
you should look at OpenTNL and adapt some of the techniques they use there, like the concept of Network Strings
I would be inclinded to use the most significant bit of each character, which is currently wasted, by shifting groups of 9 bytes leftwards, you will fit into 8 bytes.
You could go further and map the characters into a small space - can you get them down to 6 bits (i.e. only having 64 valid characters) by, for example, not allowing capital letters and subtracting 0x20 from each character ( so that space becomes value 0 )
You could go further by mapping the frequency of each character and make a Huffman type compression to reduce the avarage number bits of each character.
I suspect that there are no algorithms that will save data any better that, in the general case, as there is essentially no redundancy in the message after the changes that you have alrady made.
How about sending a binary representation of your script?
So I'm thinking in the lines of a Abstract Syntax Tree with each procedure having a identifier.
This means preformance gain on the clients due to the one time parsing, and decrease of size due to removing the method names.

Resources