What is the best compression library for very small amounts of data (3-4 kib?) - networking

I am working on a game engine which is loosely descended from Quake 2, adding some things like scripted effects (allowing the server to specify special effects in detail to a client, instead of having only a limited number of hardcoded effects which the client is capable of.) This is a tradeoff of network efficiency for flexibility.
I've hit an interesting barrier. See, the maximum packet size is 2800 bytes, and only one can go out per client per frame.
Here is the script to do a "sparks" effect (could be good for bullet impact sparks, electrical shocks, etc.)
http://pastebin.com/m7acdf519 (If you don't understand it, don't sweat it; it's a custom syntax I made and not relevant to the question I am asking.)
I have done everything possible to shrink the size of that script. I've even reduced the variable names to single letters. But the result is exactly 405 bytes. Meaning you can fit at most 6 of these per frame. I also have in mind a few server-side changes which could shave it down another 12, and a protocol change which might save another 6. Although the savings would vary depending on what script you are working with.
However, of those 387 bytes, I estimate that only 41 would be unique between multiple usages of the effect. In other words, this is a prime candidate for compression.
It just so happens that R1Q2 (a backward-compatible Quake 2 engine with an extended network protocol) has Zlib compression code. I could lift this code, or at least follow it closely as a reference.
But is Zlib necessarily the best choice here? I can think of at least one alternative, LZMA, and there could easily be more.
The requirements:
Must be very fast (must have very small performance hit if run over 100 times a second.)
Must cram as much data as possible into 2800 bytes
Small metadata footprint
GPL compatible
Zlib is looking good, but is there anything better? Keep in mind, none of this code is being merged yet, so there's plenty of room for experimentation.
Thanks,
-Max
EDIT: Thanks to those who have suggested compiling the scripts into bytecode. I should have made this clear-- yes, I am doing this. If you like you can browse the relevant source code on my website, although it's still not "prettied up."
This is the server-side code:
Lua component: http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/lua/scriptedfx.lua
C component: http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/game/g_scriptedfx.c
For the specific example script I posted, this gets a 1172 byte source down to 405 bytes-- still not small enough. (Keep in mind I want to fit as many of these as possible into 2800 bytes!)
EDIT2: There is no guarantee that any given packet will arrive. Each packet is supposed to contain "the state of the world," without relying on info communicated in previous packets. Generally, these scripts will be used to communicate "eye candy." If there's no room for one, it gets dropped from the packet and that's no big deal. But if too many get dropped, things start to look strange visually and this is undesirable.

LZO might be a good candidate for this.

FINAL UPDATE: The two libraries seem about equivalent. Zlib gives about 20% better compression, while LZO's decoding speed is about twice as fast, but the performance hit for either is very small, nearly negligible. That is my final answer. Thanks for all other answers and comments!
UPDATE: After implementing LZO compression and seeing only sightly better performance, it is clear that my own code is to blame for the performance hit (massively increased number of scripted effects possible per packet, thus my effect "interpreter" is getting exercised a lot more.) I would like to humbly apologize for scrambling to shift blame, and I hope there are no hard feelings. I will do some profiling and then maybe I will be able to get some numbers which will be more useful to someone else.
ORIGINAL POST:
OK, I finally got around to writing some code for this. I started out with Zlib, here are the first of my findings.
Zlib's compression is insanely great. It is reliably reducing packets of, say, 8.5 kib down to, say, 750 bytes or less, even when compressing with Z_BEST_SPEED (instead of Z_DEFAULT_COMPRESSION.) The compression time is also pretty good.
However, I had no idea the decompression speed of anything could even possibly be this bad. I don't have actual numbers, but it must be taking 1/8 second per packet at least! (Core2Duo T550 # 1.83 Ghz.) Totally unacceptable.
From what I've heard, LZMA is a tradeoff of worse performance vs. better compression when compared to Zlib. Since Zlib's compression is already overkill and its performance is already incredibly bad, LZMA is off the table sight unseen for now.
If LZO's decompression time is as good as it's claimed to be, then that is what I will be using. I think in the end the server will still be able to send Zlib packets in extreme cases, but clients can be configured to ignore them and that will be the default.

zlib might be a good candidate - license is very good, works fast and its authors say it has very little overhead and overhead is the thing that makes use for small amounts of data problematic.

you should look at OpenTNL and adapt some of the techniques they use there, like the concept of Network Strings

I would be inclinded to use the most significant bit of each character, which is currently wasted, by shifting groups of 9 bytes leftwards, you will fit into 8 bytes.
You could go further and map the characters into a small space - can you get them down to 6 bits (i.e. only having 64 valid characters) by, for example, not allowing capital letters and subtracting 0x20 from each character ( so that space becomes value 0 )
You could go further by mapping the frequency of each character and make a Huffman type compression to reduce the avarage number bits of each character.
I suspect that there are no algorithms that will save data any better that, in the general case, as there is essentially no redundancy in the message after the changes that you have alrady made.

How about sending a binary representation of your script?
So I'm thinking in the lines of a Abstract Syntax Tree with each procedure having a identifier.
This means preformance gain on the clients due to the one time parsing, and decrease of size due to removing the method names.

Related

What Are The Reasons For Bit Shifting A Float Before Sending It Via A Network

I work with Unity and C# - when making multiplayer games I've been told that when it comes to values like positions that are floats, I should use a bit shift operator on them before sending them and reverse the operation on receive. I have been told this not only allows for larger numbers values and is capable of maintaining floating point precision which may be lost. However, if I do not have to, I do not wish to run this operation every time I receive a packet unless I have to. Though the bottle necks seem to be the actual parsing of the bytes received. Especially without message framing and attempting to move from string to byte array. (But that's another story!)
My question are:
Are these valid reason to undergo the operation? Are they accurate statements?
If not should I be running bit shift ops on my floats?
If I should, what are the real reasons to do it?
Any additional information would be most appreciated.
One of the resourcesI'm referring to:
Main reasons for going back and forth to/from network byte order is to combat endianness caused problems, mainly to ensure each byte of multi byte values (long, int but also floats) is read and written in the way giving the same results regardless of architecture. This issue can be theoretically ignored if you are sure you are exchanging data between systems using the same endianness, but that's rather bad idea from very beginning as you are simply creating unneded technological debt and keep unjustified exceptions in the code ("It all works BUT on the same endianness only. What can go wrong?").
Depending on your app architecture you can rewrite the packet payload/data once you receive it and then use that version further in the code. Also note that you need to encode the data again prior sending it out.

Math - big number from couple of numbers export-able

Let's say I have some numbers, like
5,10,7,8,9,6,2,4,8,5,3,9,78,5,6
I need to send this to another computer, but as the least number of possible bytes. I know what there is a way to do that, I just forgot what it's called and how it works, but generally doing some math with those numbers, getting a big number that, from this number, I'll be able to export the data and get this numbers from this number. Thanks in advance.
EDIT
OK so I need to send this text in UDP but I need it as less bits as possible. I'm sending some options, like firstcolor-secondcolor, let's say I have 15 colors. Every color is just number, from 1 to 199, but maybe there is a better way to send this data? thanks.
No one can say which compression scheme is the best for you. We don't have any information about the numbers. But as a first try, you could just write them into a file and use gzip compression on it. Or bzip2, or 7zip.
And only if all these don't help, you should think about doing the compression yourself.
You also didn't tell us your operating systems (source computer, destination computer) and from where you get the data.
[Update, based on the edit in the question:] So basically you want to send some numbers in the range of 1 to 199. This is pretty close to what a single byte can hold.
If it is ok that you use 8 bits per number (meaning you waste 0.4 bits per number), this is trivial but highly depends on the programming language. Here is how it might look like in Java syntax:
ByteBuffer buf = new ByteBuffer();
buf.add(1);
buf.add(199);
buf.add(78);
buf.add(7);
udpSocket.send(buf.toArray());
Get a compression library (like zlib, for example) and feed your numbers in (as an array of integers, for example). This is compressing your data. That same library should allow you to reverse the process and decompress the data at the other end to get your values back out.
If you want to improve your algorithmic knowledge and your requirements are simple and non-critical I'd recommend having a go at writing your own compression/decompression code. If not, grab some code off the shelf - there are loads of good libraries around.

I want to convert a sound from Mic to binary and match it from the database

I want to convert a sound from Mic to binary and match it from the database(a type of voice identification program but don't getting idea how to get sound from Mic directly so that i can convert it to binary?Also it is possible or not. Please guide me )
See this:
http://www.dotnetspider.com/resources/4967-How-record-voice-from-microphone.aspx
You're not going to be able to identify voices by doing a binary comparison on sound data. The binary of a particular sound will not be identical to an imitation of that sound unless it is literally the same file because of minor variations in just about everything. You'll need to do some signals processing to do a fuzzy comparison of the data. You can read about signal processing on wikipedia.
You will probably find it easier to use a third party library to process the sound for you. Something like this might be a good start.
You're looking at two very distinct problems here.
The first is pretty technical: Getting sound from the microphone into a digital waveform. How you do this exactly depends on the OS and API you're using (on Windows, you're probably looking at DirectX audio or, if available, ASIO). Typically, this is how you'd proceed:
Set up a recording buffer for the microphone, with suitable parameters (number of channels, physical input on the sound card, sample rate, bit depth, buffer size)
Start the recording. This usually involves pointing the sound library to a callback function to process the recorded buffer.
In the callback, read the buffer, convert it to a suitable format, and append it to the audio file of your choice. (You could also record to RAM only, but longer recordings may exceed available storage).
Store the recorded audio in a suitable database field (some kind of binary blob)
This is the easy part though; the harder part is matching a chunk of audio data against other chunks. A naïve approach would be to try and find exact matches, but that won't help you much, because the chance that you find one is practically zero - recording equipment, even the best, introduces a bit of random noise, and recording setups vary slightly whether you want to or not, so even if you'd have someone say something twice, perfectly identical, you'd still see differences in the recorded audio.
What you need to do, then, is find certain typical characteristics of the waveform. Things you could look for are:
Overall amplitude shape
Base frequencies
Selected harmonics (formants)
Extracting these is non-trivial and involves pretty severe math; and then you'll have to condense them into some sort of fingerprint, and find a way to compare them with some fuzziness (so that a near-match is good enough, rather than requiring exact matches). Finding the right parameters and comparison algorithms isn't easy, and it takes a lot of tweaking and testing; your best bet is to go find a library that does this for you.

Transfer files using checksums only?

Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a n byte file with n-1 bytes of data, that means you've got 256^(n-1) possible patterns of data you may have sent, but are selecting from a space of size 256^n. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.
Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file file.bin with a 1000-byte size. There are 2^(8*1000) different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,
you still have about 2^(7*1000) colliding alternatives.
By sending a single additional bit, you might be able cut those down by half... and you still have 2^6999 collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.
The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.

Caveats of select/poll vs. epoll reactors in Twisted

Everything I've read and experienced ( Tornado based apps ) leads me to believe that ePoll is a natural replacement for Select and Poll based networking, especially with Twisted. Which makes me paranoid, its pretty rare for a better technique or methodology not to come with a price.
Reading a couple dozen comparisons between epoll and alternatives shows that epoll is clearly the champion for speed and scalability, specifically that it scales in a linear fashion which is fantastic. That said, what about processor and memory utilization, is epoll still the champ?
For very small numbers of sockets (varies depending on your hardware, of course, but we're talking about something on the order of 10 or fewer), select can beat epoll in memory usage and runtime speed. Of course, for such small numbers of sockets, both mechanisms are so fast that you don't really care about this difference in the vast majority of cases.
One clarification, though. Both select and epoll scale linearly. A big difference, though, is that the userspace-facing APIs have complexities that are based on different things. The cost of a select call goes roughly with the value of the highest numbered file descriptor you pass it. If you select on a single fd, 100, then that's roughly twice as expensive as selecting on a single fd, 50. Adding more fds below the highest isn't quite free, so it's a little more complicated than this in practice, but this is a good first approximation for most implementations.
The cost of epoll is closer to the number of file descriptors that actually have events on them. If you're monitoring 200 file descriptors, but only 100 of them have events on them, then you're (very roughly) only paying for those 100 active file descriptors. This is where epoll tends to offer one of its major advantages over select. If you have a thousand clients that are mostly idle, then when you use select you're still paying for all one thousand of them. However, with epoll, it's like you've only got a few - you're only paying for the ones that are active at any given time.
All this means that epoll will lead to less CPU usage for most workloads. As far as memory usage goes, it's a bit of a toss up. select does manage to represent all the necessary information in a highly compact way (one bit per file descriptor). And the FD_SETSIZE (typically 1024) limitation on how many file descriptors you can use with select means that you'll never spend more than 128 bytes for each of the three fd sets you can use with select (read, write, exception). Compared to those 384 bytes max, epoll is sort of a pig. Each file descriptor is represented by a multi-byte structure. However, in absolute terms, it's still not going to use much memory. You can represent a huge number of file descriptors in a few dozen kilobytes (roughly 20k per 1000 file descriptors, I think). And you can also throw in the fact that you have to spend all 384 of those bytes with select if you only want to monitor one file descriptor but its value happens to be 1024, wheras with epoll you'd only spend 20 bytes. Still, all these numbers are pretty small, so it doesn't make much difference.
And there's also that other benefit of epoll, which perhaps you're already aware of, that it is not limited to FD_SETSIZE file descriptors. You can use it to monitor as many file descriptors as you have. And if you only have one file descriptor, but its value is greater than FD_SETSIZE, epoll works with that too, but select does not.
Randomly, I've also recently discovered one slight drawback to epoll as compared to select or poll. While none of these three APIs supports normal files (i.e., files on a file system), select and poll present this lack of support as reporting such descriptors as always readable and always writable. This makes them unsuitable for any meaningful kind of non-blocking filesystem I/O, a program which uses select or poll and happens to encounter a file descriptor from the filesystem will at least continue to operate (or if it fails, it won't be because of select or poll), albeit it perhaps not with the best performance.
On the other hand, epoll will fail fast with an error (EPERM, apparently) when asked to monitor such a file descriptor. Strictly speaking, this is hardly incorrect. It's merely signalling its lack of support in an explicit way. Normally I would applaud explicit failure conditions, but this one is undocumented (as far as I can tell) and results in a completely broken application, rather than one which merely operates with potentially degraded performance.
In practice, the only place I've seen this come up is when interacting with stdio. A user might redirect stdin or stdout from/to a normal file. Whereas previously stdin and stdout would have been a pipe -- supported by epoll just fine -- it then becomes a normal file and epoll fails loudly, breaking the application.
In tests at my company, one issue with epoll() came up, thus a single cost compared to select.
When attempting to read from the network with a timeout, creating an epoll_fd ( instead of a FD_SET ), and adding the fd to the epoll_fd, is much more expensive than creating a FD_SET (which is a simple malloc).
As per the previous answer, as the number of FDs in the process becomes large, the cost of select() becomes higher, but in our testing, even with fd values in the 10,000's, select was still a winner. These are cases where there is only one fd that a thread is waiting on, and simply trying to overcome the fact that network read, and network write, doesn't timeout when using a blocking thread model. Of course, blocking thread models are low performance compared to non-blocking reactor systems, but there are occasions where, to integrate with a particular legacy code base, it is required.
This kind of use case is rare in high performance applications, because a reactor model doesn't need to create a new epoll_fd every time. For the model where an epoll_fd is long-lived --- which is clearly preferred for any high performance server design --- epoll is the clear winner in every way.

Resources