Are binary protocols dead? [closed] - tcp

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
It seems like there used to be way more binary protocols because of the very slow internet speeds of the time (dialup). I've been seeing everything being replaced by HTTP and SOAP/REST/XML.
Why is this?
Are binary protocols really dead or are they just less popular? Why would they be dead or less popular?

You Just Can't Beat the Binary
Binary protocols will always be more space efficient than text protocols. Even as internet speeds drastically increase, so does the amount and complexity of information we wish to convey.
The text protocols you reference are outstanding in terms of standardization, flexibility and ease of use. However, there will always be applications where the efficiency of binary transport will outweigh those factors.
A great deal of information is binary in nature and will probably never be replaced by a text protocol. Video streaming comes to mind as a clear example.
Even if you compress a text-based protocol (e.g. with GZip), a general purpose compression algorithm will never be as efficient as a binary protocol designed around the specific data stream.
But Sometimes You Don't Have To
The reason you are seeing more text-based protocols is because transmission speeds and data storage capacity have indeed grown fast compared to the data size for a wide range of applications. We humans find it much easier to work with text protocols, so we designed our ubiquitous XML protocol around a text representation. Certainly we could have created XML as a binary protocol, if we really had to save every byte, and built common tools to visualize and work with the data.
Then Again, Sometimes You Really Do
Many developers are used to thinking in terms of multi-GB, multi-core computers. Even your typical phone these days puts my first IBM PC-XT to shame. Still, there are platforms such as embedded devices, that have rather strict limitations on processing power and memory. When dealing with such devices, binary may be a necessity.

A parallel with programming languages is probably very relevant.
While hi-level languages are the preferred tools for most programming jobs, and have been made possible (in part) by the increases in CPU speed and storage capactity, they haven't removed the need for assembly language.
In a similar fashion, non-binary protocols introduce more abstraction, more extensibility and are therefore the vehicle of choice particularly for application-level communication. They too have benefited from increases in bandwidth and storage capacity. Yet at lower level it is still impractical to be so wasteful.
Furthermore unlike with programming languages where there are strong incentives to "take the performance hit" in exchange for added simplicity, speed of development etc., the ability to structure communication in layers makes the complexity and "binary-ness" of lower layers rather transparent to the application level. For example so long as the SOAP messages one receives are ok, the application doesn't need to know that these were effectively compressed to transit over the wire.

Facebook, Last.fm, and Evernote use the Thrift binary protocol.

I rarely see this talked about but binary protocols, block protocols especially can greatly simplify the complexity of server architectures.
Many text protocols are implemented in such a way that the parser has no basis upon which to infer how much more data is necessary before a logical unit has been received (XML, and JSON can all provide minimum necessary bytes to finish, but can't provide meaningful estimates). This means that the parser may have to periodically cede to the socket receiving code to retrieve more data. This is fine if your sockets are in blocking mode, not so easy if they're not. It generally means that all parser state has to be kept on the heap, not the stack.
If you have a binary protocol where very early in the receive process you know exactly how many bytes you need to complete the packet, then your receiving operations don't need to be interleaved with your parsing operations. As a consequence, the parser state can be held on the stack, and the parser can execute once per message and run straight through without pausing to receive more bytes.

There will always be a need for binary protocols in some applications, such as very-low-bandwidth communications. But there are huge advantages to text-based protocols. For example, I can use Firebug to easily see exactly what is being sent and received from each HTTP call made by my application. Good luck doing that with a binary protocol :)
Another advantage of text protocols is that even though they are less space efficient than binary, text data compresses very well, so the data may be automatically compressed to get the best of both worlds. See HTTP Compression, for example.

Binary protocols are not dead. It is much more efficient to send binary data in many cases.
WCF supports binary encoding using TCP.
http://msdn.microsoft.com/en-us/library/ms730879.aspx

So far the answers all focus on space and time efficiency. No one has mentioned what I feel is the number one reason for so many text-based protocols: sharing of information. It's the whole point of the Internet and it's far easier to do with text-based, human-readable protocols that are also easily processed by machines. You rid yourself of language dependent, application-specific, platform-biased programming with text data interchange.
Link in whatever XML/JSON/*-parsing library you want to use, find out the structure of the information, and snip out the pieces of data you're interested in.

Some binary protocols I've seen on the wild for Internet Applications
Google Protocol Buffers which are used for internal communications but also on, for example Google Chrome Bookmark Syncing
Flash AMF which is used for communication with Flash and Flex applications. Both Flash and Flex have the capability of communicating via REST or SOAP, however the AMF format is much more efficient for Flex as some benchmarks prove

I'm really glad you have raised this question, as non-binary protocols have multiplied in usage many folds since the introduction of XML. Ten years ago, you would see virtually everybody touting their "compliance" with XML based communications. However, this approach, one of several approaches to binary protocols, has many deficiencies.
One of the values, for example, was readability. But readability is important for debugging, when humans should read the transaction. They are very inefficient when compared with binary transfers. This is due to the fact that XML itself is a binary stream, that has to be translated using another layer into textual fragments ("tokens"), and then back into binary with the contained data.
Another value people found was extensibility. But extensibility can be easily maintained if a protocol version number for the binary stream is used at the beginning of the transaction. Instead of sending XML tags, one could send binary indicators. If the version number is an unknown one, then the receiving end can download the "dictionary" of this unknown version. This dictionary could, for example, be an XML file. But downloading the dictionary is a one time operation, instead of every single transaction!
So efficiency could be kept together with extensibility, and very easily! There are a good number of "compiled XML" protocols out there which do just that.
Last, but not least, I have even heard people say that XML is a good way to overcome little-endian and big-endian types of binary systems. For example, Sun computers vs Intel computers. But this is incorrect: if both sides can accept XML (ASCII) in the right way, surely both sides can accept binary in the right way, as XML and ASCII are also transmitted binarically.......
Hope you find this interesting reading!

Binary protocols will continue to live wherever efficency is required. Mostly, they will live in the lower-levels, where hardware-implementation is more common than software implementations. Speed isn't the only factor - the simplicity of implementation is also important. Making a chip process binary data messages is much easier than parsing text messages.

Surely this depends entirely on the application? There have been two general types of example so far, xml/html related answers and video/audio. One is designed to be 'shared' as noted by Jonathon and the other efficient in its transfer of data (and without Matrix vision, 'reading' a movie would never be useful like reading a HTML document).
Ease of debugging is not a reason to choose a text protocol over a 'binary' one - the requirements of the data transfer should dictate that. I work in the Aerospace industry, where the majority of communications are high-speed, predictable data flows like altitude and radio frequencies, thus they are assigned bits on a stream and no human-readable wrapper is required. It is also highly efficient to transfer and, other than interference detection, requires no meta data or protocol processing.
So certainly I would say that they are not dead.
I would agree that people's choices are probably affected by the fact that they have to debug them, but will also heavily depend on the reliability, bandwidth, data type, and processing time required (and power available!).

They are not dead because they are the underlying layers of every communication system. Every major communication system's data link and network layers are based on some kind of "binary protocol".
Take the internet for example, you are now probably using Ethernet in your LAN, PPPoE to communicate with your ISP, IP to surf the web and maybe FTP to download a file. All of which are "binary protocols".
We are seeing this shift towards text-based protocols in the upper layers because they are much easier to develop and understand when compared to "binary protocols", and because most applications don't have strict bandwidth requirements.

depends on the application...
I think in real time environment (firewire, usb, field busses...) will always be a need for binary protocols

Are binary protocols dead?
Two answers:
Let's hope so.
No.
At least a binary protocol is better than XML, which provides all the readability of a binary protocol combined with all the efficiency of less efficiency than a well-designed ASCII protocol.

Eric J's answer pretty much says it, but here's some more food for thought and facts. Note that the stuff below is not about media protocols (videos, images). Some items may be clear to you, but I keep hearing myths every day so here you go ...
There is no difference in expressiveness between a binary protocol and a text protocol. You can transmit the same information with the same reliability.
For every optimum binary protocol, you can design an optimum text protocol that takes just around 15% more space, and that protocol you can type on your keyboard.
In practice (practical protocols is see every day), the difference is often even less significant due to the static nature of many binary protocols.
For example, take a number that can become very large (e.g., in 32 bit range) but is often very small. In binary, people model this usually as four bytes. In text, it's often done as printed number followed by colon. In this case, numbers below ten become two bytes and numbers below 100 three bytes. (You can of course claim that the binary encoding is bad and that you can use some size bits to make it more space efficient, but that's another thing that you have to document, implement on both sides, and be able to troubleshoot when it comes over your wire.)
For example, messages in binary protocols are often framed by length fields and/or terminators, while in text protocols, you just use a CRC.
In practice, the difference is often less significant due to required redundancy.
You want some level of redundancy, no matter if it's binary or text. Binary protocols often leave no room for error. You have to 100% correctly document every bit that you send, and since most of us are humans, that happens rarely and you can't read it well enough to make a safe conclusion what is correct.
So in summary: Binary protocols are theoretically more space and compute efficient, but the difference is in practice often less than you think and the deal is often not worth it. I am working in the Internet of Things area and have to deal nearly on daily base with custom, badly designed binary protocols which are really hard to troubleshoot, annoying to implement and not more space efficient. If you don't need to absolutely tweak the last milliampere out of your battery and calculate with microcontroller cycles (or transmit media), think twice.

Related

Binary Protocol Serialization Frameworks

There are some great libraries out there for deserializing binary formats. I really like the declarative approach by kaitai and nom's approach which is using Rust.
However, I am not aware of any good approaches to serialize binary formats.
For example, you often have the case that you have to write your message length right into the message header, but actually you do not know your exact message length at this point because it depends on many fields which are downstream from the header. And you sometimes also have to deal with padding alignment which can be cumbersome.
Do you know any solutions for problems like these?
Please take a look at ASN.1 which has solved this problem many years ago, and is still continuing to be widely used in critical infrastructure in many different industries. It is independent of programming language and machine architecture so you can set up communication whether one peer is using C on a little-endian machine and the other is using Java or C# on a big-endian machine. Structure padding issues are easily handed by good quality tools for ASN.1. A good list of tools (both free and commercial) is available at the ASN.1 Tools page of the ITU-T ASN.1 Project.

Why do I have more flash (STM32F103RCT6) than what my datasheet says?

I'm writing a firmware to a STM32F103RCT6 microcontroller that has a flash of 256KB according to the datasheet.
Because of a mistake of mine, I was writing some data at 0x0807F800 that according to the reference manual is the last page of a high density device. (The ref. manual make no distinction of different sizes of 'high density devices' on the memory layout)
The data that I wrote, was being read with no errors, so I did some tests and read/wrote 512KB of random data and compared the files and they matched!
files hash pic
I did some research I couldn't find similar experiences.
Are those extra flash reliable? Is that some kind of industrial maneuver?
I would not recommend using this extra FLASH memory for anything that matters.
It is not guaranteed to be present on other chips with the same part number. If used in a product that would be a major problem. Even if a sample is successful now, the manufacturer could change the design or processes in the future and take it away.
While it might be perfectly fine on your chip, it could also be prone to corruption if there are weak memory cells.
A common practice in the semiconductor industry is to have several parts that share a common die design. After manufacturing, the dies are tested and sorted. A die might have a defect in a peripheral, so is used as a part that doesn't have that peripheral. Alternatively, it might be perfectly good, but used as a lesser part for business reasons (i.e., supply and demand).
Often, the unused features are disabled by cutting traces, burning fuses, or special programming at the factory, but it's possible extra features might left intact if there are no negative effects and are unlikely to be observed.
If this is only for one-off use or experimentation, and corruption is an acceptable condition, I don't really see a harm in using it.

Methods to discourage reverse engineering of an opencl kernel

I am preparing my opencl accelerated toolkit for release. Currently, I compile my opencl kernels into binaries targeted for a particular card.
Are there other ways of discouraging reverse engineering? Right now, I have many opencl binaries in my release folder, one for each kernel. Would it be better to splice these binaries into one single binary, or even add them into the host binary, and somehow read them in using a special offset ?
OpenCL 2.0 and SPIR-V can be used for this, but is not available on all platforms yet.
Encode binaries. Keep keys in server and have clients request it at time of usage. Ofcourse keys should be encoded too,( using a variable value such as time of server maybe). Then decode in client to use as binary kernel.
I'm not encode pro but I would use multiple algorithms applied multiple times to make it harder, if they are crunchable in several months(needed for new version update of your GPGPU software for example) when they are alone. But simple unknown algorithm of your own such as reversing order of bits of all data (1st bit goes nth position, nth goes 1st) should make it look hard for level-1 hackers.
Warning: some profiling tools could get its codes in "run-time" so you should add many maybe hundreds of trivial kernels without performance penalty to hide it in a crowded timeline or you could disable profiling in kernel options or you could add a deliberate error maybe some broken events in queues then restart so profiler cannot initiate.
Maybe you could obfuscate final C99 code so it becomes unreadable by humans. If can, he/she doesn't need hacking in first place.
Maybe most effectively, don't do anything, just buy copyrights of your genuine algorithm and show it in a txt so they can look but can not dare copying for money.
If kernel can be rewritten into an "interpreted" version without performance penalty, you can get bytecodes from server to client, so when client person checks profiler, he/she sees only interpreter codes but not real working algorithm since it depends on bytecodes from server as being "data"(buffer). For example, c=a+b becomes if(..)else if(...)else(case ...) and has no meaning without a data feed.
On top of all these, you could buy time against some evil people reverseengineer it, you could pick variable names to initiate his/her "selective perception" so he/she can't focus for a while. Then you develop a meaner version at the same time. Such as c=a+b becomes bulletsToDevilsEar=breakDevilsLeg+goodGame

Data Removal Standards

I want to write an application that removes data from a hard drive. Are there any standards that I need to adhere to which will ensure that my software removes at least the bare minimum, or should I just use off the shelf software? If so any advice?
I think any "standard" you may encounter won't be any less science fiction or science mysticism than anything you come up with yourself. Basically, as long as you physically overwrite the data (even just once), there's no commercial forensic service that - even in the face of any amount of money you throw at them - will claim to be able to recover your data.
(Any "overwrite 35 times with rotating bit patterns" advice may have been true for coarsely spaced magnetic tapes in the 1970s, but it is entirely irrelevant for contemporary hard disks).
The far more important problem you have to solve is how to overwrite data physically. This is essentially impossible through any sort of application or even OS programming, and you'll have to find a way to talk to the hardware properly and get a reliable confirmation that the location you intended to write to has indeed be written to, and that there aren't any relocations of the clusters in question to other parts of the disk that might leak the data.
So in essence this is a very low-level question that'll probably have you pouring over your hard disk manufacturer's manuals quite a bit if you want a genuine solution.
Please define "data removal". Is this scrubbing in order to make undeletions impossible; or simply deletion of data ?
It is common to write over a file several times with a random bitpattern, if one wants to make sure it cannot be recovered. Due to the analog nature of the magnetic bit patterns, it might be possible to recover overwritten data in some circumstances.
Under all circumstances a normal file system delete operation will be revertable in most cases. When you delete a file (using a normal file system delete operation), you remove the file allocation table entry, not the data.
There are standards... see http://en.wikipedia.org/wiki/Data_erasure
You don't give any details so it is hard to tell whether they apply to your situation... Deleting a file with OS built-in file deletion can be almost always reverted... OTOH formatting a drive (NOT quick format) is usually ok except when you deal with sensitive data (like data from clients, patients, finance etc. or some security relevant stuff) then the above mentioned standards which usually use differents amounts/rounds/patterns of overwriting the data so make it nearly impossible to revert the deletion... in really really sensitive cases you first use the best of these methods, then format the drive, then use that method again and then destroy the drive physically (which in fact means real destruction, not only removing the electronics or similar!).
The best way to avoid all this hassle is to plan for this kind of thing and to use strong proven full-disk-encryption (with a key NOT stored on the drive electronics or media!)... this way you can easily just format the drive (NOT quick) and then sell it for example... since any strong encryption will look like "random data" is (if implemented correctly) absolutely useless without the key(s).

What techniques can you use to encode data on a lossy one-way channel?

Imagine you have a channel of communication that is inherently lossy and one-way. That is, there is some inherent noise that is impossible to remove that causes, say, random bits to be toggled. Also imagine that it is one way - you cannot request retransmission.
But you need to send data over it regardless. What techniques can you use to send numbers and text over that channel?
Is it possible to encode numbers so that even with random bit twiddling they can still be interpreted as values close to the original (lossy transmittion)?
Is there a way to send a string of characters (ASCII, say) in a lossless fashion?
This is just for fun. I know you can use morse code or any very low frequency binary communication. I know about parity bits and checksums to detect errors and retrying. I know that you might as well use an analog signal. I'm just curious if there are any interesting computer-sciency techniques to send this stuff over a lossy channel.
Depending on some details that you don't supply about your lossy channel, I would recommend, first using a Gray code to ensure that single-bit errors result in small differences (to cover your desire for loss mitigation in lossy transmission), and then possibly also encoding the resulting stream with some "lossless" (==tries to be loss-less;-) encoding.
Reed-Solomon and variants thereof are particularly good if your noise episodes are prone to occur in small bursts (several bit mistakes within, say, a single byte), which should interoperate well with Gray coding (since multi-bit mistakes are the killers for the "loss mitigation" aspect of Gray, designed to degrade gracefully for single-bit errors on the wire). That's because R-S is intrinsically a block scheme, and multiple errors within one block are basically the same as a single error in it, from R-S's point of view;-).
R-S is particularly awesome if many of the errors are erasures -- to put it simply, an erasure is a symbol that has most probably been mangled in transmission, BUT for which you DO know the crucial fact that it HAS been mangled. The physical layer, depending on how it's designed, can often have hints about that fact, and if there's a way for it to inform the higher layers, that can be of crucial help. Let me explain erasures a bit...:
Say for a simplified example that a 0 is sent as a level of -1 volt and a 1 is send as a level of +1 volt (wrt some reference wave), but there's noise (physical noise can often be well-modeled, ask any competent communication engineer;-); depending on the noise model the decoding might be that anything -0.7 V and down is considered a 0 bit, anything +0.7 V and up is considered a 1 bit, anything in-between is considered an erasure, i.e., the higher layer is told that the bit in question was probably mangled in transmission and should therefore be disregarded. (I sometimes give this as one example of my thesis that sometimes abstractions SHOULD "leak" -- in a controlled and architected way: the Martelli corollary to Spolsky's Law of Leaky Abstractions!-).
A R-S code with any given redundancy ratio can be about twice as effective at correcting erasures (errors the decoder is told about) as it can be at correcting otherwise-unknown errors -- it's also possible to mix both aspects, correcting both some erasures AND some otherwise-unknown errors.
As the cherry on top, custom R-S codes can be (reasonably easily) designed and tailored to reduce the probability of uncorrected errors to below any required threshold θ given a precise model of the physical channel's characteristics in terms of both erasures and undetected errors (including both probability and burstiness).
I wouldn't call this whole area a "computer-sciency" one, actually: back when I graduated (MSEE, 30 years ago), I was mostly trying to avoid "CS" stuff in favor of chip design, system design, advanced radio systems, &c -- yet I was taught this stuff (well, the subset that was already within the realm of practical engineering use;-) pretty well.
And, just to confirm that things haven't changed all that much in one generation: my daughter just got her MS in telecom engineering (strictly focusing on advanced radio systems) -- she can't design just about any serious program, algorithm, or data structure (though she did just fine in the mandatory courses on C and Java, there was absolutely no CS depth in those courses, nor elsewhere in her curriculum -- her daily working language is matlab...!-) -- yet she knows more about information and coding theory than I ever learned, and that's before any PhD level study (she's staying for her PhD, but that hasn't yet begun).
So, I claim these fields are more EE-y than CS-y (though of course the boundaries are ever fuzzy -- witness the fact that after a few years designing chips I ended up as a SW guy more or less by accident, and so did a lot of my contemporaries;-).
This question is the subject of coding theory.
Probably one of the better-known methods is to use Hamming code. It might not be the best way of correcting errors on large scales, but it's incredibly simple to understand.
There is the redundant encoding used in optical media that can recover bit-loss.
ECC is also used in hard-disks and RAM
The TCP protocol can handle quite a lot of data loss with retransmissions.
Either Turbo Codes or Low-density parity-checking codes for general data, because these come closest to approaching the Shannon limit - see wikipedia.
You can use Reed-Solomon codes.
See also the Sliding Window Protocol (which is used by TCP).
Although this includes dealing with packets being re-ordered or lost altogether, which was not part of your problem definition.
As Alex Martelli says, there's lots of coding theory in the world, but Reed-Solomon codes are definitely a sweet spot. If you actually want to build something, Jim Plank has written a nice tutorial on Reed-Solomon coding. Plank has a professional interest in coding with a lot of practical expertise to back it up.
I would go for some of these suggestions, followed by multiple sendings of the same data. So that way you can hope for different errors to be introduced at different points in the stream, and you may be able to infer the desired number a lot easier.

Resources