Binary Protocol Serialization Frameworks - binary-serialization

There are some great libraries out there for deserializing binary formats. I really like the declarative approach by kaitai and nom's approach which is using Rust.
However, I am not aware of any good approaches to serialize binary formats.
For example, you often have the case that you have to write your message length right into the message header, but actually you do not know your exact message length at this point because it depends on many fields which are downstream from the header. And you sometimes also have to deal with padding alignment which can be cumbersome.
Do you know any solutions for problems like these?

Please take a look at ASN.1 which has solved this problem many years ago, and is still continuing to be widely used in critical infrastructure in many different industries. It is independent of programming language and machine architecture so you can set up communication whether one peer is using C on a little-endian machine and the other is using Java or C# on a big-endian machine. Structure padding issues are easily handed by good quality tools for ASN.1. A good list of tools (both free and commercial) is available at the ASN.1 Tools page of the ITU-T ASN.1 Project.

Related

What approaches/languages exist for describing network-protocols/packet-structures?

There're simple structures of network protocols (e.g. ipv4, tcp, udp, ...) which can be can be easily described in any language via strictures. But there are more difficult structures with optional fields/block and dynamic block/field sizes (TVL, LT, etc.) - e.g. ipv6, sctp, PROFINET-IO (decentralized periphery), ...
My question is - How to properly describe the protocol data structure and store that for future using? E.g. generating structures for different languages, or getting all trees (e.g. in ipv6 Wireshark ipv6.opt.pdm.delta_last_recv), or getting all fields for specific block/extension/option of the protocol.
I hope the description is clear. Thanks.
The ASN1 language was created to solve this and other problems like it. IMHO, the reason that you do not see it used often is that the language got very complex and different factions started to use it in different ways (SNMP MIBs, Crypto X509, etc) which resulted in ASN1 compilers being specialized and not general.
Often instead of ASN1 you see a C-Struct definition of the packet or just an RFC packet diagram ( you can use the protocol tool to generate one) with some markings (like ...) to indicate variable length.
I guess protobuf technically also qualifies as a language that describes a binary message though I do not believe it is a general language that can describe any message and is meant to be used by other protobuf-enabled applications.

Use cases for self-modifying code?

On a Von Neumann architecture, program and data are both stored in memory, so a program can modify itself. Is this useful for a programmer? Could you give some examples?
Metamorphism
One (questionable) use case that comes to my mind is metamorphic computer viruses. These are malicious pieces of software that conceal themselves from signature based detection by rewriting their own machine code to an semantically equivalent representation that looks different.
Trampolining
Another (more complex, but also more common) use case is trampolining, a technique based on dynamic code generation to solve certain problems with nested function calls.
JIT compilation
The most common usage of dynamic code generation that I can think of is JIT (just-in-time) compilation. Modern languages like .NET or Java are not compiled into native machine code, but into some kind of intermediate language (called bytecode). This bytecode is then interpreted when the program is executed (by a virtual machine written for the target architecture). At the same time, a background process checks which parts of the code are executed very often. These parts then have a good chance of being dynamically compiled into native machine language for maximum performance. All this happens during the run time of the program!
Security implications
One thing to keep in mind is that the possibility to interpret data as code is useful for exploiting security holes in computer software, which is why the trend in modern hardware and operating systems is to enable and, if possible, even enforce the separation of code and data (also see NX bit and DEP).
I can best answer this by referring you to an answer to a similar (exceptionally well written and answered) question, also on StackOverflow - Homoiconic and "unrestricted" self modifying code + Is lisp really self modifying?. The answer focuses on Lisp, a family languages known for taking "code is data" to the next level, and explores the uses of that in AI.

Fast interpreted language for memory constrained microcontroller

I'm looking for a fast interpreted language for a microcontroller.
The requirements are:
should be fast (not crucial but would be nice)
should be light on data memory (small overhead <8KB, excludes program variable space)
preferably would be small in program size and the language would be compact
preferably, human readable (for example, BASIC)
Thanks!
Some AVR interpreters:
http://www.cqham.ru/tbcgroup/index_eng.htm
http://www.jcwolfram.de/projekte/avr/chipbasic2/main.php
http://www.jcwolfram.de/projekte/avr/chipbasic8/main.php
http://www.jcwolfram.de/projekte/avr/main.php
http://code.google.com/p/python-on-a-chip/
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=688&item_type=project
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=626&item_type=project
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=460&item_type=project
This is a bit generic: there are many kinds of Microcontrollers, and thanks to technologies like Jazelle, it is possible to run hardware-accelerated Java on Microcontrollers. (if... your microcontroller supports it)
For a generic answer: Forth is commonly referenced. But really, you need to be far more specific with your question.
Micro-controllers come in a vast variety of architectures. There are small 8-bit families, 32-bit families with simple architectures and 32-bit families with MMU support, suitable for running a modern OS. If you don't state which family you are targeted at, it is impossible to answer your question.
Anyway, for 8-bit families the best you can get is a BASIC variant. See Bascom for example. Note that this would be a compiler version of the "interpreted" language. If you actually want to have a runtime or an interpreter that will execute your code, then you most probably need to install an operation system in your microcontroller.
There were a variety of interpreted languages for small micros back in the late 1970's and 1980's. They seem to have mostly fallen out of fashion. I'd like to have a p-code based C compiler for the PIC18 that could coexist nicely with my other C compiler; for much of my code I'd be willing to accept a 100-fold slowdown for a 50% space reduction (so long as I could keep the important stuff in native code). I would think that would be achievable, but I'm not about to implement such a thing from scratch myself.

Are binary protocols dead? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
It seems like there used to be way more binary protocols because of the very slow internet speeds of the time (dialup). I've been seeing everything being replaced by HTTP and SOAP/REST/XML.
Why is this?
Are binary protocols really dead or are they just less popular? Why would they be dead or less popular?
You Just Can't Beat the Binary
Binary protocols will always be more space efficient than text protocols. Even as internet speeds drastically increase, so does the amount and complexity of information we wish to convey.
The text protocols you reference are outstanding in terms of standardization, flexibility and ease of use. However, there will always be applications where the efficiency of binary transport will outweigh those factors.
A great deal of information is binary in nature and will probably never be replaced by a text protocol. Video streaming comes to mind as a clear example.
Even if you compress a text-based protocol (e.g. with GZip), a general purpose compression algorithm will never be as efficient as a binary protocol designed around the specific data stream.
But Sometimes You Don't Have To
The reason you are seeing more text-based protocols is because transmission speeds and data storage capacity have indeed grown fast compared to the data size for a wide range of applications. We humans find it much easier to work with text protocols, so we designed our ubiquitous XML protocol around a text representation. Certainly we could have created XML as a binary protocol, if we really had to save every byte, and built common tools to visualize and work with the data.
Then Again, Sometimes You Really Do
Many developers are used to thinking in terms of multi-GB, multi-core computers. Even your typical phone these days puts my first IBM PC-XT to shame. Still, there are platforms such as embedded devices, that have rather strict limitations on processing power and memory. When dealing with such devices, binary may be a necessity.
A parallel with programming languages is probably very relevant.
While hi-level languages are the preferred tools for most programming jobs, and have been made possible (in part) by the increases in CPU speed and storage capactity, they haven't removed the need for assembly language.
In a similar fashion, non-binary protocols introduce more abstraction, more extensibility and are therefore the vehicle of choice particularly for application-level communication. They too have benefited from increases in bandwidth and storage capacity. Yet at lower level it is still impractical to be so wasteful.
Furthermore unlike with programming languages where there are strong incentives to "take the performance hit" in exchange for added simplicity, speed of development etc., the ability to structure communication in layers makes the complexity and "binary-ness" of lower layers rather transparent to the application level. For example so long as the SOAP messages one receives are ok, the application doesn't need to know that these were effectively compressed to transit over the wire.
Facebook, Last.fm, and Evernote use the Thrift binary protocol.
I rarely see this talked about but binary protocols, block protocols especially can greatly simplify the complexity of server architectures.
Many text protocols are implemented in such a way that the parser has no basis upon which to infer how much more data is necessary before a logical unit has been received (XML, and JSON can all provide minimum necessary bytes to finish, but can't provide meaningful estimates). This means that the parser may have to periodically cede to the socket receiving code to retrieve more data. This is fine if your sockets are in blocking mode, not so easy if they're not. It generally means that all parser state has to be kept on the heap, not the stack.
If you have a binary protocol where very early in the receive process you know exactly how many bytes you need to complete the packet, then your receiving operations don't need to be interleaved with your parsing operations. As a consequence, the parser state can be held on the stack, and the parser can execute once per message and run straight through without pausing to receive more bytes.
There will always be a need for binary protocols in some applications, such as very-low-bandwidth communications. But there are huge advantages to text-based protocols. For example, I can use Firebug to easily see exactly what is being sent and received from each HTTP call made by my application. Good luck doing that with a binary protocol :)
Another advantage of text protocols is that even though they are less space efficient than binary, text data compresses very well, so the data may be automatically compressed to get the best of both worlds. See HTTP Compression, for example.
Binary protocols are not dead. It is much more efficient to send binary data in many cases.
WCF supports binary encoding using TCP.
http://msdn.microsoft.com/en-us/library/ms730879.aspx
So far the answers all focus on space and time efficiency. No one has mentioned what I feel is the number one reason for so many text-based protocols: sharing of information. It's the whole point of the Internet and it's far easier to do with text-based, human-readable protocols that are also easily processed by machines. You rid yourself of language dependent, application-specific, platform-biased programming with text data interchange.
Link in whatever XML/JSON/*-parsing library you want to use, find out the structure of the information, and snip out the pieces of data you're interested in.
Some binary protocols I've seen on the wild for Internet Applications
Google Protocol Buffers which are used for internal communications but also on, for example Google Chrome Bookmark Syncing
Flash AMF which is used for communication with Flash and Flex applications. Both Flash and Flex have the capability of communicating via REST or SOAP, however the AMF format is much more efficient for Flex as some benchmarks prove
I'm really glad you have raised this question, as non-binary protocols have multiplied in usage many folds since the introduction of XML. Ten years ago, you would see virtually everybody touting their "compliance" with XML based communications. However, this approach, one of several approaches to binary protocols, has many deficiencies.
One of the values, for example, was readability. But readability is important for debugging, when humans should read the transaction. They are very inefficient when compared with binary transfers. This is due to the fact that XML itself is a binary stream, that has to be translated using another layer into textual fragments ("tokens"), and then back into binary with the contained data.
Another value people found was extensibility. But extensibility can be easily maintained if a protocol version number for the binary stream is used at the beginning of the transaction. Instead of sending XML tags, one could send binary indicators. If the version number is an unknown one, then the receiving end can download the "dictionary" of this unknown version. This dictionary could, for example, be an XML file. But downloading the dictionary is a one time operation, instead of every single transaction!
So efficiency could be kept together with extensibility, and very easily! There are a good number of "compiled XML" protocols out there which do just that.
Last, but not least, I have even heard people say that XML is a good way to overcome little-endian and big-endian types of binary systems. For example, Sun computers vs Intel computers. But this is incorrect: if both sides can accept XML (ASCII) in the right way, surely both sides can accept binary in the right way, as XML and ASCII are also transmitted binarically.......
Hope you find this interesting reading!
Binary protocols will continue to live wherever efficency is required. Mostly, they will live in the lower-levels, where hardware-implementation is more common than software implementations. Speed isn't the only factor - the simplicity of implementation is also important. Making a chip process binary data messages is much easier than parsing text messages.
Surely this depends entirely on the application? There have been two general types of example so far, xml/html related answers and video/audio. One is designed to be 'shared' as noted by Jonathon and the other efficient in its transfer of data (and without Matrix vision, 'reading' a movie would never be useful like reading a HTML document).
Ease of debugging is not a reason to choose a text protocol over a 'binary' one - the requirements of the data transfer should dictate that. I work in the Aerospace industry, where the majority of communications are high-speed, predictable data flows like altitude and radio frequencies, thus they are assigned bits on a stream and no human-readable wrapper is required. It is also highly efficient to transfer and, other than interference detection, requires no meta data or protocol processing.
So certainly I would say that they are not dead.
I would agree that people's choices are probably affected by the fact that they have to debug them, but will also heavily depend on the reliability, bandwidth, data type, and processing time required (and power available!).
They are not dead because they are the underlying layers of every communication system. Every major communication system's data link and network layers are based on some kind of "binary protocol".
Take the internet for example, you are now probably using Ethernet in your LAN, PPPoE to communicate with your ISP, IP to surf the web and maybe FTP to download a file. All of which are "binary protocols".
We are seeing this shift towards text-based protocols in the upper layers because they are much easier to develop and understand when compared to "binary protocols", and because most applications don't have strict bandwidth requirements.
depends on the application...
I think in real time environment (firewire, usb, field busses...) will always be a need for binary protocols
Are binary protocols dead?
Two answers:
Let's hope so.
No.
At least a binary protocol is better than XML, which provides all the readability of a binary protocol combined with all the efficiency of less efficiency than a well-designed ASCII protocol.
Eric J's answer pretty much says it, but here's some more food for thought and facts. Note that the stuff below is not about media protocols (videos, images). Some items may be clear to you, but I keep hearing myths every day so here you go ...
There is no difference in expressiveness between a binary protocol and a text protocol. You can transmit the same information with the same reliability.
For every optimum binary protocol, you can design an optimum text protocol that takes just around 15% more space, and that protocol you can type on your keyboard.
In practice (practical protocols is see every day), the difference is often even less significant due to the static nature of many binary protocols.
For example, take a number that can become very large (e.g., in 32 bit range) but is often very small. In binary, people model this usually as four bytes. In text, it's often done as printed number followed by colon. In this case, numbers below ten become two bytes and numbers below 100 three bytes. (You can of course claim that the binary encoding is bad and that you can use some size bits to make it more space efficient, but that's another thing that you have to document, implement on both sides, and be able to troubleshoot when it comes over your wire.)
For example, messages in binary protocols are often framed by length fields and/or terminators, while in text protocols, you just use a CRC.
In practice, the difference is often less significant due to required redundancy.
You want some level of redundancy, no matter if it's binary or text. Binary protocols often leave no room for error. You have to 100% correctly document every bit that you send, and since most of us are humans, that happens rarely and you can't read it well enough to make a safe conclusion what is correct.
So in summary: Binary protocols are theoretically more space and compute efficient, but the difference is in practice often less than you think and the deal is often not worth it. I am working in the Internet of Things area and have to deal nearly on daily base with custom, badly designed binary protocols which are really hard to troubleshoot, annoying to implement and not more space efficient. If you don't need to absolutely tweak the last milliampere out of your battery and calculate with microcontroller cycles (or transmit media), think twice.

lowest level language until asp.net?

it's assembler right? can someone please point out the progression that we've had in programming languages since assembler to the days of asp.net, namely the chronological order of languages?
Here's a wiki timeline of all programming languages.
I would include a FTA table, but the list is very robust and extensive.
And also, the lowest language you ever get to is assembly (aside from straight up issuing machine instructions), regardless of what other language is built on top (including ASP.NET). Other languages are really just abstractions on top of assembly. In fact, ASP.NET gets compiled into IL (Intermediate Language) code, which then get's JITed into assembly. Assembly is as close to the metal as you're going to get.
To be pedantic, "assembler" is not actually a language (any more than "compiler" is;-) -- rather, it's a program that takes a source file in "assembly language" and emits binary machine code. The binary machine code can be said to be lower-level than the assembly language, since the latter allows use of some symbols and often includes a macro processing ability as well.
"Below" binary machine code, there may be other levels, known as "microcode" (but there might not be -- the CPU might be implemented entirely in real hardware, without any microprogramming aspect). That might be relevant only if the system's architecture allowed programmers to alter the microcode, especially by adding to it, etc -- there have been machines that did that, but I don't believe any currently commercialized CPU does. So you probably don't have to care about that (and the by-now-esoteric distinctions between vertical and horizontal microcode, etc, etc;-).
Programming languages are just ways to assemble solutions to computing problems.
The argument is "assembled out of what?"
From that point of view, I'd suggest the following evolutionary curve:
Napier's Bones
Babbage's difference engine
Jacquard (card) looms
(Conceptual) Abstract Turing machines/Post Systems/Church's calculus
Relay Computers (Aiken?)
Vacuum tubes as switching elements (Eniac)
Transistor-based computers
Microprogrammed machines
Integrated Circuits
Large Scale Circuits
with "assembler" being the programming language used to
put together solutions consisting of instructions for
real machines starting with the vacuum tube systems.
(I'm not sure the relay machines actually had assemblers).
Programming langauges are just ways to put together high
level commands that reduce in effect to assembler instructions.
There are two different dimensions to consider here, what I'd call vertical growth (languages build up over time from one generation to the next) and horizontal growth (syntactic improvements and reduction in complexity.)
A good explanation of vertical change is seen here: http://web.sxu.edu/rogers/sys/generations.html
And a nice, yet incomplete, illustration of horizontal change it here: http://oreilly.com/news/graphics/prog_lang_poster.pdf

Resources