Wire Protocol Serialization - binary-serialization

I'm looking for what I'll call a 'binary serializer/deserializer code generator' for lack of a better term that specifically allows you to specify the on-the-wire format with arbitrary bit lengths and then generates the necessary C/C++ code to pack/unpack packets in that format. I started down the path of using a struct with bit fields but after reading this post I'm wondering if there's already something out there that handles all the messy problems. An example data structure I would need to deal with:
struct header {
unsigned int val1 : 8;
unsigned int val2 : 24;
unsigned int val3 : 16
unsigned int val4 : 2;
unsigned int val5 : 3;
unsigned int val6 : 1;
unsigned int val7 : 10;
}
The motivation for keep the fields of the data structure like that is that it makes the programmers job easier to set/get the fields based on a what they match in the protocol, ex. val5 might be a meaningful 3 bit flag. Yes I could just have two 32 bit values for the whole struct and have to use bit masks and stuff to keep track of everything but why?
I'm aware of things like Google Proto Buf and the like, but AFAIK these all focus on the programmer side data structure and don't allow you to specify specific bit patterns - imagine trying to create the client code for low level protocols where the binary wire format is how it's specified. The closest thing I've found is protlr which sounds great except it doesn't appear to be FOSS. Other posts on SO point to:
RedBlocks which appears to be part of a full blown embedded framework.
PADS which seems extremely stale and overly complicated for my needs.
binpac which sounds interesting but I can't find an example of using it to parse arbitrary bit lengths (e.g. 1 bit, 2 bit, 17 bit fields) or if it also has a serialization method since it seems to be focused on one way deserialization for intrusion detection.
Is there a FOSS alternative that meets my criteria besides rolling yet another serialization format, or can someone provide an example using one of these references for the structure above?

You might consider ASN.1 for this and use PER (aligned or unaligned). You can use either BIT STRING types constrained to your needed lengths, or INTEGER types with constraints to limit values to the number of bits you would like. Since ASN.1 and its encoding rules are independent of machine architecture and programming language, you don't have to worry about whether your machine is big-endian or little-endian, or whether one end of the communications prefers Java rather than C or C++. A good ASN.1 Tool handles all of that for you. You can find out more about ASN.1 at the ASN.1 Project page which has a link Introduction to ASN.1 as well as a list of ASN.1 Tools (some free some commercial). The reason I mention UNALIGNED PER is that you can literally send exactly the number of bits across that line as you desire with no added padding bits between.
For BIT STRINGS, you can even assign names to individual bits that have some meaning to you for your application.

Related

Ada pragma Pack or Alignment attribute for Records?

Having just discovered alignment issues for the first time I am unsure on which method is the best/safest way to deal with them. I have a record which I am serialising to send over a Stream and vice-versa so it must meet the interface spec and contain no padding.
Given the example record:
type MyRecord is record
a : Unsigned_8;
b : Unsigned_32;
end record;
This by default would require 8 bytes but I am able to remove packing using 2 methods:
for MyRecord'Alignment use 1;
or
pragma Pack (MyRecord);
I have found a few questions relating to C examples but haven't been able to find a clear answer on which method is the most appropriate, how to determine which method to use or if they are equivalent?
UPDATE
When I tried both on my 'real' code rather than a basic example I found that the Alignment attribute achieved what I was looking for. pragma Pack significantly reduced the size, not confirmed but I assume it has packed the many enumerated types I'm using, overriding the 'Size use 8 attribute applied to each type.
For Streams you could leave MyRecord without any representation clauses and use the default MyRecord’Write and MyRecord’Read; ARM 13.13.2(9) says
For elementary types, Read reads (and Write writes) the number of stream elements implied by the Stream_Size for the type T; the representation of those stream elements is implementation defined. For composite types, the Write or Read attribute for each component is called in canonical order, which is last dimension varying fastest for an array (unless the convention of the array is Fortran, in which case it is first dimension varying fastest), and positional aggregate order for a record.
One possible disadvantage of the GNAT implementation (and maybe of others) is that the ’Write and ’Read calls each end in a call to the underlying network software. Not a problem (aside from possible inefficiency) normally, but if you’re using TCP_NODELAY (or worse, UDP) this is not the behaviour you’re looking for.
Overloading ’Write leads back to your original problem (but at least it’s confined to the overloading procedure, so the rest of your program can deal with properly aligned data).
I’ve used an in-memory stream for this (especially the UDP case); ’Write to the in-memory stream, then send the Stream_Element_Array to the socket. One example is ColdFrame.Memory_Streams (.ads, .adb).
I think you want the record representation clauses, if you want full control:
for MyRecord'Size use 40;
for MyRecord use record
a at 0 range 0 .. 7;
b at 1 range 0 .. 31;
end record;
(or some such, I might have messed up some of the indices here).
NB: edited as per comment by Simon

How to replace MPI_Pack_size if I need to send more than 2GB of data?

I want to send and receive more than 2 GB of data using MPI and I came across a lot of articles like the ones cited below:
http://blogs.cisco.com/performance/can-we-count-on-mpi-to-handle-large-datasets,
http://blogs.cisco.com/performance/new-things-in-mpi-3-mpi_count
talking about changes that are made starting with MPI 3.0 allowing to send and receive bigger chunks of data.
Most of the functions now are receiving as parameter an MPI_Count object instead of int, but not all of them.
How can I replace
int MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm,
int *size)
in order to get the size of a larger buffer? (because here the size can only be at most 2GB)
The MPI_Pack routines (MPI_Pack, MPI_Unpack, MPI_Pack_size, MPI_Pack_external) are, as you see, unable to support more than 32 bits worth of data, due to the integer pointer used as a return value. I don't know why the standard did not provide MPI_Pack_x, MPI_Unpack_x, MPI_Pack_size_x, and MPI_Pack_external_x -- presumably an oversight? As Jeff suggests, it might have been done so because packing multiple gigs of data is unlikely to provide much benefit. Still, it breaks orthogonality not to have those...
A quality implementation (I do not know if MPICH is one of those) should return an error about the type being too big, allowing you to pack a smaller amount of data.

Arithmetic operation in Assembly

I am learning assembly language. I find that arithmetic in assembly can be either signed or unsigned. Rules are different for both type of arithmetic and I find it is programmer's headache to decide which rules to apply. So a programmer should know beforehand if arithmetic involves the negative numbers or not. if yes, signed arithmetic rules should be used, else simpler and easier unsigned arithmetic will do.
Main problem I find with unsigned arithmetic is ‘what if result is larger than its storage area?’. It can be easily solved by using a bigger-than-required storage area for the data. But that will consume extra bytes and size of data segment would increase. If size of the code is no issue, can't we use this technique freely?
If you are the programmer, you are in control of your data representation within the bounds of the requirements of your software's target domain. This means you need to know well before you actually start touching code what type of data you are going to be dealing with, how it is going to be arranged (in the case of complex data types) and how it is going to be encoded (floating-point/unsigned integer/signed integer, etc.). It is "safest" to use the operations that match the type of the data you're manipulating which, if you've done your design right, you should already know.
It's not that simple. Most arithmetic operations are sign agnostic: they are neither signed nor unsigned.
The interpretation of the result—which is determined by program specification—is what makes them signed or unsigned, not the operation itself. The proper flavor of compare instructions always have to be chosen carefully.
In some CPU architectures there are distinct signed and unsigned divide instructions, but that is about as far as it goes. Most CPUs have arithmetic shift right instruction flavors which either preserve the high bit or replace it with zero: that can be used as signed and unsigned handling, respectively.

What the best ways to use decimals and datetimes with protocol buffers?

I would like to find out what is the optimum way of storing some common data type that were not included in the list supported by protocol buffers.
datetime (seconds precision)
datetime (milliseconds precision)
decimals with fixed precision
decimals with variable precision
lots of bool values (if you have lots of them it looks like you'll have 1-2 bytes overhead for each of them due to their tags.
Also the idea is to map them very easy to corresponding C++/Python/Java data types.
The protobuf design rationale is most likely to keep data type support as "native" as possible, so that it's easy to adopt new languages in future. I suppose they could provide in-build message types, but where do you draw the line?
My solution was to create two message types:
DateTime
TimeSpan
This is only because I come from a C# background, where these types are taken for granted.
In retrospect, TimeSpan and DateTime may have been overkill, but it was a "cheap" way of avoiding conversion from h/m/s to s and vice versa; that said, it would have been simple to just implement a utility function such as:
int TimeUtility::ToSeconds(int h, int m, int s)
Bklyn, pointed out that heap memory is used for nested messages; in some cases this is clearly very valid - we should always be aware of how memory is used. But, in other cases this can be of less concern, where we're worried more about ease of implementation (this is the Java/C# philosophy I suppose).
There's also a small disadvantage to using non-intrinsic types with the protobuf TextFormat::Printer; you cannot specify the format in which it is displayed, so it'll look something like:
my_datetime {
seconds: 10
minutes: 25
hours: 12
}
... which is too verbose for some. That said, it would be harder to read if it were represented in seconds.
To conclude, I'd say:
If you're worried about memory/parsing efficiency, use seconds/milliseconds.
However, if ease of implementation is the objective, use nested messages (DateTime, etc).
Here are some ideas based on my experience with a wire protocol similar to Protocol Buffers.
datetime (seconds precision)
datetime (milliseconds precision)
I think the answer to these two would be the same, you would just typically be dealing with a smaller range of numbers in the case of seconds precision.
Use a sint64/sfixed64 to store the offset in seconds/milliseconds from some well-known epoch like midnight GMT 1/1/1970. This how Date objects are internally represented in Java. I'm sure there are analogs in Python and C++.
If you need time zone information, pass around your date/times in terms of UTC and model the pertinent time zone as a separate string field. For that, you can use the identifiers from the Olson Zoneinfo database since that has become somewhat standard.
This way you have a canonical representation for date/time, but you can also localize to whatever time zone is pertinent.
decimals with fixed precision
My first thought is to use a string similar to how one constructs Decimal objects from Python's decimal package. I suppose that could be inefficient relative to some numerical representation.
There may be better solutions depending on what domain you're working with. For example, if you're modeling a monetary value, maybe you can get away with using a uint32/64 to communicate the value in cents as opposed to fractional dollar amounts.
There are also some useful suggestions in this thread.
decimals with variable precision
Doesn't Protocol Buffers already support this with float/double scalar types? Maybe I've misunderstood this bullet point.
Anyway, if you had a need to go around those scalar types, you can encode using IEEE-754 to uint32 or uint64 (float vs double respectively). For example, Java allows you to extract the IEEE-754 representation and vice versa from Float/Double objects. There are analogous mechanisms in C++/Python.
lots of bool values (if you have lots
of them it looks like you'll have 1-2
bytes overhead for each of them due to
their tags.
If you are concerned about wasted bytes on the wire, you could use bit-masking techniques to compress many booleans into a single uint32 or uint64.
Because there isn't first class support in Protocol Buffers, all of these techniques require a bit of a gentlemens' contract between agents. Perhaps using a naming convention on your fields like "_dttm" or "_mask" would help communicate when a given field has additional encoding semantics above and beyond the default behavior of Protocol Buffers.
Sorry, not a complete answer, but a "me too".
I think this is a great question, one I'd love an answer to myself. The inability to natively describe fundamental types like datetimes and (for financial applications) fixed point decimals, or map them to language-specified or user-defined types is a real killer for me. Its more or less prevented me from being able to use the library, which I otherwise think is fantastic.
Declaring your own "DateTime" or "FixedPoint" message in the proto grammar isn't really a solution, because you'll still need to convert your platform's representation to/from the generated objects manually, which is error prone. Additionally, these nested messages get stored as pointers to heap-allocated objects in C++, which is wildly inefficient when the underlying type is basically just a 64-bit integer.
Specifically, I'd want to be able to write something like this in my proto files:
message Something {
required fixed64 time = 1 [cpp_type="boost::posix_time::ptime"];
required int64 price = 2 [cpp_type="fixed_point<int64_t, 4>"];
...
};
And I would be required to provide whatever glue was necessary to convert these types to/from fixed64 and int64 so that the serialization would work. Maybe thru something like adobe::promote?
For datetime with millisecond resolution I used an int64 that has the datetime as YYYYMMDDHHMMSSmmm. This makes it both concise and readable, and surprisingly, will last a very long time.
For decimals, I used byte[], knowing that there's no better representation that won't be lossy.

Packet data structure?

I'm designing a game server and I have never done anything like this before. I was just wondering what a good structure for a packet would be data-wise? I am using TCP if it matters. Here's an example, and what I was considering using as of now:
(each value in brackets is a byte)
[Packet length][Action ID][Number of Parameters]
[Parameter 1 data length as int][Parameter 1 data type][Parameter 1 data (multi byte)]
[Parameter 2 data length as int][Parameter 2 data type][Parameter 2 data (multi byte)]
[Parameter n data length as int][Parameter n data type][Parameter n data (multi byte)]
Like I said, I really have never done anything like this before so what I have above could be complete bull, which is why I'm asking ;). Also, is passing the total packet length even necessary?
Passing the total packet length is a good idea. It might cost two more bytes, but you can peek and wait for the socket to have a full packet ready to sip before receiving. That makes code easier.
Overall, I agree with brazzy, a language supplied serialization mechanism is preferrable over any self-made.
Other than that (I think you are using a C-ish language without serialization), I would put the packet ID as the first data on the packet data structure. IMHO that's some sort of convention because the first data member of a struct is always at position 0 and any struct can be downcast to that, identifying otherwise anonymous data.
Your compiler may or may not produce packed structures, but that way you can allocate a buffer, read the packet in and then either cast the structure depending on the first data member. If you are out of luck and it does not produce packed structures, be sure to have a serialization method for each struct that will construct from the (obviously non-destination) memory.
Endiannes is a factor, particularly on C-like languages. Be sure to make clear that packets are of the same endianness always or that you can identify a different endian based on a signature or something. An odd thing that's very cool: C# and .NET seems to always hold data in little-endian convention when you access them using like discussed in this post here. Found that out when porting such an application to Mono on a SUN. Cool, but if you have that setup you should use the serialization means of C# anyways.
Other than that, your setup looks very okay!
Start by considering a much simpler basic wrapper: Tag, Length, Value (TLV). Your basic packet will look then like this:
[Tag] [Length] [Value]
Tag is a packet identifier (like your action ID).
Length is the packet length. You may need this to tell whether you have the full packet. It will also let you figure out how long the value portion is.
Value contains the actual data. The format of this can be anything.
In your case above, the value data contains a further series of TLV structures (parameter type, length, value). You don't actually need to send the number of parameters, as you can work it from the data length and walking the data.
As others have said, I would put the packet ID (Tag) first. Unless you have cross-platform concerns, I would consider wrapping your application's serialised object in a TLV and sending it across the wire like that. If you make a mistake or want to change later, you can always create a new tag with a different structure.
See Wikipedia for more details on TLV.
To avoid reinventing the wheel, any serialization protocol will work for on the wire data (e.g. XML, JSON), and you might consider looking at BEEP for the basic protocol framework.
BEEP is summed up well in its FAQ document as 'kind of a "best hits" album of the tricks used by experienced application protocol designers since the early 80's.'
There's no reason to make something so complicated like that. I see that you have an action ID, so I suppose there would be a fixed number of actions.
For each action, you would define a data structure, and then you would put each one of those values in the structure. To send it over the wire, you just allocate sum(sizeof(struct.i)) bytes for each element in your structure. So your packet would look like this:
[action ID][item 1 (sizeof(item 1 bytes)][item 1 (sizeof(item 2 bytes)]...[item n (sizeof(item n bytes)]
The idea is, you already know the size and type of each variable on each side of the connection is, so you don't need to send that information.
For strings, you can just throw 'em in in a null terminated form, and then when you 'know' to look for a string based on your packet type, start reading and looking for a null.
--
Another option would be to use '\r\n' to delineate your variables. That would require some overhead, and you would have to use text, rather then binary values for numbers. But that way you could just use readline to read each variable. Your packets would look like this
[action ID]
[item 1 (as text)]
...
[item n (as text)]
--
Finally, simply serializing objects and passing them down the wire is a good way to do this too, with the least amount of code to write. Remember that you don't want to prematurely optimize, and that includes network traffic as well. If it turns out you need to squeeze out a little bit more performance later on you can go back and figure out a more efficient mechanism.
And check out google's protocol buffers, which are supposedly an extreemly fast way to serialize data in a platform-neutral way, kind of like a binary XML, but without nested elements. There's also JSON, which is another platform neutral encoding. Using protocol buffers or JSON would mean you wouldn't have to worry about how to specifically encode the messages.
Do you want the server to support multiple clients written in different languages? If not, it's probably not necessary to specify the structure exactly; instead use whatever facility for serializing data your language offers, simply to reduce the potential for errors.
If you do need the structure to be portable, the above looks OK, though you should specify stuff like endianness and text encoding as well in that case.

Resources