Re: Encoding of primitives for binary serialization

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Thu, 9 Apr 2009 22:06:02 +0100

Message-ID:

<alpine.DEB.1.10.0904092119540.27156@urchin.earth.li>

On Thu, 9 Apr 2009, kb wrote:

I'm implementing binary serialization for primitive data types both in
java and c++. Also I need to handle serialization/de-serialization
across java and c++ i.e. serialization from java and de-serialization in
c++ and vice-versa.

For this I need to decide an encoding for primitive data types which is
independent of language and platform. Does any one have some idea about
such an encoding format.

Use the formats used in internet protocols - see pretty much any low-level
RFC for details. The TCP and IP ones would do. Bytes are bytes, 16- and
32-bit numbers are written out byte by byte in 'network byte order', ie
most significant first. In java, use Data{Out,In}putStream for that, and
in C, the htons/ntohs and htonl/ntohl functions from arpa/inet.h. Not sure
what you do about 64-bit numbers. You can do signed and unsigned, but be
aware that in java, which has no native unsigned types, you'll need to use
the next bigger type to hold unsigneds, eg an unsigned short will need an
int to hold.

Floating-point numbers are harder; you might be better off avoiding them
altogether if possible, but if not, use the IEEE 754 32- and 64-bit
formats. Again, in java the Data*putStreams do that. I'm not aware of
standard functions to do it in C, though - if you're on a machine which
uses 754 natively, you can just pun the float as an int and write that out
(through the htonl function, i think). On one that doesn't, like an x86,
you'll need to find a machine-specific library with an encoding function
in it.

Booleans are bytes - false is 0, true is 1.

For characters, you're working in unicode (whether you like it or not!),
and you just have to pick an encoding. UTF-16 will let you encode all
characters (all the ones you're likely to encounter, anyway) in two bytes
each, and is simple to do. UTF-8 encodes most latin characters in one byte
each, greek, cyrillic, hebrew, arabic and a few other scripts in two
bytes, and all others in three bytes, making it a good choice if you're
mostly handling western text but a poor one if you might be handling
southern and eastern asian scripts, and has good library support in most
languages. SCSU encodes all text in a minimal number of bytes (averaging
one per character for alphabetic scripts, two per character for
ideographic ones), but is rather complex (and is really a string rather
than a character encoding); however, there are libraries for doing it in
java and C.

There are various ways you could do strings. The best is probably to write
the string length as an integer, then all the characters one by one. This
is different from the standard formats in both java and C, but easier to
implement!

Alternatively, relax the 'binary' requirement and use JSON.

tom

--
PS I am trying to stab a giant warthog in the arse but it keeps throwing
me off a bridge :( -- Martin Lewis