Re: binary file parsing
On May 5, 12:07 pm, p...@informatimago.com (Pascal J. Bourguignon)
wrote:
James Kanze <james.ka...@gmail.com> writes:
On May 4, 11:16 am, p...@informatimago.com (Pascal J. Bourguignon)
wrote:
Christopher <cp...@austin.rr.com> writes:
[...]
I'd rather use arithmetic operations since
this would avoid the need for reinterpreting bits.
I fully agree with the rest of what you wrote, but I'm
curious about this. If the format is specified in terms
of bits (usually the case), it would seem to me that the
bit operations are more appropriate, in the sense that
they are closer to the specification. IOW, for a 32 bit
unsigned integer: if the format specification says that
the first octet contains the value divided by 16777216,
the second the value divided by 65536, modulo 256, etc.,
then I'd use arithmetic operators. If it says that the
first octet contains the bits 24-31, the second the bits
16-23, etc., I'd probably use bit operations (shifting
and masking).
For integers, I assume it wouldn't make a lot of
difference.
For anything: if the specification says "this byte
corresponds to bits 8-15", I find shifting left 8 more
intuitive (closer to what is written in the specification)
than multiplying by 256.
The compiler probably generate the same code
for n<<8 and for n*256 anyways. You'd just have to be
careful not to leave uninitialized bits in the case where
the word size of the host is bigger than that of the file
(eg. reading a 32-bit integer on a 36-bit host),
particularly for signed integers where you may need to do
a sign extend. That's where using the substration to
convert a two-complement seems to me to be more
interesting than just using bit-or.
That's a different issue. You use shifting, etc. to extract
the sign bit and the value bits, because that's how their
location in the input data is specified: by the bit position
of the data. You use arithmetic operations to create the
final value from the extracted fields, because you're now
dealing with mathematical identities.
The difference is obviously much more evident in floating
point: you're not going to be able to extract individual
fields using arithmetic operators on floating point, and
anything you do using bitwise operators to assemble the
fields will be very implementation dependent---for that, you
want the mathematical functions and operators.
But I see that later, after my example:
[...]
Absolutely. The file format will be specified at the bit
or byte level and will have to be processed with bit
operations. But to build the host value, integer or
floating-point, it's safer to use arithmetic operations
such as ldexp() and negation.
You actually agree with me. Bitwise for extracting (and
inserting when writing) the individual fields, arithmetic
for manipulating the values. (It seems reasonable to
consider each field an unsigned int of a specific size.)
For example, for a 2-complement, 32-bit integer stored
as four 8-bit bytes in big-endian order, I would do:
#include <limits.h>
#if CHAR_BITS<8
#error "Won't be able to read the bytes"
#error "Isn't a legal C/C++ implementation"
Are you sure? AFAIK, char may be 6 bits, there are
trigraphs to deal with these hosts. But I haven't read
the recent C standards.
C90 required that UCHAR_MAX be at least 255, and that the
representation be pure binary. For an implementation to be
conform with less than 8 bits, it whould have to be able to
fit 255 in less than 8 bits, using a binary representation.
Which is impossible.
Admitted, it's a pretty indirect way to specify the minimum
size, but it's what C90 used, and it hasn't changed in C++
or C99.
In the end, the real question is how portable do you
want (or need) to be. For a lot of people, it's
probably acceptable to suppose that 1) there is a 32 bit
integral type, using 2's complement, and 2) that
conversion unsigned to signed int doesn't change the bit
pattern. In such cases, you can make reading an integer
a lot simpler. Similarly, if in addition you can
suppose IEEE floating point, my floating point read can
be made a lot, lot simpler---just put the four bytes in
an array, and memcpy into the double. (Currently, there
are only a few exotic machines which don't have a 32 bit
2's complement integral type. None of the mainframes I
know use IEEE floating point, however.) I would stress,
however, that if you make such simplifying assumptions,
you document them clearly and explicitly.
Of course, this is a question of specification.
If all is called is to be able to save and load binary
data on the same computer, we could just memory map the
file.
Not necessarily. As I said, I've actually experienced the
case where the byte order of a long changed from one version
of the compiler to the next, and on the machine I currently
use, the size of a long depends on a compiler option (and
there are two different system API's, depending on which
compiler option was used).
However, I see one problem in relying on specifications:
they are not formally defined, and cannot automatically be
checked (in general, I'm yet to see a real project using
formal specification tools, seems I'm not in the right
industry for this kind of tools :-( ). So the problem is
that you will clearly document that your lateral
accelerometre outputs 16-bit values, you will document it
and write it all over. But when another team will reuse
your module, and embed it in hardware able to support
lateral accelerations that need to be expressed in more
than 16 bits, no compiler will be there to check the
specification mismatch and you'll lose costly hardware and
perhaps lives. Ok, you've specified and documented the
limits of your module, so nobody can reproach you
anything, nonetheless, this didn't prevent a catastrophe.
I see you've read the details of the Ariane 5 explosion as
well:-). In practice, this is a real problem. The boss
says that the code will never have to be ported to another
machine, and the week following its delivery, he installs a
new machine, with different characteristics. I've found
that it is often worthwhile being "portable", even if it
isn't part of the specifications. How portable depends on
the extra effort required, but I'd certainly attempt to
handle at least byte order and the size of the basic
integral types. Beyond that, a lot depends---the type of
projects I usually work on might conceivably be run on a
mainframe, so I take mainframe architectures into
consideration (none of the mainframes I know use IEEE, and
some have 36 or 48 bit ints, padding bits in int, and 1's
complement or signed magnitude), but if I were writing, say,
a GUI interface, I probably wouldn't bother: I'd suppose
IEEE floating point, for example, because all current
workstations and PC's use it, and I can't imagine that
changing.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34