Re: binary file parsing

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 4 May 2009 02:26:27 -0700 (PDT)

Message-ID:

<9f82b259-45ee-4512-a0bc-0e4f33e7c703@r3g2000vbp.googlegroups.com>

On May 4, 8:46 am, Christopher <cp...@austin.rr.com> wrote:

On May 4, 1:21 am, joshuamaur...@gmail.com wrote:

On May 3, 10:39 pm, Christopher <cp...@austin.rr.com> wrote:
Google ios::binary.

Do you realize how many contradicting opinions come up in a
google search for something so basic, 95% of which are
probably wrong?

:-)

Here is what I found in my google search:
1) make a template function that converts an array of bytes
into an integral type (neat but probably overkill)

I'm not sure I see the use of a template here, either. You
generally have at most two or three cases to deal with (2 bytes,
4 bytes, and once in a while, 8 bytes). And the meta
programmation necessary for the templates is probably more work
and more lines of code than just writing the three functions.

2) use >> and extract to an int, don't worry about it (probably
wrong)

If the format is binary, certainly wrong.

3) read byte by byte and bit shift into an integral result
(probably overly complicated and still depends on size)

It's the only solution I know. And it's not that complicated.

4) let's use fstream.h and do it the deprecated way
(definatly wrong)

And doesn't work any better than with <fstream>.

5) Well first we better check the indianess of the machine....
(I can assume the same OS wrote it that is reading it in my
case)

Endianness doesn't really depend on the OS. I've seen it change
from one version of the compiler to the next. And of course,
you really can't assume that your users are never going to
upgrade the material, which depending on the evolution path,
could mean a lot of things. It's simpler just to do it right to
begin with.

I bet most of those are either not necessary, overly
complicated, or just wrong.

I think a better source is:http://www.parashift.com/c++-faq-lite/serializ=

ation.html#faq-36.6

However, it still fails to explain why we need to avoid the
extraction operator in favor of read and write, although that
tid bit did answer my question.

The extraction operator formats/unformats. For a specific
format. (You can control it somewhat via flags, but it always
handles text.) For binary input, *if* you use istream or
ostream (I wouldn't, for a file that is completely binary), you
want to bypass the iostream formatting, using unformatted input
and output (read/get and write/put), and do your own formatting.

---------------------------
Here is my summation

1) There is no guarentee of the size of an integral type
2) The extraction operator, as I've always dealth with, is
going to extract a number of bytes from the stream that are
equiv to the sizeof the data type you are attempting to
extract to, it will try to convert it and if it fails, will
set the fail bit

No. The extraction operator will read bytes, interpret them as
characters, and convert the resulting string into the type you
want. If the format in the source file is not text based, they
simply won't work.

So, it wouldn't be safe to

int numBytes; // 4 bytes make this value
file >> numBytes;

Because the authoring tool that wrote the file wrote 4 bytes

That is *NOT* enough information. Four bytes doesn't mean
anything. You still have to know the format used.

and depends on you reading 4 bytes, if any implementation of
an int comes along that isn't 4 bytes, your parser is broken.

In general, if you try to read in a different format than that
was written, it's not going to work.

Regardless if it is incorrect or not, the fail bit gets set in
my code and I would like to understand why?

Probably, the extractor didn't find what looked like an int.
The extractor skips whitespace, then looks for an optional sign,
followed by one or more digits. In whatever encoding is imbued
into the stream (typically either ISO 8859-1 or UTF-8 in the
environments I work in, but YMMV).

//-----------------------------------------------------------------------=

----

void PolygonSetParser::ParseFile(const std::string & filepath,
                                 ID3D10Device & device,
                                 InputLayoutManager &
inputLayoutManager,
                                 EffectManager & effectManager,
                                 bool generateTangentData)
{
   BaseException exception("Not Set",
                           "void PolygonSetParser::ParseFile(const
std::string & filepath)",
                           "PolygonSetParser.cpp");

   // Open the file
   std::ifstream file(filepath.c_str(), std::fstream::in |
std::fstream::binary);

   // Check if the file was successfully opened
   if( file.fail() )
   {
      exception.m_msg = std::string("Failed to open file: ") +
filepath;
      throw exception;
   }

You also want to imbue the "C" locale, to ensure that no code
translation occurs. (This is the sort of thing that works in
your test programs, because the "C" locale is the default, and
your test programs don't need to change it, but fails in actual
code, because the application has switched to some other
locale.)

   // snip

   // Read in a data identifier
   unsigned char dataID;

   file >> dataID;

Note that I'd use get for this. And pass through an
intermediate int:

    int dataId = file.get() ;
    if ( dataId == EOF ) {
        // bad format...
    }

while( file.good() )

No. ios::good() is generally useless.

If you're reading into an int, as above, you can use:

while ( dataID != EOF )

Otherwise (and more generally):

while ( file )

This is one of the reasons I wouldn't use istream, but would
implement my own ibinstream, or whatever. The ibinstream (which
would still derive from ios) would then define the extraction
operator to handle the format you're reading, setting the
various status bits as appropriate.

   {
      switch( static_cast<int>(dataID) )
      {
         case MATERIAL_START:
         {
            ParseMaterial(file);
            break;
         }

         default:
         {
            std::stringstream msg;
            msg << "Unknown data identifier encountered in file: " <<
filepath << " at streampos " << file.tellg();

            exception.m_msg = msg.str();
            throw exception;
         }
      }

      file >> dataID;
   }
}

//-----------------------------------------------------------------------=

----

void PolygonSetParser::ParseMaterial(std::ifstream & file)
{
   BaseException exception("Not Set",
                           "void PolygonSetParser::ParseMaterial
(std::ifstream & file)",
                           "PolygonSetParser.cpp");

   // Get how many bytes to parse
   long numBytes; // 4 bytes make this value

Not on my machines. Long is usually 64 bits today (although 32,
36 and 48 bits are not unknown, and 32 bits was common in the
past).

Note that this is one of the reasons why you must pass through
explicit serialization. Even if long is 32 bits on your machine
today, it's likely that if your user upgrades, it will be 64
bits. Whereas the format of the data file will not change.

file >> numBytes;

This reads ASCII.

if( !file.good() ) // fails test

And again, this may fail even if the input succeeded. Just use:

if ( ! file )

   {
      exception.m_msg = "Error parsing number of bytes to read for
material";
      throw exception;
   }
}

First few bytes of file in hex:
01 24 01 01 00 00 16 43 00 00 16 43 00 00 16 43

So, I am going to try it this way:
int numBytes;
fileread(static_cast<unsigned char *>(&numBytes), 4);

Which looks like it solves my problem of worrying about
reading 4 bytes and the conversion.

Except that it doesn't. It may seem to work in a particular
case, but it doesn't provide a general solution, and may fail
the next time you recompile your code.

I'd still like to know why the extraction operator fails
though, if anyone can explain.

A better question is why you would expect it to work.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34