Re: Binary file IO: Converting imported sequences of chars to desired type

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Wed, 28 Oct 2009 05:40:12 -0700 (PDT)

Message-ID:

<ce3098bd-7d8f-4289-bc72-ce81874aacf6@f16g2000yqm.googlegroups.com>

On Oct 26, 5:55 pm, Rune Allnor <all...@tele.ntnu.no> wrote:

On 26 Okt, 18:06, James Kanze <james.ka...@gmail.com> wrote:

On Oct 25, 7:39 pm, Rune Allnor <all...@tele.ntnu.no> wrote:

(but my data sets have generally been very mixed, with a lot
of non floating point data as well). And binary formatting
can be more or less expensive as well---I'd rather deal with
text than a BER encoded double. And Jorgen said very
explicitly "if you have a choice". Sometimes you don't have
the choice: you have to conform to an already defined
external format, or the profiler says you don't have the
choice.

The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).

Try it on what machine:-).

Any machine. The problem is to decode text-formatted numbers
to binary.

You're giving concrete figures.

Yep. But as rule-of-thumb. My point is not to be accurate (you
have made a very convincing case why that would be difficult),
but to point out what performance costs and trade-offs are
involved when using text-based file fomats.

The problem is that there is no real rule-of-thumb possible.
Machines (and compilers) differ too much today.

In terms of concrete numbers, of course... Using time gave
me values too small to be significant for 10000000 doubles
on the Linux machine (top of the line AMD processor of less
than a year ago); for 100000000 doubles, it was around 85
seconds for text (written in scientific format, with 17
digits precision, each value followed by a new line, total
file size 2.4 GB). For 10000000, it was around 45 seconds
under Windows (file size 250 MB).

I suspect you might either have access to a bit more funky
hardware than most users, or have the skills to fine tune what
you have better than most users. Or both.

The code was written very quickly, with no tricks or anything.
It was tested on off the shelf PC's---one admittedly older than
those most people are using, the other fairly recent. The
compilers in question were the version of g++ installed with
Suse Linux, and the free download version of VC++. I don't
think that there's anything in there that can be considered
"funky" (except maybe that most people professionally concerned
with high input have professional class machines to do it, which
are out of my price range), and I certainly didn't tune
anything.

Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,

This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.

There is no 50-100x difference. There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase);

Again, your assets might not be representative for the
average users.

Well, I'm not sure there's such a thing as an average user. But
my machines are very off the shelf, and I'd consider VC++ and
g++ very "average" as well, in the sense that they're what an
average user is most likely to see.

Here is a test I wrote in matlab a few years ago, to
demonstrate the problem (WinXP, 2.4GHz, no idea about disk):

I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?

The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.

Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.

I'll rephrase: Eliminates *variability* due to file caches.

By choosing the best case, which rarely exists in practice.

Whatever happens affect both files in equal amounts. It would
bias results if one file was cached and the other not.

What is cached depends on what the OS can fit in memory. In
other words, the first file you wrote was far more likely to be
cached than the second.

The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.

If you're dumping raw data, a binary file with 10000000
doubles, on a PC, should be exactly 80 MB.

It was. The file browser I used reported the file size
in KBytes. Multiply the number by 1024 and you get
exactly 80 Mbytes.

Strictly speaking, a KB is exactly 1000 bytes, not 1024:-). But
I know, different programs treat this differently.

The first few lines in the text file look like
-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001
(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.

It's not sufficient with regards to the number of digits.
You won't read back in what you've written.

I know. If that was a constraint, file sizes and read/write
times would increase correspondingly.

It was a constraint. Explicitly. At least in this thread, but
more generally: about the only time it won't be a constraint is
when the files are for human consumption, in which case, I think
you'd agree, binary isn't acceptable.

The timing numbers (both absolute and relative) would be
of similar orders of magnitude if you repeated the test
with C++.

I did, and they aren't. They're actually very different in
two separate C++ environments.

The application I'm working with would need to crunch
through some 10 GBytes of numerical data per hour. Just
reading that amount of data from a text format would
require on the order of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format
for these kinds of things.

But it must not be doing much processing on the data, just
copying it and maybe a little scaling. My applications do
significant calculations (which I'll admit I don't
understand, but they do take a lot of CPU time). The time
spent writing the results, even in XML, is only a small part
of the total runtime.

The read?

I don't know. It's by some other applications, in other
departments, and I have no idea what they do with the data.

You're probably right, however, that to be accurate, I should do
some comparisons including reading. For various reasons (having
to deal with possible errors, etc.), the CPU overhead when
reading is typically higher than when writing.

But I'm really only disputing your order of magnitude
differences, because they don't correspond with my experience
(nor my measurements). There's definitely more overhead with
text format. The only question is whether that overhead is more
expensive than the cost of the alternatives, and a there depends
on what you're doing. Obviously, if you can't afford the
overhead (and I've worked on applications which couldn't), then
you use binary, but my experience is that a lot of people jump
to binary far too soon, because the overhead isn't that critical
that often.

If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.

So that the language couldn't be used on some important
platforms? (Most mainframes still do not use IEEE. Most don't
even use binary: IBM's are base 16, and Unisys's base 8.) And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.

I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.

The current standard doesn't even say that. It only gives a
minimum list of characters which must be supported. But I'm
not sure what your argument is: you're saying that we should
standardize some binary format more than the text format?

Yep. Some formats. like IEEE 754 (and maybe descendants)
are fairly universal. No matter what the native formats
look like, it ought to suffice to call a standard method
to dump binary data on the format.

To date, neither C nor C++ have made the slightest gest in the
direction of standardizing any binary formats. There are other
(conflicting) standards which do: XDR, for example, or BER. I
personally think that adding a second set of streams, supporting
XDR, to the standard, would be a good thing, but I've never had
the time to actually write up such a proposal. And a general
binary format is quite complex to specify; it's one thing to say
you want to output a table of double, but to be standardized,
you also have to define what is output when a large mix of types
are streamed, and how much information is necessary about the
initial data in order to read them.

(The big difference is, of course, is that while the
standard doesn't specify any encoding, there are a number of
different encodings which are supported on a lot of
different machines. Where as a raw dump of double doesn't
work even between a PC and a Sparc. Or between an older
Mac, with a Power PC, and a newer one, with an Intel chip.
Upgrade your machine, and you loose your data.)

Exactly. Which is why there ought to be a standardized binary
floating point format that is portable between platforms.

There are several: I've used both XDR and BER in applications in
the past. One of the reasons C++ doesn't address this issue is
that there are several, and C++ doesn't want to choose one over
the others.

--
James Kanze