Re: Binary file IO: Converting imported sequences of chars to desired type

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 26 Oct 2009 10:06:56 -0700 (PDT)

Message-ID:

<1bdc9664-5a97-4d95-ba4f-c0ccf94ee9e9@y32g2000prd.googlegroups.com>

On Oct 25, 7:39 pm, Rune Allnor <all...@tele.ntnu.no> wrote:

On 25 Okt, 18:47, James Kanze <james.ka...@gmail.com> wrote:

On Oct 25, 3:13 pm, Rune Allnor <all...@tele.ntnu.no> wrote:

On 23 Okt, 10:27, James Kanze <james.ka...@gmail.com> wrote:

On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:

[...]

But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.

The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what?

Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

The only comparison that is relevant is compared to some other
way of doing it.

My experience has always been
that the disk IO is the limiting factor

Disk IO is certainly *a* limiting factor. But not the only
one. In this case it's not even the dominant one.

And that obviously depends on the CPU speed and the disk speed.
Text formatting does take some additional CPU time; if the disk
is slow and the CPU fast, this will be less important than if
the disk is fast and the CPU slow.

See the example below.

Which will only be for one compiler, on one particular CPU, with
one set of compiler options.

(Note that it's very, very difficult to measure these things
accurately, because of things like disk buffering. The order
you run the tests can make a big difference: under Windows, at
least, the first test run always runs considerably faster than
if it is run in some other position, for example.)

(but my data sets have generally been very mixed, with a lot
of non floating point data as well). And binary formatting
can be more or less expensive as well---I'd rather deal with
text than a BER encoded double. And Jorgen said very
explicitly "if you have a choice". Sometimes you don't have
the choice: you have to conform to an already defined
external format, or the profiler says you don't have the
choice.

The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).

Try it on what machine:-).

Any machine. The problem is to decode text-formatted numbers
to binary.

You're giving concrete figures. "Any machine" doesn't make
sense in such cases: I've seen factors of more than 10 in terms
of disk speed between different hard drives (and if the drive is
remote mounted, over a slow network, the difference can be even
more), and in my time, I've seen at least six or seven orders of
magnitude in speed between CPU's. (I've worked on 8 bit machines
which took on an average 10 =C5=B3s per machine instruction, with no
hardware multiply and divide, much less floating point
instructions.)

The compiler and the library implementation also make a
significant difference. I knocked up a quick test (which isn't
very accurate, because it makes no attempt to take into account
disk caching and such), and tried it on the two machines I have
handy: a very old (2002) laptop under Windows, using VC++, and a
very recent, high performance desktop under Linux, using g++.
Under Windows, the difference between text and binary was a
factor of about 3; under Linux, about 15. Apparently, the
conversion routines in the Microsoft compiler are a lot, lot
better than those in g++. The difference would be larger if I
had a higher speed disk or data bus; it would be significantly
smaller (close to zero, probably) if I synchronized each write.
(A synchronized disk write is about 10 ms, at least on a top of
the line Sun Sparc.)

In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB). For
10000000, it was around 45 seconds under Windows (file size 250
MB).

It's interesting to note that the Windows version is clearly IO
dominated. The difference in speed between text and binary is
pretty much the same as the difference in file size.

Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).

Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,

This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.

There is no 50-100x difference. There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase); I won't bother
trying it with synchronized writes, however, because that would
go to the opposite extreme, and you'd probably never use
synchronized writes for each double: when they're needed, it's
for each record.

Here is a test I wrote in matlab a few years ago, to
demonstrate the problem (WinXP, 2.4GHz, no idea about disk):

I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?

The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.

Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.

The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.

If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.

The first few lines in the text file look like

-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001

(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.

It's not sufficient with regards to the number of digits. You
won't read back in what you've written.

The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.

I did, and they aren't. They're actually very different in two
separate C++ environments.

The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order
of

1e10/1.75e8*42s = 2400s = 40 minutes.

There is no point in even considering using a text format for
these kinds of things.

But it must not be doing much processing on the data, just
copying it and maybe a little scaling. My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time). The time spent writing the
results, even in XML, is only a small part of the total runtime.

If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.

So that the language couldn't be used on some important
platforms? (Most mainframes still do not use IEEE. Most don't
even use binary: IBM's are base 16, and Unisys's base 8.) And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.

I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.

The current standard doesn't even say that. It only gives a
minimum list of characters which must be supported. But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?

(The big difference is, of course, is that while the standard
doesn't specify any encoding, there are a number of different
encodings which are supported on a lot of different machines.
Where as a raw dump of double doesn't work even between a PC and
a Sparc. Or between an older Mac, with a Power PC, and a newer
one, with an Intel chip. Upgrade your machine, and you loose
your data.)

--
James Kanze