Re: Binary file IO: Converting imported sequences of chars to desired type

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Thu, 29 Oct 2009 09:36:43 -0700 (PDT)

Message-ID:

<f08db84f-1598-4d28-8b7c-725005374c05@l31g2000vbp.googlegroups.com>

On Oct 29, 2:02 pm, Rune Allnor <all...@tele.ntnu.no> wrote:

On 29 Okt, 11:00, James Kanze <james.ka...@gmail.com> wrote:
...

Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
I get:
    text: 90 sec.
    cooked: 31 sec.
    raw: 9 sec.
The last is, of course, not significant, except that it is
very small. (I can't run it on the networked disk, where
any real data would normally go, because it would use too
much network bandwidth, possibly interfering with others.
Suffice it to say that the networked disk is about 5 or more
times slower, so the relative differences would be reduced
by that amount.) I'm not sure what's different in the code
above (or the environment---I suspect that the disk
bandwidth is higher here, since I'm on a professional PC,
and not a "home computer") compared to my tests at home
(under Windows); at home, there was absolutely no difference
in the times for raw and cooked. (Cooked is, of course, XDR
format, at least on a machine like the PC, which uses IEEE
floating point.)

Hmm.... so everything was done on your local disc? Which means
one would expect that disk I/O delays are proportional to file
sizes?

More or less. There are also caching effects, which I've not
tried to mask or control, which means that the results should be
taken with a grain of salt. More generally, there are a lot of
variables involved, and I've not made any attempts to control
any of them, which probably explains the differences I'm seeing
from one machine to the next.

If so, the raw/cooked binary formats are a bit confusing.
According to this page,

http://publib.boulder.ibm.com/infocenter/systems//index.jsp?topic=/co...

the XDR data type format uses "the IEEE standard" (I can find
no mention of exactly *which* IEEE standard...) to encode both
single- precision and double-precision floating point numbers.

IF "the IEEE standard" happens to mean "IEEE 754" there is a
chance that an optimizing compiler might deduce that re-coding
numbers on IEEE 754 format to another number on IEEE 754
format essentially is a No-Op.

I'm not sure what you're referring to. My "cooked" format is a
simplified, non-portable implementation of XDR---non portable
because it only works on machines which have 64 long longs and
use IEEE floating point.

Even if XDR uses some other format than IEEE754, your numbers
show one significant effect:

1) Double-precision XDR is of the same size as double-precision
IEEE 754 (64 bits / number).
2) Handling XDR takes significantly longer than handling native
binary formats.

Again, that depends on the machine. On my tests at home, it
didn't. I've not had the occasion to determine where the
difference lies.

Since you run the test with the same amopunts of data on the
same local disk with the same delay factors,

I don't know whether the delay factor is the same. A lot
depends on how the system caches disk accesses. A more
significant test would use synchronized writing, but
synchronized at what point?

this factor ~4 of longer time spent on handling XDR data must
be explained by something else than mere disk IO.

*IF* there is no optimization, *AND* disk accesses cost nothing,
then a factor of about 4 sounds about right.

The obvious suspect is the extra manipulations and recoding of
XDR data. Where native-format binary IO only needs to perform
a memcpy from the file buffer to the destination, the XDR data
first needs to be decoded to an intermediate format, and then
re-encoded to the native binary format before the result can
be piped on to the destination.

The same happens - but on a larger scale - when dealing with
text-based formats:

1) Verify that the next sequence of characters represent a
valid number format
2) Decide how many glyphs need to be considered for decoding
3) Decode text characters to digits
4) Scale according to digit placement in number
5) Repeat for exponent
6) Do the math to compute the number

That's input, not output. Input is significantly harder for
text, since it has to be able to detect errors. For XDR, the
difference between input and output probably isn't signficant,
since the only error that you can really detect is an end of
file in the middle of a value.

True, this takes insignificant amounts of time when compared
to disk IO, but unless you use a multi-thread system where one
thread reads from disk and another thread converts the formats
while one waits for the next batch of data to arrive from the
disk, one have to do all of this sequentially in addition to
waiting for disk IO.

Nah, I still think that any additional non-trivial handling of
data will impact IO times of data. In single-thread
environments.

You can always use asynchronous IO:-). And what if your
implementation of filebuf uses memory mapped files?

The issues are extremely complex, and can't easily be
summarized. About the most you can say is that using text I/O
won't increase the time more than about a factor of 10, and may
increase it significantly less. (I wish I could run the tests
on the drives we usually use---I suspect that the difference
between text and binary would be close to negligible, because of
the significantly lower data transfer rates.)

--
James Kanze