Re: std iostreams design question, why not like java stream wrappers?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Thu, 27 Aug 2009 02:49:57 -0700 (PDT)

Message-ID:

<4b281a0e-4957-477f-a3a5-a67de81596dd@b15g2000yqd.googlegroups.com>

On Aug 27, 5:07 am, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

In article <fdb1cf7f-9851-49e1-90a4-7adb771fdad2
@o9g2000prg.googlegroups.com>, joshuamaur...@gmail.com says...

I've always found the C++ std iostreams interface to be
convoluted, not well documented, and most non-standard uses
to be far out of the reach of novice C++ programmers, and
dare I say most competent C++ programmers. (When's the last
time you've done anything with facets, locales, etc.?)

Last week, though I realize I'm somewhat unusual in that respect.

It's interesting that he complains of iostream, and then cites
facets. The base design of iostream (pre-standard, if you
prefer) is actually quite clean (although with lousy naming
conventions and very limited error handling); the facets stuff
ranks as some of the most poorly designed software I've seen,
however, and the way it is integrated into iostream is pretty
bad.

I've never been a fan of parallel duplicate class
hierarchies. It's a huge design smell to me. The C++
standard streams have this design smell. They have ifstream
and the associated file streambuf, stringstream its
associated string streambuf, etc. The design smell tends to
indicate duplicate code and an overly complex set of
classes. If each class appears as a pair, why have two
separate hierarchies?

In this case, there's very little (if any) duplicated code.
For the most part, what you have is a set of stream buffer
classes, and a single iostream class (which is, itself,
derived from istream, ostream, ios_base, and so on -- but we
probably don't need to get into that part at the moment).

The other iostream classes (stringstreams, fstreams, etc.) are
just puny adapter classes. They add a function or two to the
iostream to let you easily pass suitable parameters from the
client code to the underlying buffer class without going
through creating a stream buffer in one step, and then
attaching it to a formatter in a separate step.

The important point is that they are just convenience classes;
they make the most frequent cases easier. The separation of
formatting from data sink/source is IMHO an essential concept,
however, and any modern IO design must recognize this.

Also, I've always liked C++ templates as compile time
polymorphism. I would think it natural to do something like
create a stream which writes to a file, then put a buffering
wrapper over that, then put a formatter over it to change
'\n' to the system's native line ending, then put a
formatter over that whose constructor takes an encoding (eg:
UTF 8, ASCII, etc.) and whose operator<< functions take your
unicode string and converts it the encoding passed in the
constructor. The current std streams allow you to do this
(sort of), but it's much more complicated than what it needs
to be, and it's done with runtime polymorphism, not
compile-time polymorphism of templates, so potentially much
slower.

That's actually fairly similar to how iostreams really do
work. There are a _few_ minor differences, but most of them
really are fairly minor.

First, the designers (mostly rightly, IMO) didn't bother with
having two classes, one for unbuffered and another for
buffered access to a stream. Unbuffered access to a stream
just isn't common enough to justify a separate class for this
purpose (at least IME).

I'd guess that 99% of the streambuf's I write use unbuffered
access. About the only time you want buffering is when you're
going to an external source or sink, like a file (filebuf, or a
custom memorybuf or socketbuf).

The reason it's a single class is far more fundamental. Back in
the 1980's, when the concept was being developed, actually
calling a virtual function for each character really was too
expensive in runtime to be acceptable; the public interface to
streambuf is typically implemented as inline functions, and the
virtual call only occurs when there is nothing in the buffer.
Today, given modern machines, I think I'd separate the two, as
in Java. But it's not a big deal; you can more or less ignore
the buffering for output, and some sort of buffering is always
necessary for input anyway, at least if you want to provide a
peek function (named sgetc in streambuf---as I said the naming
conventions were horrible).

Formatting is divided between a locale and an iostream. A
locale contains all the details about things like how to
format numbers (including what characters to use for digit
grouping and such). The iostream mostly keeps track of flags
(e.g. set by manipulators) to decide what locale to use, and
how to use it.

The fact that some of the flags are permanent, and others not,
can cause some confusion. More generally, one would like some
means of "scoping" use of formatting options, but I can't think
of a good solution. (In the meantime, explicit RAII isn't that
difficult.)

Also, the std streams internationalization support is at
best pisspoor. The existence of locales and their meanings
are implementation defined. One cannot rely upon any of the
C++ standard locale + facet stuff for a portable program.

Yes and no. The only piece that's implementation defined is
exactly what locales will exist (and what name will be given
to each).

An awful lot of programs can get by quite nicely with just
using whatever locale the user wants, and C++ makes it pretty
easy to access that one -- an empty name gets it for you.

Yes and no. It's not an iostream problem, but I use UTF-8
internally, and I've had to implement all of my own isalpha,
etc. This really belongs in the standard. (Back when C was
defined, limiting support to single byte encodings was quite
comprehensible. But even in the 1990's, it was clear that
functions like toupper couldn't provide a character to character
mapping, and multibyte encodings were common.)

It's also entirely convoluted and complex, and doesn't
support simple things like changing from one encoding to
another.

I beg your pardon?

The issue is complex, and he's at least partially right. But
part of the problem is inherent---logically, the encoding is a
separate issue from the locale (which is concerned with things
like whether the decimal is a dot or a comma), but practically,
at least with single byte encodings, things like toupper or is
digit depend on both. If your dealing with a stream, the
solution I use is to imbue the stream itself with the locale
you're interested in, then (the order is important) to imbue the
streambuf with the correct locale for the encoding---this is
especially true if the encoding isn't known until part of the
file has been read (the usual case, in my experience). I find
this more logical than creating new locale on the fly (although
in principle, that should also work).

Now, in the standard committee's defense,
internationalization is hard (tm). However, I wish they did
not try at all rather than clutter up a good standard
library with nearly useless features like locales and
facets. Also, seriously, wchar_t's size is implementation
defined? Why even bother?

At the time, the ISO 10646 camp figured wide characters
required 32 bits. The Unicode camp still thought UCS-2 would
do the job. Eventually Unicode decided 32 bits was really
necessary too, but a number of major vendors were still
thinking in terms of UCS-2 at the time.

At least they didn't do like Java and decree that wide
characters were, and would always remain, 16 bits. A C++
implementation can get things right or wrong, but a Java
implementation is stuck with being wrong.

:-). In practice, there's nothing "wrong" with the Java
solution (UTF-16). Nor with either of the two widespread C++
solutions. Or with my solution of using UTF-8 and char. What
might be considered "wrong" is imposing one, and only one, on
all code. But I don't know of any system which offers both
UTF-16 and UTF-32; Java imposes one "right" solution, whereas
C++ allows the implementation to choose (guess?) which solution
is right for its customers.

Of course, the real reason why C++ is so open with regards to
what an implementation can do with wchar_t is because C is. And
the reason C is so open is because when C was being normalized,
no one really knew what encodings would end up being the most
wide spread; Unicode hadn't really become the standard back
then.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34