Re: std iostreams design question, why not like java stream wrappers?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 28 Aug 2009 04:35:31 -0700 (PDT)

Message-ID:

<8e655ca7-1aaf-484c-90f2-7ad40641bc30@g31g2000yqc.googlegroups.com>

On Aug 28, 12:49 am, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

In article <4b281a0e-4957-477f-a3a5-
a67de8159...@b15g2000yqd.googlegroups.com>, james.ka...@gmail.com
says...

On Aug 27, 5:07 am, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

In article <fdb1cf7f-9851-49e1-90a4-7adb771fdad2

[ ... ]

It's also entirely convoluted and complex, and doesn't
support simple things like changing from one encoding to
another.

I beg your pardon?

The issue is complex, and he's at least partially right.

Yes -- I was thinking in terms of the supposed inability to
change from one encoding to another, which most certainly IS
possible. Admittedly (and as you've pointed out) there's a
bit of a problem with changing encoding when buffering is
involved.

I'd note that this is really just an instance of a much larger
problem though. Buffering is intended to decouple the actions
on the two sides of the buffer, and it does that quite well.
In this case, however, we care about the state on the "far"
side of the buffer -- exactly what the buffer is supposed to
hide from us.

Certainly, and I don't know of a system which really supports it
fully. In general, you can pass from a one-to-one (byte)
encoding to something more complex, but once you've started with
something more complex, you can't go back. About the only
difference between C++ and Java here is that C++ documents this
fact. (Or else... IIRC, Java does the encoding after the
buffering, so the problems should be less.)

[ ... ]

At least they didn't do like Java and decree that wide
characters were, and would always remain, 16 bits. A C++
implementation can get things right or wrong, but a Java
implementation is stuck with being wrong.

:-). In practice, there's nothing "wrong" with the Java
solution (UTF-16).

Sort of true -- it's certainly true that UTF-16 (like UTF-8)
is a much "nicer" encoding than things like the old shift-JIS.
At least it's easy to recognize when you're dealing with a
code point that's encoded as two (or more) words.

And it's trivial to resynchronize if you get lost.

At the same time, you do still need to deal with the
possibility that a single logical character will map to more
than one 16-bit item, which keeps most internal processing
from being as clean and simple as you'd like.

But if you're doing any serious text processing, that's true for
UTF-32 as well. \u0302\u0071 is a single character (a q with a
circumflex accent), even if it takes two code points to
represent. And if you're not concerned down to that level,
UTF16 will usually suffice.

But speaking from experience... Handling multibyte characters
isn't that difficult, and I find UTF8 the most appropriate for
most of what I do.

Then again, at least in the Java code I've seen, the internal
code is kept clean and simple -- which is fine until somebody
feeds it the wrong data, and code that's been "working" for
years suddenly fails completely...

Yes and no. Where I live, there are very strong arguments to go
beyond ISO 8859-1---the Euro character, the oe ligature, etc.,
not to mention supporting foreign names. But everything must be
in a Latin script; anything not Latin script is "wrong data".
In this case (and it's a frequent one in Europe), whether the
byte is a surrogate or a CKJ character really doesn't
matter---it's wrong data, and must be detected as such.

Nor with either of the two widespread C++ solutions. Or
with my solution of using UTF-8 and char. What might be
considered "wrong" is imposing one, and only one, on all
code. But I don't know of any system which offers both
UTF-16 and UTF-32; Java imposes one "right" solution,
whereas C++ allows the implementation to choose (guess?)
which solution is right for its customers.

IMO, UTF-16 causes the biggest problem. The problem arises
when the "length" of a string is ambiguous -- the number of
characters differs from the number of units of storage.

But that's just as true with UTF-8 (which I regularly use), and
in a very real sense, with UTF-32 as well (because of combining
diacritical marks).

With UTF-8, those differences are large enough and common
enough that a mistake in this area will cause visible problems
almost immediately.

With UCS-4/UTF-32, there's never a difference, so no problem
ever arises.

With UTF-16, however, there's only rarely a difference -- and
even when there is, it's often small enough that if (for
example) your memory manager rounds up memory allocation
sizes, you can use buggy code almost indefinitely without the
bug becoming apparent. Then, (Murphy still being in charge)
exactly when it's most crucial for it to work, the code fails
_completely_, but duplicating the problem is next to
impossible...

OK. I can almost see that point. Almost, because I'm still not
sure from where you're getting the length value for the
allocator. If you have a routine for counting characters that
is intelligent enough to handle surrogates correctly (where two
code points form a single character), then it might be
intelligent enough to handle combining diacritical marks
correctly as well, and the same problem will occur with UTF32.

Of course, the real reason why C++ is so open with regards
to what an implementation can do with wchar_t is because C
is. And the reason C is so open is because when C was being
normalized, no one really knew what encodings would end up
being the most wide spread; Unicode hadn't really become the
standard back then.

At the time, Unicode was still _competing_ with ISO 10646
rather than cooperating with it.

The Unicode Consortium was incorporated in January, 1991, and
the C adopted wchar_t sometime in the late 1980's---certainly
before 1988, when the final committee draft was voted on. And
ISO attributes standard numbers successively, which means that
ISO 10646 was adopted after ISO 9899 (the C standard). At the
time, I think that while it was generally acknowledged that
characters should be more than 8 bits, there was absolutely no
consensus as to what they should be.

I think there's more involved though: C++ (like C) embodies a
general attitude toward allowing (and even embracing)
variation. While I think in recent years it has moderated to a
degree, I think for a while (and still, though to a lesser
degree) there was rather an amount of pride taken in leaving
the languages loosely enough defined that they could be
implemented on almost any machine (past or future), including
some for which there was no realistic hope of anybody actually
porting an implementation.

I think this is a good point---an essential point in some ways.
Don't formally standardize until you know what the correct
solution is. Today (2009), I think it's safe to say that the
"correct" solution is to support all of the Uncode encoding
formats (UTF-8, UTF-16 and UTF-32), and let the user choose; if
I were designing a language from scratch today, that's what I'd
do. Today, however, both Java and C++ have existing code to
deal with, which complicates the issues---Java has an additional
problem in that evolutions of the language must still run on the
original JVM. (But Java could define a new character type for
UTF-32 at the language level, using 'int' to implement it at the
JVM level. Except that some knowledge of the class String is
built into the language.)

FWIW: I'm not really convinced that we know enough about what is
"correct" even today to dare build it into the language (which
means casting it in stone). For the moment, I think that the
C++ solution is about the most we dare do.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34