Re: Question about wcout and its underlying C functions

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

18 Dec 2006 11:50:34 -0500

Message-ID:

<1166437958.143598.57560@f1g2000cwa.googlegroups.com>

iwongu wrote:

I have a question about std::wcout and its underlying C functions.

According to C99, a stream have an orientation type that is one of
byte-oriented or wide-one, and it is determined by the first using C
function on that stream. And the specific orientation functions should
not be applied to the stream that have other orientation.

But that doesn't apply to C++, where you have two different
stream types, a wide character one, and a narrow character one.

And, the follwing code is working; (gcc 3.4.3 in solaris 10 x86)

wcout << "abcd" << endl;
wcout << "efgh" << endl; // the first four bytes are Korean
// characters.

In what encoding? The basic character set doesn't include any
Korean characters, so it is implementation defined what the
implementation actually does. On my installation of g++ under
Solaris, it treats them as ISO 8859-1. But I suspect that this
is more or less a random result, depending on other things in my
environment, and not something intentional; my impression is
that g++ just passes them through "as is", and it looks like
8859-1 because the fonts I have active use that encoding (and
LC_CTYPE is set to a locale with this encoding).

wcout << "ijkl" << endl;

The result is;

abcd
efgh
ijkl

But the following is not.

wcout << L"abcd" << endl;
wcout << L"efgh" << endl;
wcout << L"ijkl" << endl;

The result is;

abcd

All characters are not printed after I try to print Korean.

First question: what is the state of the stream after outputting
the Korean characters? Second, what does a hex dump of the two
string literals (narrow character and wide character) look like?
Also, what locale is imbued in the stream? A hex dump of the
bytes the editor puts into the source file might help, too. (A
quick trial with accented French characters showed that my
installation of g++ simply inserted the ISO 8859-1 character codes
generated my editor into the narrow character string.
Attempting to use them in a wide character string provoked an
error: "error: converting to execution character set: Illegal
byte sequence". Replacing the characters with their universal
character names worked, however, and the resulting wchar_t
contained the correct Unicode.

Not that that will necessarily tell us much. The problem here
is that all handling of such codes is implementation defined
and depends on the locale. Theoretically, all implementation
specified behavior is supposed to be documented. But I've never
been able to find such documentation. Neither for g++ nor for
any other compiler I've used. And without such documentation,
it's hard to say what is going on.

I assumed that wcout might use byte-function (not wide-) to print
narrow characters even though it is wide character string literal.

I assume that it doesn't use FILE* to begin with, so the issue
shouldn't occur.

So the stdout had byte-orientation and the rest are not
printed because they are wide-oriented function. But this
assumption can not explain the last line; L"ijkl".

I tried more code.

wcout << "abcd" << endl;
wcout << L"efgh" << endl;
wcout << "ijkl" << endl;

The result is;

abcd

..

If the stream cannot convert the character being output to
external encoding (as defined by the imbued locale), output
fails. And once output has failed, no further output takes
place until the error has been explicitly cleared. Without
knowing the imbued locale, nor the actual character being passed
into the stream, it's difficult to say what the implementation
should do with it. (If the imbued locale is "C", I rather
suspect that it should reject all non-ASCII characters as an
error.)

wcout << "abcd" << endl;
wcout << L"efgh" << endl;
cout << "ijkl" << endl; // <-- changed to cout

The result is;

abcd
ijkl

Why would you expect anything else, given the earlier results.
cout is an entirely different stream than wcout, with its own
buffers, error status, etc. Whatever has taken place earlier in
wcout simply doesn't affect it.

Is this a bug in compiler?

I don't think so, although there might be a QoI issue concerning
g++'s implementation of localization.

What's the standard-correct output for this?

Implementation defined and locale dependant. The point of view
of the standard is that implementations may vary here, and that
you should read the implementation documentation to know what it
actually does. If you find the implementation documentation for
this, let me know.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientie objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Simard, 78210 St.-Cyr-l'Icole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]