Re: wcout, wprintf() only print English

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sun, 24 Feb 2008 07:05:07 -0800 (PST)

Message-ID:

<8cae2a2d-57d5-4ea7-b50d-4901f5049f30@d5g2000hsc.googlegroups.com>

On Feb 24, 12:20 am, Ioannis Vranos
<ivra...@nospam.no.spamfreemail.gr> wrote:

James Kanze wrote:

You're still not telling us a lot of important information.
What is the actual encoding used in the source file, and what
are the bytes actually output. (FWIW: I think g++, and most
other compilers, just pass the bytes through transparently in a
narrow character string. Which means that your second code will
output whatever your editor put in the source file. If you're
using the same encoding everywhere, it will seem to work.)

Note that there isn't really any portable solution, because so
much depends on things the C++ compiler has no control over.
Run the same code in two different xterm, and it can output two
different things, completely; just specify a different font
(option -fn) with a different encoding for one of the xterm.
(And of course, it's pretty much par for the course to see one
thing when you cat to the screen, and something else when you
output the same file to the printer.)

I posted a C95 question in c.l.c., about this (which is a subset of
C++03) and I got a C95 working code. My last message there:

> Ben Bacarisse wrote:

> You need "%ls". This is very important with wprintf since without it
> %s denotes a multi-byte character sequence. printf("%ls\n" input)
> should also work. You need the w version if you want the multi-byte
> conversion of %s or if the format has to be a wchar_t pointer.

I'd forgotten about that aspect. It's been many, many years
since I last used printf et al. But yes, you'll definitely need
a modifier in any printf specifier.

Perhaps you may help me understand better.

Well, the main thing you have to understand is that there are
many different players in game, and that each is doing more or
less what it wants, without considering what the others are
doing.

We have the usual char encoding which is implementation
defined (usually ASCII).

The "usual char encoding" for what? One of the problems is
that different tools have different ideas as to what the "usual
char encoding" should be.

Unless you have to deal with mainframes (where EBCIDC still
rules), you can probably count on whatever encoding is being
used for narrow characters to understand ASCII as a subset
(although I'm not at all sure that this is true for the Asian
languages).

wchar_t is wide character encoding, which is the "largest
character set supported by the system", so I suppose Unicode
under Linux and Windows.

wchar_t is implementation defined, and can be just about
anything. On the systems I know, it's UTF-16 for Windows and
AIX, UTF-32 (I think) under Linux, and some pre-Unicode 32 bit
encoding under Solaris. Except that all it really is is a 16 or
32 bit integral type. (On the usual systems. The standard
doesn't make any requirements, and an implementation which
typedef's it to char is conformant.) How the implementation
interprets it (the encoding) may depend on the locale (and I
think recent versions of Solaris have locales which interpret it
as UTF-32, rather than the pre-Unicode encoding).

What exactly is a multi-byte character?

A character which requires several bytes for its encoding.
Very, very succinctly (Haralambous takes about 60 pages to cover
the issues, so I've obviously got to leave something out):

A character is represented by one or more code points.
Probably, all of the characters we're concerned with here can be
represented by a single code point in Unicode, but that's not
always true. And even characters that can be represented by a
single code point (e.g. an o with a circumflex accent) may be
represented by more than one code point (e.g. latin small letter
O, followed by combining accent circumflex), and will be
represented thusly in some canonical representations. A code
point is a numeric value, e.g. 0x0065 (Latin small letter E, in
Unicode) or 0x0394 (Greek capital letter Delta, in Unicode).
Which leaves open how the numeric value is represented. Unicode
code points require at least 21 bits in order to be represented
numerically, but in fact, Unicode defines a certain number of
"transformation formats", specifying how the code points are to
be formatted. The most frequent are UTF-32 (with 32 bits per
element, and one element per code point, always), UTF-16 (BE or
LE), with 16 bits per element, and one or two elements per code
point (but if all you're concerned with is the Latin and the
Greek alphabets, you can consider that it is always one element
per code point as well), and UTF-8, with 8 bit elements, and one
to four elements per code point.

In all cases of Unicode where there can be more than one element
per code point, the encoding format is defined in such a way
that you can always tell from a single element whether it is a
complete code point, the first element of a multiple element
code point, or a following element of a multiple element code
point. Thus, in UTF-8, byte values 0-0x7F are single element
code points (corresponding in fact to US ASCII), byte values
0x80-0xBF can only be a trailing byte in a multibyte code point,
0xC2-0xF7 can only be the first byte of a multibyte code point,
and values 0xC0, 0xC1, 0xF8-0xFF never occur. (The UTF-8
encoding format is actually capable of handling numeric values
up to 0x7FFFFFFF; such values may use the byte values 0xF8-0xFD
for the first byte.)

The important point, of course, being that a single code point
may require more than one byte.

Historically, earlier encodings didn't make such a rigorous
distinction between characters and code points, and tended to
define code points directly in terms of the encoding format,
rather than as a numeric value. Also, most of them didn't have
the characteristic that you could tell immediately from the
value of a byte whether it was a first byte or not; in general,
if you just indexed into a string at any arbitrary byte index,
you had no way of "resynchronizing", i.e. finding the nearest
character boundary. Some of the earlier encodings also depended
on stream state, using codes for shift in and shift out to
specify that the numeric values which followed (until the next
shift in or shift out code) were e.g. in the Greek alphabet,
rather than in the Latin one. (Some early data transmission
codes were only five bits, using shift in and shift out to
change from letters to digits/punctuation and vice versa---and
only supporting one case of letters.)

I have to say that I am talking about C95 here, not C99.

>> return 0;
>> }

>> Under Linux:

>> [john@localhost src]$ ./foobar-cpp
>> Test
>> T
>> [john@localhost src]$

>> [john@localhost src]$ ./foobar-cpp
>> =CE=94=CE=BF=CE=BA=CE=B9=CE=BC=CE=B1=CF=83=CF=84=CE=B9=CE=BA=CF=8C
>> =EF=BF=BD
>> [john@localhost src]$

> The above my not be the only problem. In cases like this,
> you need to say way encoding your terminal is using.

You are somehow correct on this. My terminal encoding was
UTF-8 and I added Greek(ISO-8859-7).

In general: all a program written in C++ can do is output bytes,
which have some numeric value. We suppose a particular
encoding, etc. in the program, but there's no guarantee that
whoever later reads those bytes supposes the same thing, and
there's not much C++ can do about it.

(Since you're under Linux, try starting an xterm with a font
using UTF-8, set the locales correctly for it, and create a file
with Greek characters in the name. Then start a different xterm
with a font using ISO-8859-7, set the locales for that, and do
an ls on the directory where you created the file. As you can
see, even without any C++, there can be problems. And there's
nothing C++ can do about it.)

[...]

BTW, how can we define UTF-8 as the locale?

It depends on the implementation, but the Unix conventions
prescribe something along the lines of
<language>[_<country>][.<encoding>], where the language is 2
letter language code, as per ISO 639-2 (in lower case), the
country is the 2 letter country code, as per ISO 3166, in upper
case, and the encoding is somthing or other. With the optional
parts defaulting to some system defined value if they're not
specified. For historical reasons, most implementations also
support additional names, like "Greek". And of course,
depending on the machine, any given locale may or may not be
installed---typically, if you do an ls of either
/usr/share/locale or /usr/lib/locale, you'll get a list of
supported locales for the machine in question. (On the version
of Linux I'm running here, UTF-8 is the default, and I can't see
it in the locale names. IIRC from the Solaris machine at work,
however, the UTF-8 locales end in .utf8. Also note that there
may be some additional files in this directory.)

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=C3=A9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=C3=A9mard, 78210 St.-Cyr-l'=C3=89cole, France, +33 (0)1 30 23 00 3=
4