Re: STL, UTF8, and CodeCvt

From:

"P.J. Plauger" <pjp@dinkumware.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Tue, 6 Mar 2007 10:56:40 CST

Message-ID:

<_bGdnQZ8StMCHHDYnZ2dnUVZ_hqdnZ2d@giganews.com>

"Lourens Veen" <lourens@rainbowdesert.net> wrote in message
news:d3008$45ec8ce6$8259a2fa$1824@news1.tudelft.nl...

James Kanze wrote:

Lourens Veen wrote:

Eugene Gershnik wrote:

Philip wrote:

Does the STL standard currently encompass any terminology to
accommodate UTF-8 (a sort of combination of narrow and wide) or
is the standards committee considering anything along these
lines?

I am not qualified to comment on the standard but UTF-8 is not a
"combination of narrow and wide". It is precisly one more kind of
narrow character. There is no fundamental difference between
Shift-JIS for example and UTF-8. Both encode a given logical
character as a sequence of one or more byte-sized units.

But there _is_ a fundamental difference between encodings that use
the same size code for each character and encodings that use
variable length codes. I think of a UTF-8 string as a wide (UCS-4
or UTF-32) string stored in a compressed format. That seems to be
more natural than looking at it as a narrow (ASCII) string where
some of the characters are really only partial characters.

Natural or not, it doesn't correspond to the reality. UTF-8 is
a multibyte encoding; multibyte encodings have been around for
years (at least 40 years), and were officially recognized by the
first version of the C standard.

It's sometimes confusing, of course, because a lot of the C
functions which deal with characters (e.g. the functions in
<ctype.h>) don't really recognize this fact. Chalk it up to
historical reasons.

Just out of interest, how did the designers of C++ end up designing an
std::string that doesn't work with them, then?

For the same reason we didn't make any special provision for Roman
numerals. std::string works quite nicely with UTF-8 text -- you can
store it, search it, copy it about, etc. in many useful ways. The
class is not aware of the inner structure of UTF-8, however, any more
than it is aware of the inner structure of sentences of text. Or
Roman numerals, for that matter.

If a UTF-8 string is to be stored in an array (or std::vector, or
std::basic_string) of char, then it should be possible to store a
UTF-8 character in a char. Which it isn't.

I don't see the relationship. Nor how UTF-8 is different from
any other multibyte character set. std::vector is just an array
of whatever's, with no semantics associated with what it
contains; std::string is not really much more, either (except
that the whatever's have to be PODs). UTF-8 defines its
encoding format as a sequence of bytes, so any container which
can contain a sequence of bytes (char's, in C++ parlance) is
appropriate. None have specific support for UTF-8 encoding, but
then, none have specific support for US ASCII encoding either.

Sorry, I didn't put that right, and the way I wrote it it does indeed
not make any sense. Here's another try.

A string is a sequence of characters. These characters can be chosen
from a very limited set (everything that can be encoded in 7-bit
ASCII) or a very extensive set (unicode). A class represents a string
if its interface is based on this model.

A sequence of (assume 8-bit) chars models a sequence of characters if
you use ASCII as the representation. A sequence of chars does not
model a sequence of characters if you use UTF-8, although a sequence
of characters can be stored in a sequence of bytes using e.g. UTF-8.

Indeed, std::string will not give you the length of a string in
characters if you use it to store a UTF-8 encoded string, and it
won't give you the character at position n if you use operator[](n).
So, std::string is not a string at all, it is a sequence of chars.

You mean it's not a string as *you'd* define it in this particular
context. And yet millions of programmers have used it as a "string"
of some sort that meets their needs.

It
just happens to be a string too if you use a character encoding that
uses a single char for each character, such as ASCII or one of the
ISO-8859 variants.

Or if you use it as a sequence of bytes, or ....

And yet, according to the C++ standard (according to Eugene, I'm not
very familiar with it), UTF-8 is just another narrow encoding. Well
if it was, you'd think that std::string would still model a string if
used with it!

And how is it to know that the contents of the string *this time*
are UTF-8, as opposed to JIS, Shift-JIS, EUC, UTF-16LE, etc. etc.?
These are all encodings for character sequences that have been widely
used over the past decade or so.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]