Re: STL, UTF8, and CodeCvt
Lourens Veen wrote:
Eugene Gershnik wrote:
Philip wrote:
Does the STL standard currently encompass any terminology to
accommodate UTF-8 (a sort of combination of narrow and wide) or is
the standards committee considering anything along these lines?
I am not qualified to comment on the standard but UTF-8 is not a
"combination of narrow and wide". It is precisly one more kind of
narrow character. There is no fundamental difference between
Shift-JIS for example and UTF-8. Both encode a given logical
character as a sequence of one or more byte-sized units.
But there _is_ a fundamental difference between encodings that use the
same size code for each character and encodings that use variable
length codes. I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
string stored in a compressed format. That seems to be more natural
than looking at it as a narrow (ASCII) string where some of the
characters are really only partial characters.
Natural or not, it doesn't correspond to the reality. UTF-8 is
a multibyte encoding; multibyte encodings have been around for
years (at least 40 years), and were officially recognized by the
first version of the C standard.
It's sometimes confusing, of course, because a lot of the C
functions which deal with characters (e.g. the functions in
<ctype.h>) don't really recognize this fact. Chalk it up to
historical reasons.
If a UTF-8 string is to be stored in an array (or std::vector, or
std::basic_string) of char, then it should be possible to store a
UTF-8 character in a char. Which it isn't.
I don't see the relationship. Nor how UTF-8 is different from
any other multibyte character set. std::vector is just an array
of whatever's, with no semantics associated with what it
contains; std::string is not really much more, either (except
that the whatever's have to be PODs). UTF-8 defines its
encoding format as a sequence of bytes, so any container which
can contain a sequence of bytes (char's, in C++ parlance) is
appropriate. None have specific support for UTF-8 encoding, but
then, none have specific support for US ASCII encoding either.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]