Re: STL, UTF8, and CodeCvt
Lourens Veen wrote:
James Kanze wrote:
Lourens Veen wrote:
Eugene Gershnik wrote:
Philip wrote:
Does the STL standard currently encompass any terminology to
accommodate UTF-8 (a sort of combination of narrow and wide) or
is the standards committee considering anything along these
lines?
I am not qualified to comment on the standard but UTF-8 is not a
"combination of narrow and wide". It is precisly one more kind of
narrow character. There is no fundamental difference between
Shift-JIS for example and UTF-8. Both encode a given logical
character as a sequence of one or more byte-sized units.
But there _is_ a fundamental difference between encodings that use
the same size code for each character and encodings that use
variable length codes. I think of a UTF-8 string as a wide (UCS-4
or UTF-32) string stored in a compressed format. That seems to be
more natural than looking at it as a narrow (ASCII) string where
some of the characters are really only partial characters.
Natural or not, it doesn't correspond to the reality. UTF-8 is
a multibyte encoding; multibyte encodings have been around for
years (at least 40 years), and were officially recognized by the
first version of the C standard.
It's sometimes confusing, of course, because a lot of the C
functions which deal with characters (e.g. the functions in
<ctype.h>) don't really recognize this fact. Chalk it up to
historical reasons.
Just out of interest, how did the designers of C++ end up designing an
std::string that doesn't work with them, then?
What do you mean by: doesn't work? I use std::string for my
UTF-8 sequences.
A more valid question might be why C++ has no real data type
which is an abstraction for text. Nothing more than a
collection of ad hoc functions (some of which can only be used
on C style arrays or std::vector<char>, but not on std::string),
which only more or less work.
I suspect that the reason is that even as late as 1998 (or, for
that matter, today), we don't really know what is needed.
If a UTF-8 string is to be stored in an array (or std::vector, or
std::basic_string) of char, then it should be possible to store a
UTF-8 character in a char. Which it isn't.
I don't see the relationship. Nor how UTF-8 is different from
any other multibyte character set. std::vector is just an array
of whatever's, with no semantics associated with what it
contains; std::string is not really much more, either (except
that the whatever's have to be PODs). UTF-8 defines its
encoding format as a sequence of bytes, so any container which
can contain a sequence of bytes (char's, in C++ parlance) is
appropriate. None have specific support for UTF-8 encoding, but
then, none have specific support for US ASCII encoding either.
Sorry, I didn't put that right, and the way I wrote it it does indeed
not make any sense. Here's another try.
A string is a sequence of characters. These characters can be chosen
from a very limited set (everything that can be encoded in 7-bit
ASCII) or a very extensive set (unicode). A class represents a string
if its interface is based on this model.
That's one definition. (I'd qualify it as a text string.) And
standard C++ has no class which represents a text string, according
to this definition. Neither did C, and nor does Java. (Or a
lot of other languages, most of which I don't know.)
As I said above, I'm not even really certain that we know what
such a class should look like, even today.
A sequence of (assume 8-bit) chars models a sequence of characters if
you use ASCII as the representation.
It can. Provided you interpret it as such in your code.
A sequence of chars does not
model a sequence of characters if you use UTF-8, although a sequence
of characters can be stored in a sequence of bytes using e.g. UTF-8.
It can. Provided you interpret it as such in your code.
std::vector<char> and std::basic_string<char> are simply
containers of char. What you do with the contents is up to you.
C++ does give you very little support for multibyte characters.
Things like ctype<char> don't work with them, for example. But
I don't expect that to change soon, because I don't see any real
consensus with regards to what is required. (If there is a
consensus today, which I doubt, it is more that you shouldn't be
using multibyte characters internally anyway---UTF-8 gets
translated into UTF-32 at the IO interface level.)
Indeed, std::string will not give you the length of a string in
characters if you use it to store a UTF-8 encoded string, and it
won't give you the character at position n if you use operator[](n).
More importantly: std::string will not do anything with
characters. It is a container of char.
So, std::string is not a string at all, it is a sequence of chars.
It is not a text string. I see no problem with calling the
sequence a string, but as you say, it is not a string of
characters, but rather as string of char (or bytes, if you
prefer). Just as std::basic_string<double> is a stirng of
doubles. (I'd say that this is almost implicit from the moment
string is a template.)
It
just happens to be a string too if you use a character encoding that
uses a single char for each character, such as ASCII or one of the
ISO-8859 variants.
Even then, it's only a text string if you, as a user, interpret
it as such.
And yet, according to the C++ standard (according to Eugene, I'm not
very familiar with it), UTF-8 is just another narrow encoding. Well
if it was, you'd think that std::string would still model a string if
used with it!
UTF-8 is just another encoding. Not even necessarily supported
by a C++ implementation.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]