Re: Poll: Which type would you prefer for UTF-8 string literals in C++0x
On Sep 8, 1:56 pm, =D6=F6 Tiib <oot...@hot.ee> wrote:
On Sep 8, 11:52 am, James Kanze <james.ka...@gmail.com> wrote:
[...]
I am somewhat sceptical about usefulness of utf-8 bytes
for anything but storing and transporting texts. Simple
operations like std::toupper will never work with these
anyway.
"Simple" operations like std::toupper don't work with most
encodings, including char32_t UTF-32. Mainly because things
like "toupper" aren't simple. (The classical example:
toupper('\u00DF') should result in the two character sequence
"SS".) Any effective toupper has to work on a string level, not
on a character level, and generate a new string (since there is
no one to one mapping to upper). And this can be done in UTF-8,
with the correct tools (which maybe should be part of the
standard).
Yes, standard library should contain correct tools and not
contain incorrect and misleading tools. Some function in
standard library that accepts char* as character sequence may
pretend to be silly and expect ASCII. If there was a thing
like char8_t that has exact meaning and so old "i thought it's
ASCII" trick can nt be pulled.
The standard library does contain "correct" tools, in the sense
that they work according to specification:-). They probably are
a bit misleading, but this could be considered a problem of
documentation (which isn't the role of the standard): it's clear
to me that isupper, for example, is meaningless in Unicode, or
with an alphabet which doesn't have case, or ideographs, or any
number of other things: in general, the basic functions in ctype
are only meaningful in "constrained" situations (e.g. to parse
a case insensitive programming language). They generally don't
work well with human languages.
In practice, a lot of applications aren't concerned with
manipulating individual characters anyway; they need to
recognize separators (but often all of the separators will have
single byte codes in UTF-8), and break the text up into
segments, but not much more. (Or so much more that the
difference between UTF-8 and UTF-32 becomes negligible, e.g.
they need to treat the two code point sequence "\u0061\u0302" as
a single character equal to "\u00E2".)
Hmm but ... very lot of apps have to deal with texts entered
by user or sent by other apps that can not manage to specify
a binary (or well- formed xml) interface. Such apps always
need operations like capitalizing, case-insensitive
search/compare, date, time, numeric and money
formatting/parsing and so on.
Do they? A very lot of apps don't do any real text processing
at all.
This is not something that can be called separator searcing or
breaking up? Also it all sounds as being business of <locale>
and if <locale> can not pull the weight then it should be
kicked out from standard and something that works should be
put in.
The issue isn't simple, and there are historical considerations
which have to be taken into account. The current <locale> does
represent a halfway solution, but I don't think that, even now,
we know exactly what is needed for a full solution (but we're
a lot closer than we were when <locale> was specified).
If you have to use ICU4C library anyway to have correct locale
specific comparision, transformation and regular expression
rules then standard should drop petending that it provides
something.
I doubt that even ICU handles all of the cases needed for
correct presentation, although they certainly do a lot more than
anything else I know.
The standard library doesn't pretend to solve all problems. It
offers a minimal set of functionality for certain limited uses,
IMHO more for historical reasons (and the fact that it is needed
for iostream) than for anything else. If you need more, you
need a third party library (if you can find one which is
adequate), or to implement your own code. Full
internationalization is very, very complex, and rather
difficult.
--
James Kanze