Re: Poll: Which type would you prefer for UTF-8 string literals in C++0x
On Sep 8, 9:20 am, =D6=F6 Tiib <oot...@hot.ee> wrote:
On Sep 7, 8:26 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
=D6=F6 Tiib <oot...@hot.ee> wrote in news:d8ffa771-0bef-4bc9-b310-
052278883...@m1g2000yqo.googlegroups.com:
I am somewhat sceptical about usefulness of utf-8 bytes
for anything but storing and transporting texts. Simple
operations like std::toupper will never work with these
anyway.
The string variables are mostly used just for storing,
concatenating and transporting texts. Splitting on ASCII
delimiters and searching substrings also works fine with
UTF-8. The only problematic operations are related to single
character manipulations, which are quite rare in my
experience.
I meant these functions in <locale> do not you really ever need them?
template < class charT > bool isspace( charT c, const locale& loc );
template < class charT > bool isprint( charT c, const locale& loc );
template < class charT > bool iscntrl( charT c, const locale& loc );
template < class charT > bool isupper( charT c, const locale& loc );
template < class charT > bool islower( charT c, const locale& loc );
template < class charT > bool isalpha( charT c, const locale& loc );
template < class charT > bool isdigit( charT c, const locale& loc );
template < class charT > bool ispunct( charT c, const locale& loc );
template < class charT > bool isxdigit( charT c, const locale& loc );
template < class charT > bool isalnum( charT c, const locale& loc );
template < class charT > bool isgraph( charT c, const locale& loc );
template < class charT > charT toupper( charT c, const locale& loc );
template <class charT> charT tolower( charT c, const locale& loc );
In most applications, no. And if you're actually dealing with
full Unicode, neither they nor their wide character equivalents
work: even in UTF-32, you may need several code points to
specify a character.
The toupper() function seems everything than simple to me.
The standard example from James Kanze is the German =DF, which
should go to SS in uppercase.
Simple ... i meant for the people who say that here it should be
capitalized and here in upper case, here with bold font.
These sound like presentation issues (bold font is definitly
one). A lot of applications aren't concerned with presentation.
And those that are, and that need to support full Unicode,
generally can't use the above functions anyway, because several
code points may be necessary to specify a character, even in
Unicode.
For them all three feel tasks with similar complexity. It is
reasonable requirement: "i want to search for the text i typed
in case-insensitively", isn't it?
Maybe, but then you have to define exactly what you mean by
"case-insensitive". In Germany, there are two separate
conventions regarding Umlauts ("=E4" may compare equal to "a" or
to "ae", depending on the convention), for example, and of
course, "=DF" must compare equal to "SS" (or in certain special
cases, to "SZ", it's context dependent).
If std::toupper() with char32_t needs still special
post-processing with German =DF or some other exception, then
okay. If it does not work with char8_t then it should throw,
and not produce rubbish.
That's an interesting proposition; I rather like it.
The simple forms of the functions are useful in many contexts,
where you know that you'll only be treating (or should only be
treating) pure ASCII, for example. They should be supported, if
only for historical reasons. The question is what to do when
something like isalpha is called on something that isn't
a character in the locale specific encoding. The current
specification says to return false (or 0 in the C versions); if
it isn't a character, it isn't an alphabetic character. But
I rather like the idea of throwing an exception: if you pass it
something that isn't a character, then you've probably got the
wrong file, or the wrong data, or whatever. (Alternatively, you
need a function islegal, and then a precondition for the other
functions that islegal returns true.)
I would also be in favor of raising an exception or having
a precondition failure if they are called on a local which uses
a multibyte encoding (like UTF-8), even if the actual character
in question is only a single byte. (And what about calling
islower in an encoding for an alphabet like arabic, which
doesn't have case?)
--
James Kanze