Re: Poll: Which type would you prefer for UTF-8 string literals in C++0x

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 6 Sep 2010 03:23:08 -0700 (PDT)

Message-ID:

<75cec034-f715-4c60-a989-079758f67350@u6g2000yqh.googlegroups.com>

On Aug 31, 8:39 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

"Martin B." <0xCDCDC...@gmx.at> wrote innews:i5j7gk$8da$1@news.eternal-
september.org:

QUESTION: For the upcoming UTF-8 string literals, which type would you
prefer?

   a) The current proposal, "array of n const char" is great!
   b) "array of n const unsigned char" would be better!
      (Because I'm using libxml2 ;-)
   c) FCS! Add a distinct char8_t and make u8 literals use that!

Logically, b) would be better of course. However, as there are
zillions of text interfaces using char and most of them work
fine with UTF-8, I would vote for a).

The question raises a more general issue: should the encoding be
part of the type. Or in other words, should UTF-8 strings and
characters (in general) have a different type than e.g. ISO
8859-1 (which in turn should have a different type than ISO
8859-2)?

I think one could argue both ways, but historically, narrow
characters have always been char/char*/basic_string<char>,
regardless of the encoding, and I don't think it would work to
change this now.

As for c), this is already present - char is different from
signed char or unsigned char. The problem is that most
mainstream implementations define plain char as signed. Adding
a new type would only cause more confusion IMO.

I've usually seen a general convention that text is char, and
that small integers are either signed char or unsigned char. Of
course, using a signed type to represent characters is an
anomaly; allowing it is arguably an error in the initial
specification of C (but making plain char unsigned on a PDP-11
had a very significant negative impact on performance). In some
ways, I'd like to see a requirement that plain char be unsigned,
or even that it be more restricted, only supporting operations
which might make sense on a character (no multiplication, for
example). But practically speaking, it's not going to happen,
and practically speaking, QoI considerations will ensure that
all implementations will support things like ISO 8859-1 or UTF-8
on plain char---if plain char is signed, they will ensure that
there is a lossless two way conversion between an int in the
range 0-UCHAR_MAX (as returned by streambuf::sgetc, for example)
and char. (Note that according to the standard, the results of
converting a value in the range SCHAR_MAX+1-UCHAR_MAX to a char
is implementation defined, and---at least according to the
C standard---may result in an implementation defined signal. In
practice, I wouldn't worry about it.)

--
James Kanze