Re: UTF8 and std::string

From:

"kanze" <kanze@gabi-soft.fr>

Newsgroups:

comp.lang.c++.moderated

Date:

14 Jun 2006 18:31:32 -0400

Message-ID:

<1150277100.156630.194960@y43g2000cwc.googlegroups.com>

Eugene Gershnik wrote:

Bronek Kozicki wrote:

[...]

UTF-8 has special properties that make it very attractive for
many applications. In particular it guarantees that no byte of
multi-byte entry corresponds to a standalone single byte. Thus
with UTF-8 you can still search for english only strings (like
/, \\ or .) using single-byte algorithms like strchr().

It also means that you cannot use std::find to search for an ?.

It is also can be used (with caution) with std::string unlike
UTF-16 and UTF-32 for which you will have to invent a
character type and write traits.

Agreed, but in practice, if you are using UTF-8 in std::string,
you're strings aren't compatible with the third party libraries
using std::string in their interface. Arguably, you want a
different type, so that the compiler will catch errors.

IMO UTF-8 (and UTF-8 locales) is probably the best way to use
Unicode on Unix. Apparently I am also backed by known experts
http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

That article only really speaks of external representations.
For which there's not really much choice: for better or for
worse, we live in an 8-bit world -- all modern architectures
have 8 bit bytes, all of the Internet protocols are octet
oriented, etc. And the only 8 bit code which can handle all
languages is UTF-8.

Internally, it depends on the application, and what you are
doing with the strings. For many applications, I think that
UTF-8 is a good choice, even for internal use. For others, I'd
go with UTF-32.

UTF-16 is a good option on platforms that directly support it
like Windows, AIX or Java. UTF-32 is probably not a good
option anywhere ;-)

I can't think of any context where UTF-16 would be my choice.
It seems to have all of the weaknesses of UTF-8 (e.g.
multi-byte), plus a few of its own (byte order in external
files), and no added benefits -- UTF-8 will usually use less
space. Any time you need true random access to characters,
however, UTF-32 is the way to go. The one exception might be if
you could be sure of not having to handle surrogates; if
internationalisation were limited to Europe, for example.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]