Re: UTF8 and std::string
Eugene Gershnik wrote:
Bronek Kozicki wrote:
[...]
UTF-8 has special properties that make it very attractive for
many applications. In particular it guarantees that no byte of
multi-byte entry corresponds to a standalone single byte. Thus
with UTF-8 you can still search for english only strings (like
/, \\ or .) using single-byte algorithms like strchr().
It also means that you cannot use std::find to search for an ?.
It is also can be used (with caution) with std::string unlike
UTF-16 and UTF-32 for which you will have to invent a
character type and write traits.
Agreed, but in practice, if you are using UTF-8 in std::string,
you're strings aren't compatible with the third party libraries
using std::string in their interface. Arguably, you want a
different type, so that the compiler will catch errors.
IMO UTF-8 (and UTF-8 locales) is probably the best way to use
Unicode on Unix. Apparently I am also backed by known experts
http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux
That article only really speaks of external representations.
For which there's not really much choice: for better or for
worse, we live in an 8-bit world -- all modern architectures
have 8 bit bytes, all of the Internet protocols are octet
oriented, etc. And the only 8 bit code which can handle all
languages is UTF-8.
Internally, it depends on the application, and what you are
doing with the strings. For many applications, I think that
UTF-8 is a good choice, even for internal use. For others, I'd
go with UTF-32.
UTF-16 is a good option on platforms that directly support it
like Windows, AIX or Java. UTF-32 is probably not a good
option anywhere ;-)
I can't think of any context where UTF-16 would be my choice.
It seems to have all of the weaknesses of UTF-8 (e.g.
multi-byte), plus a few of its own (byte order in external
files), and no added benefits -- UTF-8 will usually use less
space. Any time you need true random access to characters,
however, UTF-32 is the way to go. The one exception might be if
you could be sure of not having to handle surrogates; if
internationalisation were limited to Europe, for example.
--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]