Re: UTF8 and std::string

From:

"kanze" <kanze@gabi-soft.fr>

Newsgroups:

comp.lang.c++.moderated

Date:

16 Jun 2006 05:25:34 -0400

Message-ID:

<1150443290.778113.267630@c74g2000cwc.googlegroups.com>

Eugene Gershnik wrote:

kanze wrote:

Eugene Gershnik wrote:

Bronek Kozicki wrote:

[...]

UTF-8 has special properties that make it very attractive for
many applications. In particular it guarantees that no byte of
multi-byte entry corresponds to a standalone single byte. Thus
with UTF-8 you can still search for english only strings (like
/, \\ or .) using single-byte algorithms like strchr().

It also means that you cannot use std::find to search for an
?.

Neither can you with UTF-32 or anything else since AFAIK ? may
be encoded as e followed by the thingy on top or as a single
unit ?. ;-)

Not according to Unicode, at least not in correctly formed
Unicode sequences.

But that's not the point. The point is more the opposite:
simplistic solutions like looking for a single character are
just that: simplistic. The fact that you can find certain
characters with such a single character search in UTF-8 is a
marginal advantage, at best.

In any event my point was that in many contexts (system
programming, networking) you almost never look for anything
above 0x7F, even though you have to store it.

For the moment. Although in most of those contexts, you have to
deal with binary data as well, which means that any simple text
handling will fail.

Also note that you *can* use std::find with a filtering
iterator (which is easy to write) sacrificing performance.
Then again nobody uses std::find on strings. You either use
basic_string::find or strstr() and similar. Which both work
fine on ? in UTF-8 as long as you pass it as a string and not
a single char.

Agreed. But then, a lot of other multibyte character sets will
work in that case as well.

I'm not saying that UTF-8 doesn't have any advantages. But the
fundamental reason for using it is much simpler: it's the only
game in town. I know of no other internationally standardized 8
bit code set which covers all languages.

The only thing that doesn't work well with UTF-8 is access at
arbitrary index but I doubt any software except maybe document
editors really needs to do it.

I don't know. I know that what I do doesn't need it, but I
don't know too much about what others might be doing.

It is also can be used (with caution) with std::string
unlike UTF-16 and UTF-32 for which you will have to invent
a character type and write traits.

Agreed, but in practice, if you are using UTF-8 in
std::string, you're strings aren't compatible with the third
party libraries using std::string in their interface.

This depends on a library. If it only looks for characters
below 0x7F and passes the rest unmodified I stay compatible.
Most libraries fall in this category. That's why so much Unix
code works perfectly in UTF-8 locale even though it wasn't
written with it in mind.

Are you kidding. I've not found this to be the case at all.
Most Unix tools are extremely primitive, and line things up in
columns based on byte count (which also imposes fixed width
fonts---rather limiting as well).

Note that the problem is NOT trivial. If everything were UTF-8,
it would be easy to adopt. But everything isn't UTF-8, and we
cannot change the past. The file systems I work on do have
characters like '?' in them, already encoded in ISO 8859-1. If
you create a filename using UTF-8 in the same directory, ls is
going to have one hell of a time displaying the directory
contents correctly. Except that ls doesn't worry about
displaying them correctly. It just spits them out, and counts
on xterm doing the job correctly. And xterm delegates the job
to a font, which has one specific encoding (which both ls and
xterm ignore).

This is one case where Windows has the edge on Unix: Windows
imposes a specific encoding for filenames. IMHO, it would have
been better if they had followed the Plan 9 example, and chosen
UTF-8, but anything is better than the Unix solution, where
nothing is defined, every application does whatever it feels
like, and filenames with anything other than basic US ASCII end
up causing a lot of problems.

Arguably, you want a different type, so that the compiler
will catch errors.

Yes. When I want maximum safety I create struct utf8_char
{...}; with the same size and alignment as char. Then I
specialize char_traits, delegating to char_traits<char> and
have typedef basic_string<utf8_char> utf8_string. This creates
a string binary compatible to std::string but with a different
type. It gives me type safety but I am still able to
reinterpret_cast pointers and references between std::string
and utf8_string if I want to. I know it is undefined behavior
but it works extremely well on all compilers I have to deal
with (and I suspect on all compilers in existence).

And you doubtlessly have to convert a lot:-). Or do you also
create all of the needed facet's in locale?

Still, it doesn't work if the code you're interfacing to is
trying to line data up using character counts, and doesn't
expect multi-byte characters. If, like a lot of software here
in Europe, it assumes ISO 8859-1.

UTF-16 is a good option on platforms that directly support
it like Windows, AIX or Java. UTF-32 is probably not a good
option anywhere ;-)

I can't think of any context where UTF-16 would be my
choice.

Any code written for NT-based Windows for example. The system
pretty much forces you into it.

In the same way that Unix forces you into US ASCII, yes.

All the system APIs (not some but *all*) that deal with
strings accept UTF-16. None of them accept UTF-8 and UTF-32.
There is also no notion of UTF-8 locale. If you select
anything but UTF-16 for your application you will have to
convert everywhere.

They've got to support UTF-8 somewhere. It's the standard
encoding for all of the Internet protocols.

I'd probably treat Windows the same way I treat Unix: I use the
straight 8 bit interface, make sure all of the strings the
system is concerned with are pure US ASCII, and do the rest
myself.

It seems to have all of the weaknesses of UTF-8 (e.g.
multi-byte), plus a few of its own (byte order in external
files), and no added benefits -- UTF-8 will usually use less
space. Any time you need true random access to characters,
however, UTF-32 is the way to go.

Well as long as you don't need to look up characters
*yourself* but only get results from libraries that already
understand UTF-16 the problems above disappear. Win32, ICU and
Rosette all use UTF-16 as their base character type (well
Rosette supports UTF-32 too) so it is easier to just use it
everywhere.

On the most fundamental level to do I18N correctly strings
have to be dealt with as indivisible units. When you want to
perform some operation you pass the string to a library and
get the results back. No hand written iteration can be
expected to deal with pre-composed vs. composite, different
canonical forms and all the other garbage Unicode brings us.
If a string is an indivisible unit then it doesn't really
matter what this unit is as long as it is what your libraries
expect to see.

So we basically agree:-). All that's missing is the libraries.
(I know, some exist. But all too often, you don't have a
choice.)

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
                    Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]