Re: STL, UTF8, and CodeCvt

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Tue, 6 Mar 2007 04:16:47 CST

Message-ID:

<1173174041.164643.252010@n33g2000cwc.googlegroups.com>

Pete Becker wrote:

Eugene Gershnik wrote:

Lourens Veen wrote:

I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
string stored in a compressed format.

Which is precisely the same as any other "MBCS" encodings
people have been using for a long time.

Not quite. With UTF-8 you can always tell from the value of a
byte whether it is part of a multi-byte character. Other
encodings don't have this property, making it much more
difficult to move around (especially backwards) in a string.

I think a lot of other multi-byte encodings do have this
feature. What UTF-8 has that I've not seen elsewhere is the
possibility to identify in addition whether a given byte is the
first byte of a sequence, or one of the following bytes. This
makes operations like counting the number of characters very
simple (just count the bytes where (*p & 0xC0) != 0x80), and
allows guaranteed resynchronization without looking outside the
character.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient?e objet/
                    Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]