"James Kanze" <>
Tue, 6 Mar 2007 04:16:47 CST
Pete Becker wrote:

Eugene Gershnik wrote:

Lourens Veen wrote:

I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
string stored in a compressed format.

Which is precisely the same as any other "MBCS" encodings
people have been using for a long time.

Not quite. With UTF-8 you can always tell from the value of a
byte whether it is part of a multi-byte character. Other
encodings don't have this property, making it much more
difficult to move around (especially backwards) in a string.

I think a lot of other multi-byte encodings do have this
feature. What UTF-8 has that I've not seen elsewhere is the
possibility to identify in addition whether a given byte is the
first byte of a sequence, or one of the following bytes. This
makes operations like counting the number of characters very
simple (just count the bytes where (*p & 0xC0) != 0x80), and
allows guaranteed resynchronization without looking outside the

