Re: Acquiring UTF-8 string length

From:

Ulrich Eckhardt <eckhardt@satorlaser.com>

Newsgroups:

microsoft.public.vc.language

Date:

Wed, 04 Apr 2007 09:59:35 +0200

Message-ID:

<8g2ce4-rj4.ln1@satorlaser.homedns.org>

Tim Roberts wrote:

"Igor Tandetnik" <itandetnik@mvps.org> wrote:

Alexander Nickolov <agnickolov@mvps.org> wrote:

MultiByteToWideChar will tell you how many UTF-16 words
you need to represent the string in UTF-16 - not how many
UNICODE codepoints it contains. Any codepoint above 0xffff
will requre a surrogate pair, thus bump the result by one.

Well, the question is, again, what do you need this length for. A length
in Unicode codepoints is largely useless.

Igor, with all due respect, I don't understand the attitude you've shown
in this whole thread. What he's asking is perfectly reasonable. Despite
the fact that his "I<heart>NY" string contains six bytes, if it were
printed to a UTF-8 console it would only occupy four character positions.
Why wouldn't I want a way to get that information?

Sorry, but Igor is right and you are assuming something that isn't. The
point is that a single glyph/letter/character can require more than one
codepoint, i.e. when you have a letter with an accent. Further, some
letters exist in a combined form ("latin small letter o with caret") and a
decomposed form ("latin small letter o" and "combining caret"). One
contains one codepoint, the other two, while both are displayed equally and
also compare equal in Unicode string comparisons (this is then
called 'collation').

As far as I know, calling mbstowcs is the only way to do that.

It might, but if the target type is wchar_t and you use some Thai scripts
you end up with surrogate sequences, i.e. you get four wchar_t that
possibly only require two characters/letters/glyphs on the screen.

The real problem here is that it is unknown still what the OP wants to
achieve. If you don't know what length you want to measure you can as well
always use 42. Only if you know what is required and intended you can take
the correct measure, i.e. codepoints, bytes, chars/wchar_ts or glyphs.

Uli