Re: Acquiring UTF-8 string length
"Ulrich Eckhardt" wrote:
count of code points = 6, obtained by strlen()
Wrong. strlen() only returns the number of chars up to the first NUL char.
The number of codepoints is four, plus the terminating NUL.
Just tested: the returned length is six (6) characters. I am looking right
at Microsoft's implementation of CRT's strlen and it doesn't do any character
validation of code points, it simply reports the number of 1-byte blocks up
to the nil.
// "I{heart}NY"
char str[] = { 0x49, 0xE2, 0x99, 0xA5, 0x4E, 0x59, 0x00 };
// len becomes 6
size_t len = strlen(str);
// len becomes 6
HRESULT hr = StringCchLengthA(
str,
sizeof(str),
&len);
For fixed-width character sets (Ansi) this makes sense, but I just want one
that supports variable-length character sets, which includes UTF-8 and
UTF-16. The only way I've been able to do this is call CharNextExA in a
loop, but I haven't sufficiently tested and it doesn't provide all the
information I want. I'll take a look at the ICU you referenced since I am
not looking forward to rolling my own encoder/decoder.
Thanks