Re: sizeof peculiarity ?

From:

Tim Roberts <timr@probo.com>

Newsgroups:

microsoft.public.vc.language

Date:

Tue, 27 Jun 2006 22:10:22 -0700

Message-ID:

<mi24a25v3a5m4e4t799ln8genh7onebkmo@4ax.com>

cristi <cristi@discussions.microsoft.com> wrote:

Microsoft's compilers have always supported an extension allowing for
2-byte and 4-byte character literals:

unsigned short ab = 'AB'; // hex value 4241
unsigned int abcd = 'ABCD'; // hex value 44434241

I don't think you can complain about the compiler's behavior in this case.
0001FB94 is not a valid Unicode code point, so there isn no way to
determine whether it maps to one or more characters in the current
character set.

It sound very interesting and useful to me to hear that.

Why \U0001FB94 is it not a valid Unicode codepoint?

Because it does not represent any character. Not every random bit sequence
actually represents a Unicode code point. As near as I can tell, none of
the code points in 1FBxx are defined.

An ANSI string cannot contain Unicode characters. When you embed a Unicode
character in a non-Unicode string, as you have done, the compiler has to
translate that to ANSI in some way, based on the code page currently in use
(I believe). With some code pages, Japanese characters CAN be represented
in an 8-bit string, but only by using the multibyte escape sequences. Thus,
a single Unicode escape sequence in an ANSI string might map to more than
one byte.

In order for the compiler to know that, it has to know exactly which real
character the Unicode code point represents. \U0001FB94 does not map to
any real character, so there is no criteria the compiler can use to decide
how wide the equivalent 8-bit representation would be. Hence, you get
garbage.

It identifies a
unicode character outside BMP and it is also defined by the C++
standard. Probably Visual C does not consider it a valid one because
wchar_t is only 2 bytes long.

The FORMAT of the escape sequence is defined. The MEANING of escape
sequences that are not part of Unicode 4.0 is not defined.

Also, remember that you were NOT defining a wide character constant. You
defined a NARROW (8-bit) constant, using a Unicode escape sequence. If you
had defined them as Unicode constants, you would have received very
different results. You had:
printf( "6. size : %d\n", sizeof('\U0001FB94') );
if you had tried this:
printf( "6. size : %d\n", sizeof(L'\U0001FB94') );
you would have seen that ALL of the constants were 2 bytes in size (which,
it occurs to me, is incorrect in the 1FB94 case).

I thought it is ok to use in a wide string literal such a universal
character name. The following piece of code:

int main(void)
{
printf( "length: %d\n", wcslen(L"A\U0001FB94") );
return 0;
}

shows a length of 3. I didn't look to the encoding. But, having read
other documents specifying that Win32 is UTF-16 (and seeing that we
can use japanese/chinese characters) I thought the compiler encodes
all the wide string literal in UTF-16.

Yes. The compiler encodes your string as the three UTF-16 words 0x0041
0xD83E 0xD94F. The D800 plane is reserved for use in UTF-16, to represent
characters larger than 16-bits. The D83E/D94F pair is the UTF-16
representation of 0001FB94. In this case, the compiler has done exactly
the right thing: the string L"A\U0001FB94" DOES contain three characters in
UTF-16.

All this means that we cannot use japanese/chinese (or characters
outside BMP) characters in wide string/character literals?

No, no, no!! Remember that you were *NOT* defining a wide character
literal! You were using a Unicode escape sequence in a NARROW character
literal. 'A' is a narrow character literal. L'A' is a wide character
literal.

Further, you were using an undefined character. If you used a Japanese
character outside of BMP that had an encoding in your current code page,
the compiler would properly produce a narrow character literal for it.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.