Re: How to find only one invalid char in src buffer with MultiByte

=?Utf-8?B?QmlsbA==?= <>
Sat, 7 Jun 2008 01:20:00 -0700
Hi Igor Tandetnik and Alex Blekhman,

    Thank you for your prompt response. I also got same results here.

    MB_PRECOMPOSED gives ZERO for all. What are precomposed characters?

   I am working on client/server based application and client side. We are
providing UNICODE support for our application as well as backward
compatibility also. Here, I choosed WideCharToMultiByte and
MultiByteToWideChar api's to process data in UTF-8 or ACP. Is it correct?

    Requirement: my application recieves data (void*) from server through
sockets. I need to identify whether this data was encoded by UTF-8 or ACP. I
am sure, server side data was encoded by either UTF-8 or ACP.

  I tried with below code snippet also. I dint get succes. Any clue on this
how to achieve?

code snippet:
    CHAR szData[256] = {0};
    strcpy(szData, "1??2345"); // (?? val 233 or -23 )
                //?? is there more than once in szData buffer, it is getting
    INT nDataLen = strlen(szData);

    INT nDesBufferLen = ::MultiByteToWideChar(CP_UTF8,
    if (nDesBufferLen == 0) // Here it should return ZERO
        nDesBufferLen = ::MultiByteToWideChar(CP_ACP,
Thanks & Regards,

"Alex Blekhman" wrote:

"Bill" wrote:

    I am filling char buffer with 0-127 range characters along
with ?? character, then MultiByteToWideChar API failed. If I
include two times ?? character, it is getting success. Please
find the below code snippet. Please correct If I am wrong. My
system settings are United States, English. VC++ 6.0, Windows

CHAR szData[100] = {0};
strcpy(szData, "1??2345");
INT nWideCharBufferLen = MultiByteToWideChar(CP_UTF8,
szData, -1, 0, 0 );

You're getting unpredictable results because you specified wrong
codepage: CP_UTF8. Your string is not valid UTF-8 sequence. That's
why `MultiByteToWideChar' fails. Garbage in - garbage out. '??'
character (Latin small letter E with acute) has value 0xE9 (or
11101001 in binary). According to UTF-8 format, leading byte with
values E0-EF (11100000-11101111) must be followed by another two
bytes, which has values 0x80-0xBF.

You should specify correct codepage when you call
`MultiByteToWideChar', for example: CP_ACP.


