Re: How to find only one invalid char in src buffer with MultiByte
"Bill" <Bill@discussions.microsoft.com> wrote in message
news:CEC01EEB-2A84-4BDF-8BE6-2FD284F29160@microsoft.com
MB_PRECOMPOSED gives ZERO for all. What are precomposed characters?
There are two ways to represent, say, 'e' in Unicode: as a single
character U+00E9 (Latin Small Letter E With Acute) or as a combimation
U+0065 U+0301 (Latin Small Letter E / Combining Acute Accent).
MB_PRECOMPOSED prefers the first form whenever possible, MB_COMPOSITE
prefers the second form.
Neither flag can be used when converting from UTF-8, since UTF-8 is
already a Unicode encoding: MultiByteToWideChar just uses whatever form
is encoded in the original string.
I think you should read this article, to clear the confusion:
http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html
Requirement: my application recieves data (void*) from server
through sockets. I need to identify whether this data was encoded by
UTF-8 or ACP. I am sure, server side data was encoded by either UTF-8
or ACP.
First, ACP is not a specific code page. It stands for "Active Code
Page", and means the system default code page that happens to be
configured on this computer. The user can change it at any time, through
Control Panel | Regional and Language Options | Advanced | Language for
non-Unicode Programs. Naturally, two machines may be configured
differently, meaning that CP_ACP on one is not the same code page as
CP_ACP on the other.
So saying "I received a string from another machine encoded in ACP" is
meaningless. Which ACP do you mean - yours or the server's? And if the
server's, how do you know what code page is configured as ACP there?
Second, it is impossible, in general, to distinguish between UTF-8 and a
legacy code page just by looking at the string. The same sequence of
bytes can be interpreted as a valid UTF-8 string and as a valid (but
different) string in legacy code page. For example, the sequence of two
bytes C3 A9 is a valid UTF-8 encoding of the character e (U+00E9, Latin
Small Letter E With Acute). At the same time, if interpreted according
to code page 1252 (Windows Western, the one used by English US version
of Windows by default), it stands for "A?" (U+00C3, Latin Capital Letter
A with Tilde / U+00A9, Copyright Sign). And in code page 1251 (Windows
Cyrillic), it stands for "??" (U+0413, Cyrillic Capital Letter Ghe /
U+00A9, Copyright Sign). Neither of the interpretations is "more valid"
than any other.
Whenever you transmit text between two machines, the transmission
protocol should allow them to agree on encoding, either implicitly (e.g.
the protocol specifies that the text is always in UTF-8) or explicitly,
through some form of metadata (e.g. in HTTP the server may send a header
like this: Content-Type: text/html; charset=UTF-8). Your protocol needs
to do that, too.
--
With best wishes,
Igor Tandetnik
With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925