Re: decoding character encoding confusion
"Jack" <notaround@dontmail.com> ha scritto nel messaggio
news:L42dnV6ZOr9yo7XVRVnyvQA@pipex.net...
I'm, a little confused about how to convert text encodings in a file
downloaded from the Internet to memory (via InternetReadFile()).
I download into a char buffer.
The text is UTF-8 encoded (I think).
If you are sure that your text is UTF-8, I think that the first thing to do
is to convert from UTF-8 to UTF-16, when you receive that text.
This is because Windows APIs understand Unicode UTF-16. So, UTF-8 is fine
for transmitting data e.g. over the Internet, but UTF-16 is fine for
processing *inside* Windows applications.
To convert from UTF-8 to UTF-16, you can use MultiByteToWideChar API, and
you can read and use some code of mine that I shared on an MSDN forum:
MSDN Forums -> Visual C++ -> Visual C++ Language -> "Proeblem with some
Unicode chars"
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=3200146&SiteID=1
Now, for example, when I display text from the file in an edit control I
get
"&" where "&" is required (I have added the quotes).
Are you sure that your text is UTF-8, and not, for example ISO 8859-1
(Latin-1) ?
ISO 8859-1 (Latin-1)
http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html
I found sometimes that this encoding tends to use:
&#<decimal code>;
to represent some characters, for example: the "&" symbol has decimal code
38, and so can be represented as
&
So, the first thing that you must be sure about is the kind of encoding of
your text (UTF-8 ? ISO 8859-1 Latin-1?)
Assuming that you still have these &#<...>; substrings after conversion
(e.g. from UTF-8 to UTF-16), I would parse this text, searching for
occurrences of these &#...; substrings, and convert them to corresponding
characters.
It is not hard. You may also use a regular-expression parser, like
CAtlRegExp:
http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx
HTH,
Giovanni