Re: decoding character encoding confusion
"Jack" <notaround@dontmail.com> ha scritto nel messaggio
news:1OWdnRiUhZBYwLXVnZ2dneKdnZydnZ2d@pipex.net...
This is in the document:
<meta http-equiv="content-type" charset="UTF-8">
OK, so the server is sending you UTF-8 text. So, the first thing I would do
is to convert from UTF-8 to UTF-16 (using MultiByteToWideChar, and if you
want, the code snipped I posted above).
Then, after converting to UTF-16, I would do a post-processing, finding the
&#<...>; and substitute them with proper character.
Note that David suggested the opposite order (first process &#...; then
convert to UTF-16); I'm not sure about that (and I would need some
experimentation), but the really first thing I would do is UTF-16
conversion. Then, post-processing on this converted UTF-16 data.
It is not hard. You may also use a regular-expression parser, like
CAtlRegExp:
Thanks, the uissue for me is not so much whether it is difficult, but that
it is very time critical and if the app can get away with not having to do
it then so much the better.
I have no idea about the time it takes to post-process the string, removing
&#...; and changing that to proper text.
For speed reasons, I would *not* modify the original string "in place";
instead I would use an out-of-place destination buffer.
i.e. I would scan the source string, and copy all "normal" characters from
source to destination buffer. And whenever a &#...; sequence is found in
source string, I would #1) convert that sequence (instead of copying), and
#2) I would copy the conversion-resulting character in the destination
buffer.
Giovanni