Re: Want Input boxes to accept unicode strings on Standard Window

From:

"David Ching" <dc@remove-this.dcsoft.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Wed, 25 Jul 2007 14:11:33 GMT

Message-ID:

<pSIpi.27972$2v1.25586@newssvr14.news.prodigy.net>

"David Wilkinson" <no-reply@effisols.com> wrote in message
news:eJWXQErzHHA.4712@TK2MSFTNGP04.phx.gbl...

David Ching wrote:

Ah, UTF-8. I know you discussed this at length several months ago here,
but to be honest, this is my understanding of it: it is an 8-bit
encoding scheme no different than Ansi (that's how it fits in 8 bits).
Since it is 8-bits, it cannot specify everything a LPWSTR can. Yet it is
somehow is supposed to be better than Ansi, not reliant on any codepage.
But if it's only 8 bits, how is that?

And UTF-8 begs the question about UTF-16. Is UTF-16 the same as what
Windows Notepad (in the Save As dialog) calls "Unicode"? Or is Windows
concept of Unicode and LPWSTR different than UTF-16?

David:

Both UTF-8 and UTF-16 are complete encodings of Unicode. UTF-8 uses up to
four 8-bit characters, and UTF-16 uses up to two 16-bit characters.

Yes, thanks. For some reason I had thought UTF-8 was SBCS (since it was 8
bits) and not MBCS. Even Ansi codepage is MBCS, so UTF-8 and Ansi are
really different scheme for the same idea. Makes sense now! :-)

When "Windows Unicode" first started out, all code points could be
represented by one 16-bit code unit, but no longer. Modern Windows Unicode
*is* UTF-16. The Windows ANSI code pages are (I think) all DBCS, so UTF-8
cannot be used as a code page (at any rate, it is not the ANSI code page
for any language).

Some say, and I agree, that now there are surrogate pairs in UTF-16, it
holds no advantage over UTF-8.

Not to offend anyone, but I recently developed a small product in 30
languages. The languages were selected to match the ones where Windows had
a native SKU. UTF-16 was fine for this, we never worried about surrogate
pairs. I had understood surrogate pairs were only used for a few Han
dialects in Chinese, and perhaps a couple other languages, but they weren't
mainstream by any means. How long before UTF-16 *really* does not work for
all practical purposes?

Many Linux systems use UTF-8 as their native encoding, but this will never
happen in Windows.

The way you've explained UTF-8, it has all the disadvantages of MBCS (in
fact it is a MBCS) and is thus very hard to parse. I'm not sure why any
modern OS would want to be built internally on it.

This does not mean that a Windows program cannot use UTF-8 internally. In
fact the whole back end of my application uses UTF-8. XML serialization is
just one of the things this back end does.

I take it STL string is UTF-8 friendly? ;) Seriously,what library to use
to represent UTF-8 in memory? I understood STL string (often typedef'd to
be tstring) is just a UTF-16 string like CStringW. I did not see any UTF-8
capable string that is widespread. What are you using?

Thanks,
David