Re: Understanding UNICODE

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sun, 22 Nov 2009 02:57:00 -0800 (PST)

Message-ID:

<2a1ef89e-14dd-4c4d-bb68-2f5f8dae959e@b15g2000yqd.googlegroups.com>

On Nov 21, 1:38 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

mathieu <mathieu.malate...@gmail.com> wrote in news:8e3fda98-b476-4d82-
af39-ee2e6229b...@x31g2000yqx.googlegroups.com:

Hi there,

I am trying to understand UNICODE in C++, but I fear this is
really something I do not understand. Where can I find good
documentation regarding portability (I am targeting UNIX/gcc
and Win32/cl) ? Esp. I'd like know how I can open a
std::ifstream when user input is UNICODE.

  Does the following line makes any sense (I know this is
  not legal) ?

  const char alpha[] = "=E1.dcm";

Is there a way to say, when I share my C++ file, that my
file is in UTF-8 ?

Not generally, but some implementations may support it. For
example, current Linux implementations use UTF-8 encoding as a
default locale, which is supported by many Linux applications
(but not by the Linux kernel itself, which does not care about
the encodings AFAIK).

Still, I would keep Unicode strings out of the source code for
now for portability, and put them into data files instead. The
data files would be read by the programs knowing more about
their encoding.

As you mentioned Windows, it would be handy to know that
Windows does not support UTF-8 locales.

Really? I've used them under Windows, with no problems.

More accurately: locales do not support UTF-8, or any multibyte
encodings, in general---they are designed with the idea that
everything internal is single byte, and that large character
sets would be handled by a wchar_t (which must be at least 21
bits to handle full Unicode). The only place the encoding
enters into play is in the codecvt facet (or in the single byte
encodings for char---islower will depend on the encoding, for
example---but these do not work for multibyte encodings). So
depending on what you are doing, there are several alternatives:

-- Use wchar_t and imbue your input and output with a locale
    which has a UTF-8 codecvt facet. This will work on systems
    which have a wchar_t which supports Unicode, or at least all
    of the Unicode characters which are of interest to you.
    (This is the case for Linux, for example, and some locales
    of Solaris. It's also the case for Windows IF you don't
    need any characters outside of the basic encoding plane; if,
    for example, you're really only concerned with European
    languages.)

-- Use char, imbue input and output with the "C" locale, at
    least for the codecvt facet, and open them in binary (which
    means you can't read standard in or write to standard
    error). Do the rest yourself. This is what works best for
    me most of the time: typically, I only need to iterate over
    the strings, looking for specific characters which are all
    single byte in UTF-8, and UTF-8 was designed with support
    for this as a goal. It does mean that you don't have
    functions like isupper, but if you don't need them, this is
    clearly a good solution.

This means that if the strings in the program are internally
UTF-8, then you have to translate them back and forth all the
time you are calling Windows SDK functions. The good news is
that Windows fully supports Unicode, but only in UTF-16
encoding, with the wchar_t/*W versions of the SDK functions.
Linux on the other hand does not support UTF-16, all the SDK
interface functions are defined in terms of char only.

Yes. The interface with the system and other software you might
be using must also be considered in your choice.

--
James Kanze