Re: Understanding UNICODE
On Nov 21, 1:38 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
mathieu <mathieu.malate...@gmail.com> wrote in news:8e3fda98-b476-4d82-
af39-ee2e6229b...@x31g2000yqx.googlegroups.com:
Hi there,
I am trying to understand UNICODE in C++, but I fear this is
really something I do not understand. Where can I find good
documentation regarding portability (I am targeting UNIX/gcc
and Win32/cl) ? Esp. I'd like know how I can open a
std::ifstream when user input is UNICODE.
Does the following line makes any sense (I know this is
not legal) ?
const char alpha[] = "=E1.dcm";
Is there a way to say, when I share my C++ file, that my
file is in UTF-8 ?
Not generally, but some implementations may support it. For
example, current Linux implementations use UTF-8 encoding as a
default locale, which is supported by many Linux applications
(but not by the Linux kernel itself, which does not care about
the encodings AFAIK).
Still, I would keep Unicode strings out of the source code for
now for portability, and put them into data files instead. The
data files would be read by the programs knowing more about
their encoding.
As you mentioned Windows, it would be handy to know that
Windows does not support UTF-8 locales.
Really? I've used them under Windows, with no problems.
More accurately: locales do not support UTF-8, or any multibyte
encodings, in general---they are designed with the idea that
everything internal is single byte, and that large character
sets would be handled by a wchar_t (which must be at least 21
bits to handle full Unicode). The only place the encoding
enters into play is in the codecvt facet (or in the single byte
encodings for char---islower will depend on the encoding, for
example---but these do not work for multibyte encodings). So
depending on what you are doing, there are several alternatives:
-- Use wchar_t and imbue your input and output with a locale
which has a UTF-8 codecvt facet. This will work on systems
which have a wchar_t which supports Unicode, or at least all
of the Unicode characters which are of interest to you.
(This is the case for Linux, for example, and some locales
of Solaris. It's also the case for Windows IF you don't
need any characters outside of the basic encoding plane; if,
for example, you're really only concerned with European
languages.)
-- Use char, imbue input and output with the "C" locale, at
least for the codecvt facet, and open them in binary (which
means you can't read standard in or write to standard
error). Do the rest yourself. This is what works best for
me most of the time: typically, I only need to iterate over
the strings, looking for specific characters which are all
single byte in UTF-8, and UTF-8 was designed with support
for this as a goal. It does mean that you don't have
functions like isupper, but if you don't need them, this is
clearly a good solution.
This means that if the strings in the program are internally
UTF-8, then you have to translate them back and forth all the
time you are calling Windows SDK functions. The good news is
that Windows fully supports Unicode, but only in UTF-16
encoding, with the wchar_t/*W versions of the SDK functions.
Linux on the other hand does not support UTF-16, all the SDK
interface functions are defined in terms of char only.
Yes. The interface with the system and other software you might
be using must also be considered in your choice.
--
James Kanze