Re: stl, iostream and wchar_t

From:

Ulrich Eckhardt <eckhardt@satorlaser.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Fri, 11 Jul 2008 14:18:05 CST

Message-ID:

<4opjk5-kgg.ln1@satorlaser.homedns.org>

kmmx wrote:

Can anyone tell me what the proper method to write a unicode string to
a file is, using streams?

No. The first thing you will have to decide first is what file format you
want. For Unicode texts, there are in particular the various UTF-x
encodings, which support the full Unicode range of codepoints. I would
suggest UTF-8.

some quick sample code:
typedef std::basic_ofstream<TCHAR, std::char_traits<TCHAR> >
_tofstream;

First thing here: 'TCHAR' is a volatile type, in that its real type depends
on a macro (_UNICODE). Further, it is not part of C++ but rather part of
the win32 API. The problem with this type is that it changes definition, so
it is already almost impossible to answer your question because the real
type is unknown.

One more thing: adding underscores doesn't actually increase readability and
some names with underscores are even reserved. I'd rather not do this.

_tofstream s(_T("filename.txt"));

Note: if TCHAR is wchar_t, this uses a non-standard extension, i.e. an
fstream taking a wchar_t string in the constructor. The standard one always
only takes a char string.

wchar_t* msg = _T("FOOBAR");

This is definitely wrong. A TCHAR is not a wchar_t, or at least not reliably
so. What is wrong with writing either of these:

wchar_t const* msg = L"FOOBAR";
TCHAR const* tmsg = _T("FOOBAR");

Also take the habit of _NEVER_ using the conversion of a string literal to a
non-const pointer, you may not use the pointer to modify the content
anyway.

This works fine as long as I only write ANSI characters. As soon as I
write unicode:

wchar_t* msg = _T("??");

we run into problems.

The correct way is to imbue the stream with the correct codecvt facet, as
the article below partially demonstrates. Note that you should be able to
easily locate a suitable codecvt facet for UTF-8 (e.g. from Boost). An
alternative is to convert strings in memory and then use a normal
char-based string with a non-converting codecvt facet, e.g. from the "C"
locale. Note that you can also use a char-based stream with a codecvt facet
to read or write Unicode files, only that you are limited in what you can
represent internally.

See
http://www.codeproject.com/KB/stl/upgradingstlappstounicode.aspx

Okay, so I can use this giant hack mentioned in the article to write a
unicode string. But that just seems "wrong."

This page is horrible. Firstly, many statements there only apply to win32
systems, which one might still live with, but claims that wchar_t is not a
native type actually shows how little the author knew about C++. Note than
older MS C++ compilers actually implemented wchar_t as a typedef, but that
is against the standard. Secondly, it assumes that you can always convert a
char to a wchar_t, but some encodings use 0x80 for the Euro sign while
Unicode uses 0x20ac. This simple conversion is only correct when the
char-string's charset is Latin 1, otherwise it leads to garbage.

Uli

--
Sator Laser GmbH
Gesch??ftsf??hrer: Thorsten F??cking, Amtsgericht Hamburg HR B62 932

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]