Re: STL, UTF8, and CodeCvt
Philip wrote:
I am using MS VC++ 7/1 (Visual Studio 2003) and working in a fully
Unicode application targeted for Far Eastern language support.
The STL stream constructors
The STL doesn't include any streams, it only consists of containers,
iterators and algorithms. The C++ streams come from the IOStreams library
rather. Anyway, neither are standard C++, though the C++ standard was
heavily influenced by both, in fact they were mostly incorporated. When you
refer to the "STL standard", I assume you mean the C++ standard.
and open functions require the filename be provided as a narrow (char)
string. Most operating systems now support Unicode paths and filenames
as UTF-8 strings, which can be represented as char strings.
I would like to pass UTF-8 strings to the STL stream constructors and
open functions, in order to support Far Eastern language filenames.
You can't. As pointed out, many OS' don't support UTF-8 and the string
passed to the ctor has implementation-defined meaning anyways. The IMHO
best way around this is to simply create a function
void open_file( ofstream& out, std::string const& utf_8_filename);
which then is implemented in a platform/compiler-dependant way. Typically,
this would delegate to to the normal ofstream::open or use the generally
present way to create an fstream from a FILE* or other, platform-dependant
things.
The STL standard requires std::codecvt functions to support conversion
to and from Unicode UTF-16 wchar_t and MBCS char.
No it doesn't. Nothing requires any particular meaning or interpretation for
wchar_t or char. Further, std::codecvt is notoriously bad at handling
internal multi-element per char encodings like UTF-8, UTF-16 and,
considering combining glyphs like accents, Unicode in general. In
particular the latter part is often simply ignored, leading to subtle
problems sometimes.
However, there is no clearly delineated type which could stand for
UTF-8. I believe unsigned char is taken up (in Windows and VC++
anyway) for MBCS, and typedefs are not real types, only aliases and
thus do not allow a separate specialization.
Again: encoding is rather a convention than that it is in any way mandated.
That means that 'char' is well-suited to holding UTF-8 data, just
as 'unsigned char' would be.
Uli
--
Sator Laser GmbH
Gesch??ftsf??hrer: Ronald Boers Steuernummer: 02/858/00757
Amtsgericht Hamburg HR B62 932 USt-Id.Nr.: DE183047360
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]