Re: STL, UTF8, and CodeCvt
"Philip" <Montrowe@Hotmail.com> wrote in message
news:1173013138.633327.153070@t69g2000cwt.googlegroups.com...
I am using MS VC++ 7/1 (Visual Studio 2003) and working in a fully
Unicode application targeted for Far Eastern language support.
The STL stream constructors and open functions require the filename be
provided as a narrow (char) string. Most operating systems now
support Unicode paths and filenames as UTF-8 strings, which can be
represented as char strings.
I would like to pass UTF-8 strings to the STL stream constructors and
open functions, in order to support Far Eastern language filenames.
Our upgrade library for VC++ accepts wchar_t strings as filenames,
so you can use UTF-16 names directly. UTF-8 names require a translation
step, but we also provide the translator.
The STL standard requires std::codecvt functions to support conversion
to and from Unicode UTF-16 wchar_t and MBCS char.
However, I cannot find equivalent functions for conversion from UTF-16
to UTF-8.
See:
http://www.dinkumware.com/manuals/?manual=compleat&page=wstring.html
which is also part of our upgrade library.
Are there UTF-16/UTF-8 conversions that are already part of or being
considered for inclusion in the STL standard?
The wstring header described above has been proposed for the
next version of the C++ Standard.
Is there an Intel/Windows based STL implementation which currently
provides UTF-16/UTF-8 conversion?
Just our upgrade library, AFAIK.
More generally, STL currently focuses on char/wchar_t types and refers
to them more or less consistently as narrow versus wide type (see
std::ios::widen()).
Does the STL standard currently encompass any terminology to
accommodate UTF-8 (a sort of combination of narrow and wide) or is the
standards committee considering anything along these lines?
The committee is considering several approaches, but hasn't settled
on a given one yet.
Lastly, the UTF-16/UTF-8 conversions are well-known and relatively
simple so I have considered writing my own std::codecvt
specialization.
However, there is no clearly delineated type which could stand for
UTF-8. I believe unsigned char is taken up (in Windows and VC++
anyway) for MBCS, and typedefs are not real types, only aliases and
thus do not allow a separate specialization.
You can use any of the char types for UTF-8. No need to store
UTF-8 in its own type. But see below.
Is there another technique or a language enhancement on the horizon
which would address this specialization limitation?
That too is being discussed in the C++ committee.
P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]