Re: Character set
On Jun 22, 11:30 pm, Amit Kumar <amitkumar.i...@gmail.com> wrote:
Hi Alf, Ron and Andy,
Thanks a lot for your valuable inputs.
[Ron]: Even US Windows is natively a 16-bit UNICODE machine.
[Andy]:If you are writing for Windows only I would advise you to use wc=
har_t
The 'W' varients of the windows APIs take UTF-16 encoded null
terminated strings and the 'A' varients require platform
encoded null terminated strings (and not the UTF-8 encoded
strings; AFAIK)
At least on the Windows machines I use, the 8 bit encodings are
ISO 8859-1 (not UTF-8). Note, however, that like Unix, there
are a lot of interfaces which do no more than copy the bytes,
without interpreting. It may be impossible, for example, to
create a filename in UTF-8, but you can certainly write and read
UTF-8 to and from the file.
The question arises: Can I really use wchar_t to store a
UTF-16 encoded character and std::wstring to store a UTF-16
encoded string?
Stroustrup: "The size of wchar_t is implementation defined and
large enough to hold the largest character set support by the
implementation's locale."
Since it is not guaranteed that wchar_t is 16 bits, I cannot
simply store a UTF-16 string in std::wstring and call .c_str()
to obtain a UTF16* for a Windows utf-16 based API.
You almost certainly can if you're under Windows. And code
which calls Windows UTF-16 based APIs isn't going to be portable
elsewhere anyway.
Even more frustrating and annoying thing is that I cannot even
store a utf-8 string in std::string.
Of course you can. I do it all the time. (Technically, there
is a slight problem if char is 8 bit signed, since the
conversion of an unsigned value, like 0xC3, to signed is
implementation defined, but in practice, no implementation would
dare break this: if the "conversion" doesn't just copy the bits,
the implementation will certainly make char unsigned if it has
only 8 bits.)
Why? Because std::string is std::basic_string<char>, and char
is not guaranteed to be 8 bits (though it is practically
always 8 bits, as pointed out by Ron).
char is guaranteed to be at least 8 bits. If it is more, you
can still store 8 bit values in it. The only possible problem
is signedness, and the conversion of a value in the range
0x80-0xFF to the signed char, and in practice, you're certainly
safe here as well.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34