Re: Upgrade from Windows-1252 to UCS-2
On Jun 20, 12:36 pm, Boris <b...@gtemail.net> wrote:
I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.
I think you mean UCS-4 and UTF-16. Old documents talk about UCS-2, but
current Windows (and I assume Linux etc.) is UCS-4. This causes no end
of confusion especially as for most purposes there isn't much
difference. Check that your software manages to handle the treble
cleff character properly. Let's see how it works here :)
As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring then? When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends now
on the implementation of the C++ standard library if and what kind of
conversions are supported? I might need to use a third-party library like
the Dinkum Conversions Library which converts data on the fly or something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?
Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.
After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct that
I'm safe to use member functions of std::wstring as long as the character
set used is not multibyte?
If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.
Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?
Anything I might have missed?
To convert from UTF-16 to UTF-8 is fairly simple, but don't forget you
HAVE to go through UTF-32.
It's not directly about your situation, but you may find this
interesting as it does discuss some of the issues about encodings and
Unicode.
http://www.kirit.com/Getting%20the%20correct%20Unicode%20path%20within%20an%20ISAPI%20filter
The way you're going about it is a good way to start this sort of
conversion. In the end for our systems we made our own
std::basic_string like class that knows it is UTF-16 and alters parts
of the interface accordingly.
Once you start working with Unicode you won't want to go back.
K