Re: isspace

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sun, 31 Jan 2010 04:58:09 -0800 (PST)

Message-ID:

<43d40efa-c4ac-42f0-b72b-f1a19754e088@k41g2000yqm.googlegroups.com>

On Jan 31, 9:39 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

Paavo Helde <myfirstn...@osa.pri.ee> wrote
innews:Xns9D116950C4paavo256@216.196.109.131:

[...]

Ok, so I think that I will open my file specifying to use UTF-8
encoding, but how can I do it in C++?

You can open it as a narrow stream and read in as binary
UTF-8, or (maybe) you can open it as a wide stream and get
an automatic translation from UTF-8 to wchar_t. The
following example assumes that you have a file test1.utf
containing valid UTF-8 text. It reads the file in as a wide
stream and prints out the numeric values of all wchar_t
characters.

#include <iostream>
#include <fstream>
#include <locale>
#include <string>

int main() {
std::wifstream is;
const std::locale filelocale("en_US.UTF8");

The above line supposes 1) that you're on a Unix platform
(because it uses the Unix conventions for naming locales), and
2) that the "en_US.UTF8" locale has been installed---under that
name. (I've worked on a lot of systems where this was not the
case.)

    is.imbue(filelocale);
    is.open("test1.utf8");

    std::wstring s;
    while(std::getline(is, s)) {
        for (std::wstring::size_type j=0; j<s.length(); ++j) {
            std::cout << s[j] << " ";
        }
        std::cout << "\n";
    }
}

(Tested on Linux with a recent gcc, I am not too sure if
this works on Windows. First, wchar_t in MSVC is too narrow
for real Unicode, at best one might get UTF-16 as a result.)

UTF-16 is "real Unicode". Just like UTF-8.

For curiosity, I tested this also on Windows with MSVC9, and
as expected it did not work, the locale construction
immediately threw an exception (bad locale name). Neither did
any alterations work ("english.UTF8", ".UTF8", ".utf-8",
".65001").

That's because Windows uses different conventions for naming
locales. (Windows Vista and later clames that names conforming
to RFC 4646 are used, see
http://msdn.microsoft.com/en-us/library/dd373814%28VS.85%29.aspx.
Except that RFC 4646 doesn't seem to contain information
concerning the character encoding. I'm guessing that Windows
would use the code page for this---65001 for UTF-8. But I don't
know how it has to be added to the "en-US".)

Thus, if one wants any portability it seems the best approach
currently is still to read in binary UTF-8 and perform any
needed conversions by hand.

It should be sufficient to find out how the different locales are
named for each system, and read this information in from some
sort of configuration file.

--
James Kanze