Re: isspace

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sat, 30 Jan 2010 03:57:25 -0800 (PST)

Message-ID:

<86570f49-5fb3-4eb0-8d25-d2c9f33f709b@l30g2000yqb.googlegroups.com>

On Jan 28, 11:54 pm, gervaz <ger...@gmail.com> wrote:

On Jan 28, 11:46 pm, James Kanze <james.ka...@gmail.com> wrote:

On 28 Jan, 21:25, gervaz <ger...@gmail.com> wrote:

On Jan 28, 9:40 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

gervaz <ger...@gmail.com> wrote in news:198ffd0f-8a21-4d23-802f-
cf0f0fee0...@o28g2000yqh.googlegroups.com:

Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?

Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.

Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,

That's because std::isspace requires two arguments, the
character to be tested, and the locale.

and still I don't know if the file contains unicode
characters can be correctly handles.

The functions in <locale> are pretty useless, since they
only handle single byte characters. The "approved" solution
is to read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the
appropriate locale) on the wchar_t in the wstring.

Ok, well, suppose I want to use UTF-8 encoding, how do I
specify it using locale? And where can I find a list of the
possible locale encoding configuration (e.g. if I wanted to
correctly decode a web page just parsing the fist bytes
looking for 'charset')?

There are no standard names for locales -- you'll have to read
your system documentation. Posix defines a standard *format*
for names under Unix systems. But you'll still have to read the
documentation to see what is present, *and* what the default
encoding is, since if UTF-8 is the default, it may not be
present in the name. (Actually, I can't find a definition of
this format in the Posix standard. But it is common to Solaris,
HP-UP, AIX and Linux, at least, and seems to be at least a de
facto standard. The problem is that it doesn't necessarily
represent the default encoding, so UTF-8 might be "en_US.utf8"
or "en_US", the latter only if the default encoding is UTF-8.)

--
James Kanze