Re: isspace
On Jan 28, 11:54 pm, gervaz <ger...@gmail.com> wrote:
On Jan 28, 11:46 pm, James Kanze <james.ka...@gmail.com> wrote:
On 28 Jan, 21:25, gervaz <ger...@gmail.com> wrote:
On Jan 28, 9:40 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
gervaz <ger...@gmail.com> wrote in news:198ffd0f-8a21-4d23-802f-
cf0f0fee0...@o28g2000yqh.googlegroups.com:
Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?
Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,
That's because std::isspace requires two arguments, the
character to be tested, and the locale.
and still I don't know if the file contains unicode
characters can be correctly handles.
The functions in <locale> are pretty useless, since they
only handle single byte characters. The "approved" solution
is to read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the
appropriate locale) on the wchar_t in the wstring.
Ok, well, suppose I want to use UTF-8 encoding, how do I
specify it using locale? And where can I find a list of the
possible locale encoding configuration (e.g. if I wanted to
correctly decode a web page just parsing the fist bytes
looking for 'charset')?
There are no standard names for locales -- you'll have to read
your system documentation. Posix defines a standard *format*
for names under Unix systems. But you'll still have to read the
documentation to see what is present, *and* what the default
encoding is, since if UTF-8 is the default, it may not be
present in the name. (Actually, I can't find a definition of
this format in the Posix standard. But it is common to Solaris,
HP-UP, AIX and Linux, at least, and seems to be at least a de
facto standard. The problem is that it doesn't necessarily
represent the default encoding, so UTF-8 might be "en_US.utf8"
or "en_US", the latter only if the default encoding is UTF-8.)
--
James Kanze