Re: string encoding in C++

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sun, 4 Jul 2010 04:18:42 -0700 (PDT)

Message-ID:

<5622f1bd-236c-4c88-aab1-17183b4aa8a4@g19g2000yqc.googlegroups.com>

On Jul 2, 1:43 am, Sam <s...@email-scan.com> wrote:

Allen writes:

On 7=E6=9C=882=E6=97=A5, =E4=B8=8A=E5=8D=886=E6=97=B603=E5=88=86, Sam <=

s...@email-scan.com> wrote:

Allen writes:

Hi, I am transporting a c++ program from win32 to ibm aix
5.3. There is a file name Measurement.cpp which contains
some string, for example:

static std::wstring breaker = L"=E5=BC=80=E5=85=B3";

The Measurement.cpp is encoding in UTF-8; The
transporting procedure is as following:
1.change Measurement.cpp encoding type to be GB18030 as aix 5.3
needed.
2.write a subfunction name ws2s:
std::string ws2s(const std::wstring & src) {
   const int dsize = 2 * src.size() + 1;
   char * buff = new char[dsize];
   memset(buff, 0, dsize);
   setlocale(LC_ALL, "");
   wcstombs(buff, src.c_str(), dsize);
   setlocale(LC_ALL, "C");
   std::string result = buff;
   delete[] buff;
   buff = NULL;
   return result;
}
3.output the breaker
std::cout << ws2s(breaker) << std::endl;

But the output text is not correctly display.

Would you please help me? Thank you.

Three possible reasons:

1) When you compile Measurement.cpp, your C++ compiler must
be aware that this module uses GB18030. Check your
compiler's documentation.

2) At runtime, your locale does not match the encoding
using by your display terminal.

3) Your C++ library does not implement the encoding used by
your locale.

How can you set a locale which isn't installed?

You can find out the answer yourself, by printing the
contents of your std::wstring first, as numerical
wchar_t's, and verifying their unicode values, presuming
that your C++ library puts UTF-16 ot UTF-32 into your
wchar_t's; and by printing the contents of your converted
string buffer, as numerical chars, and verifying that their
encoding is correct.

< 1K=E6=9F=A5=E7=9C=8B=E4=B8=8B=E8=BD=BD

Thank you for the detailed answer.
It is strange that the part of string read from xml file by
xerces-c is displayed ok,

Generally, XML parsers expect XML document to use UTF-8. If an
XML document uses a different encoding, it would specify it in
the <?xml =E2=80=A6 > processing instruction.

I'd also be curious as to what characters are involved. It's
very frequent to have XML files which don't contain Chinese
characters, or only contain them in CDATA sections. If the
characters he's displaying from Xerces correspond to ASCII, then
it's not surprising that they display correctly.

while the part of constant breaker string is not correct.
To illustrate it, I write the example codes as following:
std::wstring prefix = xercesc-c...getAttributeText(...);
std::wstring breaker = L"=E5=BC=80=E5=85=B3";
std::wstring name = prefix + breaker;
I output the name into a file, and prefix will be correct, but breaker
not correct.

In order to begin analysing such a problem, it's necessary to
know 1) what should be output, and 2) what actually is output.
In both cases, the actual numerical values of the bytes, not
what is being displayed by some display engine.

So I don't understand two things.
1. how does source file encoding relate constant string, i.e. L"=E5=BC=

=80=E5=85=B3"?

This is implementation defined. Most C++ libraries use UTF-16
or UTF-32.

I think you're being over optimistic about Unicode use---the
last time I had access to non Windows machines, Solaris (and Sun
CC) still didn't use Unicode.

Still, I think AIX is UTF-16. And I'm pretty sure that the
compiler doesn't use GB 18030 by default.

(As a general rule, I'd recommend using a Unicode format
internally regardless, and translating to GB 18030, if
necessary, on input and output.)

Another factor is what your compiler thinks is the character
coding of the C++ source.

2. what type encoding does std::wstring use?

The same one.

The same one as what? In practice, std::wstring can probably
handle any encoding which will fit, the compiler ignores the
encoding, except for wide character literals, and the library
will use whatever encoding is specified by the locale it uses in
a given function---which isn't necessarily the same as the one
the compiler used when it interpreted the wide character
literals.

Again: you will find the answer to your questions by printing
out the numerical values of your wide and narrow character
strings, using a test program, instead of guessing as to
what's going on.

I'd use two steps: print out the numerical values in the
program, and dump the numerical values of the bytes in the file.

--
James Kanze