Re: How to emit UTF-8 from console mode program?

Alberto Ganesh Barbati <>
Thu, 2 Oct 2008 17:40:25 CST
Siegfried Heintze ha scritto:

The following perl program works when I run it from urxvt-X console on
cygwin-x windows when running on Microsoft Windows XP:

LC_CTYPE=en_US.UTF-8 urxvt-X.exe&
perl -wle "binmode STDOUT, q[:utf8]; print chr() for 0x410 .. 0x430;"

This little one liner prints the Russian alphabet in Cryllic. With some
slight modification it will also print a lot of other alphabets too --
including Hebrew, chinese and japanese.

It does not work with cmd.exe because apparently cmd.exe cannot deal with

Can someone help me translate it into C++? I would not expect it to work
from cmd.exe with C++, but I am hopeful it will work with urxvt-X!

This does not work:

for(int ii = 0x410; ii < 0x430; ++ii) std::wcout << (wchar_t) ii;

I obviously need to tell urxvt-X that I want to use utf-8 but I don't know
how! I suppose UTF-16 would be fine too. I just want to see some Chinese and
Cyrillic glyphs.

I'm afraid but it can't be done portably. Assuming you work on Windows
and UTF-16 is good for you, then you could use:

   for(int ii = 0x410; ii < 0x430; ++ii)
       << (char)(unsigned char)(ii & 0xff)
       << (char)(unsigned char)((ii >> 8) & 0xff);

Noticed that I used cout and not wcout.

However, there's a big pitfall that you should be aware of! As cout is
opened as a *text* file if you ever tried to output a character such as
U+040A (CYRILLIC CAPITAL LETTER NJE) then the output will get corrupted
because the character '\x0a' == '\n' will trigger CR/LF expansion.
Therefore I discourage using this approach at all.

For UTF-8 the problem is slightly better, but you must implement
yourself the algorithm to convert from the Unicode code point to the
UTF-8 encoding:

   std::string uft8encode(int u);

   for(int ii = 0x410; ii < 0x430; ++ii)
     std::cout << utf8encode(ii);

I say that it's slightly better because the character '\n' can occur as
an UTF-8 code unit only when encoding U+0A, so you never trigger CR/LF
expansion inadvertently.

Other options include writing a codecvt<> facet performing the wchar_t
to UTF-8 encoding (not an easy task!), make a locale object with it and
then imbue the locale in an ofstream. Imbuing the locale in cout/wcout
wouldn't solve your problem because only file stream buffers actually
use the codecvt facet. The advantage of this approach is that it's going
to be portable.



