Re: Best way to handle UTF-8 in C++
On May 8, 10:19 pm, Marek Borowski <marek_remo...@borowski.com> wrote:
On 08-05-2010 16:05, Sam wrote:> Peter Olcott writes:
I want the exact std::string interface, but, the underlying
representation would be UTF-8. This means that substring
would work on the basis of Unicode CodePoints, instead of
bytes.
The point that you consistently seem to be missing is that
UTF-8 /is/ a byte-oriented representation of Unicode. If
you're asking for something that handles unicode codepoints,
what you're asking has absolutely nothing to do, whatsoever,
with UTF-8, or any other encoding. UTF-8 is just a
byte-oriented encoding of the full Unicode set.
NO. Every other 8bit encoding has 1 byte per char.
Bullshit. There are any number of multibyte encodings, many of
them older than UTF-8.
UTF-8 It's not the same! Have you ever tried what you proposed ?
Until we know what Peter wants to do, it's impossible to say
whether std::string can be used "as is", or not.
std::string is perfectly capable of handling UTF-8-encoded
text, as in this very own news client, running on a UTF-8
platform, accepting UTF-8-encoded input from the keyboard,
composing a UTF-8-encoded message, and posting it.
Assing that "g=C4=99=C5=9B" is in UTF-8 text, substr(0,2) don't produce
"g=C4=99" as it should be.
Should it? (In practice, I've not found much use for
std::string::substr. And something like std::string(s.begin(),
std::search(s.begin(), s.end(), target.begin(), target.end())
does work as expected. But that's probably linked to my
particular type of applications; I don't think my experience
would hold in an editor, for example.)
Depending on what your application is doing, std::string and the
standard library might provide all you need. Or you might need
a few addional functions. Or you might be better off
transcoding on input and output (which you probably have to do
anyway) and using UTF-32 internally.
--
James Kanze