Re: Best way to handle UTF-8 in C++
On May 6, 6:39 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
"Victor Bazarov" <v.baza...@comcast.invalid> wrote in
messagenews:hruqhc$lo6$1@news.eternal-september.org...
[...]
I want a string class that works exactly the same way as
std::string, except implements UTF-8.
I think Victor's point is that std::string does implement UTF-8.
And ISO 8859-1, and EBCDIC, and any other encoding which uses
char (as opposed to UTF32, for example, which requires 32 bit
entities).
And I think he's only right to a point: in the end, an
std::string doesn't handle characters, it handles small
integers. In a single byte encoding, however, those small
integers are the same as your characters, with one character per
integer. So to advance one character, you can simply use ++ on
an std::string::iterator. UTF-8 does require more. And there's
no support for that "more" in C++ (including, as far as I know,
C++0x---in C++0x, you can have UTF-8 string literals, but you
can't take an std::string::iterator and advance it one UTF-8
character).
This means that the interface can remain the same, (all of the
member functions have the same name and same parameters) but
the underlying meaning may be different.
It's not that easy. You can't simply implement something like
utf8_string_iterator::operator++()
{
underlying_iter += size(*underlying_iter);
}
since there might not be enough bytes in the string pointed to
by underlying_iter.
--
James Kanze