Re: Best way to handle UTF-8 in C++
On May 9, 2:01 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
"James Kanze" <james.ka...@gmail.com> wrote in message
news:9f0e0bdb-04cd-4586-b397-a40d26f76cac@o14g2000yqb.googlegroups.com...>
On May 9, 4:14 am, "Peter Olcott" <NoS...@OCR4Screen.com>
wrote:
"Thomas J. Gritzan" <phygon_antis...@gmx.de> wrote in
messagenews:hs4r9r$bg2$1@newsreader5.netcologne.de...
[...]
Now that I know how to do this myself very easily I won't
bother looking at alternatives. I will be precisely
implementing the subset of the std::string that I need:
operator[]()
With what return type?
operator[]() returns a 32-bit Unicode CodePoint.
In std::string, operator[] returns a reference. To an object
with a defined lifetime. When applied to a non-const string,
you can modify the string through it, e.g.:
s[3] = 'a';
Returns a utf8string, including a seqeunce of one or more
bytes representing a single UTF-8 character.
Takes either a 32-bit Unicode CodePoint or a utf8string and
returns a utf8string.
operator=()
length() in characters
reserve() in bytes
capacity() in bytes
size() in bytes
resize() in bytes
relational operators
operator>>()
operator<<()
With the exception of length and substr (assuming you want
to use character indexes), these all already work for UTF-8
in std::string.
Right, so I only need to actually implement
operator+=()
substr()
length()
No. operator+=() already works if the right hand side is
a string, and doesn't make sense for UTF-8 otherwise. And you
definitely need to handle operator[]; that's the hard one.
Given that all you apparently need is substr, length and
some sort of indexing, the simplest solution would seem to
be some sort of free functions. In practice, however,
I think you'll find that you also need some sort of
mechanism to support iterators, so that you can use the STL.
Where things get complicated, of course, is what operator[]
and iterator::operator* should return. (An uint32_t is an
obvious choice. Except that this doesn't allow using these
results as an lvalue.)
There is no way to make it use uint32_t as an lvalue?
I thought that I had a way.
Not in a way that is in any way compatible with the standard.
Some of us consider that a defect in the standard, but that's
the way it is. Basically, you need for &*iter to result in
a T*, or you're going to have problems; the standard actually
guarantees that *iter and operator[] return references, and
I suspect that some code counts on it, e.g.:
typename std::iterator_traits<Iter>::value_type& r = *iter;
(in a template, of course---and it has to work if the template
is instantiated on your string class).
--
James Kanze