Re: Fat String Class

From:

Alberto Ganesh Barbati <AlbertoBarbati@libero.it>

Newsgroups:

comp.lang.c++.moderated

Date:

Tue, 8 Jul 2008 05:29:37 CST

Message-ID:

<HzGck.22813$Ca.14899@twister2.libero.it>

Le Chaud Lapin ha scritto:

While not an expert in the subject, I do have knowledge of foreign
languages, and I know that UNICODE is not a cure-all. There are deep,
fundamental problems with the semantics of strings and and their
operations that might or might not have been solved as of 2008.
Perhaps a sensible solution exists in fragmented for in standard
libraries and encodings. Perhaps computational linguists have created
ad-hoc solutions that we are not yet aware of. Whatever the state of
the art, I would like to capture as much possible now to minimize
regret later. I do know that the locale will be included in each and
every object of my String<>.

If you had ever read the UNICODE book you would know that UNICODE is not
just an encoding but also a collection of data tables and algorithms to
solve a large number of issues related to localization. Yes, I do
believe that UNICODE is a cure-all, because I know that UNICODE is so
much more than what most people think it is.

Though feature richness is a requirement, I am not interested in
operators like:

operator ==
operator !=
operator >
operator <

These are obvious. They already exist in my String<>, and those that
are absent will be shamelessly copied from open-source projects.

There are far from obvious, in my opinion. See below.

Instead I would like to find a terminal or near-terminal
representation of a string object that facilitates sensibility between
international string operations.

UNICODE is your best choice here.

Let's take a concrete example:

String<char> s1 "exasperation"; // English
String<char> s2 "exasp?ration"; // fran?ais

In your opinion,

1. Should [s1 == s2] be true?
2. What should be the sort order of s1 versus s2?
3. What should be the difference in representation for s1 and s2?

You are missing the point. It doesn't make any sense to compare an
English string with a French string! A string is a string. If I put an
English word in a French dictionary, the reader expects the dictionary
to be in French order and the English word should collate as-if it were
French. The locale is not attached to each single string, but rather to
the context in which the string is considered.

All string comparison operations therefore require contextual
information to be known: i.e. the locale. Once you have clear the locale
you are working into, the result of those operations is also clear and
can be determined using data tables.

The problem is that those operations should *not* be performed by
calling operator== or operator< because the interface of a binary
operator does not allow passing the contextual information! We could use
a global locale object (as in C setlocale()) but that isn't a good
solution for at least two reasons:

1) you have to be careful with multi-threading
2) if you enter strings as keys in a map or set with one locale and then
change the locale, the map or set is totally screwed up

That's why set and map allows stateful comparison objects! Because you
want to have an English map holding strings, rather than a (non-local
aware) map holding English strings.

Note that s1 and s2 might be in a container of my choosing where
comparison operation is operator == for class String (Yes, I know this
is not the way std:: does it).

As I said, a locale-aware comparison should not be performed with ==.

I am not concerned with efficiency in space. If it turns out that
sizeof(String<>) must be == 16 to preempt later grief, so shall it
be.

Storing locale in String<> objects is useless, because the locale is not
an attribute of the string. Therefore I don't think String<> should ever
need to be larger than a pointer and a size_t, unless you are using a
short-string optimization technique.

And finally, to the polyglots of the group whose native language(s) is
not English, how would you design a string class, with and without
consideration for ASCII?

In my opinion, a string should model sequence of UNICODE characters. A
UNICODE character can be encoded with one or more code-units according
to some encoding, which ought to be hidden by the interface. The problem
is that most string implementations model a sequence of code-units
instead, thus the encoding "leaks" through the interface. This is bad.

The work made by Python community is exemplary. Python 2.x has two
classes, namely str and unicode. The former is a sequence of bytes and
the latter a sequence of UNICODE characters. The problem is that a
sequence of bytes can also be interpreted as a sequence of character in
some yet-to-be-specified code-page. A mess. Python 3.x fixes the
problem: the unicode class has been removed, str now models a sequence
of UNICODE characters and a new class bytes models a sequence of bytes,
which cannot be interpreted as a sequence of characters. In order to
interpret a bytes object as a string you need to apply the function
decode() and you get a str object. The encode() function performs the
opposite transformation.

Back to my idea of what a string should be, binary operations should
perform non-local-aware lexicographical comparison for convenience only.
Any other form of comparison should be provided through the use of
locale objects, whose interface should make them suitable to be used as
comparison function in containers like set and map.

Anyway, have a look at http://www.icu-project.org/

HTH,

Ganesh

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]