Re: Caseless String

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

19 Nov 2006 16:26:56 -0500

Message-ID:

<1163961523.089680.42020@h48g2000cwc.googlegroups.com>

Le Chaud Lapin wrote:

Le Chaud Lapin wrote:
Update:

After some fiddling, at least it seems that it is possible to have a
caseless string.

It's certainly possible. Whether it is a good idea or not is
another question. Whether 'I' and 'i' should compare equal, for
example, depends on the locale---do you really want to embed
locale information in each string object, or is it more
reasonable to provide it as a separate argument. (And if you do
embed it, what do you do when comparing two strings with
different locale information?)

I don't have the same knid of warm and fuzzy that I
got when developing, say, Priortized_Associative_Set<>, but it's a
start:

My objective was to do caseless string comparisons:

Caseless_String s1 = "Hello";
String s2 = "HeLLo";
s1 == s2; // true.

I am not sure, but it seems so far that a way to do this is to not
define caseless strings, but a caseless character class:

I'm sceptical. In a caseless comparison, "SS" == "?", for
example (a two character sequence compares equal to a single
character).

template <typename C> struct Caseless
{
    typedef C Type;
    C c;
    Caseless(C c = 0) : c(c) {}

// operator C & () {return c;}
} ;

Then define, for example,

String<Caseless<wchar_t> > s3 = "World."
String<wchart_t> s4 = "WORLD."

s3 == s4; // true

The code for template String<> would be written as it would by anyone
making a string class template, with the exception that most of the
member functions would be templates themselves. More about that in a
moment.

The key for caseless comparisons is to define global operators for
comparisons between a Caseless<> character and any other character.
The rule is simple - whenever a Caseless<> character is involved in a
comparison with any other type of character (including another
Caseless<> character), both get converted using toupper before doing
the comparison:

template <typename C, typename X> inline bool operator == (const
Caseless<C> &c, const X &x) {return toupper(c.c) == toupper(x);}
template <typename C, typename X> inline bool operator != (const
Caseless<C> &c, const X &x) {return toupper(c.c) != toupper(x);}

Becareful. That looks like undefined behavior for most
instantiations of Caseless to me. You probably mean:

return std::toupper( c.c, std::locale() )
== toupper( x, std::locale() ) ;

or something along those lines in the function.

These two functions are just two of a set of functions defined for when
a Casless<> character is present as the right operand, the left
operand, or both operands.

Furthermore, I noticed that std::string does not allow copy
construction from a narrow string to a wide string or vice-versa:

std::string s6 = "Hallo";
std::wstring s7 = s6; // Construction from different type not
permitted.

Copy construction, no, because it doesn't make sense.
(Technically, it wouldn't be copy conversion anyway, but I think
it's clear that you mean a conversion constructor, taking a
single parameter.)

Basically, I think that the intent is that there be a single
encoding for wchar_t, with conversion on the fly during input
and output, and numerous different encodings for char, depending
on the locale. The problem with a conversion constructor is
that it must be told the encoding of the narrow string.

Perhaps something along the lines of:

     std::string::string( std::wstring const&,
                          locale const& = std::locale() ) ;

     [...]

I have not yet thought about signed-ness for different character types.
I suspect that there is troube ahead. Still, this seems better than
the other options.

if you're really starting from scratch, don't allow signed
integral types. It causes no end of headaches.

Note finally, that the Caseless<> template could be useful in its own
right. For example, to check to see if string contains the letter 'z'
or 'Z', without regard for case, one could wrap a lower-case 'z' in a
Caseless<>, then supplied the packaged character to a bool contains ()
template member function of the string.

I do have one question:

Caseless<> has only one member, c, whose type is the type of the
character being wrapped. I used Visual C++ to verify that
sizeof(Caseless<char>) == 1. This makes sense. There is only 1 field,
and it has an alignment requirement of 1-byte.

I would like to know if I can rely on this behavior in general.

Officially no. On a word addressed machine, probably not. It's
pretty much a question of the system API, however, so if one
compiler is OK on the system, all will be. In practice, you're
OK at least under Windows, Linux on PC and Solaris on Sparc; I
suspect that you're OK on all, or almost all byte addressed
machines.

--
James Kanze (Gabi Software) email: james.kanze@gmail.com
Conseils en informatique orient?e objet/
                    Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]