Re: Help migrating hash_set to c++0x

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Wed, 22 Dec 2010 03:16:01 -0800 (PST)

Message-ID:

<b49fd35b-6909-467d-bf02-e8cf8dc029b2@l17g2000yqe.googlegroups.com>

On Dec 22, 12:59 am, Howard Hinnant <howard.hinn...@gmail.com> wrote:

On Dec 21, 6:32 pm, James Kanze <james.ka...@gmail.com> wrote:

[...]

This is a big win in generic code. For example storing pointers or
iterators in containers is common practice.

You mean an excessively dangerous practice, banned by a lot of
coding guidelines. (Iterators and pointers tend to become
invalid at inconvenient times.)

No, I see containers of iterators or pointers into other containers on
a regular basis used in a safe and useful manner. I see this pattern
recommended here on this newsgroup with little or no objections.

I've not seen it recommended by any serious professional. And
it *is* dangerous: iterators are by design to be ephemeral.
They are more or less the equivalent to pointers to local or
member variables. (As with all "rules", there are some
exceptions. But you won't find them much in application code.)

Making hash<char*> behave differently than hash<int*> would
have even more severe consequences than making vector<bool>
behave differently than vector<int>.

Come now. In both cases, all that's needed is a little
specialization (or overloading, if it's a question of
functions). With the difference that I suspect unordered
containers are fairly rare in generic code, where as vectors
aren't.

I expect unordered containers to become about as popular as (multi)map/
set, perhaps a little more. Naturally the future is always difficult
to predict.

I don't doubt that they'll become popular. But I can't imagine
much generic code which deals with hash---if it does, it's not
too generic, since it is limited to unordered containers.

If there is a silver lining to the vector<bool> cloud, it is
that it taught us a lesson in generic programming. Best we
not forget that lesson.

The problem with vector<bool> isn't just that it's unorthogonal.
It's that it doesn't meet any of the requirements for a
container. (Arguable, some of the requirements may be
overspecification; it's a bit hard that operator[] must return a
real reference, for example. But that's a different issue.)

The problem with vector<bool> is that it has expected semantics based
on experience with vector<int>, vector<char>, vector<char*>, etc.

Expected semantics based on the fact that vector is a sequence.

And
those expectations are not always met. This has more to do with the
consistency of the client's experience than it does with a spec that
may or may not be thoroughly read by the majority of the client
authors. Don't get me wrong: I think a dynamically sized array of
bits is a very beautiful and useful data structure. It just shouldn't
have been named vector<bool> because vector<T> for all scalars has a
well defined behavior without the specialization and the
specialization of vector<bool> implementing a dynamic array of bits is
unable to completely mimic the data structure of a dynamic array of
bools. /That/ is why it was a mistake -- a mismatch of reasonable
expectations.

Agreed. Or in more formal terms, the standard defines
a contract for std::vector, then violates it for
std::vector<bool>.

If we were designing the language from zero, I'd totally agree
with you. But we're not, and for better or worse, char const*
is, in most people's minds (and in a lot of the interfaces we
have to deal with), a string.

For C++0x we are designing std::hash from scratch. And we are also
designing std::unordered_(multi)map/set from scratch. Though
admittedly gratefully guided by a fair amount field experience. That
field experience indicates to me that:

1. std::hash should treat all pointers uniformly.
2. For storing containers of strings in a hash container, std::string,
not char* is the preferred key. The latter is replete with memory
ownership issues that the beginner is ill-equiped to deal with.
3. If the client really wants to store null-terminated char*'s,
dealing with memory ownership himself, then the hash containers should
allow that. They do: write your own hash function. This client must
be knowledgable enough to handle the manual memory management, manual
equality comparison, and manual hash functions. It is all allowed, it
just isn't the default.

I'm not convinced that all of the above are that essential, but
I am beginning to think that treating std::hash<char*> as just
another pointer may be OK. If only because you also want it to
work when the pointer is null. And of course, because the cases
where the container will be indexed by C style strings should be
very exceptional, and the coder should expect to have to do
something exceptional in that case. (On the other hand, I'm
sure we both know programmers who thing char const* is
a string.)

Anyway, I doubt that it's a real problem. Since the standard
doesn't make any quality guarantees (and I don't see how it
could), no one will actually use the standard hash function
anyway. (It suffers from the same problem rand() does: anything
that's good enough to work in all cases will be unnecessarily
slow in a number of frequent cases.)

rand suffered from a different problem: a implicitly recommended poor
implementation.

That aggrevated the problem, I agree. But the problem is, IMHO,
more general: if you're doing any serious work, and you need
random numbers, you need a minimum quality for those random
numbers.

Similarly, if you're using an unordered container, it's probably
because of speed as well -- if not, std::vector and std::find
work very well. And speed likely implies that you'll want
a guaranteed minimum quality for the hash. For the data sets
you expect. And while I think it would be possible to define
a hash with a minimum quality for std::string (but
std::basic_string already causes problems, because the hash has
to be consistent with char_traits::eq), but I don't know of any
generic way of specifying quality criteria for all possible
types; for that matter, I don't know how you'd specify the
quality critera other than by specifying an implementation.

<random> goes a long way to correct that problem ...
a very long way.

Some would say too far:-). (Over engineering and all that.
Personally, since it's someone else who did the work, I'm all
for over engineering.)

<random> is one of the most impressive random number
libraries I've seen for general purpose use. And it took a tremendous
effort and expense from a few very dedicated individuals to guide it
through the standardization process (I was not one of those).

But I agree with you that clients should feel free to write their own
hash functions. And I'm quite glad we didn't try to standardize, or
make an example hashing function for arrays of char. That would have
repeated the problems of rand. When prototyping, try the default
first. If hashing turns up performance problems, then write your own,
else don't. The library anticipates both cases. I expect some
std::hash<std::string> to be better than others, and for the market to
decide which implementation does best in this area. I have no doubt
that all implementations will do their best to become the best.
Vendors of the std::lib's want to please their customers. They have
high motivation to do so.

Agreed in general, although I suspect that there's one issue
that people will tend to forget: portability of the data. As
long as the unordered containers are entirely in a single
process, or only shared by code using the same implementation
(and those are the only cases which the standard should be
concerned with), using the standard hash is find, even if it is
underspecified. I fear, however, that there are people who will
use it (because it's there) to hash tables on disk, or other
external media. I don't think it's a problem the standard
should address (other than to specify that the hashing
algorithms are "unspecified"), but teachers should at least
mention the issue.

--
James Kanze