Re: How to get an insertion hint for an unordered associated container?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 27 Nov 2009 04:39:15 -0800 (PST)

Message-ID:

<ca8a4807-60f6-4376-b525-1311f3e36b34@s20g2000yqd.googlegroups.com>

On Nov 27, 4:30 am, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:

James Kanze wrote:

On Nov 24, 5:24 am, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:

James Kanze wrote:

On Nov 21, 8:34 pm, Pavel

[...]

they have to re-compute same information in insert() again (at
least bucket index and hash code). I hope they will use it in
the future though and that API is there to allow optimization,
not just for compatibility..

Explain how? The iterator doesn't contain the hash code in any
way.

Iterator may contain anything that addresses the element, in
particular hash_code (it does so in one version of GCC
hashtable, more precisely, points to the node that contains
hash_code).

That's true, but how would that help? The hint has to be
checked; it's not an absolute. So the insertion code would
still have to calculate the hash code for the element to be
inserted. And having done that, having the iterator will not
make much difference.

It has to contain something to allow navigation (++, --). For
example, it could contain:
1. A handle for constant-time access to the bucket, to be able to find
out where the bucket ends (like an index of the bucket in some
array/vector/deque or a pointer to it),
2. A handle for constant-time access to element in the bucket (another
index or whatever)
3. A handle for constant-time access to the container (to know how to
navigate to the next bucket as you need for ++, --). Again, a pointer or
similar.

If buckets contain pointers to actual objects (which sounds
feasible as the Standard guarantees the references and
pointers to the objects are not invalidated during re-hashes),
the above is quite enough to insert the object "at" given
iterator or at first available space in the bucket pointed to
by the iterator.

You seem to be forgetting that the iterator is only a hint, and
that the insertion has to work correctly even if the hint isn't
appropriate. Which means that the inserter would still have to
verify that the element to be inserted belongs in the bucket
designated by the hint. If it doesn't, it has to find the
correct bucket; if it does, it can use the bucket designated by
the iterator. But finding the correct bucket is O(1), with very
low constant time; using the bucket designated by the iterator
doesn't really gain enough to make the extra tests worthwhile.

Note how this is different for ordered containers. The
implementation can verify in constant time whether the insertion
should be adjacent to the given iterator, where as finding the
correct position from scratch is O(lg n). In a hash table,
everything is O(1), and the only potentially expensive operation
which you'd like to eliminate is recalculating the hash code.
To do this, the implementation would have to consider the hint
as guaranteed, and just insert the object in the bucket obtained
from the iterator. Which wouldn't be conform, and probably
wouldn't work in one of the more frequent uses of the hinted
version of insert: in an insert_iterator. (I suspect that
supporting insert_iterator is a major motivation for having the
hinted inserter in the hashed containers. You can't use
back_insert_iterator, since for obvious reasons, the container
doesn't support push_back, and you do want to be able to use
std::copy to insert elements into the container.)

Most of the hash maps I've written in the past would cache
the last element found, in order to support things like:

     if ( m.contains(x) ) {
         m[x] = 2*m[x] ; // or whatever...
     }

I understand you can optimize the set for the usage pattern
you think is common. This comes at cost even for this use,
however, as the comparision may be much more expensive than
hash-function computation and you guarantee you will have an
extra comparison in m[x] at all times. We know we can do
without (see above) so why should we live with it.

It's a common usage pattern with my hashed containers: since []
has a precondition that the entry exists, you often end up
writing things like the above (and if (m.contains) followed by
use. With the STL idiom, of course, the above would probably be
written:

    Map::iterator elem = m.find(x);
    if ( elem !== m.end() ) {
        *m = 2 * *m ; // or whatever...
    }

The STL avoids multiple look-up, at the cost of IMHO a less
natural syntax. Given my syntax, however, caching will be a
win: you do have the comparison each time, but you at least
avoid the extra calcuation of the hash code.

efficiently. It still required a comparison for each
access, in order to know whether I could use the cached
value, but it did avoid multiple calculations of the hash
code when accessing the same element several times in
succession.

You are making assumptions that may be very true for your
particular problem but I would not use these in a
general-purpose library like STL.

I think it more a question of the idioms associated with the
container, than whether it is general purpose or not. With the
STL, you use find and then the returned iterator; with mine, you
use contain(), and then [].

But the equal_range() definition to return two end()
iterators on "not found" condition is a hog -- I can't
understand why this could not be defined as "return an
empty range" so either of two identical iterators that
could be used as a hint for insertion, consistent with how
it's defined for ordered ass. containers.

What does "consistent with how it's defined for ordered
containers" mean here? The value returned from lower_range
defines where any new element should be inserted, according
to the ordering relationship. In an unordered container,
there is no specific place where the new element should be
inserted. What you're asking for really isn't possible.

Why not? You can have a bucket and an index in the bucket and
the bucket has the index if the last element. If a pointer to
the element stored the bucket (think of the bucket as an
array/vector of pointers although it does not have to be it)
is NULL, insert your element into this spot; otherwise, if the
bucket has free space insert it at the end of the bucket;
otherwise, you have to re-hash (but you would have anyway;
supposedly, you still have "amortized constant" for the
average insertion time)

Several points, but the most important one is what I mentionned
before: the function must work even if the hint is incorrect.
And typically, the bucket doesn't have a fixed length: you don't
rehash because there are more than n elements in a single
bucket; you rehash because size()/bucket_count() passes a
pre-established limit.

With regards to the original poster's problem (he didn't want to
construct a complete object if the entry was already there, and
he didn't want to calculate the hash code multiple times), the
interface could be extended with a additional function,
insert_into_bucket, so that he could get the bucket using
bucket, then iterate over it to see if his target was already
present, and if not use this new function for the insertion.
Whether it's worth it is another question: it's a very
specialized use, and most users would probably be content with
just insert (which only inserts if not present), despite having
to create a new complete object.

What is the benefit of requiring them both be end()? Checking
for "not found" condition costs one iterator comparison either
way.. seems like waste to me.

No benefit, perhaps, but no real harm either. Ordered and
unordered containers are fundamentally different beasts.

If you returned two identical iterators pointing to a free
spot in the correct bucket (say the first free spot), you
would know to insert your element there without any
computations whatsoever.

You still need to calculate the hash code of the object being
inserted, to validate the hint (and in a unique hash table, do
some comparisons to ensure that the element isn't already
present). And once you've done that, there's not much
calculation left anyway.

[...]

The cost of
insertion can be anything -- in case of conflict it may even be some
secondary hash or linear or non-linear search in the overflow area or
similar. The Standard does not define how exactly the overflows are
processed.

It does require buckets, since it exposes this detail at the
interface level. The guarantees with regards to this part of
the interface are rather vague, but I would expect that if
(double)(m.bucket_count()+1)/m.size() < m.max_load_factor(), then:
    size_t b = m.bucket(obj);
    m.insert(obj);
    assert(b == m.bucket(obj));
should hold. If so, that pretty much defines how overflows are
handled.

To summarize my point:

The Standard recognizes the hint may be useful (it is still in
the API for unordered ass. containers)

Does it recognize a utility, other than compatibility of
interface (so e.g. you can instantiation an insert_iterator on
an unordered ass. container)? (I'm not sure I understand the
meaning of "The iterator q is a hint pointing to where the
search should start" in the standard. It doesn't make too much
sense when buckets are used for collision handling, at least to
me.)

and I am ok that GCC STL does not use it now -- it or another
implementation may do it in the future or I can write it
myself and the client code will continue to rely on the
Standard-compliant unordered ass. container while enjoying
the faster implementation.

I'd be interesting in seeing an actual implementation which does
use it somehow.

But, I do have an issue with the requirement to return two
end() iterators in equal_range() on "not found" condition
instead of such a hint. I think it is a defect in the Standard
that limits possible optimizations without good reason.

I would agree that it seems to be a case of overspecification.
Even if I don't see any possible advantages in returning
something else at present, that doesn't mean that they can't
exist, if not now, in the future, and there's no real advantage
in excluding them.

--
James Kanze