Re: How to get an insertion hint for an unordered associated container?

From:

Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>

Newsgroups:

comp.lang.c++

Date:

Thu, 26 Nov 2009 23:30:57 -0500

Message-ID:

<0080a244$0$8064$c3e8da3@news.astraweb.com>

James Kanze wrote:

On Nov 24, 5:24 am, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:

James Kanze wrote:

On Nov 21, 8:34 pm, Pavel

[...]

rather than find, and using the return value as a
hint---this does cost an extra comparison in your code,
however. The cost of the second look-up in an unordered
container cannot easily be avoided, but it is O(1), and not
O(lg n). The main potentially avoidable cost would be
calculating the hash code twice, but I don't think that the
interfaces currently available have any provision to support
this.

That is actually avoidable in tr1/unordered map (you can
access bucket(const key_type&) etc). However, O(1) (on
average, worst case is linear), is not 1. In my domain area,
spending 2*O(1) instead of O(1) often means "you won, I lost".
So, I am trying to get to O(1) from 2 * O(1). If one assumes
that the lookup algorithm in hash is always like this:

1. Identify the bucket.
2. Linearly search in the bucket for the equal key or free spot or
end-of-bucket.

I'm not sure that the (upcoming) standard requires linear search
in the bucket, but it does require buckets, which are accessible
for some things (but I'm not sure what) from the user interface.
I think that the available functionality is only sufficient for
instrumentation---determining how effective your hash function
actually is.

(which it is in my version of GNU implementation of tr1
unordered_map). I could even say my problem if solved (if
only they used the hint, which they don't as you correctly
suggested). But it is not guaranteed.

The "hint" is in the form of an iterator. I'm not sure how they
could use it.

It might be useful if they provided functions for looking up and
inserting into a given bucket (at your own risk, of course);
you could then call c.bucket(key) (which calculates the hash
code), and do all further operations on the returned bucket.
Probably a bit too dangerous, however.

(Basically, it would mean> calculating the hash value in
user code, and having additional variants of functions like
find and insert which take a pre-calculated hash value.)

IIRC, the unordered containers do have variants of insert which
take a "hint", but this is only present for compatibility with
the existing ordered containers; the hint is not used.

Just noticed that.. Don't understand why they don't use it:

Because it's an iterator, and an iterator doesn't contain any
useful information to use in the case of an unordered container.

they have to re-compute same information in insert() again (at
least bucket index and hash code). I hope they will use it in
the future though and that API is there to allow optimization,
not just for compatibility..

Explain how? The iterator doesn't contain the hash code in any
way.

Iterator may contain anything that addresses the element, in particular
hash_code (it does so in one version of GCC hashtable, more precisely,
points to the node that contains hash_code). It has to contain something
to allow navigation (++, --). For example, it could contain:
1. A handle for constant-time access to the bucket, to be able to find
out where the bucket ends (like an index of the bucket in some
array/vector/deque or a pointer to it),
2. A handle for constant-time access to element in the bucket (another
index or whatever)
3. A handle for constant-time access to the container (to know how to
navigate to the next bucket as you need for ++, --). Again, a pointer or
similar.

If buckets contain pointers to actual objects (which sounds feasible as
the Standard guarantees the references and pointers to the objects are
not invalidated during re-hashes), the above is quite enough to insert
the object "at" given iterator or at first available space in the bucket
pointed to by the iterator.

Most of the hash maps I've written in the past would cache the
last element found, in order to support things like:

     if ( m.contains(x) ) {
         m[x] = 2*m[x] ; // or whatever...
     }

I understand you can optimize the set for the usage pattern you think is
common. This comes at cost even for this use, however, as the
comparision may be much more expensive than hash-function computation
and you guarantee you will have an extra comparison in m[x] at all
times. We know we can do without (see above) so why should we live with it.

efficiently. It still required a comparison for each access, in
order to know whether I could use the cached value, but it did
avoid multiple calculations of the hash code when accessing the
same element several times in succession.

You are making assumptions that may be very true for your particular
problem but I would not use these in a general-purpose library like STL.

But the equal_range() definition to return two end() iterators
on "not found" condition is a hog -- I can't understand why
this could not be defined as "return an empty range" so either
of two identical iterators that could be used as a hint for
insertion, consistent with how it's defined for ordered ass.
containers.

What does "consistent with how it's defined for ordered
containers" mean here? The value returned from lower_range
defines where any new element should be inserted, according to
the ordering relationship. In an unordered container, there is
no specific place where the new element should be inserted.
What you're asking for really isn't possible.

Why not? You can have a bucket and an index in the bucket and the bucket
has the index if the last element. If a pointer to the element stored
the bucket (think of the bucket as an array/vector of pointers although
it does not have to be it) is NULL, insert your element into this spot;
otherwise, if the bucket has free space insert it at the end of the
bucket; otherwise, you have to re-hash (but you would have anyway;
supposedly, you still have "amortized constant" for the average
insertion time)

What is the benefit of requiring them both be end()? Checking
for "not found" condition costs one iterator comparison either
way.. seems like waste to me.

No benefit, perhaps, but no real harm either. Ordered and
unordered containers are fundamentally different beasts.

If you returned two identical iterators pointing to a free spot in the
correct bucket (say the first free spot), you would know to insert your
element there without any computations whatsoever.

If you're concerned about the time necessary to calculate the
hash function, you could use a wrapper for the key, which caches
the hash function once it has been calculated, and uses the
cached value if it is present. Most of the time, I suspect that
it won't make a difference (although for strings, if the keys
can be fairly long, maybe). So you end up with something like:

     class CachedHashString : private std::string
     {
         bool isHashed;
         unsigned hashValue;
     public:
         // Duplicate any needed constructors, setting isHashed
         // to false.

         // using declarations for whatever functions you want
         // to expose (only const!)
         unsigned hash() const
         {
             if ( ! isHashed ) {
                 hashValue = ... ;
                 isHashed = true;
             }
             return hashValue;
         }
     };

and the hash function you use with the container uses the member
hash function.

I know.. but as per above, I do not think this is necessary. The cost of
insertion can be anything -- in case of conflict it may even be some
secondary hash or linear or non-linear search in the overflow area or
similar. The Standard does not define how exactly the overflows are
processed.

To summarize my point:

The Standard recognizes the hint may be useful (it is still in the API
for unordered ass. containers) and I am ok that GCC STL does not use it
now -- it or another implementation may do it in the future or I can
write it myself and the client code will continue to rely on the
Standard-compliant unordered ass. container while enjoying the faster
implementation.

But, I do have an issue with the requirement to return two end()
iterators in equal_range() on "not found" condition instead of such a
hint. I think it is a defect in the Standard that limits possible
optimizations without good reason.

--
James Kanze