Re: How to get an insertion hint for an unordered associated container?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 30 Nov 2009 01:36:37 -0800 (PST)

Message-ID:

<0f3d5a19-a8be-4c29-9a72-433586ad3760@r24g2000yqd.googlegroups.com>

On Nov 27, 9:24 pm, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:

James Kanze wrote:

On Nov 27, 4:30 am, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:

James Kanze wrote:

On Nov 24, 5:24 am, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:

James Kanze wrote:

On Nov 21, 8:34 pm, Pavel

      [...]
Most of the hash maps I've written in the past would cache
the last element found, in order to support things like:

      if ( m.contains(x) ) {
          m[x] = 2*m[x] ; // or whatever...
      }

I understand you can optimize the set for the usage pattern
you think is common. This comes at cost even for this use,
however, as the comparision may be much more expensive than
hash-function computation and you guarantee you will have an
extra comparison in m[x] at all times. We know we can do
without (see above) so why should we live with it.

It's a common usage pattern with my hashed containers: since
[] has a precondition that the entry exists,

You mean -- in your container, not STL?

Yes.

I don't like STL's inserting things into maps behind my back
either, e.g.:

std::map<int, int> m;
int i = m[5];
/* it's not exactly intuitive that we have just created a (5, 0) entry
on the map.. */

My earliest versions of a hash table did this as well---they
were based on my experience with AWK. (In fact, I've come to
the conclusion that it's a very bad idea. AWK and C++ are
different languages, and what's the best solution in one isn't
necessarily a good solution in the other.) But even then, I
found myself often using contains to avoid accidentally
inserting a new element. All too often, you need to procede
differently depending on whether the element is present or not.

That's why I mostly use explicit insert, find etc.

Agreed. I've found almost no use for std::map's operator[]
either. I rather think it was an afterthought---that the design
itself is to use find for access, and insert for insertion, and
that operator[] was added with an AWK-like semantic as a
convenience function, for the cases where you need or want an
AWK-like semantic.

If you use find and insert, then of course, there's no point in
caching the last value, since the iterator find returns contains
it. I have a lot of code along the lines of:

    Map::const_iterator elem = map.find(key);
    return elem == map.end()
        ? someDefaultValue
        : elem->second;

With my earlier implementations, this was:

    return map.contains(key)
        ? map[key]
        : someDefaultValue;

(which I still prefer, but the STL usage above is more than
acceptable as well).

[...]

The cost of insertion can be anything -- in case of
conflict it may even be some secondary hash or linear or
non-linear search in the overflow area or similar. The
Standard does not define how exactly the overflows are
processed.

It does require buckets, since it exposes this detail at the
interface level. The guarantees with regards to this part
of the interface are rather vague, but I would expect that
if (double)(m.bucket_count()+1)/m.size()<
m.max_load_factor(), then:
     size_t b = m.bucket(obj);
     m.insert(obj);
     assert(b == m.bucket(obj));

My reading is opposite:

It says (1) "Keys with the same hash code appear in the same
bucket" and (2) "The number of buckets is automatically
increased as elements are added to an unordered associative
container"

So, if there is one bucket now, m.bucket(obj) has to return 0;
if there must be two buckets after after the insertion as per
(2), the inserted element may have gone to the bucket 1 with
other its peers (the bucket may contain elements with more
than one hash code it is just that if two elements have same
hash code they must appear in the same bucket per (1) so if
obj had hashcode 2 and 0th bucket contained all elements with
codes 1 and 2, all twos may need to go to the 1st bucket
during the insertion).

That's why I had the precondition.

As I said, I don't think this is 100% clear in the draft, but
you do have "The number of buckets is automatically increased as
elements are added to an unordered associative container, so
that the average number of elements per bucket is kept below a
bound. Rehashing invalidates iterators, changes ordering between
elements, and changes which buckets elements appear in, but does
not invalidate pointers or references to elements." There seems
(to me, at least) to be a strong presumption that if you don't
rehash, iterators aren't invalidated, and that the container
doesn't rehash unless it needs to, so that you can count on
iterators not being invalidated if you can ensure that the load
factor will not be exceeded. I may be reading too much into
this, since I don't see any practical way of guaranteeing a
sufficient number of buckets beforehand; e.g. something like
reserve in std::vector. Something like:
    map.max_load_factor(0.5F);
    map.rehash();
    map.max_load_factor(0.8F);
would probably work in practice, but I'm not sure that it is
guaranteed. (The argument of max_load_factor is only a hint.
From a QoI point of view, I would expect that it result in a
max_load_factor <= the given value, and I don't understand the
standard not requiring it.) I think some function along the
lines of reserve or reserve_buckets would be in order---if I
know I'm going to insert 100000 elements, it's rather a bother
not to be able to tell the hash table this beforehand, so that
it doesn't have to rehash.

--
James Kanze