Re: How to get an insertion hint for an unordered associated container?
James Kanze wrote:
On Nov 27, 4:30 am, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:
James Kanze wrote:
On Nov 24, 5:24 am, Pavel
<pauldontspamt...@removeyourself.dontspam.yahoo> wrote:
James Kanze wrote:
On Nov 21, 8:34 pm, Pavel
[...]
they have to re-compute same information in insert() again (at
least bucket index and hash code). I hope they will use it in
the future though and that API is there to allow optimization,
not just for compatibility..
Explain how? The iterator doesn't contain the hash code in any
way.
Iterator may contain anything that addresses the element, in
particular hash_code (it does so in one version of GCC
hashtable, more precisely, points to the node that contains
hash_code).
That's true, but how would that help? The hint has to be
checked; it's not an absolute. So the insertion code would
still have to calculate the hash code for the element to be
inserted. And having done that, having the iterator will not
make much difference.
It has to contain something to allow navigation (++, --). For
example, it could contain:
1. A handle for constant-time access to the bucket, to be able to find
out where the bucket ends (like an index of the bucket in some
array/vector/deque or a pointer to it),
2. A handle for constant-time access to element in the bucket (another
index or whatever)
3. A handle for constant-time access to the container (to know how to
navigate to the next bucket as you need for ++, --). Again, a pointer or
similar.
If buckets contain pointers to actual objects (which sounds
feasible as the Standard guarantees the references and
pointers to the objects are not invalidated during re-hashes),
the above is quite enough to insert the object "at" given
iterator or at first available space in the bucket pointed to
by the iterator.
You seem to be forgetting that the iterator is only a hint, and
that the insertion has to work correctly even if the hint isn't
appropriate.
point taken -- the hint-using code must count hash code.
Most of the hash maps I've written in the past would cache
the last element found, in order to support things like:
if ( m.contains(x) ) {
m[x] = 2*m[x] ; // or whatever...
}
I understand you can optimize the set for the usage pattern
you think is common. This comes at cost even for this use,
however, as the comparision may be much more expensive than
hash-function computation and you guarantee you will have an
extra comparison in m[x] at all times. We know we can do
without (see above) so why should we live with it.
It's a common usage pattern with my hashed containers: since []
has a precondition that the entry exists,
You mean -- in your container, not STL? I don't like STL's inserting
things into maps behind my back either, e.g.:
std::map<int, int> m;
int i = m[5];
/* it's not exactly intuitive that we have just created a (5, 0) entry
on the map.. */
That's why I mostly use explicit insert, find etc.
you often end up
writing things like the above (and if (m.contains) followed by
use. With the STL idiom, of course, the above would probably be
written:
Map::iterator elem = m.find(x);
if ( elem !== m.end() ) {
*m = 2 * *m ; // or whatever...
}
The STL avoids multiple look-up, at the cost of IMHO a less
natural syntax. Given my syntax, however, caching will be a
win: you do have the comparison each time, but you at least
avoid the extra calcuation of the hash code.
Sometimes hash code computation is more expensive than comparison and
sometimes it isn't. Depends on how you calculate hashcode: sometimes it
is feasible to only use a part of the whole that is known to be usually
more distinctive.
efficiently. It still required a comparison for each
access, in order to know whether I could use the cached
value, but it did avoid multiple calculations of the hash
code when accessing the same element several times in
succession.
You are making assumptions that may be very true for your
particular problem but I would not use these in a
general-purpose library like STL.
I think it more a question of the idioms associated with the
container, than whether it is general purpose or not. With the
STL, you use find and then the returned iterator; with mine, you
use contain(), and then [].
I meant the assumption that hash code calculation is the most expensive
part of the search. Value comparison can be sometimes more expensive and
walking the bucket may take some time (The bucket of size 5 or 6 is
still traverse-able at O(1) but O becomes too big :-))
Why not? You can have a bucket and an index in the bucket and
the bucket has the index if the last element. If a pointer to
the element stored the bucket (think of the bucket as an
array/vector of pointers although it does not have to be it)
is NULL, insert your element into this spot; otherwise, if the
bucket has free space insert it at the end of the bucket;
otherwise, you have to re-hash (but you would have anyway;
supposedly, you still have "amortized constant" for the
average insertion time)
Several points, but the most important one is what I mentionned
before: the function must work even if the hint is incorrect.
true as admitted above. Still some gain is due.
And typically, the bucket doesn't have a fixed length: you don't
rehash because there are more than n elements in a single
bucket; you rehash because size()/bucket_count() passes a
pre-established limit.
Yes, poing taken.. what I was alluding to a map implementation that was
not stl compliant; The essense does not change though: please replace
"re-hash" with "add more space to the end of the bucket".
With regards to the original poster's problem (he didn't want to
construct a complete object if the entry was already there, and
he didn't want to calculate the hash code multiple times)
I was the original poster BTW :-)
, the
interface could be extended with a additional function,
insert_into_bucket, so that he could get the bucket using
bucket, then iterate over it to see if his target was already
present, and if not use this new function for the insertion.
Whether it's worth it is another question: it's a very
specialized use, and most users would probably be content with
just insert (which only inserts if not present), despite having
to create a new complete object.
That's what I ended up for a while (sigh). But it's quite probably I
will have to return to it as the lookup in this map seems to take the
lion share of CPU time and the nature of the problem is such that 99% of
all inserted values get removed by the end-of-day so I never know
whether the given key is already in or not. I need to measure more to
see what actually takes time within this lookup.
What is the benefit of requiring them both be end()? Checking
for "not found" condition costs one iterator comparison either
way.. seems like waste to me.
No benefit, perhaps, but no real harm either. Ordered and
unordered containers are fundamentally different beasts.
If you returned two identical iterators pointing to a free
spot in the correct bucket (say the first free spot), you
would know to insert your element there without any
computations whatsoever.
You still need to calculate the hash code of the object being
inserted, to validate the hint (and in a unique hash table, do
some comparisons to ensure that the element isn't already
present). And once you've done that, there's not much
calculation left anyway.
As I said above, there is some: comparison and potential navigation in
the bucket. It is usually linear and the unpleasant part is that
reasonable bucket length is only guaranteed on average; wherever
selecting a good hash function is difficult (this is not the case in my
current problem), it is possible to accidentally push some 50% of the
whole key population into one bucket and then spend n/4 iterations on
average for each lookup while the average number of elements per bucket
is still below whatever implementation believes is the reasonable limit.
[...]
The cost of
insertion can be anything -- in case of conflict it may even be some
secondary hash or linear or non-linear search in the overflow area or
similar. The Standard does not define how exactly the overflows are
processed.
It does require buckets, since it exposes this detail at the
interface level. The guarantees with regards to this part of
the interface are rather vague, but I would expect that if
(double)(m.bucket_count()+1)/m.size()< m.max_load_factor(), then:
size_t b = m.bucket(obj);
m.insert(obj);
assert(b == m.bucket(obj));
My reading is opposite:
It says (1) "Keys with the same hash code appear in the same bucket" and
(2) "The number of buckets is automatically increased as elements are
added to an unordered associative container"
So, if there is one bucket now, m.bucket(obj) has to return 0;
if there must be two buckets after after the insertion as per (2), the
inserted element may have gone to the bucket 1 with other its peers (the
bucket may contain elements with more than one hash code it is just that
if two elements have same hash code they must appear in the same bucket
per (1) so if obj had hashcode 2 and 0th bucket contained all elements
with codes 1 and 2, all twos may need to go to the 1st bucket during the
insertion).
should hold. If so, that pretty much defines how overflows are
handled.
To summarize my point:
The Standard recognizes the hint may be useful (it is still in
the API for unordered ass. containers)
Does it recognize a utility, other than compatibility of
interface (so e.g. you can instantiation an insert_iterator on
an unordered ass. container)? (I'm not sure I understand the
meaning of "The iterator q is a hint pointing to where the
search should start" in the standard. It doesn't make too much
sense when buckets are used for collision handling, at least to
me.)
and I am ok that GCC STL does not use it now -- it or another
implementation may do it in the future or I can write it
myself and the client code will continue to rely on the
Standard-compliant unordered ass. container while enjoying
the faster implementation.
I'd be interesting in seeing an actual implementation which does
use it somehow.
I will post a notice here if I make the change in GNU implementation (I
will be legally obliged to contribute it back anyway). But I will only
work on it if I figure out it would it would make a difference for my
particular problem so it's not guaranteed.
-Pavel
But, I do have an issue with the requirement to return two
end() iterators in equal_range() on "not found" condition
instead of such a hint. I think it is a defect in the Standard
that limits possible optimizations without good reason.
I would agree that it seems to be a case of overspecification.
Even if I don't see any possible advantages in returning
something else at present, that doesn't mean that they can't
exist, if not now, in the future, and there's no real advantage
in excluding them.
--
James Kanze