Re: I cannot use the hash_map::operator[] to read the value in the hash map?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 06 Aug 2007 08:58:35 -0000

Message-ID:

<1186390715.081172.91640@o61g2000hsh.googlegroups.com>

On Aug 6, 1:31 am, tony_in_da...@yahoo.co.uk wrote:

On Aug 2, 8:36 pm, James Kanze <james.ka...@gmail.com> wrote:

Still, the STL version is worse:

std::map< X, int >::const_iterator it = hm1.find( key ) ;
int a = it == hm1.end() ? 42 : *hm1 ;

The real problem in reading a map (or a hashmap) is how to
handle missing values. Sometimes, an exception is appropriate,
but it's not a good general solution. So you need something out
of band.

Yeah - that is ugly too :-(.

I don't particularly like the interface of any of the standard
containers; if nothing else, push_back() is a horrible name for
append(). In this case, however, there is no right solution:
When I first started using C++, I wrote a hash map based on AWK,
where [] also automatically inserts (which makes a lot of sense
in AWK). After a number of years, I found that I was
systematically checking before calling the [], because I didn't
want automatic insertion. So, on porting the library to a new
client, I modified it to treat a missing entry as an error. As
you might guess, the first time I used the new version, I found
that I had to check for the error, and insert if it occured.

IMHO, the real solution with regards to std::map is to forget
about [] (most of the time, anyway), and write higher level
functions with whatever semantics corresponds to what you need
for the particular application: insert, return a pointer which
can be NULL, raise an exception, return a default value, etc.

But trivial wrapper functions can clean
up usage for these simple cases, and at least it discourages
inefficiency as people are more aware of their ability to use the
iterators for further operations.

If you want to be a programmer, you should at least know how to do a
recursive directory search for filenames including *hash*. Really,
kids these days.... ;-P

Given that the correct name is unordered_set, searching for a
file whose name includes *hash* isn't going to help. More
generally, of course, recursive directory search is almost
certainly the wrong way to go about it here.

As you said above "it's likely that your implementation still uses
something
pre-standard". Because it's non standard, many implementations put
the hash_map header in some strange subdirectory of their "include"
directory. Recursive find / ls -R / dir /s whatever is typically an
effective and quick way to find the header files. Try it for your
compiler and let us know if it doesn't work for you...

Because it's non standard, all of the implementations implement
something slightly different as well. So just finding the
header is not enough. You have to find the documentation, in
order to know how to use it. And the documentation should tell
you what you need to include. If you need to use find or "ls
-R", then the extension isn't adequately documented to risk
using.

Note too that std::map guarantees O(n ln n). unordered_set will
be *typically* O(n), but only if you have a good hashing
function on the key. And good hashing functions are not
necessarily trivial to come by.

This seems bizarre to me. Unordered_set should typically be O(1), as
the insert/search/delete operations are only dependent on the
collisions in the hash bucket (which should typically be 1), and
completely independent of the number of elements in the table ("n").
map "typically" will be ln n for these operations, though after insert
and/or delete it may take extra time to rebalance the tree. James,
how do you arrive at O(n ln n)?

By mistyping:-). I meant O(ln n), obviously. std::map is
guaranteed O(ln n) (where complexity is measured in terms of
number of comparaisons---if the allocator is O(n) over the
number of nodes allocated, then you could end up with O(n ln n)
total anyway).

One hint for making good hash functions: try to have it so that if one
bit varies in the value being hashed, about half the bits vary in the
hash value.

I wouldn't recommend it for double. Having +0.0 and -0.0 hash
to different values is a NOT a good idea. I wouldn't recommend
it for an unordered_set, either.

This minimises collisions even for nearly-identical
values. One simple and reliable way to do this is to use parts of the
hash value as keys into random data.

You're thinking of the Pearson algorithm, from the CACM, June,
1990, I think. FWIW, there's a somewhat old article at my site
(http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html), with
the results of some measurements I did a while back. I only
considered hashing of strings, where every bit is significant,
as is order, which makes things easier. I've done some
additional tests since then (present in the code for the
benchmark at the site, but not in the document). All in all, I
found that "h[i]=127*h[i-1]+c[i]" seems to result in the fastest
tables for a wide variety of data sets; FNV hashing or CRC
actually do better in terms of distribution, but only very
marginally, and cost more to calculate. (This obviously depends
on the machine. My timings were done on a Sparc, which doesn't
have hardware multiply, so the multiplication by 127---a shift
and a subtraction---is significantly faster than that required
in FNV.) Replacing the 127 by 31 or 33---the first is used by
Java and g++---usually does just as well, but do significantly
poorer for some particular very "dense" data sets. (The "dense"
data set I tested consisted of 8836 two character strings. Not
that dense, when you think about it, since there are a total of
65536 possible two character strings.)

The hashing algorithms I tested can easily be modified for other
string-like structures. The problem is to adapt them (or
anything else) to structures where different bit patterns can
compare equal, such as double or an iteration over an
unordered_set.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34