Re: Hash Code

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 20 May 2008 20:13:11 +0100

Message-ID:

<Pine.LNX.4.64.0805201927380.15168@urchin.earth.li>

On Tue, 20 May 2008, j1mb0jay wrote:

At the moment I am setting the size of the hash table to be the next
prime number which is double the value of the expected number of items
to be added. (Using the Sieve of Atkin prime test)

If you have a good hash function, you don't need to use a prime number for
table size, any size will do.

Each item(Object) that is added has a unique String as a form of ID. I
use the String.hashCode() method on this ID and then Mod this 32bit
integer, so that I then can use this number as an index into the array
of linked lists, that represents the hash table. ( The array of linked
lists is so the hash table can handle collisions ).

I was wondering if instead of using the String.hashCode() method I used
a secure hash, like SHA512 or MD5. Then converted this 512bit / 128bit
number into base10, so that i could parse it as a String into the
constructor of java.math.BigInteger class, then Mod then number by the
size of the hash table so it could be used an index into the array.

Would this mean there would be no collisions

No. Completely avoiding collisions is impossible - if there are more
things in the input domain than in the table, some *must* map to the same
slot.

It would also be very, very slow - cryptographic hashes are engineered to
have properties which are completely unnecessary for use in general
hashing, and do so at the expense of speed.

String's hashCode is a pretty good one - it's a variant of the Bernstein
hash, which is fairly standard. If you want a better one, though,
implement this:

http://burtleburtle.net/bob/hash/doobs.html

There's a catalogue of other hashes at the bottom.

thus keeping the time complexity of the search method down to O(1). In
turn meaning the hash table could be an array of Key Value pairs rather
than a array of linked lists.

There are three ways to avoid collision lists. The first is to use
reprobing instead; you can look that up if you don't know what it is - i
suggesting using quadratic reprobing, as it's simple and fast. The second
is to use a cuckoo hash - again, look it up.

The third, well, i lied when i said avoiding collisions was impossible.
There is kind of a way to do it, but with an important drawback: you have
to choose the hashing function and table size specifically for the data
you want to hash. You have to know the keys before you can build the
table. Also, building the table is typically quite slow. However, once
you've done it, lookups are very fast. The techniques for doing this are
known as perfect hashing - you can look them up too.

tom

--
Gin for the mind, kebabs for the body, sushi for the soul