Re: Hash table performance

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 24 Nov 2009 23:58:33 +0000

Message-ID:

<alpine.DEB.1.10.0911242338470.4374@urchin.earth.li>

On Tue, 24 Nov 2009, markspace wrote:

Tom Anderson wrote:

long bits = Double.doubleToLongBits( key );
int hash = (int)(bits ^ (bits >>> 32));

provides terrible performance.

Interesting. I chose that function because it's what java.lang.Double
does, rather than because i thought it'd be good, but i am surprised to
hear it's terrible - doubles are quite complicated internally, so would
have thought that a parade of natural numbers would give reasonably
well-distributed hashes this way (whereas longs wouldn't, of course).
How did you conclude it's terrible?

Writing my own hash table implementation, I noticed that I was getting
terrible performance with a ton of collisions and everything was heaped up in
a tiny spot in the table.

Inspecting the hash in hexadecimal, I realized that Jon's data keys --
the natural counting numbers 1, 2, 3, etc. -- are represented in a
double as a few bits in the upper most bits of the double. The lower
bits are always 0, even after slicing the 64 bit double's bit pattern in
half and xoring the two halves.

This xoring results in regular hash bit patterns like:

0x20200000
0x40200000
0x40600000
0x60200000

etc. as the numbers count up
(bit patterns made up from memory, but you get the idea.)

Ahaaa, yes, of course. Those bits in full:

0.0 0000000000000000000000000000000000000000000000000000000000000000
1.0 0011111111110000000000000000000000000000000000000000000000000000
2.0 0100000000000000000000000000000000000000000000000000000000000000
3.0 0100000000001000000000000000000000000000000000000000000000000000
4.0 0100000000010000000000000000000000000000000000000000000000000000
5.0 0100000000010100000000000000000000000000000000000000000000000000
6.0 0100000000011000000000000000000000000000000000000000000000000000
7.0 0100000000011100000000000000000000000000000000000000000000000000
8.0 0100000000100000000000000000000000000000000000000000000000000000
9.0 0100000000100010000000000000000000000000000000000000000000000000
10.0 0100000000100100000000000000000000000000000000000000000000000000
11.0 0100000000100110000000000000000000000000000000000000000000000000
12.0 0100000000101000000000000000000000000000000000000000000000000000
13.0 0100000000101010000000000000000000000000000000000000000000000000
14.0 0100000000101100000000000000000000000000000000000000000000000000
15.0 0100000000101110000000000000000000000000000000000000000000000000
16.0 0100000000110000000000000000000000000000000000000000000000000000
17.0 0100000000110001000000000000000000000000000000000000000000000000
18.0 0100000000110010000000000000000000000000000000000000000000000000
19.0 0100000000110011000000000000000000000000000000000000000000000000
20.0 0100000000110100000000000000000000000000000000000000000000000000

I'd been thinking that there would be lots of mismatch between base 10 and
base 2 giving us long binary expansions of those numbers, but of course 1
is sort of a common ground in both bases! On the other hand, if we divide
those numbers by 10:

0.0 0000000000000000000000000000000000000000000000000000000000000000
0.1 0011111110111001100110011001100110011001100110011001100110011010
0.2 0011111111001001100110011001100110011001100110011001100110011010
0.3 0011111111010011001100110011001100110011001100110011001100110011
0.4 0011111111011001100110011001100110011001100110011001100110011010
0.5 0011111111100000000000000000000000000000000000000000000000000000
0.6 0011111111100011001100110011001100110011001100110011001100110011
0.7 0011111111100110011001100110011001100110011001100110011001100110
0.8 0011111111101001100110011001100110011001100110011001100110011010
0.9 0011111111101100110011001100110011001100110011001100110011001101
1.0 0011111111110000000000000000000000000000000000000000000000000000
1.1 0011111111110001100110011001100110011001100110011001100110011010
1.2 0011111111110011001100110011001100110011001100110011001100110011
1.3 0011111111110100110011001100110011001100110011001100110011001101
1.4 0011111111110110011001100110011001100110011001100110011001100110
1.5 0011111111111000000000000000000000000000000000000000000000000000
1.6 0011111111111001100110011001100110011001100110011001100110011010
1.7 0011111111111011001100110011001100110011001100110011001100110011
1.8 0011111111111100110011001100110011001100110011001100110011001101
1.9 0011111111111110011001100110011001100110011001100110011001100110
2.0 0100000000000000000000000000000000000000000000000000000000000000

Perhaps not radically better, but at least there are more ones around.

i.e., hashes with very few bits different, and all in the upper most
bits of the hash. This is exactly the opposite of what you want in a
good hash, which is lots of randomness in the lower bits of the hash
code.

So I concluded: absent any other perturbation in the hash, it sucks.

Indeed. I suppose you could argue that the natural numbers represented as
doubles is a pathological case - if you have the naturals, why are you
using doubles? - but even so, it's not hard to come up with sequences
which have minimally different hashes:

1.0 0011111111110000000000000000000000000000000000000000000000000000
1.0000000000000002 0011111111110000000000000000000000000000000000000000000000000001
1.0000000000000004 0011111111110000000000000000000000000000000000000000000000000010
1.0000000000000007 0011111111110000000000000000000000000000000000000000000000000011
1.0000000000000009 0011111111110000000000000000000000000000000000000000000000000100
1.000000000000001 0011111111110000000000000000000000000000000000000000000000000101
1.0000000000000013 0011111111110000000000000000000000000000000000000000000000000110
1.0000000000000016 0011111111110000000000000000000000000000000000000000000000000111
1.0000000000000018 0011111111110000000000000000000000000000000000000000000000001000
1.000000000000002 0011111111110000000000000000000000000000000000000000000000001001
1.0000000000000022 0011111111110000000000000000000000000000000000000000000000001010
1.0000000000000024 0011111111110000000000000000000000000000000000000000000000001011
1.0000000000000027 0011111111110000000000000000000000000000000000000000000000001100
1.0000000000000029 0011111111110000000000000000000000000000000000000000000000001101
1.000000000000003 0011111111110000000000000000000000000000000000000000000000001110
1.0000000000000033 0011111111110000000000000000000000000000000000000000000000001111
1.0000000000000036 0011111111110000000000000000000000000000000000000000000000010000
1.0000000000000038 0011111111110000000000000000000000000000000000000000000000010001
1.000000000000004 0011111111110000000000000000000000000000000000000000000000010010
1.0000000000000042 0011111111110000000000000000000000000000000000000000000000010011
1.0000000000000044 0011111111110000000000000000000000000000000000000000000000010100

And, yes, for any hash function, there will be sets of values for which
the hashes are minimally different, or even the same. But they should be
harder to come up with than that!

This still leaves us with the question of what would constitute a good
hash. And perhaps a broader question: whose job is it to turn an object
into a well-distributed hashtable index? Should the object - every object
- undertake to provide a well-distributed hash (and i can't really define
what i mean like that, but it means something not like Double), or is it
up to the hashtable to take whatever hashes it gets and stir them up to
make good indices?

tom

--
A is for Absinthe, for which I now thirst