Re: hash_map

From:

"kanze" <kanze@gabi-soft.fr>

Newsgroups:

comp.lang.c++.moderated

Date:

4 Oct 2006 08:53:12 -0400

Message-ID:

<1159947568.666405.320100@c28g2000cwb.googlegroups.com>

tat wrote:

Thanks very much for all of your comments about hash_map and
various issues around it.
Can someone suggest a good string hash function for a string
of length ranging from 7 to 13?

Globally, everything I've seen tends to indicate that FNV (see
http://www.isthe.com/chongo/tech/comp/fnv/") is the best general
choice. It involves multiplication by an arbitrary value,
however, and on machines with slow multiplication, using a
Mersenne prime as multiplier may end up faster (since the
multiplication becomes a simple shift and subtract).

I've tested a number of hash codes in the past: there's a write
up of the results at
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html.

I googled and found a webpage (GeneralHashFunctions) of Arash
Partow (http://www.partow.net) on which there are a number of
hash functions.

Most of his examples seem rather dated. Some are among the
"known bad algorithms" which I didn't bother testing. Still, it
would only require a couple of lines of code to add them to the
test suite. (Since I did the write up, I've added CRC hashing,
for example.) Since one or two of them seem to be widely used,
it is probably worth doing. It would also be worth modifying
the test (which is in the code available at my site) so that I
could get more statistics than simple run-time.

I copied them below. I am not sure which one is good for my
case.

I'd say to use FNV unless it gave performance problems, then try
a variant with a Mersenne prime.

---------------------------------------------- Hash functions
------------------------------
unsigned int RSHash(string str)
{
    unsigned int b = 378551;
    unsigned int a = 63689;
    unsigned int hash = 0;

    for(unsigned int i = 0; i < str.length(); i++)
    {
       hash = hash*a+str[i];
       a = a*b;
    }
    return (hash & 0x7FFFFFFF);
};
/* End Of RS Hash Function */

Regretfully, there's no explination as to why this one might be
good, and no reference to any tests. A priori, calculating the
hash function will be slower than FNV or a Mersenne prime, since
it involves two multiplications, rather than just one. So
unless the distribution is a lot better (and it's hard to be
better than FNV), there's no point.

Note too that it initializes the hash key with 0. This is a
known bad value, although it probably won't affect you. (The
problem is that all keys of all nul characters hash to the same
value, regardless of their length.) It's also trivial to
correct.

unsigned int JSHash(string str)
{
    unsigned int hash = 1315423911;
    for(unsigned int i = 0; i < str.length(); i++)
    {
       hash ^= ((hash << 5) + str[i] + (hash >> 2));
    }
    return (hash & 0x7FFFFFFF);
};
/* End Of JS Hash Function */

I'm sceptical. And it seems to contain additional operations.
Partow's page starts with the very pertinant observation that
hash coding (at least in this context) is related to random
number generation. Linear congruent generators are known good
random number generators, and I haven't seen a hashing algorithm
to date which can compete with them. (There are better random
number generators than linear congruence, but they are all
significantly slower. Which is an issue in hash coding; taking
10 times more time just to gain 1% better distribution results
in slower access, not faster.)

unsigned int PJWHash(string str)
{
    unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int)
* 8);
    unsigned int ThreeQuarters = (unsigned int)((BitsInUnignedInt *
3) / 4);
    unsigned int OneEighth = (unsigned int)(BitsInUnignedInt /
8);
    unsigned int HighBits = (unsigned int)(0xFFFFFFFF) <<
(BitsInUnignedInt - OneEighth);

In a C++ implementation, of course, all of the above should be
const... and probably static.

    unsigned int hash = 0;
    unsigned int test = 0;

    for(unsigned int i = 0; i < str.length(); i++)
    {
       hash = (hash << OneEighth) + str[i];
       if((test = hash & HighBits) != 0)
       {
          hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));
       }
    }
  return (hash & 0x7FFFFFFF);
};
/* End Of P. J. Weinberger Hash Function */

Again, lots of extra tests. For probably no improvement in the
distribution (and maybe some loss).

unsigned int ELFHash(string str)
{
    unsigned int hash = 0;
    unsigned int x = 0;

    for(unsigned int i = 0; i < str.length(); i++)
    {
       hash = (hash << 4) + str[i];
       if((x = hash & 0xF0000000L) != 0)
       {
          hash ^= (x >> 24);
          hash &= ~x;
       }
    }
    return (hash & 0x7FFFFFFF);
};
/* End Of ELF Hash Function */

Ditto. (This is one of the "known bad hashing algorithms" that
I didn't bother testing.)

I suspect that this algorithm was originally developped in
assembler, using a rotate instruction; at least then, it would
be as fast or faster than linear congruence (which was often
avoided in the early days because multiplication was so
slow---see my comments on the DJB algorithm, however).

unsigned int BKDRHash(string str)
{
    unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
    unsigned int hash = 0;

    for(unsigned int i = 0; i < str.length(); i++)
    {
       hash = (hash*seed)+str[i];
    }
    return (hash & 0x7FFFFFFF);
};
/* End Of BKDR Hash Function */

Another linear congruent algorithm. It's probably worth
testing, but values like 131 seem to be chosen just because they
look interesting, and not for any theoretical reasons.

unsigned int SDBMHash(string str)
{
    unsigned int hash = 0;
    for(unsigned int i = 0; i < str.length(); i++)
    {
       hash = str[i] + (hash << 6) + (hash << 16) - hash;
    }
    return (hash & 0x7FFFFFFF);
/* End Of SDBM Hash Function */

This is just:
hash = 4194303 * hash + str[ i ] ;
written in a way as to obfuscate the intent and (probably)
prevent some compiler optimizations. (If multiplication is
faster than shifting, as it is on some machines, then it is a
definite pessimization. And if the version with the shifts is
faster, the compiler should find it.)

Again, a linear congruent generator, with no indication as to
why the particular multiplier was chosen.

unsigned int DJBHash(string str)
{
    unsigned int hash = 5381;
    for(unsigned int i = 0; i < str.length(); i++)
    {
       hash = ((hash << 5) + hash) + str[i];
    }
    return (hash & 0x7FFFFFFF);
};
/* End Of DJB Hash Function */

From Daniel J. Bernstein. Who also mentioned Chris Torek's

version (which subtracts instead of adding the hash) in his
posting. In the original posting in comp.lang.c, Bernstein
mentions that his version basically multiplies by 33, and Chris
Torek's by 31.

This is the same basic philosophy as that motivating the use of
a Mersenne prime (and Chris Torek's version is a Mersenne prime,
and is the hash code use in Java). It's interesting to note
that others had the same idea as I did (about the same time, or
possibly before).

Today, 1) I'd definitly use a larger Mersenne prime---with 31 or
33, dense sets of short strings will tend to cluster--, and 2)
I'd write the multiplication, and leave it up to the compiler to
decide whether shifting or multiplying was faster. (In 1990,
when Bernstein posted his article to comp.lang.c, it wasn't
obvious that the compiler would do this optimization.)

unsigned int APHash(string str)
{
    unsigned int hash = 0;

    for(unsigned int i = 0; i < str.length(); i++)
    {
       if ((i & 1) == 0)
       {
          hash ^=((hash << 7)^str[i]^(hash >> 3));
       }
       else
       {
          hash ^= (~((hash << 11)^str[i]^(hash >> 5)));
       }
    }
    return (hash & 0x7FFFFFFF);
};
/* End Of AP Hash Function */

Again, a test and a complicated expression. For probably no
reasons. (Note too that in this case, there's no way branch
prediction can help, so the test will be particularly
expensive.)

My recommendations are:
-- on a 16 bit machine, use a linear congruent hash function
    with a multiplier of 31,
-- for 32 bits and up, use FNV, and
-- if, and only if, the profiler shows that you are spending an
    inordinate amount of time in the hash function, and the
    machine has slow multiplication, switch to a multiplier of
    127 (which is a shift and a subtraction, when the compiler
    gets through with it).
(Another thing I do is cast the char to unsigned char before
adding it. I don't think it changes the efficiency of the hash
any, but it does mean that my results are the same, regardless
of the signedness of char.)

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]