Re: Benchmarks
On Nov 11, 8:16 pm, user923005 <dcor...@connx.com> wrote:
On Nov 11, 8:02 am, James Kanze <james.ka...@gmail.com> wrote:
[...]
According to the link on your page, this is one I've already
tested. And found it to be slower than FNV or my own hash
functions. It is, IMHO, a typical example where being
overly complicated to match a particular machine doesn't
pay. The distribution isn't significantly better than FNV
or my own, and in fact it takes more time (on the machines I
have access to) to calculate.
There is an updated version of Bob Jenkin's hash that is faster.
Now he tells me:-). I've just finished adding Paul's algorithm
(and I'll quote the results and comment on them below).
Another excellent hash is this
one:http://www.geocities.com/drone115b/Goulburn06.pdf
If you are hashing big keys, UMAC is
marvelous.http://fastcrypto.org/umac/2000/perf00.html
Two more to add. You're mention of big keys is significant,
however, see below.
[...]
I use frog.cpp as my testing harness. Is your hash testing
harness code available?
It's at my site, in the Test subsystem. While I won't swear to
it, I don't thing anything has changed in it since I put it up.
See http://kanze.james.neuf.fr/code-en.html. (The easiest
solution might be to simply download the lot, and browse around
in it. If all you're interested in is the harness, however, it
is in Test, and it only depends on Global---no other subsystems,
and no other component in test.)
[...]
It's interesting to note that on a modern machine, FNV or my
own functions do not take any more time than just summing
the bytes, but result in a very good distribution.
FNV is not as good as some of the others. It cascades a bit.
I'm not sure. It seems to give very good results with my test
data.
[...]
Is your Mersenne prime hash based on the Mersenne twister RNG
or are just just big Mersenne primes as a large prime for
modulus operations with a very long period?
No. It's actually based on linear congruential random number
generation: x[i] = (a * x[i-1] + b) % m. Where m is 2^n (not a
good idea of random number generation, but it doesn't seem to be
a problem here, and b is actually the next character, rather
than a constant. (FNV uses more or less the same principle,
except that it xor's in the next character.) The Mersenne prime
part is that a is a Mersenne prime. If you think about it, it's
important for a to be prime with respect to the size of your
table, so a prime number is indicated. And a Mersenne prime has
the advantage that it can be calculated with a simple shift and
subtraction if multiplication is expensive.
[...]
So I think that the word length factor is really a red herring.
I think it is a good idea to test everything, and then later
on retest it all because assumptions are based on models that
can change over time.
Several things are clear, and one is that the behavior will
depend on hardware. Which evolves, so yesterday's answer might
not hold today.
OK, for the results of my tests. Two comments to begin with,
though:
1. I did my tests using std::string as a key, which definitly
favors character by character access though, at least for
clarity. I used std::string::const_iterator throughout,
reading four bytes and assembling them into a uint32_t when
the algorithms called for it. In the case of Paul's and
Bob's algorithms, which are based on reading four bytes at
a time, I also implemented a hacked version (marked (opt),
for optimized), below, which called std::string::data() to
get the pointer, and did reinterpret_cast it to uint32_t (or
uint16_t, in the case of Paul's), to see what different that
made.
2. Which leads me to the second remark: both Paul's and Bob's
algorithms core dump for certain input on my usual platform
(Sun Sparc), because they don't ensure correct alignment;
incorrect alignment will also cause them to run considerably
slower on an Intel or AMD platform. In my case, the actual
implementations of std::string I use (g++, Sun CC, and once
I get the library back up and running with it, VC++) all
happen (by chance) to ensure that the pointer returned by
data() will be aligned, but I can easily imagine very viable
implementations where this is NOT the case. As such, I
would consider either implementation as unusable except in
certain cases, unless you do as I did above, and reassemble
the bytes.
Anyway, with excuses for the HTML (that's what my scripts
generate), and the fact that it probably won't look that good
(since the generated tables are meant to be incorporated into
pages with a CSS header), here are the complete results, for a
Linux based machine running on an AMD 64 X2 5200+ machine,
compiled with g++ 4.2.1, -O3:
<table border=3>
<tr>
<th> </th>
<th>75URLs</th>
<th>1275URLs</th>
<th>ProgSyms</th>
<th>FrWords</th>
<th>Constructed</th>
<th>Dense</th>
</tr>
<tr>
<td class="tcolHeader">sorted vector</td>
<td class="tFData">0.227</td>
<td class="tFData">0.430</td>
<td class="tFData">0.379</td>
<td class="tFData">0.976</td>
<td class="tFData">2.054</td>
<td class="tFData">0.351</td>
</tr>
<tr>
<td class="tcolHeader">std::map</td>
<td class="tFData">0.249</td>
<td class="tFData">0.467</td>
<td class="tFData">0.511</td>
<td class="tFData">1.319</td>
<td class="tFData">2.784</td>
<td class="tFData">0.465</td>
</tr>
<tr>
<td class="tcolHeader">FNVHash</td>
<td class="tFData">0.150</td>
<td class="tFData">0.163</td>
<td class="tFData">0.147</td>
<td class="tFData">0.296</td>
<td class="tFData">0.339</td>
<td class="tFData">0.106</td>
</tr>
<tr>
<td class="tcolHeader">FNVHash (opt.)</td>
<td class="tFData">0.149</td>
<td class="tFData">0.167</td>
<td class="tFData">0.147</td>
<td class="tFData">0.295</td>
<td class="tFData">0.338</td>
<td class="tFData">0.106</td>
</tr>
<tr>
<td class="tcolHeader">Java hash</td>
<td class="tFData">0.147</td>
<td class="tFData">0.168</td>
<td class="tFData">0.145</td>
<td class="tFData">0.296</td>
<td class="tFData">0.329</td>
<td class="tFData">0.131</td>
</tr>
<tr>
<td class="tcolHeader">Mersenne 2^7-1</td>
<td class="tFData">0.147</td>
<td class="tFData">0.167</td>
<td class="tFData">0.150</td>
<td class="tFData">0.297</td>
<td class="tFData">0.375</td>
<td class="tFData">0.106</td>
</tr>
<tr>
<td class="tcolHeader">Mersenne 2^7-1 (opt.)</td>
<td class="tFData">0.147</td>
<td class="tFData">0.166</td>
<td class="tFData">0.147</td>
<td class="tFData">0.296</td>
<td class="tFData">0.375</td>
<td class="tFData">0.105</td>
</tr>
<tr>
<td class="tcolHeader">Mersenne 2^9-1</td>
<td class="tFData">0.147</td>
<td class="tFData">0.167</td>
<td class="tFData">0.142</td>
<td class="tFData">0.296</td>
<td class="tFData">0.471</td>
<td class="tFData">0.107</td>
</tr>
<tr>
<td class="tcolHeader">Mersenne 2^11-1</td>
<td class="tFData">0.147</td>
<td class="tFData">0.167</td>
<td class="tFData">0.143</td>
<td class="tFData">0.297</td>
<td class="tFData">3.613</td>
<td class="tFData">0.106</td>
</tr>
<tr>
<td class="tcolHeader">Sedgwick</td>
<td class="tFData">0.181</td>
<td class="tFData">0.182</td>
<td class="tFData">0.152</td>
<td class="tFData">0.299</td>
<td class="tFData">0.346</td>
<td class="tFData">0.107</td>
</tr>
<tr>
<td class="tcolHeader">Sobel</td>
<td class="tFData">0.163</td>
<td class="tFData">0.182</td>
<td class="tFData">0.151</td>
<td class="tFData">0.300</td>
<td class="tFData">0.360</td>
<td class="tFData">0.128</td>
</tr>
<tr>
<td class="tcolHeader">Weinberger</td>
<td class="tFData">0.246</td>
<td class="tFData">0.262</td>
<td class="tFData">0.165</td>
<td class="tFData">0.313</td>
<td class="tFData">0.372</td>
<td class="tFData">0.184</td>
</tr>
<tr>
<td class="tcolHeader">K&R</td>
<td class="tFData">0.178</td>
<td class="tFData">0.195</td>
<td class="tFData">0.151</td>
<td class="tFData">0.300</td>
<td class="tFData">0.329</td>
<td class="tFData">0.105</td>
</tr>
<tr>
<td class="tcolHeader">Open SDBM</td>
<td class="tFData">0.165</td>
<td class="tFData">0.182</td>
<td class="tFData">0.152</td>
<td class="tFData">0.301</td>
<td class="tFData">0.368</td>
<td class="tFData">0.107</td>
</tr>
<tr>
<td class="tcolHeader">Bernstein</td>
<td class="tFData">0.147</td>
<td class="tFData">0.166</td>
<td class="tFData">0.147</td>
<td class="tFData">0.296</td>
<td class="tFData">0.366</td>
<td class="tFData">0.129</td>
</tr>
<tr>
<td class="tcolHeader">Knuth</td>
<td class="tFData">0.134</td>
<td class="tFData">0.153</td>
<td class="tFData">0.148</td>
<td class="tFData">0.296</td>
<td class="tFData">0.361</td>
<td class="tFData">0.131</td>
</tr>
<tr>
<td class="tcolHeader">Partow</td>
<td class="tFData">0.181</td>
<td class="tFData">0.197</td>
<td class="tFData">0.155</td>
<td class="tFData">0.303</td>
<td class="tFData">0.360</td>
<td class="tFData">0.110</td>
</tr>
<tr>
<td class="tcolHeader">Pearson</td>
<td class="tFData">0.434</td>
<td class="tFData">0.456</td>
<td class="tFData">0.224</td>
<td class="tFData">0.384</td>
<td class="tFData">0.413</td>
<td class="tFData">0.153</td>
</tr>
<tr>
<td class="tcolHeader">Jenkins</td>
<td class="tFData">0.149</td>
<td class="tFData">0.168</td>
<td class="tFData">0.157</td>
<td class="tFData">0.309</td>
<td class="tFData">0.362</td>
<td class="tFData">0.147</td>
</tr>
<tr>
<td class="tcolHeader">Jenkins (opt.)</td>
<td class="tFData">0.135</td>
<td class="tFData">0.153</td>
<td class="tFData">0.156</td>
<td class="tFData">0.307</td>
<td class="tFData">0.359</td>
<td class="tFData">0.147</td>
</tr>
<tr>
<td class="tcolHeader">Hsieh</td>
<td class="tFData">0.129</td>
<td class="tFData">0.150</td>
<td class="tFData">0.156</td>
<td class="tFData">0.299</td>
<td class="tFData">0.870</td>
<td class="tFData">0.140</td>
</tr>
<tr>
<td class="tcolHeader">Hsieh (opt).</td>
<td class="tFData">0.121</td>
<td class="tFData">0.143</td>
<td class="tFData">0.157</td>
<td class="tFData">0.296</td>
<td class="tFData">0.868</td>
<td class="tFData">0.138</td>
</tr>
<tr>
<td class="tcolHeader">CRC</td>
<td class="tFData">0.249</td>
<td class="tFData">0.270</td>
<td class="tFData">0.181</td>
<td class="tFData">0.325</td>
<td class="tFData">0.381</td>
<td class="tFData">0.143</td>
</tr>
<tr>
<td class="tcolHeader">CRC (opt.)</td>
<td class="tFData">0.250</td>
<td class="tFData">0.270</td>
<td class="tFData">0.178</td>
<td class="tFData">0.325</td>
<td class="tFData">0.383</td>
<td class="tFData">0.142</td>
</tr>
</table>
The times are in microseconds per access.
There's a lot more that could be said, but at the least, it's
important to describe the data sets:
75URLs: 73 URL's, taken from the history file of my browser at
one particular occasion.
1275URLs: 1259 URL's, from all of the files in my browser's
cache, at the same occasion.
(The numbers in the names of
these two represent the original number of entries, before I
realized that there were duplicates, and eliminated them.)
ProgSyms: The set of all symbols and numeric literals in my code
base (the one posted at my site, I think), 8554 entries.
FrWords: A set of French words, extracted from the sources for
the French ispell tables, 47451. (Note that this is the
only table with anything other than standard ASCII; it's in
ISO 8859-1, with accented characters.)
Constructed: An artificially constructed set of all of the words
from x000000 to x999999 (for exactly one million entries---I
wanted something big, with a lot of close matches).
Dense: Another artificially constructed set, will all two
character strings consisting of printable ASCII characters,
so 95*95 = 8836 entries. (This one was constructed
intentionally to display a theoretical weakness of the Java
hash code, which is in fact my Mersenne prime algorithm
using 2^5-1.)
About the only comment I'll make on the results is that with the
exception of the two sets of URL's, most of my strings are
fairly short; I'll write up an AWK script to calculate the
average, median, max and min lengths (although for the last two,
they're trivial), but off hand, ProgSym is the only other one
which contains a significant number of entries with more than,
say 10 characters. (Some of the URL's are also pretty short,
however.) This alone could account for the differences in my
measurements and Paul's; obviously, for very short strings,
there can be nothing gained by reading four characters at a
time. At any rate, it's precisely the two sets with the URL's
in which Paul's and Bob Jenken's algorithms come out best.
Which brings me back to what is probably the most important
point: which hash code is best will depend on your data sets,
the underlying hardware and the compiler, and (one thing we've
not mentionned) how the hash table works. (The above
measurements were done using my own AssocArrayOf class, also
present at my site. A class using a different growth strategy
or collision handling may perform differently.) IMHO, the
Mersenne 2^7-1 algorithm has the advantage of being among the
best pretty much everywhere, and is dead simple to implement, so
I'd stick with it in general. But if your profiler says that
hash table accesses are an issue, it's definitely worth trying
some of the others.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34