Re: Really fast checksum?
On Sat, 28 Feb 2009, Spud wrote:
I need to determine very quickly whether a database record has changed or
not. Using JDBC, I select records from a table (possibly large), roll through
them, calculate a checksum on the text of all of the fields, and then check
it against a stored checksum to see if the record has changed.
What's the fastest checksum algorithm which still yields fairly reliable
results?
In the past I've used CRC32, but that only operates on bytes and I'd rather
not convert all the strings I'll get from JDBC into bytes (for performance
reasons).
So convert the chars into bytes yourself, in a way which you know will be
fast.
CRC32 crc;
String s;
for (int i = 0; i < s.length(); ++i) {
char ch = s.charAt(i);
crc.update((ch >> 8) & 0xff);
crc.update(ch & 0xff);
}
Can I just call String.hashCode() for each field and add them all up?
Simply adding the checksums won't detect field being swapped: {"foo",
"bar"} and {"bar", "foo") will have identical hashes. You can fix this by
instead computing a sort of meta-hash over the hashes, eg using a
multiply-and-add scheme like the implementation of String.hashCode to
combine them. But ...
Is that reliable enough that I can be almost certain that if the
contents of the record changes, the sum of the hash codes will change as
well?
No hash function, of any kind, is capable of giving you that guarantee.
Look up the pigeonhole principle if you don't believe me.
The best you can get is a probabilistic assurance: for an ideal n-bit
hash, there's a 1 in 2^n chance that two different records will have the
same hash. CRC32 is 32-bit, so that's a one in four billion chance. If you
put that probability into the mathematics for the birthday paradox [1],
you see that if you have as few as 2^16 = 65536 records, you can expect to
have two with the same hash. CRC32 is adequate for verifying message
integrity in the face of small random errors (it's specifically engineered
to catch the kind of errors you get in communications - single flipped
bits, runs of stuck bits, etc), but it's not really up to the job you want
to put it to.
Or do I have to do something fancy like MD5?
Something fancy would be better. SHA1 is the new MD5, so that's your first
choice. It's a 160-bit hash, so you'd have to get up to 2^80 records
before you have to worry about birthday collisions. It's also engineered
to be more general than CRC32, so it should deal with all kinds of
variations in the records, including ones designed by evil hackers to fool
you. Note that it still can't guarantee that two records are the same,
though, so if the equality of records is a mission-critical matter, then
you're going to have to do full comparisons.
tom
[1] http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem
--
Orange paint menace