Re: Tree Hashing / Equivalence

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Sun, 4 Feb 2007 15:12:29 CST

Message-ID:

<1170598815.035320.67540@a34g2000cwb.googlegroups.com>

Al wrote:

James Kanze wrote:
<snip>

There are two issues here.

The first is to visit the nodes in a canonic order. (Note that
if the order of the children in the vector doesn't affect
equality, it can't be allowed to affect the hash code either.)

The order of the children does matter. To provide greater context, this
is an AST. Actually, it's a little higher level than that, so more like
a semantic resolution tree.

That makes calculating a hash code marginally simpler.

The second is to recursively integrate the hash values of the
subtrees into the hash value of the node. Basically, this can
be handled the same way you hash a string, if 1) the subtree
hash is convertable to unsigned (which should be the case), and
2) either order matters, or you visit the subtrees in a
canonical order. (I've got some templates at my site,
http://kanze.james.neuf.fr/code-en.html. Subsystem Basic,
component Hashing. The elements there are designed for use with
my pre-standard hash tables, but it shouldn't be too difficult
to adapt them to other hash tables.)

It's the "integration" step that I'm worried about. Basically, I don't
want to do something that is mathematically unsound and basically
collides all over the place. The reverse is not as problematic -- if two
tree hashes don't match, when they should have, the worst that will
happen is it will trigger a re-process.

I'm not sure I fully understand. If the two hashes don't
match, you know that the trees are not the same; if they do,
however, you can't make any assumptions about their
equality. They might be the same, and they might not be.
(If you think about it, it can't be otherwise. The data you
are hashing contains a lot more bits than the final hash
code, so inevitably, there will be more than one bit pattern
which must result in the same hash.)

Depending on the quality of the hashing function, you might
be able to make some statistical statements. But the
hashing component I sited isn't designed with that in mind;
it is designed for use in a hash table, where to be useful,
it has to be relatively rapid (and matches will be followed
with equality comparisons anyway). Cryptographic hashs are
designed to reduce to a maximum the probability of a change
resulting in the same hash value, and depending on the
application, it might be acceptable to use an MD5 hash, or
some such (but I'd still be sceptical).

On the other hand, if you're using it for caching, you'll
probably need a much better (and more expensive) hashing
algorithm---something like MD-5 or SHA-1. And even then, you'll
have to accept the fact that two different trees can hash equal.
If the goal is to avoid some expensive work when the trees are
equivalent, I don't think that hashing will do; you need to
check for complete equivalence.

Yes, the goal is to avoid reprocessing a tree if its hash matches an
existing version that has already been processed (or "cached"). So, for
example, say the user merely introduces some insignificant whitespace to
the source file. We still need to re-parse it, but ideally, the hash
will match the existing post-processed version (so it wouldn't even need
to be loaded into memory).

A hash is only a first approximation, and probably not
appropriate here. If the hash values are different, you
know that you have to reprocess, but if they are the same,
you still have to compare for equality. So if you need to
reprocess, the gain from not comparing for equality is
probably minimum, and if you do, you might as well do the
comparison for equality immediately, and save the hashing,
because you have to do it anyway.

I took a look at the library you linked to, and there appear to be some
items I could reuse, such as Gabi::HashAccumulator. Is there a complete
example that uses the Hashing components available?

Not really. They're generally used with Gabi::AssocArray,
but the whole point is just making some convenient functions
available to make it easier for the user to define a hash
code (and to avoid bad hash codes).

Perhaps it makes more sense to flatten the tree BFS style and then hash
the nodes linearly than trying to hash in true nested tree form.

The structure isn't really that relevant, as long as you can
iterate over it.

--
James Kanze (Gabi Software) email: james.kanze@gmail.com
Conseils en informatique orient?e objet/
                    Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]