Re: find words that contains some specific letters
Giovanni Azua wrote:
Hi John,
Please find my comments below.
"John B. Matthews" <nospam@nospam.invalid> wrote
The OP's requirements are not so clear to me. Perhaps the OP
can elucidate whether the algorithm addresses the
problem.
Now that I read it carefully, to my understanding the Jumble problem is
exactly what the OP was asking for. I apologize I did not completely read
the wiki Jumble description the first time. You did a good job identifying
the problem to the OP description.
"John B. Matthews" <nospam@nospam.invalid> wrote
The code [1] correctly implements the first algorithm described in the
article cited [2]. In particular, it uses the sorted letters of each
word as the key in a HashMap.
This is why I think your solution does not solve the Jumble problem:
Assume you have two sorted characters input word, say:
String A
String B
Assume input A has the following word matches: mA1, mA2
Assume input B has the following word matches: mB1, mB2, mB3
Since the String class does not offer a perfect hash function, then assume
too that the following holds hashCode(A) == hashCode(B). Under this scenario
the Hash table will place all mA* and mB* under the same bucket i.e.
hashTable.get(A) == Set of { mA1, mA2, mB1, mB2, mB3 }
hashTable.get(B) == Set of { mA1, mA2, mB1, mB2, mB3 }
Therefore your implementation will return wrong values for A namely the
That is not true. HashMaps and Hashtables do not use hashCode() for equality
checks; they use equals(). String#equals() and Set#equals() are value-based,
so do not worry that you'll get wrong results.
elements that match B and the same for B too. I would expect
in a real life dataset e.g. Oxford dictionary 350k words and phrases
350K words and phrases (I didn't know you meant phrases, too) seems low.
Let's say there are about 512 KiPhrases of interest, of an average length for
English of sixteen characters.
to have way more collisions than this, not only from the lack of
A regular HashMap for those data at the default .75f load factor will most
likely be about 475 KiBuckets long, have about a quarter of those empty,
nearly all of the rest will contain one entry, and a few will have two, three,
or much less likely, more.
perfection of the String hash function but also because
the hash table can not realistically offer a size that
correspond to the number of all the distinct sorted character input words.
Can it not?
I forget the overhead of a String, but for the Strings let's say 32 bytes for
address, lengths - I'm just too lazy to look it up right now, and the new
version of Java for 64-bit just cut its pointer overhead in half, so I know
the numbers can change anyway.
That's 16 MiB for String overhead, another 16 MiB for the Set values, another
16 MiB for Map.Entry instances, and another 16 MiB for the character data in
the Strings. That's 64 MiB, well within the capacity of nearly all JVM
installations. Hell, let's double it to 128 MiB. If RAM is that tight, you
can use clever compression techniques to lower the overhead. Nevertheless,
you can clearly see that a 128 MiB data structure is certainly realistic.
That there are collisions does not affect the accuracy of the match. Hash
codes are used merely as an optimization, to bring searches down near constant
time. Once the hash identifies a bucket, the Map then uses equals() to
distinguish between the k candidates at that bucket. That's why a good hash
code, such as String's, is desirable. It keeps the search down at O(k) rather
than O(m). Since in practice the String hash code is very good, the number of
dictionary Strings can vary widely without materially changing the performance
of HashMap#get(). The chances are very good that no bucket will contain more
than one Map.Entry<String, Set<String>>. A few might contain two or three.
You are correct that the worst case is O(m), or perhaps O(m log m) if the Map
implementor keeps the bucket list properly sorted. If the hash really ever
does just happen to land all the keys in one bucket, then, yes, you are up
that infamous creek without the ameliorative paddle. Write back when this
actually happens to you.
--
Lew