Re: Best algorithm/data structure combination for word lookup?

From:

=?ISO-8859-1?Q?Erik_Wikstr=F6m?= <Erik-wikstrom@telia.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 27 Jul 2007 16:59:37 GMT

Message-ID:

<Zvpqi.4817$ZA.2191@newsb.telia.net>

On 2007-07-27 17:59, pj@moodle.org wrote:

Hi,

I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:

1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)

2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.

When using binary trees/searches the time it takes to find out that the
word is not in the dictionary must be the same as it takes to find that
it isn't, if you think about it you'll see why.

3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.

A question, is there much work involved translating letter for letter,
ie. isn't there more or less a 1:1 mapping of the Latin letters to the
Greek? Just curious since you probably wont see any gains if it's that
simple.

Things that I don't mind trading off:

1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.

My current thoughts are:

* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).

I'd probably use a pointer instead of an int, doubt that there'll be
much difference in speed.

* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).

So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int> > for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words

A sorted std::vector would be faster for lookups than std::map, right?

For small sets the vector could probably be faster but I think the map
will be better as the size increase (but I don't have anything to back
that up with, so measure yourself).

What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?

Unless you plan to use the application in an embedded system or handheld
then std::map/std::vector will probably be fast enough for a dictionary
of that size.

What do you suggest? Is there any other good option that I 'm not
considering?

Don't think it will give you much with such small sets of words but a
radix-tree might be a good choice. It's a tree-structure and the depth
of each node would be no more than the number of letters in the word
(might be less). Another advantage is that it can give you early
negatives, you'd only have to walk as deep as the word has letters in
common with a word that's in the dictionary. The Wikipedia article on
the subject will give you the idea: en.wikipedia.org/wiki/Radix_tree

--
Erik Wikstr?m