Re: Best algorithm/data structure combination for word lookup?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sun, 29 Jul 2007 16:11:53 -0000

Message-ID:

<1185725513.849012.64810@l70g2000hse.googlegroups.com>

On Jul 27, 5:59 pm, "p...@moodle.org" <John.Papaioan...@gmail.com>
wrote:

I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:

1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)

The alternative to storing the string "leet" twice is to use
pointers. And it's far from obvious that a pointer requires
less memory than the string---on my machine, a pointer is 8
bytes, where as I can store "leet" in five bytes.

2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.

For 1000 or so words, std::set is more than adequate. For
100000, of course, you'd have to consider some sort of hashing.

You might also consider some sort of tri.

3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.

Things that I don't mind trading off:

1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.

My current thoughts are:

* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).

Which would make it significantly slower, without necessarily
reducting memory very much.

* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).

If the data is unmutable, and can be known at compile time, the
simplest solution is probably std::lower_bound over a C style
array, statically initialized; if you can set a maximum length
of, say 7, for the strings, this can avoid all pointers
altogether, and be very, very fast. (It also have very good
locality, which can make a significant difference in real life,
even if it doesn't affect the big-O factor.)

Otherwise, std::map (normally a balanced tree, and never a hash
table) can be used.

So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int> > for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words

A sorted std::vector would be faster for lookups than std::map, right?

What do the measurements say on your machine? As I said, if the
data were known at compile time, I'd use a:

    struct Entry
    {
        char from[ 8 ] ;
        char to [ 8 ] ;

        bool operator<( Entry const& other ) const
        {
            return memcmp( from, other.from, 8 ) < 0 ;
        }
    } ;

    Entry const table[] =
    {
        // Entries pre-sorted.
    } ;

and std::lower_bound. A bit brutal, perhaps, but remarkably
efficient, because locality is as good as possible.

What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?

It depends on the efficiency of the hash function. There can be
some gain for as few as a couple of hundred elements, but in
practice, the difference is very, very small, and a hash table
will never has as good locality as something like the above.

--
James Kanze (Gabi Software) email: james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34