Re: code review / efficient lookup techniques...

From:

"James" <no@spam.invalid>

Newsgroups:

comp.lang.c++

Date:

Thu, 12 Nov 2009 20:54:22 -0800

Message-ID:

<hdk2v0$fig$1@aioe.org>

"James Kanze" <james.kanze@gmail.com> wrote in message
news:d6e4e144-d30c-4772-8ca5-aaf70485746e@j4g2000yqe.googlegroups.com...

On Nov 12, 10:30 pm, "James" <n...@spam.invalid> wrote:

"James Kanze" <james.ka...@gmail.com> wrote in message

[...]

Please correct me if I am way off base here but I think the
complexity for the simple and naive algorithm I posted
should be O(8) complexity for insert and lookup operations.

There is no such thing as O(8). The complexity is O(1).

Yikes! I was errounsly counting each diminsion as a seperate
access:

     return m_table.m_fp
            [g_translator[name[0]]]
                [g_translator[name[1]]]
                    [g_translator[name[2]]]
                        [g_translator[name[3]]];

Humm... Okay. I count 4 separate accesses into the
g_translator table. I also count a single access into the
m_table.m_fp table. Why would that not be O(5)?

Because there's no such thing as O(5). Big-O ignores any
constant multipliers, so O(5) would be the same thing as O(1).

Okay. So, if an operation take a fixed number of steps to solve a problem,
you can label it as having O(1) complexity even if the number of steps is a
thousand. Am I correcting my erroneous line of thinking wrt Big-O notation,
or screwing it at all over again?

;^)

[...]

Binary search is O(lg n). For small n (less than a couple
of hundred), the difference is not significant. (In the
tests I've done, binary search on a sorted vector or
std::map beats hash coding up to somewhere between 150 and
200 entries. Even though hash coding is O(1) and the binary
search or std::map are O(ln n).

have you tries using std::map as buckets in a static hash map?

struct my_map
{
std::map<...> m_buckets[DEPTH];
};

?

What would that buy you? If you've got more than two or three
entries in a bucket, your hash function is no good; change it,
and not the structure of the hash table.

Well, AFFLICT, the only thing it would buy you is that the pressure could be
taken off a single std::map. Which one would be better:

1: a single std::map with 8192000 items in it

2: or 8192 std::maps with only 8192 items in each one

This is probably really naive, but I think that distributing the load across
N std::map's should help out when the number of items gets really huge.

I know I must be missing something here.

;^(...

Big O complexity is only significant when the number of
elements becomes large. (How large depends on the
difference in the big O factor. The difference between
constant and quadratic can become significant very quickly.
The difference between constant an logrithmic does't become
significant until you really do have a lot of entries.

I am going to populate the table with very many of entries
(e.g., hundreds of thousands).

In which case, a hash table or a tree might be the most
appropriate solution. If you really have more than a hundred
thousand, your original solution would be fine as well; the
table won't be as sparse as that; after all, there are only
475254 possible entries.

Yeah. I intend on filling up the command map with many random strings, and
assigning them to random functions. Something like:

struct random_string
{
    static void generate(char* buf, size_t size)
    {
        for (size_t i = 0; i < size; ++i)
        {
            int c;

            do
            {
                c = rand() % (UCHAR_MAX + 1);
            }

            while (! isalpha(c));

            buf[i] = tolower(c);
        }
    }
};

struct random_command
{
    typedef void (*fp_command) (chat const*);

    static void command_1(char const* msg) {}
    static void command_2(char const* msg) {}
    static void command_3(char const* msg) {}
    static void command_4(char const* msg) {}
    // and on and on

    static fp_command generate()
    {
        static fp_command const g_cmds[] =
        {
            command_1,
            command_2,
            command_3,
            command_4

            // blah, blah
        };

        return g_cmds[rand() % (sizeof(g_cmds) / sizeof(fp_command))];
    }
};

int main()
{
    {
        #define DEPTH 400000

        struct recall
        {
            char m_cmd[4];
            fp_command m_fp;
        };

        recall* r = new recall[DEPTH];

        for (size_t i = 0; i < 400000; ++i)
        {
            random_string::generate(r[i].m_cmd, 4);

            r[i].m_fp = random_command::generate();

            // add the string r[i].m_cmd with the function r[i].m_fp to the
map
        }

        for (size_t i = 0; i < 400000; ++i)
        {
            fp_command fp = // lookup r[i].m_cmd

            // check for correctness
            if (fp != r[i].m_fp)
            {
                throw std::exception();
            }
        }

        delete [] r;
    }

    return 0;
}

Otherwise: the advantage of a dynamically allocated trie is that
it doesn't use anywhere near as much memory if the table really
is sparse, since empty branches are represented by null
pointers.

Indeed!

[...]

Rather than extra braces, I'd just put the processing itself
in a separate function. (But this is perhaps more a
production code consideration. If you don't have command
line options or arguments, configuration files, etc., it's
probably overkill.)

Do you think it's a bad habit of mine to add the extra braces?
A separate function would most certainly work very well.

In your context, no. Professionally, programs tend to be a lot
more complicated, have options, read configuration files, and
things like that. Which means that main is pretty much taken up
by calling the functions to do all this, and doesn't contain any
of the program logic.

Okay. So, I am safe for now...

;^)

[...]

BTW, can you think of a faster way to perform the lookup
process???

Given that the number of used entries in the map is 5 as in
your example, it is quite certain that anything which uses a
compact map (of size about 20 bytes) will be faster than
your solution (using more than 500000 bytes) (of course
assuming there is any kind of memory caching present in the
system).

I fully intend to insert many command strings with random
names assigned to random command functions. I have to turn
this in on Monday. IMVVVHO, the sparse 256-ary array based
tree should work fine for the large data-set.

You mean a 4*27 array, don't you. 4*256 probably won't fit
into memory.

I made a stupid mistake! I was thinking of the translator
array which is 256 elements wide.

Yes, but it's only one dimension.

Humm... I do still think of the 4*27 array as being similar to
a 4-level 27-are trie. Am I off my rocker James?!

It's basically the same thing, except that you have
pre-allocated all possible combinations up front.

Thank you for clarifying. I am not nuts after all!

:^)

When "all
possible combinations" is less than 500000, and you count on
several hundred thousand being filled, no problem. But what
happens when your prof decides that for the next exercise,
you're to remove the maximum length of four restriction:-)?

Then the algorithm will be rendered into a totally useless state!

Ouch!

Or
simply increase it to 32---I think you'll agree that
pre-allocating for all possible combinations is not going to
work then; it has to be something dynamically built up.

100% totally agreed. Luckily, I am getting more and more comfortable with
tree data-structures...

If you're looking for the simplest solution, then linear
search is fine. If the number of entries because too large,
you can either sort the array and use binary search, or
switch to a trie or a hash table. (I'd probably use a trie,
but I've a certain experience with this sort of code, and it
probably wouldn't take me more than an hour to implement it.
If you've never done anything similar before, it could
easily take a day or two, once you've fully understood the
algorithm.)

I created this tree with this homework problem in mind:

http://groups.google.com/group/comp.programming/browse_frm/thread/b63...

Do you think I should use it instead of the 4D array?

I don't know. I don't like hash tables with a fixed number of
buckets. It's not that hard to grow them (unless there are
iterators into them).

I am only using the hash table to take pressure of a single tree. Is this a
boneheaded line of thinking James?

BTW, I think I am learning more on this group about practical programming
practices than in the classroom!

:^o

thanks everyone!