Re: code review / efficient lookup techniques...

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 13 Nov 2009 08:03:51 -0800 (PST)

Message-ID:

<d6e4e144-d30c-4772-8ca5-aaf70485746e@j4g2000yqe.googlegroups.com>

On Nov 12, 10:30 pm, "James" <n...@spam.invalid> wrote:

"James Kanze" <james.ka...@gmail.com> wrote in message

[...]

Please correct me if I am way off base here but I think the
complexity for the simple and naive algorithm I posted
should be O(8) complexity for insert and lookup operations.

There is no such thing as O(8). The complexity is O(1).

Yikes! I was errounsly counting each diminsion as a seperate
access:

     return m_table.m_fp
            [g_translator[name[0]]]
                [g_translator[name[1]]]
                    [g_translator[name[2]]]
                        [g_translator[name[3]]];

Humm... Okay. I count 4 separate accesses into the
g_translator table. I also count a single access into the
m_table.m_fp table. Why would that not be O(5)?

Because there's no such thing as O(5). Big-O ignores any
constant multipliers, so O(5) would be the same thing as O(1).

[...]

Binary search is O(lg n). For small n (less than a couple
of hundred), the difference is not significant. (In the
tests I've done, binary search on a sorted vector or
std::map beats hash coding up to somewhere between 150 and
200 entries. Even though hash coding is O(1) and the binary
search or std::map are O(ln n).

have you tries using std::map as buckets in a static hash map?

struct my_map
{
std::map<...> m_buckets[DEPTH];
};

?

What would that buy you? If you've got more than two or three
entries in a bucket, your hash function is no good; change it,
and not the structure of the hash table.

Big O complexity is only significant when the number of
elements becomes large. (How large depends on the
difference in the big O factor. The difference between
constant and quadratic can become significant very quickly.
The difference between constant an logrithmic does't become
significant until you really do have a lot of entries.

I am going to populate the table with very many of entries
(e.g., hundreds of thousands).

In which case, a hash table or a tree might be the most
appropriate solution. If you really have more than a hundred
thousand, your original solution would be fine as well; the
table won't be as sparse as that; after all, there are only
475254 possible entries.

Otherwise: the advantage of a dynamically allocated trie is that
it doesn't use anywhere near as much memory if the table really
is sparse, since empty branches are represented by null
pointers.

[...]

Rather than extra braces, I'd just put the processing itself
in a separate function. (But this is perhaps more a
production code consideration. If you don't have command
line options or arguments, configuration files, etc., it's
probably overkill.)

Do you think it's a bad habit of mine to add the extra braces?
A separate function would most certainly work very well.

In your context, no. Professionally, programs tend to be a lot
more complicated, have options, read configuration files, and
things like that. Which means that main is pretty much taken up
by calling the functions to do all this, and doesn't contain any
of the program logic.

[...]

BTW, can you think of a faster way to perform the lookup
process???

Given that the number of used entries in the map is 5 as in
your example, it is quite certain that anything which uses a
compact map (of size about 20 bytes) will be faster than
your solution (using more than 500000 bytes) (of course
assuming there is any kind of memory caching present in the
system).

I fully intend to insert many command strings with random
names assigned to random command functions. I have to turn
this in on Monday. IMVVVHO, the sparse 256-ary array based
tree should work fine for the large data-set.

You mean a 4*27 array, don't you. 4*256 probably won't fit
into memory.

I made a stupid mistake! I was thinking of the translator
array which is 256 elements wide.

Yes, but it's only one dimension.

Humm... I do still think of the 4*27 array as being similar to
a 4-level 27-are trie. Am I off my rocker James?!

It's basically the same thing, except that you have
pre-allocated all possible combinations up front. When "all
possible combinations" is less than 500000, and you count on
several hundred thousand being filled, no problem. But what
happens when your prof decides that for the next exercise,
you're to remove the maximum length of four restriction:-)? Or
simply increase it to 32---I think you'll agree that
pre-allocating for all possible combinations is not going to
work then; it has to be something dynamically built up.

If you're looking for the simplest solution, then linear
search is fine. If the number of entries because too large,
you can either sort the array and use binary search, or
switch to a trie or a hash table. (I'd probably use a trie,
but I've a certain experience with this sort of code, and it
probably wouldn't take me more than an hour to implement it.
If you've never done anything similar before, it could
easily take a day or two, once you've fully understood the
algorithm.)

I created this tree with this homework problem in mind:

http://groups.google.com/group/comp.programming/browse_frm/thread/b63...

Do you think I should use it instead of the 4D array?

I don't know. I don't like hash tables with a fixed number of
buckets. It's not that hard to grow them (unless there are
iterators into them).

--
James Kanze