Re: code review / efficient lookup techniques...

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 13 Nov 2009 01:40:50 -0800 (PST)

Message-ID:

<2a34dd7b-d0dc-4194-9aa6-3bacb8934ce9@n35g2000yqm.googlegroups.com>

On Nov 12, 5:01 pm, "James" <n...@spam.invalid> wrote:

"Paavo Helde" <myfirstn...@osa.pri.ee> wrote in message

A tag of 4 characters/bytes immediately suggests that you
could interpret those 4 characters as 32-bit unsigned
integers (unsigned is needed, as signed integers can have
trap values).

So can unsigned. Unsigned char is the only type which isn't
allowed to have trap values.

In theory, at least. But all systems are requires to support
values up to 2^32-1 in an unsigned long, so it doesn't matter.

Do you mean something like:

typedef unsigned int uint32_type;

typedef char assert_s
[
       sizeof(uint32_type) == sizeof(char) * 4
    && sizeof(uint32_type) * CHAR_BIT == 32
    ? 1 : -1
];

typedef char cmd_buffer[4];

union cmd
{
    cmd_buffer buffer;
    uint32_type whole;
};

then decoding 4 chars at once like:

union cmd c;

char const* const string = "command";

c.whole = *((*uint32_type)string);

?

More likely, he was thinking of something like:

    uint_fast32_t i = ((c[0] & 0xFF) << 24)
                    | ((c[1] & 0xFF) << 16)
                    | ((c[2] & 0xFF) << 8)
                    | ((c[3] & 0xFF) );

This is guaranteed to work on any machine.

(This should work for EBCDIC as well as far as both your
source code and the initial data for initializing the map
are in the same codepage.) Thus the mapping is simplified to
integer-to-value. One could hold these in a sorted array and
perform a binary search (or even a linear search in an
unsorted array if the number of entries is small, placing
the more frequent commands in the beginning).

Please correct me if I am way off base here but I think the
complexity for the simple and naive algorithm I posted should
be O(8) complexity for insert and lookup operations.

There is no such thing as O(8). The complexity is O(1). But
the constant overhead is high due to the lack of locality. It's
a relatively poor space/speed trade-off, given that on modern
machines, excess space results in a noticeable slow-down.
Better solutions exist.

This tells me that storing a large amount of items in the
array will have no impact on the complexity whatsoever.

Array access is O(1), regardless of the size of the array. But
that's purely a theoretical figure. In practice, if the array
is very sparsely populated, the program will run slower,
compared to other algorithms.

I cannot exactly claim that for a binary search whose
complexity parameters can be effected by N.

Binary search is O(lg n). For small n (less than a couple of
hundred), the difference is not significant. (In the tests I've
done, binary search on a sorted vector or std::map beats hash
coding up to somewhere between 150 and 200 entries. Even though
hash coding is O(1) and the binary search or std::map are O(ln
n). Big O complexity is only significant when the number of
elements becomes large. (How large depends on the difference in
the big O factor. The difference between constant and quadratic
can become significant very quickly. The difference between
constant an logrithmic does't become significant until you
really do have a lot of entries.

Below are just some stylistic comments...

[...]

int main()
{
{

Why extra braces?

This is probably going to sound retarded, but I personally
like to encapsulate everything in main within an extra scope
in order to ensure that all objects are destructed before I
actually return from main.

That's interesting. Generally (in actual applications---not
necessarily is homework type projects), main will only be a
couple of function calls, maybe one to check the command line
arguments, and one to do the actual work; in the case of a Unix
like filter, you might put the loop over the filenames in main,
but that's about it.

Rather than extra braces, I'd just put the processing itself in
a separate function. (But this is perhaps more a production
code consideration. If you don't have command line options or
arguments, configuration files, etc., it's probably overkill.)

Imagine I wanted to print a final message that occurs after
all messages from the object dtors are printed:

You can't, because you don't know the order the destructors of
static objects will be called:-).

[...]

static cmap<void, char const*> cm_normal(command_unknown);
static cmap<void, char const*> cm_reverse(command_unknown);

Why static?

They blow the stack on a Windows platform.

You can change the default stack size at link time, if that's
all that's bothering you.

BTW, I don't really see a need for dynamic memory allocation
here.

For large tables, it doesn't hurt anything. It costs the same
to allocate int[5] and int[5000000]. In the first case, that
cost has to be amortised over accessing 5 elements, which means
it may be significant. In the second, it's amortised over
accessing 5000000 elements, which means that it's practically
nothing.

[...]

BTW, can you think of a faster way to perform the lookup
process???

Given that the number of used entries in the map is 5 as in
your example, it is quite certain that anything which uses a
compact map (of size about 20 bytes) will be faster than
your solution (using more than 500000 bytes) (of course
assuming there is any kind of memory caching present in the
system).

I fully intend to insert many command strings with random
names assigned to random command functions. I have to turn
this in on Monday. IMVVVHO, the sparse 256-ary array based
tree should work fine for the large data-set.

You mean a 4*27 array, don't you. 4*256 probably won't fit into
memory.

If you're looking for the simplest solution, then linear search
is fine. If the number of entries because too large, you can
either sort the array and use binary search, or switch to a trie
or a hash table. (I'd probably use a trie, but I've a certain
experience with this sort of code, and it probably wouldn't take
me more than an hour to implement it. If you've never done
anything similar before, it could easily take a day or two, once
you've fully understood the algorithm.)

But as this is homework, the correctness of the solution
should be what matters, and the speed should not be an issue
at all (as long as the demo completes before the professor
gets bored ;-).

LOL!

The only reason why I am doing this is so I can get some extra
credit. I am interested in getting a good grade, and plan on
becoming at least a fairly decent programmer. Perhaps I can
even make some $$$ in the future.

If making money is your goal, the commercial side generally pays
a lot better:-). (Bill Gate's original Basic used a linear
search in the name table. It was, in fact, recommended to use
the most frequently used variables at the beginning of the
program, so they would be at the front of the table.)

--
James Kanze