Re: How to quickly search through arrays?

From:

=?ISO-8859-1?Q?Marcel_M=FCller?= <news.5.maazl@spamgourmet.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 22 Feb 2010 20:20:46 +0100

Message-ID:

<4b82d90e$0$6721$9b4e6d93@newsspool2.arcor-online.net>

Steve555 wrote:

Thanks for the comprehensive reply. I've just googled as much as I
could to try and understand before replying.
I really need to learn about hash tables, before I get why I need six
of them!
(I did say I was new to search techniques ;) )

Feel free to use std::hash_map and forget about the internal structure
of hash tables.

The lookup table (library of sequences) never changes. The elements
will never be bigger than 171, so yes I could pack 4 of them into 32
bits.
Are you, then, saying to store each sequence-of-4 in a single
unsigned long, and do some kind of bit search?

   typedef unsigned char MyEntry[4];

will most likely fit your needs.

If you encounter problems because array types are a bit special in C/C++
wrap it in a struct:
   struct MyEntry
   { unsigned char Column[4];
   };

Objects of type MyEntry are now 32 bits.

   struct MyHashKey
   { unsigned char Key[2];
   };

MyHashKey is intended to be used as hash key for two arbitrary columns.

Because MyHashKey is no integral type you must define a hash function to
use it in a hashtable.

   int MyHashFunc(MyHashKey key)
   { // portable version:
     return key[0] | (key[1] << 8);
     // fast non-portable version:
     // return *(short*)key;
   }

As content I would recommend to use something like:

   template <typename Compare>
   typedef MyHashEntry std::set<MyEntry, Compare>;

It will contain the matching MyEntry lines of your data ordered by the
remaining columns, to speed up lookups with one or no wildcards. If you
have two wildcards enumerate over the entire sublist.
The Compare template parameter is needed because Content is sorted by
different columns depending on your first key columns. It is a functor
that provides the comperator for two MyEntries.

Now you need the comparer mentioned above.

   template <int COL1, int COL2>
   bool MyColumnComparer(MyEntry l, MyEntry r)
   { unsigned lkey = (l.Column[COL1] << 8) | l.Column[COL2];
     unsigned rkey = (r.Column[COL1] << 8) | r.Column[COL2];
     return lkey < rkey;
   }

And now you can define your root data structures:

   std::hash_set<MyHashKey,
                 MyHashEntry<MyColumnComparer<3,4> >,
                 MyHashFunc> EntriesBy1234;

The above hashtable instance is intended to store the Entries hashed by
the first two columns and for each distinct values of the first two
colums ordered by the third column. In the same way you could define
further entry points for your searches.

   std::hash_set<MyHashKey,
                 MyHashEntry<MyColumnComparer<2,4> >,
                 MyHashFunc> EntriesBy1324;
   std::hash_set<MyHashKey,
                 MyHashEntry<MyColumnComparer<2,3> >,
                 MyHashFunc> EntriesBy1423;
   std::hash_set<MyHashKey,
                 MyHashEntry<MyColumnComparer<1,4> >,
                 MyHashFunc> EntriesBy2314;
   std::hash_set<MyHashKey,
                 MyHashEntry<MyColumnComparer<1,3> >,
                 MyHashFunc> EntriesBy2413;
   std::hash_set<MyHashKey,
                 MyHashEntry<MyColumnComparer<1,2> >,
                 MyHashFunc> EntriesBy3412;

For lookups with only one wildcard you can choose an arbitrary list with
the required key columns. There is some redundancy in this case.

Note, that all of that is only a rough idea. It may even contain
syntactical errors because I wrote down it quite quickly.
Also note that the data structure is only optimized for lookups not for
the population process. Changes are quite slow because they have to be
applied at different places simultaneously.

Sorry, I don't know how that would work... how would I differentiate
between the ones the exist in the library, and those that don't?

Sorry, I didn't understand the your last question.

Marcel