Re: advice needed on maintaining large collections of unique ids

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 23 Jun 2009 21:06:50 +0100

Message-ID:

<alpine.DEB.1.10.0906232045230.14570@urchin.earth.li>

On Tue, 23 Jun 2009, Andrew wrote:

I am working on a system that stores items in a CMS. Each item is
indexed by a unique id. The uid can be quite large (255 bytes). We will
be having several million items in the CMS. Items occasionally get
deleted from the CMS. I need to write a program that watches the CMS to
see what new uids appear there. The size of a uid and the sheer number
of them prevent all the uids being held all in memory at the same time.
Does anyone have any suggestion as to how I could detect new uids
please?

The best thing would be to trap additions and deletions as they happen,
rather than detecting them with some big diff-like process afterwards.
Database triggers, hooks into the CMS code, proxies on network
connections, things like that might work.

Do you just need to detect new items, or removed ones as well? Can you
affect the way the CMS stores things? If the former and yes, then add an
item number to each item, allocated from a counter. Keep track of the
highest number allocated so far. To find all items added since time t,
just look for all items with a number higher than the highest allocated
number at t. If you can't change the CMS code, you might be able to do
this by adding an autoincrementing ID column to the database underneath
it.

One obvious all in memory solution is to maintain a Map of uids. One
reads in he old Map, scans the CMS and for each uid writes an entry to
the Map. This will update any entry already there and create an entry
that needs to be created. Simple. But it just takes up too much memory.

Not it doesn't. Unless you're really tight on memory, you should be able
to fit all the IDs in memory.

If memory is tight, you could look for ways to reduce the size of the ID
set.

Do the IDs have common prefixes? For instance, are they path-like, like
/articles/news/2009/january/26/wales/7539084? If so, you can store them
much more compactly in a trie (probably - it depends on exactly how much
commonality there is).

Are the IDs text strings? If so, they probably contain much less
information than data; you could probably compress them quite a bit -
english text typically has ~1.5 bits of information per character, so a
255-character key could typically be reduced to 48 bytes. A simple Huffman
code, or perhaps a second-order scheme of some sort, would do it. To avoid
having to make two passes to build the structure (one to collect
statistics to construct the code, and one to encode the strings),
construct a code upfront and apply the same code each time - it will
probably be good enough. If you wanted, you could update the code
alongside doing an encoding run - while applying the code to batch n,
you'd be building a code ready to use for n+1.

You can probably combine the two approaches if you're clever. Compressed
tries - someone must have invented those.

tom

--
The best way to predict the future is to invent it. -- Alan Kay