Re: advice needed on maintaining large collections of unique ids

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Fri, 26 Jun 2009 20:05:02 +0100

Message-ID:

<alpine.DEB.1.10.0906261946180.7242@urchin.earth.li>

This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.

---910079544-359433879-1246043102=:7242
Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8BIT

On Thu, 25 Jun 2009, Andrew wrote:

On 23 June, 13:20, Thomas Pornin <por...@bolet.org> wrote:

According to Andrew ?<marlow.and...@googlemail.com>:

The uid can be quite large (255 bytes).

The uids are actually DOIs (from the world of digital publishing) which
are ASCII text.

Those are definitely quite compressible. The trouble is that you may have
to use different compression schemes for each registrant. You can strip
off the leading "10.", because that's always the same, and you can convert
the following digit string to some more compact identifier - i would
suggest using a Huffman code here, so that more common registrants get
shorter codes.

You should then switch to a registrant-specific scheme. For example,
Nature papers have a part which looks like "nature04924", so strip the
"nature" and encode the five digits in 17 bits. Or, since so far all of
the first digits are zero, encode the four nonzero digits in fourteen
bits. News articles and preprints have a different format, so you'll need
a couple of bits to distinguish those, and schemes for encoding their
digits.

That's likely to involve a huge amount of work, so a simpler approach
would be to use a normal Huffman code, but have a different codebook for
each registrant. It wouldn't take advantage of structure, as in the Nature
example, but it would still take a lot less than eight (or sixteen!) bits
per character.

tom

--
A playwright is not the best person to talk about his own work for
the simple reason that he is often unaware of what he has written. --
Alan Bennett
---910079544-359433879-1246043102=:7242--