On Sun, 18 Nov 2007 18:30:19 -0500, Eric Sosman
<esosman@ieee-dot-org.invalid> wrote:
Chris wrote:
Bonus question for OP: what is the size of data sets and how are they
used? Especially, where are they stored?
Multi-terabyte sized, split across multiple machines. On a single
machine, generally not more than a few hundred Gb. One or two disks per
machine, SATA, no RAID.
This sounds like it might be DNA sequences.
[...]
Most LZ based implementations (including DEFLATE) limit codes to
16-bits (I've heard of 32-bit LZ, but I've never seen it).
Compression studies have shown that, on average, the 16-bit code
dictionary will be filled after processing 200KB of input.
If the remainder of the input is characteristically similar to the
part already encoded, the full dictionary will compress the rest of
the input pretty well. But most input varies as it goes, sometimes
rapidly and drastically, so it does make sense to segment the input to
take advantage of the variation.
[...]
If we are talking about DNA sequences, then I would probably go for
256KB - once the base nucleotides and amino acid sequences are in the
dictionary (and you can guarantee this by preloading them),
compression is typically very good (80+%), so it makes sense to not
worry about it and just pick a convenient sized buffer to work with.
the need for the UTF-8 conversion. DEFLATE will quickly learn
that every other byte is a zero, and will compress them very well.