Re: Java text compression
Chris wrote:
Eric Sosman wrote:
Chris wrote:
[...]
1. The problem is fully specified. Text compression is a known
problem, and I just asked if anyone knew of a Java library that had
better tradeoffs than UTF-8 + zip.
2. Text means text, not DNA. Written words.
Elsethread you've explained that the compressed stream
gets read back into a companion program and decompressed there;
this suggests that it doesn't need to be exchanged with "foreign"
programs. In which case, I ask again: Does UTF-8 encoding buy
you enough additional compression to justify its expense? How
bad would things be if you just handed 16-bit chars to the
compressor with no "intelligence" whatsoever?
I'd like to try that. Unfortunately, java.util.zip.Deflater accepts only
byte arrays, not char arrays. I suppose it might be faster to copy the
chars to 2-byte sequences and compress, rather than run the UTF-8
compressor. An extra step, but worth a try.
Might it instead be the removal of a step? You've also
mentioned that the data are "streamed from an external source;"
in what form do they arrive? Unless you're using something
like RMI they probably don't arrive as full-fledged String
objects, but are converted to Strings from some more primitive
form -- like, perhaps, streams of bytes?
5. Asking "how much compression I want" is just stupid.
Well, you asked about compression speed. Other things
being equal, faster compressors compress less well and "looser"
compressors compress faster, so the question of "how much" must
eventually arise when you weigh alternatives.
Of course. It just reminded of walking into a store and having the clerk
ask "how much do you want to pay?" The right answer is, "show me the
merchandise and I'll figure out what the tradeoffs are on my own".
... and to push the analogy perhaps just a little bit too
far, the kindly clerk dumps the 600-page inventory list in your
lap and walks away. That is, a search is more efficient if the
searcher is an active participant instead of a filter-feeder.
--
Eric Sosman
esosman@ieee-dot-org.invalid