Re: Java text compression

From:

Robert Klemme <shortcutter@googlemail.com>

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 18 Nov 2007 22:36:32 +0100

Message-ID:

<5qbpj4FqraafU1@mid.individual.net>

On 18.11.2007 21:16, Eric Sosman wrote:

Chris wrote:

What's the fastest way to compress/decompress text?

If you're really interested in "the fastest way" to the
exclusion of all other concerns, then don't compress at all.
Bingo! Problem solved!

You might be happier with a compression scheme that did
a little better at reducing the size of the data, but now you
can't get a sensible answer until you describe the trade-offs
you're willing to make. For example, if you were offered a
compression scheme that ran ten percent faster than your current
method but emitted fifteen percent more data, would you take it
or reject it?

Bonus question for OP: what is the size of data sets and how are they
used? Especially, where are they stored?

We're doing that over really large datasets in our app. We're
currently converting char arrays to byte arrays using our own UTF-8
conversion code, and then compressing the bytes using java.util.zip.
The code is pretty old.

I don't like this two-step process, and the profiler shows that this
is a bottleneck in our app.

Is anyone aware of any code that compresses chars directly? Perhaps a
third-party library that does it faster?

    How badly do you need your own idiosyncratic UTF-8 conversion?
If you can use standard methods, consider wrapping the compressed
streams, somewhat like

    Writer w = new OutputStreamWriter(
        new GZIPOutputStream(...));

Minor detail: The encoding is missing, so in this case it would rather be

new OutputStreamWriter(new GZIPOutputstream(...), "UTF-8")

    w.write("Hello, world!");

    BufferedReader r = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(...)));
    String s = r.readLine();

    You'll have to make your own assessment of the speeds and the
degree of compression.

This is the solution I was going to suggest as well. If the custom
encoding yields the same results as Java's built in UTF-8 then I would
immediately switch to this approach of stacked streams. If the custom
encoding yields slightly different results I am sure you can plug in the
custom encoding into the standard Java io and nio classes with a little
effort.

If data is to be stored in a BLOB via JDBC you can even extend this
approach to directly stream into the database.

In our particular situation, decompression speed is a lot more
important than compression speed.

"You'll have to make your own assessment ..."

Decompression is generally much faster than compression. I believe
there is not much difference in decompression speed when decompressing a
GZIP stream that was compressed with highest and lowest level. If you
dig a bit into compression theory then it's pretty obvious that finding
a small compressed representation is significantly harder than
converting a compressed data set back.

Kind regards

robert