Re: Run-length encoding (RLE) of stream segments ...

From:

=?ISO-8859-1?Q?Arne_Vajh=F8j?= <arne@vajhoej.dk>

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 26 Dec 2010 13:22:59 -0500

Message-ID:

<4d1787fe$0$23754$14726298@news.sunsite.dk>

On 26-12-2010 12:23, lbrt chx _ gemale kom wrote:

  RLE of stream segments ...
~
  I know this is not a newsgroup about algorithms, but you may know of a java library or some API doing this.
~
  There are versions of character-based Run length encoding (RLE)
~
  http://en.wikipedia.org/wiki/Run-length_encoding
~
  http://rosettacode.org/wiki/Run-length_encoding
~
  But say instead of basing the encoding on single characters you consider sequences. Let's take as an example the one used and cannibalized on the web:
~
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
~
  RLE is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs: for example, relatively simple graphic images such as icons, line drawings, and animations. It is not recommended for use with files that don't have many runs as it could potentially double the file size.
  For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white:
   WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
  If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following:
   12W1B12W3B24W1B14W
  Interpret this as twelve W's, one B, twelve W's, three B's, etc.
  While: WBWBWBWBWBWBWB would be: 1W1B1W1B1W1B1W1B1W1B1W1B1W1B
~
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
~
  The encoding data is actually longer! It could simple be 7[WB]
~
  Also, the longest non overlapping segments should be prioritized over shorter repeated ones:
~
  WBXAWBXWBXQWBPWBQ
~
  Instead of encoding that sequence as (something like ;-)):
~
  {0,4,7,11,14}[WB],{2,6,9}X,{3}A,{10,16}Q,{13}P
~
  It should be encoded like:
~
  {0,4,7}[WBX],{3}A,{10,16}Q,{11,14},WB,{13}P
~
  Any theories on these kinds of encodings? Ideally implemented in Java?

RLE performs so poorly compared to modern compression algorithms, that
I don't think there is much interest in them today.

Why don't you just use regular zip (deflate) algorithm, which is
build into Java.

Arne