Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals?

From:

=?ISO-8859-1?Q?Arne_Vajh=F8j?= <arne@vajhoej.dk>

Newsgroups:

comp.lang.java.programmer

Date:

Fri, 04 Feb 2011 18:41:30 -0500

Message-ID:

<4d4c8ea6$0$23758$14726298@news.sunsite.dk>

On 04-02-2011 18:22, Roedy Green wrote:

On Fri, 4 Feb 2011 21:30:57 +0000, Tom Anderson<twic@urchin.earth.li>
wrote, quoted or indirectly quoted someone who said :

I am, however, at a loss to suggest a practical alternative!

What might happen is strings are nominally 32-bit.

You could probably come up with a very rapid compression scheme,
similar to UTF-8 but with a bit more compression, that could be
applied to strings at garbage collection time if they have not been
referenced since the last GC sweep.

String are immutable. This admits some other flavours of
"compression".

If the high three bytes of the string are 0, store the string
UNCOMPRESSED, as a string of bytes. All the indexOf indexing
arithmetic works identically. This behaviour is hidden inside the
JVM. The String class knows nothing about it. It is an implementation
detail of 32-bit strings.

If the high two bytes of the string are 0, store the string
uncompressed as a string of unsigned shorts.

if there are any one bits in the high 2 byte, store as a string of
unsigned ints.

Strings are what you gobble up your RAM with. If we start supporting
32 bit chars, we have to do something to compensate for the doubling
of RAM use.

Short lived strings would still be 32-bit. They would only be
converted to the other forms if they have been sitting around for a
while. Interned strings would be immediately converted to canonical
form.

indexOf works fine with compression, but substring and charAt becomes
rather expensive.

Arne