Re: String default encoding: UTF-16 or Platform's default charset?

From:

Joshua Cranmer <Pidgeot18@verizon.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Fri, 10 Dec 2010 12:52:32 -0500

Message-ID:

<idtpd0$6po$1@news-int.gatech.edu>

On 12/10/2010 11:12 AM, cs_professional wrote:

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

For serialization as a byte stream, Strings by default use the platform
default charset.

On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

It can't be, since you can store, say, ?? in a Java string, which is not
a character in CP-1252. On the other hand, if your default charset is
CP-1252, you can't serialize that character (you'll get ? instead).

Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.

What you have to be concerned about is the translation between byte
arrays (or any input/output that reads/writes bytes, possibly
autoconverting (!) characters) and character arrays (or Strings or other
containers implementing CharSequence).

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth