Re: String default encoding: UTF-16 or Platform's default charset?
On 12/10/2010 11:12 AM, cs_professional wrote:
I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?
Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.
There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."
For serialization as a byte stream, Strings by default use the platform
default charset.
On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?
It can't be, since you can store, say, ?? in a Java string, which is not
a character in CP-1252. On the other hand, if your default charset is
CP-1252, you can't serialize that character (you'll get ? instead).
Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.
What you have to be concerned about is the translation between byte
arrays (or any input/output that reads/writes bytes, possibly
autoconverting (!) characters) and character arrays (or Strings or other
containers implementing CharSequence).
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth