Re: Query:different coding systems

From:

Thomas Fritsch <i.dont.like.spam@invalid.com>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 08 May 2007 11:44:34 GMT

Message-ID:

<newscache$am0qhj$cdm$1@news.ops.de>

Jack Dowson wrote:

Hello Everybody:
As we all know,FileReader and FileWriter are both character stream
classes.

Yes!

When I use FileReader to read a text file which combines letters
and Chinese Characters coding in ANSI's ascii.

No, you don't. Chinese simply cannot be coded in ASCII. May be your text
file is encoded in UTF-8 (see below).

I know that each letter
holds one byte disk space to store while every Chinese Characters
occupies two.When that file has been read,it prints on the monitor
screen totally corresponds with it's content!

There is already a misconception on your side:
(1) Correct is that ASCII requires one byte per character, because
ASCII can only encode the characters from 0x0000 to 0x007F, (into
bytes 0x00 .. 0x7F), nothing more.
(2) ASCII simply cannot encode the Chinese chars (0x4E00 .. 0xA000).
The key is to understand that there is a difference between *byte*
streams (InputStream, OutputStream) and *char* streams (Reader, Writer).
A byte is in range 0x00..0xFF, a char is in range 0x0000..0xFFFF.
Files are always sequences of bytes, but in your Java code you want to
deal with chars. Therefore Java has to do a translation between byte
streams and char streams, which is called "encoding" or "decoding".

Unfortunately there are many different encoding algorithms. "ASCII" is
just of them, others are "ISO-8859-1", "UTF-16", "UTF-8" and many more.
Some encodings ("UTF-8", "UTF-16") are able to encode all possible 65536
chars into bytes. Some others can encode only a subset of chars into
bytes (ASCII: only chars from 0x0000 to 0x007F, ISO-8859-1: only chars
from 0x0000 to 0x00FF). "UTF-16" always encodes 1 char into 2 bytes.
"UTF-8" encodes 1 char into 1, 2 or 3 bytes (depending on the char).

You find more info and more links at
<http://mindprod.com/jgloss/encoding.html>

Now,here is my question:How does JVM identify one byte letter and two
byte Chinese Character?

*You* tell it which encoding algorithm will be used. For example you can
write:
  FileReader fr = new FileReader("text.txt", "UTF-8");
When you write:
  FileReader fr = new FileReader("text.txt");
that actually means
  FileReader fr = new FileReader("text.txt",
                      System.getProperty("file.encoding"));
If you choose the wrong encoding (for example: if you choose "UTF-16",
but your input file is actually encoded with "UTF-8"), then your program
simply will do wrong.

Here is my program demo:
import java.io.*;
class FileReaderDemo{
  public static void main(String[] args) throws Exception{
    FileReader fr = new FileReader("text.txt");
    int ch =0;
    int words = 0;
    while((ch =fr.read())!= -1){
    System.out.print((char)ch);
    words++;
    }
    fr.close();
    System.out.println("\nThere are totally " + words + " characters in
this file!");
    }

And the text.txt is:
This is a test file!
??????????????????

The outcome is:
This is a test file!
??????????????????
There are totally 31 characters in this file!

No, files always contain *bytes*, not *chars*.
Chars only occur within your Java program.

--
Thomas