Re: Unicode chinese

From:

Roedy Green <see_website@mindprod.com.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Thu, 30 Aug 2007 03:25:29 GMT

Message-ID:

<s2ecd3l4bso0hokqlvumu2v2uml6rmd1d9@4ax.com>

On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
<blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
quoted someone who said :

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!

import java.io.UnsupportedEncodingException;
public class Chinese
   {
   /**
    * test harness
    *
    * @param args not used
    */
   public static void main ( String[] args ) throws
UnsupportedEncodingException
   {
      System.out.println( System.getProperty( "file.encoding" ));
      String chinese = "\u4e2d\u5c0f";
      // explicit choice of encoding, UTF-8 supports everything
including Chinese.
      byte[] b = chinese.getBytes( "UTF-8" );
      for ( int i=0; i<b.length; i++ )
         {
         System.out.println( Integer.toHexString( 0xff & b[i] ));
         }
      // prints
      // Cp1252
      // e4
      // b8
      // ad
      // e5
      // b0
      // 8f

      // why those chars?
      // BOM is ef bb bf, so that is not it.
      // see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
      // codes >= 0x800 take 3 bytes to encode.
   }
   }
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com