Re: change ISO8859-1 to GB2312
On 24/05/2010 15:04, moonhkt wrote:
Our system is P630.
No , Suppose just two charset on file. ISO8859-1/GB2312 to UTF-8 or
EBCDID
For compare the different output by using UNIX diff command.
Your task can be broken down into three elements:
1) Read ISO-8859-1 encoded text from database.
2) Convert incorrectly encoded text back into Unicode UTF-16
3) Convert UTF-16 to UTF-8 (or EBCDIC)
For the first part, Your JDBC drivers should provide a way to make sure
the correct encoding conversion is performed so that whatever encoding
the database is using is known to the driver and it can convert text to
the UTF-16 encoding used by Java. See your DBMS documentation.
The second part is tricky. Your database thinks the GB2312 data is
ISO-8859-1 (because you lied to it). Now java is under the same illusion
and has done the arithmetic that would normally convert from ISO-8859-1
to Unicode/UTF-16. This has probably made an unholy mess of the GB2312
data. You have to reverse this. It's late, I'm tired and I just don't
care enough at the moment to think about how this would be done. (later)
I think I would use java.lang.String's methods to convert to byte[]
using ISO-8859-1 conversion then restore to String form using GB2312
conversion. I'm assuming the GB2312 data pretending to be ISO-8859-1 is
in a separate field in a table and hence in a separate
ResultSet.getString() result. If not ... oh dear.
The last part is easy - see below. I just output some GB2312 characters
using EUC-CN encoding into a HTML file because my web-browser, Firefox,
understands GB2312 - it's a convenient way to check the correctness of
the conversion. You want UTF-8 or EBCDIC not GB2312 but the principle is
the same.
-------------------------------8<------------------------------
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.io.UnsupportedEncodingException;
public class TestGB2312 {
public static void main(String[] args) {
/*
* Note: The fun characters are specified as Unicode escapes.
* We later get Java to convert to GB2312 in EUC_CN encoding.
*/
String data = "<html><head><meta charset=\"gb2312\"></head><body>"
+ "<p>Character set:GB2312</p>" + "<p>Encoding: EUC_CN</p>"
+ "<p>Roman Numerals: \u2160\u2161\u2162\u2163</p>"
+ "<p>Han (Numerals): \u3220\u3221\u3222\u3223</p>"
+ "</body></html>";
writeFileAsGB2312("GB2312.html", data);
}
private static void writeFileAsGB2312(String fileName, String data) {
PrintWriter pw;
try {
pw = new PrintWriter(fileName, "GB2312");
pw.println(data);
pw.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
}
-------------------------------8<------------------------------
Where I've got "GB2312" and "gb2312" you might want "UTF-8" and "utf8".
See
<http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html>
I imagine you knew all the above and were hoping for help with the part
which I numbered 2.
--
RGB