Andrew wrote:

I have an example program below that contains weird Icelandic
characters, and a copyright symbol, just for good measure. The code
expresses these as UTF8. They print exactly as you would want/expect
them to. So far so good. But what I want is to be able to go the other
way. I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

public class UTF8Test {
    public UTF8Test() {

    public String getString() {
    StringBuilder builder = new StringBuilder();
    builder.append("Copyright \u00A9 2009\n");
    builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
    builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
    return builder.toString();

    public static void main(String[] args) {
    UTF8Test test = new UTF8Test();

FWIW, the reason I want to do this is I need to write strings like
this to a sybase table where the column is of type varchar. We cannot
make it univarchar (don't ask). So I need to be able to write unicode
characters without using unicode chars! I thought by having them in
this expanded form java can convert them just like the program above

The specific question asked can be solved with something like:

     public static String encode(String s) {
         StringBuffer sb = new StringBuffer("");
         for(int i = 0; i < s.length(); i++) {
             char c = s.charAt(i);
             if((c >= 0) && (c <=127)) {
             } else {
                 String hex = Integer.toHexString(c);
                 sb.append("\\u" + "0000".substring(hex.length(), 4) + hex);
         return sb.toString();

But it will actually also require some work to decode it. Because the
unescape done in your code is done at compile time not runtime.

And 1 code point -> 6 bytes is not a very efficient encoding.

Assuming your VARCHAR supports 0-255 then you should be able
to store you UTF-8 bytes as ISO-8859-1.

A bit messy but more efficient space wise and less code.

Alternatively you could look at Quoted Printable but that
will also have overhead.


