Re: ascii char 26

From:

bob <bob@coolgroups.com>

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 11 Sep 2011 19:12:28 -0700 (PDT)

Message-ID:

<63554bdb-dab4-43e7-b809-5128fd831f3c@m38g2000vbn.googlegroups.com>

You're right. I messed up, and it was the em dash. It turned into 26
after going thru 'b = html.getBytes("US-ASCII");'

Here's the new code:

    public static String convertToAscii(String html) {
        html = html.replaceAll("\u2019", "'");
        html = html.replaceAll("\u201D", "\"");
        html = html.replaceAll("\u201C", "\"");

        // mdash
        html = html.replaceAll("\u2014", "-");

        byte[] b = null;
        try {
            b = html.getBytes("US-ASCII");

        } catch (UnsupportedEncodingException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return html;
    }

Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
work.

On Sep 11, 4:52 pm, Joshua Cranmer <Pidgeo...@verizon.invalid> wrote:

On 9/11/2011 4:33 PM, bob wrote:

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

The US-ASCII encoder only properly encodes characters in the range of
0-127, i.e., the characters that are present in ASCII. Any other
character is replaced with some sort of substitution character; in this
case, it looks like the charset has chosen to use ^Z as the "I don't
know what this character is" character (I would have guessed '?'
instead, but I suppose they decided to go with the less-commonly used
variant).

My guess is your input is using one of the characters like the minus
sign, em dash, or perhaps an en dash instead (there may be others),
which are visually close in appearance to a hyphen but do not share the
same Unicode codepoint.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth