Re: Convert encodings

From:

Steven Simpson <ss@domain.invalid>

Newsgroups:

comp.lang.java.help

Date:

Wed, 03 Feb 2010 22:16:40 +0000

Message-ID:

<879p37-044.ln1@news.simpsonst.f2s.com>

On 03/02/10 20:38, The87Boy wrote:

I am downloading a webpage using the HttpURLConnection, where I get
the InputStream, but are there are any ways I can convert the
webpage's charset to the client's charset
I am getting the webpage's charset by using this:

// Get the charset
String charset = conn.getContentEncoding();

It seems to be a common misunderstanding that this specifies the charset.

<http://java.sun.com/javase/6/docs/api/java/net/URLConnection.html#getContentEncoding%28%29>:

Returns the value of the |content-encoding| header field.

....which is this:

<http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11>

It's used mainly for end-to-end compression (e.g. gzip). It's something
you might have to deal with, but you'll often find it's not used.
(Plus, if your HTTP request doesn't say you accept it, the server might
decompress the content for you, or simply refuse to send it.)

What you want is the Content-Type header field:

<http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.17>

....as returned by getContentType():

<http://java.sun.com/javase/6/docs/api/java/net/URLConnection.html#getContentType%28%29>

....but this returns strings such as:

text/html; charset=ISO-8859-4

....which you then have to parse. The grammer is defined by:

<http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7>
<http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6>

....which says that the format is (informally):

token/token ; token = token ; token = quoted-string ; etc...

I don't think there's anything in the JDK that readily parses that, so
I'd write something like this:

// Writes parameters into props. Returns media-type.
public static String parseField(String line,
                                Map<? super String, ? super String> props);

....and do a simple implementation first, just looking for ';' and '='.

If no charset is specified, a default might be implied by the media-type
part, or you might have to assume a default, or you might be able to
detect the charset early (as with XML). If it's HTML, you have the
added joy of assuming it is (say) US-ASCII or ISO-8859-1, parsing the
HTML as long as that charset seems okay or until you bump into the likes of:

<meta http-equiv="Content-type" content="text/html; charset=utf-8">

....parsing the 'content' attribute as above, and cleverly switching to
the new charset.

I have managed to get some JDK classes to do the HTML-charset guessing.
The parser can be told to throw out an exception if the <meta> is
encounted, upon which the code resets the undecoded stream, uses the
specified charset instead of the guessed one, and tries again. Yeuch.

The following is from a class that extends HTMLEditorKit.ParserCallback:

    BufferedInputStream rawStream =
        new BufferedInputStream(rawBytes, 8192);
    rawStream.mark(8192);

    // Make sure the buffered stream can't be closed by
    // anything wrapped around it. The ParserDelegator
    // below will try to do that if it acts on the
    InputStream unclosableBuffer = new Uncloser(rawStream);

    // Decode the bytes as characters using the derived or
    // guessed charset.
    Reader in = new InputStreamReader(unclosableBuffer, charset);

    ParserDelegator deleg = new ParserDelegator();
    try {
        deleg.parse(in, this, httpCharset);
    } catch (ChangedCharSetException ex) {
        // The <meta> with the charset was found.

        // Determine the new charset.
        props.clear();
        Utils.parseField(ex.getCharSetSpec(), props);
        charset = props.getProperty("charset", charset);
        System.err.println(" reset with " + charset + ")");

        // Put the stream back and start decoding with the new charset.
        rawStream.reset();
        in = new InputStreamReader(unclosableBuffer, charset);

        // Reset extracted parameters.
        title = null;
        content = new StringBuilder();
        robotsIndex = robotsFollow = true;
        inTitle = inStyle = inHead = inScript = false;

        // Try again with the new charset.
        deleg.parse(in, this, true);
    }

<http://java.sun.com/javase/6/docs/api/javax/swing/text/html/HTMLEditorKit.ParserCallback.html>
<http://java.sun.com/javase/6/docs/api/javax/swing/text/html/parser/ParserDelegator.html>

As I recall, there were a number of gaps in the documentation around
there, so those particular classes didn't seem well loved. Maybe
someone else can recommend something more straight-forward.

--
ss at comp dot lancs dot ac dot uk