Re: How to slurp/get the content of a URI?
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
---910079544-175438117-1216746108=:22278
Content-Type: TEXT/PLAIN; CHARSET=iso-8859-1; FORMAT=flowed
Content-Transfer-Encoding: 8BIT
Content-ID: <Pine.LNX.4.64.0807221805111.22278@urchin.earth.li>
On Sat, 19 Jul 2008, Mark Space wrote:
Mark Space wrote:
Stefan Ram wrote:
ram@zedat.fu-berlin.de (Stefan Ram) writes:
new java.io.InputStreamReader
( httpURLConnection.getInputStream(), "UTF-8" );
A more specific question:
Shouldn't I use the document encoding instead of ?UTF-8??
The default for HTTP is "8859_1" (that's the Java charset name).
There's a special protocol for negotiating a different charset, which
you won't support because your get is to primitive.
The server will either send you 8859.1 if it can, or it'll close the
connection, I think.
My understanding is that the server may, in pretty much any situation,
send whatever charset it likes, as long as it declares it in the
content-type header.
P.S. the openStream() method for URL seems to open the type of connection
you need directly.
BufferedReader bin = null;
URL url = new URL( arg[0] );
bin = new BufferedReader(
new InputStreamReader( url.openStream() ));
I think. Better check that.
You're absolutely right.
A slightly more correct approach (which might have been expounded
downthread already) would be to use a URLConnection, get the content-type,
parse it to identify a charset, and then use that to configure the
InputStreamReader correctly.
Sadly, and shockingly, there doesn't seem to be anything to parse
content-type headers in the standard library. There is a
javax.mail.internet.ContentType in J2EE, though, and it's not too hard to
write yourself.
There's also an intriguing getContent() method that sounds like it should
be even closer to what Stefan wanted - it downloads the bytes, then uses
the content-type to convert them into an object. However, it's not
entirely clear exactly what kind of object you're supposed to get, which
makes it more or less useless. In practice, getting HTML text gives you an
InputStream, and getting an image gives you a
java.awt.image.ImageProducer. That's not enormously useful here.
tom
--
Sometimes it takes a madman like Iggy Pop before you can SEE the logic
really working.
---910079544-175438117-1216746108=:22278--