Re: How to slurp/get the content of a URI?
Stefan Ram wrote:
Shouldn't I use the document encoding instead of ?UTF-8??
But I will only know this after I have read the response!
(Or, at least part of it.)
So I'm no expert, and I hope I'm not wasting your time by blathering,
but the question is interesting to me so I did a bit of work on it.
Here's what I have so far.
static void method4() throws MalformedURLException, IOException {
String TEST_URL =
"http://cnn.com";
URL url = new URL(TEST_URL);
URLConnection c = url.openConnection();
String type = c.getContentType();
System.out.println("Mime type: " + type );
if( type == null || type.contains("text") )
{
String enc = c.getContentEncoding();
System.out.println( "Encoding: " + enc );
if( enc == null )
{
enc = "ISO-8859-1";
}
InputStreamReader inr = new InputStreamReader(
c.getInputStream(),
enc ); // I have no idea if http encoding
strings // will work here
List<CharBuffer> result = new ArrayList<CharBuffer>();
int byteCount = 0;
for( ;; )
{
int read;
CharBuffer cb = CharBuffer.allocate( 4 * 1024 );
if( ( read = inr.read( cb )) != -1 )
{
byteCount += read;
result.add( cb );
}
else
{
break;
}
}
System.out.println( "Read: " + byteCount );
}
else // binary
{
System.out.println("binary...");
}
}
Some other thoughts:
1. If the URL string depends on user input, you may have to use
URLEncoder if the user input goes in the parameter part of the URL.
2. Don't forget that other protocols besides HTTP exist. The Java API
also supports FTP and JAR I believe. You might get one of those instead
of HTTP. You may wish to check the protocol expressly if you don't set
it yourself.
3. Both mime type and the character encoding may be null. The defaults
are "text" and ISO-8859-1 respectively, but there are also "guess"
methods in the URLConnection object.
4. If you don't have text, you might have an image. It might be nice to
return an Image in that case. I didn't get that far though.
5. I can't find any expandable buffers for Java. StringBuilder or
StringWriter seem like a good idea. I made my own by stuffing
CharBuffers into a List. The idea is to avoid testing each character
for an end-of-line, which readLine() must do. Hopefully the CharBuffer
is faster.
6. You could also read the data raw (ByteBuffer) and decide what to do
with it later. This might be more in the spirit of a "slurp" operation.
7. I looked for a way to get a channel from the URLConnection and didn't
find one. I think this is a defect in the Java API, myself. Using
direct buffers might be a big performance win here. You'll need a raw
socket for that I guess.