Re: How do they do this?

From:

Nigel Wade <nmw-news@ion.le.ac.uk>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 26 Oct 2010 14:47:36 +0100

Message-ID:

<8io4fqFvj8U1@mid.individual.net>

On 26/10/10 01:20, Joe Snodgrass wrote:

On Oct 18, 11:28 am, Nigel Wade <nmw-n...@ion.le.ac.uk> wrote:

On 18/10/10 14:16, Joe Snodgrass wrote:

One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.

The first thing you need to be aware of is copyright issues. Taking data
from someone else's website and making it available directly, rather
than via accredited and referenced hyperlinks, will almost certainly be
a breach of copyright. Even hot-linking is dubious.

This is how Mark Zuckerberg got his first incarnation of facebook
running, by repurposing the jpegs in Harvard's online student
photobook, and then allowing the other students to type in snide
comments about how bad everybody looked.

News aggregators also do this.

Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc?

Normally you wouldn't. Capturing the content being directed to another
application would be fairly complicated. You may be able to do it with
an application such as Wireshark (which is a packet sniffer/traffic
analyser) and get it to save all the traffic in a file for later
analysis and processing.

A better option is to read the web content directly in your application,
by opening the desired URL, then parse the response. You can generally
only do this if the response is HTML in some form. Other responses, such
as JavaScript code, streamed or download content etc. need other
techniques.

And what is the name of this general technique? (Not
including "hacking" of course.)

I think a generic term is "web-scraping", although content authors may
may use other terms.

It's something I've only done once myself, to generate a vCalendar
calendar from a web page containing a fixture list for a sports league.
In this case all I did was to read from a URL by opening a Reader:
java.io.Reader reader;
reader = new InputStreamReader(new URL(args[0]).openStream());

then read from it using:
HtmlParserCallback callback = new HtmlParserCallback();
new ParserDelegator().parse(reader, callback, true);

My HtmlParserCallback was a class which extended
javax.swing.text.html.HTMLEditorKit.ParserCallback

Is this right?

Supposing it's html, I need a string processor, maybe perl, to
intercept the code as it arrives, methodically reading through the raw
html, as strings. As it comes in, the html format would be identical
to what I see when I give my browser the "show source code" command.

My code would have to "dig" its way down to the html that I care
about, skipping everything I don't care about, by finding opening
tags, then discarding everything until the closing tag. Little by
little, it would zero in on the part I want, also discarding non-data
html.

Did I get that right?

No. the ParserDelegator.parse() method handles reading and decoding the
HTML returned from the URL. Whenever it has decoded some element of HTML
it sends it to your code for interpretation, via the callback you
registered with it. Your callback should override certain methods in
HTMLEditorKit.ParserCallback, and the appropriate method will be called
depending on the type of element the parser has detected.

Typically you'd declare your callback to extend
HTMLEditorKit.ParserCallback, and then override whichever methods you
wanted to be able to handle those elements. As the parser detects each
type of HTML element it calls the appropriate callback method in the
HTMLEditorKit.ParserCallback object it was passed. If you override that
method your code can process the HTML element, if you don't override the
method the default action takes place (which, AFAIK, is to ignore it).

There's a simple example of how to use HTMLEditorKit.ParserCallback here:

http://www.java2s.com/Tutorial/Java/0120__Development/UsejavaxswingtexthtmlHTMLEditorKittoparseHTML.htm

Of course, you can write your own parser if you wish. In which case you
would need to do everything you've outlined above.

with callbacks to handle the various bits of HTML I was interested in.

I don't know what a "callback" is. :(

In Java-speak it would be a "listener". It's a method which you register
with some other piece of code. Under certain predefined circumstances
that other piece of code "calls back" to your code via the callback method.

--
Nigel Wade