On 18/10/10 14:16, Joe Snodgrass wrote:
One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.
The first thing you need to be aware of is copyright issues. Taking data
from someone else's website and making it available directly, rather
than via accredited and referenced hyperlinks, will almost certainly be
a breach of copyright. Even hot-linking is dubious.
This is how Mark Zuckerberg got his first incarnation of facebook
running, by repurposing the jpegs in Harvard's online student
photobook, and then allowing the other students to type in snide
comments about how bad everybody looked.
News aggregators also do this.
Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc?
Normally you wouldn't. Capturing the content being directed to another
application would be fairly complicated. You may be able to do it with
an application such as Wireshark (which is a packet sniffer/traffic
analyser) and get it to save all the traffic in a file for later
analysis and processing.
A better option is to read the web content directly in your application,
by opening the desired URL, then parse the response. You can generally
only do this if the response is HTML in some form. Other responses, such
And what is the name of this general technique? (Not
including "hacking" of course.)
I think a generic term is "web-scraping", although content authors may
may use other terms.
It's something I've only done once myself, to generate a vCalendar
calendar from a web page containing a fixture list for a sports league.
In this case all I did was to read from a URL by opening a Reader:
reader = new InputStreamReader(new URL(args).openStream());
then read from it using:
HtmlParserCallback callback = new HtmlParserCallback();
new ParserDelegator().parse(reader, callback, true);
My HtmlParserCallback was a class which extended
html, as strings. As it comes in, the html format would be identical
to what I see when I give my browser the "show source code" command.
tags, then discarding everything until the closing tag. Little by