Re: Monitoring a javascript-based web page...

From:

John Ersatznom <j.ersatz@nowhere.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 17 Dec 2006 23:54:05 -0500

Message-ID:

<em56s8$29c$1@aioe.org>

gfr92y@yahoo.com wrote:

I would like to "automatically" check a web page for messages that is
written in javascript and that requires me to sign-in with a username
and password, and email either the messages or a picture of the
messages to my email address.

In my utter ignorance, I would think some type of macro or "robot"
might do this for me.

Can anyone point me in the right direction for such a tool?

If it were me:

* First I'd use a tool like Wireshark to sniff the traffic while logging
in manually.
* Probably I'd find an HTTP GET www-formurl-encoded or HTTP POST, or
maybe an HTTPS transaction if I were really lucky.
* Then I'd figure out how to use HttpURLConnection to make the
connection (SSL if necessary) and send the same form submission from
inside a Java method.
* Then I'd write a method to do so, retrieve the result page, save it or
parse it in some way, and (if need be) send whatever HTTP request logs
out again.
* Sending email likewise: I'd send a test mail to myself at another
server (e.g. from my main to my gmail) while sniffing the traffic and
duplicate the protocol (this time at a low level). It probably consists
of contacting a mail host at your ISP (better make this a replaceable
string, e.g. with a GUI input form or at least a resource bundle) on
port 25 and sending stuff like HELO youraccountname MAIL FROM
youraccountname headers body Control-D or whatever they do nowadays.
* Concoct a method to send the mail, stuffing the body with whatever
data. Image encoding would be a PITA but I could probably cobble that
together too if I had to, and have it generate mail with MIME attachments.
* Googling the protocols involved (likely HTTP or HTTPS and SMTP) for
more information would probably also be in the offing.
* There'd need to be error trapping and recovery, too. Silent failure is
not acceptable as a rule.
* And I'd consider carefully how to make the bot play very nice. For
example, it should retrieve the resource and send one mail once a day or
some such, no more often than a human being doing it manually probably
would. This lowers the chance that someone will detect a bot being used
that has a dislike for people automating their end of something, as well
as that the bot will actually be a genuine problem causing excessive
loads or bandwidth use. Of course, to look like a human doing the task
it has to ignore robots.txt, which is a faux-pas, but I wouldn't
consider it a serious one as long as the bot a) never generates for the
server it hits more traffic than a human browsing the site in Firefox
and b) isn't ripping content in some way, such as an archiver or search
index, that goes into a publicly visible place (e.g. Google or the
Wayback machine). For a bot that logs on somewhere once a day and grabs
a single item for your personal use, these conditions are easily met.
(One way to state the informal rule I came up with is: "If the bot
emulates you or a single assistant doing something by hand, it can
pretend to be a human, as it makes no difference to anyone else anyway.
If it does something only massive automation could ever do, it has to
admit it's a bot.")