Wonky HTTP behavior?

From:

"Twisted" <twisted0n3@gmail.com>

Newsgroups:

comp.lang.java.programmer

Date:

13 Nov 2006 06:46:45 -0800

Message-ID:

<1163429205.536857.64140@b28g2000cwb.googlegroups.com>

I'm fiddling with possibly making a custom web browser with a few
extras (such as auto-retrying broken downloads (of pages, images,
files...), one-click ad blocking, one-click referrer spoofing, referrer
spoofing when retrieving non-text/foo content-types by default,
user-agent spoofing, etc.) and a few unwanted things ditched (such as
flash, ActiveX, VBScript, and some javascript capabilities). Other
notions include page loading not being interrupted by a slow-to-run
script (typical with ad banner code) and making offsite include loading
use a really short-fuse timeout (again due to synchronous ad loads that
are too slow).

So far, the bare-bones Mosaic-alike (no javascript at all, basic normal
behavior with text, html, and images only) is sort-of working. The
rendering engine needs a load of work, but then, it would. It's
something I think I can cope with. OTOH, the backend is doing something
strange as I discovered after a string of 400 Bad Request error page
retrievals.

Wireshark (fork of Ethereal) reveals these dumps of the HTTP headers
from a bog-standard GET request from clicking the first interesting
link at http://www.movingtofreedom.org (the one that displays the rest
of today's blog entry there):

Using Firefox:

Request Method: GET
Request URI:
/2006/11/12/a-round-of-gnu-linux-heading-in-to-the-back-nine-part-2/
Request Version: HTTP/1.1
Host: www.movingtofreedom.org\r\n
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.8)
Gecko/20061025 Firefox/1.5.0.8\r\n
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n
Accept-Language: en-us,en;q=0.5\r\n
Accept-Encoding: gzip,deflate\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
Keep-Alive: 300\r\n
Connection: keep-alive\r\n
Referer: http://www.movingtofreedom.org/\r\n
Cookie:
comment_author_email_bbd0d17eb26d1ffea56c7ae59736eeff=nobody%40nowhere.net;
__utmz=40810430.1158107358.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none);
__utma=40810430.2073169434.1158107358.1163248340.1163417340.39;
comment_autho

Using the results of tonight's hackathon:

Request Method: GET
Request URI:
/2006/11/12/a-round-of-gnu-linux-heading-in-to-the-back-nine-part-2/
Request Version: HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; myuseragent)
Accept-Language: en-us,en;q=0.5\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
Keep-Alive: 300\r\n
Referer: http://www.movingtofreedom.org/\r\n
Host: www.movingtofreedom.org\r\n
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\n
Connection: keep-alive\r\n
Content-type: application/x-www-form-urlencoded\r\n

Yuck. WTF is this? The cookie is missing, which is to be expected (I
haven't used the comment form from this thing, partly because it won't
even render forms properly as of yet), but ... the Accept: headers are
all jumbled in with the others, the Host: header is near the bottom,
Connection: keep-alive and Keep-Alive: 300 are separated, the Referer:
isn't near the bottom where it (apparently) belongs, and there's that
weird "Content-type:" header, which is coming from Christ alone knows
where and is probably the cause of the 400 errors and God alone knows
what more subtle problems (different format of returned page sometimes?
Getting incorrectly identified as a broken bot or something? I don't
want people using my new browser getting lots of timeouts, connections
refused, and 403 errors because some webmaster thought it was a
misbehaving bot instead of a human being and tells the world to block
the user agent! At least it's only a test user-agent; obviously when
it's done the user-agent will be changed to something a lot less lame.
Getting mistaken for a bad bot might also land my IP range in a block
list somewhere, which might put a crimp in surfing not to mention
testing this thing further, as well as inconveniencing up to 255 other
customers of my ISP at a time as an extra added bonus feature.)

The code generating the initial connection is:

uc = (HttpURLConnection)u.url.openConnection();
uc.setInstanceFollowRedirects(true); // Auto follow what redirects we
can.
uc.setRequestProperty("User-Agent", Main.USER_AGENT);
uc.setRequestProperty("Accept-Language", "en-us,en;q=0.5");
uc.setRequestProperty("Accept-Charset",
"ISO-8859-1,utf-8;q=0.7,*;q=0.7");
uc.setRequestProperty("Keep-Alive", "300");
if (u.prevURL != null) {
    uc.setRequestProperty("Referer", u.prevURL);
}
if (Main.COOKIE_ENABLE) {
    String k = getCookiesFor(u.url.getHost());
    if (k != null) uc.setRequestProperty("Cookie", k);
}
uc.connect();

(Here, "u" references an object that encapsulates a resource fetch
request. All of this is in a try block and some other cruft I don't
think is relevant here. And yes, my main class is named
"foo.bar.baz.Main"; and yes maybe that's lame; so sue me.)

As you can see, I'm not supplying "Content-type" anywhere. The things I
*am* supplying are all being put, in order, right after the Request
Foos. Those, Host, and Accept are automatic, but that's fine with me.
Same with Connection, but it was failing to provide the Keep-Alive
header until I manually added it.

So, three questions:
1. Is the "Content-type" header what's screwing things up and causing
400 errors or worse?
2. How do I get rid of it?
3. Is the header rearrangement (relative to what Firefox outputs) a
likely source of problems?

Regarding number 2, I should note that I already tried these:

uc.setRequestProperty("Content-type", null)
uc.setRequestProperty("Content-type", "")

which produced garbled results (as sniffed with Wireshark) and even
more 400 errors from hosts that didn't give them before.

Regarding number 3, given the rampancy of browser discrimination, I
intend to include user-agent spoofing functionality. If someone spoofs
Firefox and the headers are in the wrong order, might the spoof be
exposed? If I can, I want to include header ordering/characteristics
more generally in "spoof profiles", with at least ones for Firefox and
Internet Exploder; for now I'm simply trying to masquerade as Firefox
as accurately as possible (except for, ironically, the user-agent
header contents themselves) as a proof of concept. Wireshark shows me
that my efforts are falling way short of the bar there so far.

Note: I'm aware that there are Firefox extensions. I'm aware that any
Joe can program one (in theory) and that there are some for spoofing
and disabling some evil scripts and the like. I'm also aware that none
of the latter do precisely what I am looking for, and I am *un*aware of
how to program a Firefox extension. Learning the API and tools would
probably take longer than it took to make this Java user agent, for
which 80% of the work is done for me anyway (protocol implementations
and much html parsing and rendering) by the standard library. Which
means I'm coding more of a "browser extension" than a "browser" anyway,
using tools I am already familiar with...and of course there's now a
sunk investment of time and effort...