Re: Efficient URL-decoding.

From:

"=?iso-8859-1?q?Kirit_S=E6lensminde?=" <kirit.saelensminde@gmail.com>

Newsgroups:

comp.lang.c++

Date:

17 Jun 2006 20:44:42 -0700

Message-ID:

<1150602282.227489.130670@u72g2000cwu.googlegroups.com>

Peter Jansson wrote:

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific

URI encoding and decoding is a really tricky issue, quite apart from
the issues in the code that you have (that others have already helped
with). I think maybe I can shed some light on your third question
though. It's all a little off topic for this forum, but important. I
hope everybody indulges me :-)

The URI is split into several parts, but for HTTP(S) the only parts
that will be encoded are the file specification and the query string
(if present). Note that they are encoded _differently_. They are
seperated by a single question mark (?).

If you are decoding somebody else's URI then stop right now. You don't
know enough about it to be able to work it out. By all means give it a
go, but don't expect that it will make any sense and _don't_ manipulate
it before using it. You will break something.

The file specfication (which is what it looks like you're decoding) can
be in any locale. These days people are tending to drift towards UTF-8,
but it's by no means universal. If it's a URI you've encoded yourself
then you should know which locale you used.

For the query string, the format is described in the HTML specs, _but_
only browsers have to use that format when they create a query string
from a form submission using GET. Any other URI creation can be
different and in fact the W3C recommends a slightly different format
that doesn't cause common entity problems in HTML. The biggest
difference between the file specification and the query string encoding
is the space. In the file spec they are '%20' and in a query string
they are '+' - many people get this wrong (most clearly Technorati who
have wrecked the C++ blogging community through this error). If you're
trying to decode somebody else's query string you can't rely on the
format at all.

Now, the final thing you need to think about is security. You need to
check and double check the URI for all sorts of things. What checks you
need depends on where the URI came from and what you're doing with it.

I have a number of aticles on my web site about these issues:

Problems with IIS due to HTTP.SYS decoding the file specification:
http://www.kirit.com/Getting%20the%20correct%20Unicode%20path%20within%20an%20ISAPI%20filter

Another problem caused in the custom 404 handling:
http://www.kirit.com/Errors%20in%20IIS%27s%20custom%20404%20error%20handling

Another encoding fault leading to problems with redirections:
http://www.kirit.com/Response.Redirect%20and%20encoded%20URIs

And even the W3C doesn't get it right:
http://www.kirit.com/W3C%27s%20CSS%20validation%20service

These articles should give you a feel for the sorts of things that can
go wrong. They link in many places to the relevant RFCs and other specs
and notes. There are some other articles on Unicode that also touch on
other aspecs of these issues. There are also a load of tricky URIs that
you should be able to deal with properly.

So, again, I know this is strictly off-topic for this group. I hope
nobody minds too much though as the issues are pretty important.

K