Re: [LONG] java.net.URI encoding weirdness
On 06/05/14 18:24, markspace wrote:
On 5/6/2014 12:26 AM, Stanimir Stamenkov wrote:
public static void main(String[] args) throws Exception {
URI u = URI.create("http://server1/path"
+ "?param%3D1=value%261¶m%3F2=value%232"
^^^ ^^^ ^^^ ^^^
+ "#fragment");
Note how the 'query' is wrong now.
Doesn't each % above indicate an encoded value, or are you referring
to something else? I'm not sure we're not talking cross purposes here.
The name of the first parameter is "param=1", and its value is "value&1".
The name of the second parameter is "param?2", with the value "value$2".
Because = and & are used to delimit parameters in the query string, the
literals in these parameter names and values have to be encoded by the
user before going into the string.
URI u = URI.create( decode("http://server1/path"
+ "?param%3D1=value%261¶m%3F2=value%242"
+ "#fragment"));
System.out.println(u.toASCIIString());
run:
http://server1/path?param=1=value&1¶m?2=value$2#fragment
This query string no longer matches the intended input.
Why decode() before passing into create()? The URI class needs to parse
the string before anything gets decoded.
import java.net.URI;
public class URITest4 {
public static void main(String[] args) throws Exception {
URI u = URI.create("http://server1/path"
+ "?param%3D1=value%261¶m%3F2=value%232"
+ "#fragment");
System.out.println(u.toASCIIString());
System.out.println(" Query: " + u.getQuery());
System.out.println("Raw query: " + u.getRawQuery());
}
}
% java URITest4
http://server1/path?param%3D1=value%261¶m%3F2=value%232#fragment
Query: param=1=value&1¶m?2=value#2
Raw query: param%3D1=value%261¶m%3F2=value%232
This shows that getQuery() is not useful, as it decodes too soon. The
value must be split at & first, then at =, then the names and values
should be decoded. This is why v is wrong in URITest3.
w is wrong in URITest3 because, although getRawQuery()'s correct value
is provided, the URI constructor incorrectly encodes it again.
I guess the problem stems from java.net.URI only partially parsing some
components. For those components that have no further structure, it's
okay to decode. But the query string has more structure, which must be
parsed before decoding. The same goes for userInfo() to some extent,
since : is a special character in it, which the user might want to use
literally.
Correspondingly, the parts which are externally assembled should not be
encoded by multi-arg URI constructors, because the caller will already
have had to do that.
--
ss at comp dot lancs dot ac dot uk