Re: Need help with regular expression to parse URLs

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Mon, 10 Aug 2009 22:48:25 +0100

Message-ID:

<alpine.DEB.1.10.0908102210280.27269@urchin.earth.li>

On Mon, 10 Aug 2009, markspace wrote:

Neil wrote:

I wrote this regular expression:
^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?

It seems to be working fine for most urls, but it barfed on this one:
http://jammconsulting.com/jamm/page/products/Stuff/Bags-%26-Luggage/Bags-%26-Totes/Backpacks.html

The matcher gives me 1 group with this value: s/Backpacks

I dont understand how that could have happened. I was expecting to
get
two groups:
Stuff/Bags-%26-Luggage
Bags-%26-Totes/Backpacks

Any ideas what went wrong?

You have two problems.

Firstly, the repeated group as written has no way to admit slashes
*between* pairs of path elements. Expand the repetition by hand (three
times, here):

[^/]+/[^/]+[^/]+/[^/]+[^/]+/[^/]+

You get the slash between elements in a pair, but not between pairs. This
explains your results. You need something that expands to:

[^/]+/[^/]+/[^/]+/[^/]+/[^/]+/[^/]+

Like:

^http://jammconsulting.com/jamm/[^/]+/[^/]+(/[^/]+/[^/]+)*\\.html?

You can get the individual elements with smaller capturing groups (here
making the pair-level group non-capturing):

^http://jammconsulting.com/jamm/[^/]+/[^/]+(?:/([^/]+)/([^/]+))*\\.html?

Secondly, you get one matching group per occurrence of a capturing group
in the *pattern*, not per occurrence of the subpattern in the match. That
is, if the above pair group matches five times, you'll still only get a
single pair of captured groups (the last ones). That, i think, means
there's no way to use a regular expression to do what you want to do here.

At least, not directly. What you can do is make a regexp which matches a
single occurrence of a pair of elements, and then use the Matcher's find()
method to loop over all occurrences in the string. Like so:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Split {
public static void main(String... args) throws URISyntaxException {
Pattern whole = Pattern.compile("^/jamm/[^/]+/[^/]+(.*?)\\.html?$");
Pattern pair = Pattern.compile("([^/]+)/([^/]+)");
for (String s: args) {
URI uri = new URI(s);
String path = uri.getPath();
Matcher wholeMatch = whole.matcher(path);
if (wholeMatch.matches()) {
Matcher pairMatch = pair.matcher(wholeMatch.group(1));
while (pairMatch.find()) {
String first = pairMatch.group(1);
String second = pairMatch.group(2);
System.out.println(Integer.toString(pairMatch.start()) + "\t" + first + "\t" + second);
}
}
}
}
}

Note that rather than matching against the raw URL string, i'm going via
java.net.URI; this saves me having to match the other bits of the URL
explicitly, and also takes care of resolving % escapes.

I don't understand what the * was in the end of your regex: "*\.html" ?

It's a quantifier on the preceding group - the one which captures the
paired path components like 'Stuff/Bags-%26-Luggage'. It means that there
can be any number of such pairs.

tom

--
I do not fear death. I had been dead for billions and billions of years
before I was born. -- Mark Twain