Re: Need help with regular expression to parse URLs
On Mon, 10 Aug 2009, markspace wrote:
Neil wrote:
I wrote this regular expression:
^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?
It seems to be working fine for most urls, but it barfed on this one:
http://jammconsulting.com/jamm/page/products/Stuff/Bags-%26-Luggage/Bags-%26-Totes/Backpacks.html
The matcher gives me 1 group with this value: s/Backpacks
I dont understand how that could have happened. I was expecting to
get
two groups:
Stuff/Bags-%26-Luggage
Bags-%26-Totes/Backpacks
Any ideas what went wrong?
You have two problems.
Firstly, the repeated group as written has no way to admit slashes
*between* pairs of path elements. Expand the repetition by hand (three
times, here):
[^/]+/[^/]+[^/]+/[^/]+[^/]+/[^/]+
You get the slash between elements in a pair, but not between pairs. This
explains your results. You need something that expands to:
[^/]+/[^/]+/[^/]+/[^/]+/[^/]+/[^/]+
Like:
^http://jammconsulting.com/jamm/[^/]+/[^/]+(/[^/]+/[^/]+)*\\.html?
You can get the individual elements with smaller capturing groups (here
making the pair-level group non-capturing):
^http://jammconsulting.com/jamm/[^/]+/[^/]+(?:/([^/]+)/([^/]+))*\\.html?
Secondly, you get one matching group per occurrence of a capturing group
in the *pattern*, not per occurrence of the subpattern in the match. That
is, if the above pair group matches five times, you'll still only get a
single pair of captured groups (the last ones). That, i think, means
there's no way to use a regular expression to do what you want to do here.
At least, not directly. What you can do is make a regexp which matches a
single occurrence of a pair of elements, and then use the Matcher's find()
method to loop over all occurrences in the string. Like so:
import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Split {
public static void main(String... args) throws URISyntaxException {
Pattern whole = Pattern.compile("^/jamm/[^/]+/[^/]+(.*?)\\.html?$");
Pattern pair = Pattern.compile("([^/]+)/([^/]+)");
for (String s: args) {
URI uri = new URI(s);
String path = uri.getPath();
Matcher wholeMatch = whole.matcher(path);
if (wholeMatch.matches()) {
Matcher pairMatch = pair.matcher(wholeMatch.group(1));
while (pairMatch.find()) {
String first = pairMatch.group(1);
String second = pairMatch.group(2);
System.out.println(Integer.toString(pairMatch.start()) + "\t" + first + "\t" + second);
}
}
}
}
}
Note that rather than matching against the raw URL string, i'm going via
java.net.URI; this saves me having to match the other bits of the URL
explicitly, and also takes care of resolving % escapes.
I don't understand what the * was in the end of your regex: "*\.html" ?
It's a quantifier on the preceding group - the one which captures the
paired path components like 'Stuff/Bags-%26-Luggage'. It means that there
can be any number of such pairs.
tom
--
I do not fear death. I had been dead for billions and billions of years
before I was born. -- Mark Twain