Re: Keeping the split token in a Java regular expression

From:

Martin Gregorie <martin@address-in-sig.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 27 Mar 2012 21:57:34 +0000 (UTC)

Message-ID:

<jktd4e$kef$1@localhost.localdomain>

On Tue, 27 Mar 2012 01:17:26 +0000, Martin Gregorie wrote:

   Its rather late here, so I'll leave this as an exercise for anybody
   who feels keen. If nobody has touched it by mid morning tomorrow I
   may see if it works.

I put together the following this morning. Hopefully its enough of an SSCE
to pass muster.

As promised, I first implemented a two-pass splitter (the 'classico'
method): its ugly all right, even though it does the trick.

Then I swiped Stefan's code (the 'patternista' method), tewaked it
slightly and used it to drive both his and my regexes. The only other
changed it needs is to parameterise Matcher.group() because Stefan's regex
treats the whole pattern as a capture group while mine only uses the
first capture group in the pattern which lets it discard the comma
separators. This was one of my design aims: to output the exact same
strings as the classico() method does.

==========================================================================
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Splitter
{
   public static ArrayList<String> classico(String in)
   {
      String[] sList = in.split("PM, +|PM");
      for (int i=0; i<sList.length; i++)
         sList[i] = sList[i].trim() + " PM";

      ArrayList<String> aList = new ArrayList<String>();
      for (String s : sList)
      {
         String sp[] = s.split("AM, +|AM");
         for (int j=0; j < sp.length - 1; j++)
            aList.add(sp[j].trim() + " AM");

         aList.add(sp[sp.length - 1]); // The last element is
                                        // always ended wth PM
      }

      return aList;
   }

   public static ArrayList<String> patternista(String p, int g, String in)
   {
      Pattern pattern = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
      Matcher matcher = pattern.matcher(in);
      ArrayList<String> aList = new ArrayList<String>();
      while(matcher.find())
      {
         String s = matcher.group(g);
         aList.add(s.trim());
      }

      return aList;
   }

   public static void showResult(String source,
                                 String method,
                                 ArrayList<String> s)
   {
      System.out.println(String.format("\n'%s' ==> '%s'",
                                       source,
                                       method));
      for (int i = 0; i < s.size(); i++)
         System.out.println(String.format("%2d: %s", i, s.get(i)));
   }

   public static void main(String[] args)
   {
      String SOURCE = "Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM";
      String martin = "(.*?[AP]M),?";
      String stefan = ".*?(?:am|pm),?";

      ArrayList<String> s;
      s = classico(SOURCE);
      showResult(SOURCE, "classico", s);
      s = patternista(martin, 1, SOURCE);
      showResult(SOURCE, martin, s);
      s = patternista(stefan, 0, SOURCE);
      showResult(SOURCE, stefan, s);
   }
}
==========================================================================
'Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM' ==> 'classico'
0: Fri 7:30 PM
1: Sat 1, 3 and 5 AM
2: Sun 2:30 PM

'Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM' ==> '(.*?[AP]M),?'
0: Fri 7:30 PM
1: Sat 1, 3 and 5 AM
2: Sun 2:30 PM

'Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM' ==> '.*?(?:am|pm),?'
0: Fri 7:30 PM,
1: Sat 1, 3 and 5 AM,
2: Sun 2:30 PM
==========================================================================

As you can see, once I'd swapped greedy matches for non-greedy in my regex
(the second test run), both regexes do job and to my mind use much more
elegant code than the two pass classico approach, but of course ymmv.

--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |