Re: iterating over sub-matches using std::tr1::regex?

From:

DomoChan@gmail.com

Newsgroups:

comp.lang.c++

Date:

Wed, 13 Aug 2008 11:47:29 -0700 (PDT)

Message-ID:

<483a31a2-1c8f-4c44-8653-dd7c46d3ffa6@x35g2000hsb.googlegroups.com>

On Aug 13, 2:46 am, "Alf P. Steinbach" <al...@start.no> wrote:

* DomoC...@gmail.com:

On Aug 12, 11:13 pm, "Alf P. Steinbach" <al...@start.no> wrote:

* DomoC...@gmail.com:

Given a repeatable group expression
([abc])+
and given its input
cab
will result in nested subgroups, which taken from 'rad software
regular expression tester' looks like
Match 'cab'
   - Group 1
       - c at pos 0 length 1
       - a at pos 1 length 1
       - b at pos 2 length 1
Id like to use the regex classes found in std::tr1 to iterate over all
the matches in Group1.
Im using regex_search to fill a smatch object. I need to go one more
step to iterate over the matches found in Group1. Can anyone tell me
what I need to do to iterate over the sub-matches?
I've tried the following, but it doesnt seem to work
// note: initialResults is sucessfully filled with a single group
match
regex_search( "cab", initialResults, "([abc])+" );
for ( size_t ii = 1; ii < initialResults.size(); ++ii )
{
      ssub_match groupResults;
      // note: groupResults.matches is false. groupResults.first is
NULL, as is groupResults.second
      groupResults.compare( initialResults[ ii ] );
}

Not sure exactly what you're talking about, but if I understand it correctly you
want all possible matches of a single character from a specific set of chars.

Then why not use ([abc]).

All possible matches of ([abc])+, if I read it correctly as 1 or more successive
characters drawn from the set {a, b, c}, for a string of length N of consisting
of those characters only, well that's N + (N-1) + ... + 1 = (N^2 + N + 1)/2
matches, and surely you don't want that, or do you?

Please don't quote signatures.

You seem to clearly understand the expression, but perhaps I didnt use
an accurate expression to explain my situation.

If I changed my input string to "cab bat mac", the results would then
contain\

Match 'cab'
   - Group 1
       - c at pos 0 length 1
       - a at pos 1 length 1
       - b at pos 2 length 1
Match 'ba'
   - Group 1
       - b at pos 4 length 1
       - a at pos 5 length 1
Match 'ac'
   - Group 1
       - a at pos 9 length 1
       - c at pos 10 length 1

so 'cab', 'ba', and 'ac' are stored in initalResults, and I can
iterate over those easily using a for loop and using the 'smatch'
indexer.

Can you? I don't see how, if you're using the code shown earlier. Didn't work
for me.

However, Im interested in the individual results within each
group, so from the first match 'cab' i want to be able to iterate over
that group and read [0] = 'c', [1] = 'a', [2] = 'b'. So, thats what
Im tring to use 'ssub_match' for. but, im sure im not using it
correctly.

Let me know if im still vague.

No, it seems pretty clear.

I reproduced the output shown above by using a sregex_iterator to iterate over
the matches for "([abc])+", and an inner loop with sregex_iterator to iterate
over the "([abc])" matches in each match (as suggested in my previous reply). It
seems there is also capture functionality that can do this more directly, but
requires recompilation of the regex library with certain switches, and affects
efficiency in general, i.e. not just when it's used. I didn't try that.

Since this might be a school homework assignment, or an exercise you're doing in
order to learn from the experience of doing it, I'm not enclosing the code, but
yes, with this simple expression it's not only possible but simple, as
described, and I'm too lazy to think about whether a more complex expression
might present problems. ;-) I did use some time on it though: building the regex
library (never used) and checking the docs. But well used time, learned some!

Cheers, & hth.,

- Alf

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Can you? I don't see how, if you're using the code shown earlier. Didn't work
for me.

yes... you can. see "http://en.wikipedia.org/wiki/C%2B
%2B0x#Regular_expressions"

(as suggested in my previous reply)

which reply was that?

I reproduced the output ... using sregex_iterator

This is not an assignment, unless you considerate an assignment to
myself in which
case I hold no rules against cheating : ) kidding aside, this is
just syntax, not
really a logic issue and im waaay past getting any personal
gratification from personal
experience due to the amount of hair i've lost over this issue. at
any rate, im
writing an simple xml parser. see...

Cmn_XmlReader::Cmn_XmlReader( string xml )
{
    Cmn_String::StringToList( xml, m_original, "\r\n", true );
    Cmn_String::StringToList( xml, m_workingCopy, "\r\n", true );

    m_desc[Header] = "Header";
    m_desc[SplitTag] = "SplitTag";
    m_desc[CombinedTag] = "CombinedTag";
    m_desc[CloseTag] = "CloseTag";
    m_desc[OpenTag] = "OpenTag";

    m_regexDefs[Header] = "(<[\\?].+[\\?]>){1}";
    m_regexDefs[SplitTag] = "<(\\w+)\\s*(\\w+=['\"].+?['\"]\\s*)*\
\s*>(.+?)</\\1>";
    m_regexDefs[CombinedTag] = "<(\\w+)\\s*(\\w+=['\"].+?['\"]\\s*)*\\s*/

m_regexDefs[CloseTag] = "</(\\w+)>";
    m_regexDefs[OpenTag] = "<(\\w+)\\s*(\\w+=['\"].+?['\"]\\s*)*\
\s*>";

    m_patternDefs[Header] = new regex( m_regexDefs[Header] );
    m_patternDefs[SplitTag] = new regex( m_regexDefs[SplitTag] );
    m_patternDefs[CombinedTag] = new regex( m_regexDefs[CombinedTag] );
    m_patternDefs[CloseTag] = new regex( m_regexDefs[CloseTag] );
    m_patternDefs[OpenTag] = new regex( m_regexDefs[OpenTag] );

    ValidateHeader();
}

It seems that boost makes it more obvious of how to access its
repeated captures via
smatch.captures()[] which doesn't exist in tr1.

void print_captures(const std::string& regx, const std::string& text)
{
    boost::regex e(regx);
    boost::smatch what;
    std::cout << "Expression: \"" << regx << "\"\n";
    std::cout << "Text: \"" << text << "\"\n";
    if(boost::regex_match(text, what, e, boost::match_extra))
    {
        unsigned i, j;
        std::cout << "** Match found **\n Sub-Expressions:\n";
        for(i = 0; i < what.size(); ++i)
            std::cout << " $" << i << " = \"" << what[i] << "\"\n";
        std::cout << " Captures:\n";
        for(i = 0; i < what.size(); ++i)
        {
            std::cout << " $" << i << " = {";
            for(j = 0; j < what.captures(i).size(); ++j)
            {
                if(j)
                    std::cout << ", ";
                else
                    std::cout << " ";
                std::cout << "\"" << what.captures(i)[j] << "\"";
            }
            std::cout << " }\n";
        }
    }
    else
    {
        std::cout << "** No Match found **\n";
    }
}

to make matters more difficult, intellisense has not worked for any
tr1 objects, so viewing
methods and properties involves browsing the lengthy and cluttered
regex header file or
waiting util i start debug up to see whats what.

so, if anyone knows how to access repeated subgroups, please divulge
your knowledge and make
the forums a better place ^_^