Re: Splitting a string into an array words

From:

Mark P <usenet@fall2005REMOVE.fastmailCAPS.fm>

Newsgroups:

comp.lang.c++

Date:

Fri, 21 Jul 2006 01:09:09 GMT

Message-ID:

<VOVvg.128148$H71.111533@newssvr13.news.prodigy.com>

Daniel T. wrote:

In article <Jfwvg.10246$2v.1690@newssvr25.news.prodigy.net>,
Mark P <usenet@fall2005REMOVE.fastmailCAPS.fm> wrote:

Here's a little tokenizer fcn I've used before. Not necessarily the
most elegant or compact way to do this (and criticisms are welcomed):

Well since criticisms are welcomed... :-)

// Populates "out" with delimited substrings of "in".
int tokenize (const string& in, vector<string>& out, const char* delims)
{
   string::size_type wordStart = 0; // current word start position
   string::size_type wordEnd = 0; // last word end position

   while (true)
   {
     wordStart = in.find_first_not_of(delims,wordEnd);
     if (wordStart == in.npos)
       break;
     wordEnd = in.find_first_of(delims,wordStart);
     if (wordEnd == in.npos)
       wordEnd = in.size();
     out.push_back(in.substr(wordStart,wordEnd - wordStart));
   }
   return out.size();
}

From least important to most important:

1) The while true and break is not a style I prefer.

Fair enough-- I'm not a fan either, but see my comment to item 4.

2) Returning out.size() isn't very useful since the caller can find out
what out.size() equals without the functions help.

True. In my case, I pulled this function out of some actual code where
the return value is sometimes used as a check. E.g., when parsing a
particular file format, I expect a certain number of tokens per line.
It saves the calling function a line of code by having the size of out
returned automatically (and of course this fcn is called in multiple
places).

3) It only works for vectors, I'd write something that works for deques
and lists as well.

Agreed, I very much prefer your templated approach that takes any Output
Iterator. In my case, using a known type allowed me to return the
container size (cf. item 2), but this is just my own particular
situation and at times excessive code parsimony.

4) A cyclomatic complexity of 4 seems a tad excessive for what is
   supposed to be such a simple job. You can drop that to 3 by removing
   the unnecessary "if (wordEnd == in.npos)" logic. Heeding item (1)
   above can reduce the complexity to 2.

Here's how I would write it:

   template <typename OutIt>
void tokenize( const string& str, OutIt os, const string& delims = " ")
{
   string::size_type start = str.find_first_not_of( delims );
   while ( start != string::npos ) {
      string::size_type end = str.find_first_of( delims, start );
      *os++ = str.substr( start, end - start );
      start = str.find_first_not_of( delims, end );
   }
}

Looks good. In my case it was a bit more complicated because I also
have an additional parameter for a comment character. When a comment
character is encountered at the beginning of a token, that token is
discarded and the loop breaks. (So in my original implementation there
were multiple breakpoints out of the loop, although I hastily trimmed
these before I posted my code, thereby leaving some unattractive vestiges.)

In any event, I appreciate your comments and don't mean to simply make
excuses and argue all of your points. The only significant hitch to my
adopting your cleaner implementation is that I really do need support
for the comment character break. Luckily this is just a bit of a little
file parser I use for testing, so I don't stress too much about these
details, but feel free to propose a svelte implementation that supports
a comment char. :)

Mark