Re: Tokenizer Function (plus rant on strtok documentation)

From:

"Robbie Hatley" <bogus.address@no.spam>

Newsgroups:

comp.lang.c++

Date:

Tue, 11 Jul 2006 06:31:59 GMT

Message-ID:

<zBHsg.129441$dW3.67625@newssvr21.news.prodigy.com>

"jmoy" <jmoy.matecon@gmail.com> wrote:

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

Ah, sort of like the code my ex-boss left me to maintain after he
got fired. Hundreds of global variables, which he uses to pass
data from function to function, like a dumbass. Of course, since
the program is a complex windows app with timers and interrupts,
the data often gets over-written on its way from one place to
another. ::sigh:: Global variables are the work of Sauron.

I guess tying the tokenizer to vector<string> is not a good idea.

It does limit the user to a std::vector<std::string>, yes. However,
that construct is pretty good for this app. I find it hard to
think of cases which couldn't use that to hold a bunch of tokens.

If it took an output iterator it could be used with any container
or even with things like ostream_iterators.

Provided that the output container was big enough. If you start
with an empty conainer and try writing to it using output
iterators, you'll get an "illegal memory access" or "general
protection fault" or some such thing. So you'd have to make sure
that the container was huge. I don't like that approach.

#include <string>
using namespace std;
template <class OIter> void tokenize( const string &str,
                                        const string &delim,
                                        OIter oi)
{
        typedef string::size_type Sz;

        Sz begin=0;
        while(begin<str.size()){
                Sz end=str.find_first_of(delim,begin);
                *oi++=str.substr(begin,end-begin);
                begin=str.find_first_not_of(delim,end);
        }
}

I use find_first_not_of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

Alluring in its simplicity, yes. But has two major bugs:

1. Memory corruption danger if used to write to a small container.
2. You don't take into account the fact that the string might START
   with one or more delimiters.

Maybe something like THIS might be better:

#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
   (
      const std::string & str,
      const std::string & delim,
      Container & C
   )
{
   typedef std::string::size_type Sz;
   Sz begin = 0;
   Sz end = 0;
   while (begin < str.size())
   {
      begin = str.find_first_not_of (delim, begin);
      end = str.find_first_of (delim, begin);
      Container.push_back(str.substr(begin, end-begin));
   }
}

I haven't tested that, but I think something like that would work
better. It does require that the container for the tokens have
the push_back() method defined. Other than that, it's pretty
generic.

Note that to take care of the "starts with delimiters" case,
I simply moved your "first_not_of" up to the top of the loop.
That should work nicely.

--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)
http://home.pacbell.net/earnur/