Re: remove certain characters from a string

From:

Kai-Uwe Bux <jkherciueh@gmx.net>

Newsgroups:

comp.lang.c++

Date:

Mon, 26 May 2008 07:43:28 -0400

Message-ID:

<483aa261$0$25950$6e1ede2f@read.cnntp.org>

Daniel T. wrote:

Jerry Coffin <jcoffin@taeus.com> wrote:

jason.cipriani@gmail.com says...

[ ... ]

In-place, single pass through string, no unnecessary copies or moves:

void remove_chars (const string &bad, string &str) {
  string::iterator s, d;
  for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
  str.resize(d - str.begin());
}

While this is strictly linear in terms of the manipulation of the target
string, it's still O(N*N) with respect to searches of the string holding
characters to remove.

If your list of characters to remove is very long, you might want to use
a bitmap indexed by the characters to determine whether a particular
character is allowed or not:

bool bitmap[UCHAR_MAX] = { false };

while (char *pos=bad; *pos; ++pos)
bitmap[*pos] = true;

Then instead of:

if (bad.find(*s) == string::npos)

you'd use something like:

if (!bitmap[*s])

giving constant complexity for each search. If you have a large
character set, that could waste a lot of storage though. If that's a
problem, you could sort the "bad" string, and use a binary search
instead. For a list of only 5 characters (or so) that would be a waste
of time and effort -- but if (for example) its a few hundred characters
long, the time saved could be considerable.

Good comment! I didn't think of that angle. The nice thing about using
the remove_copy_if algorithm and a functor (the solution I presented
yesterday) is that the time and effort to convert to a binary search is
minimal...

struct contains_any_of : unary_function< char, bool >
{
   string str;
   contains_any_of( const string& s ): str( s ) { }
   bool operator()( char c ) const {
      return find( str.begin(), str.end(), c ) != str.end();
   }
};

   str.erase( remove_if( str.begin(), str.end(),
      contains_any_of( str2 ) ), str.end() );

Changing it simply requires a small chanage in the functor:

struct contains_any_of : unary_function< char, bool >
{
   set<char> str;
   contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
   bool operator()( char c ) const {
      return str.count( c ) == 1;
   }
};

You are right, much more efficient.

Hm, this might well be a case where logarithmic is slower than linear for
all practically relevant cases. Due to its tree-like implementation,
std::set is not very local in memory and can cause cache misses. For a
logarithmic solution, I would instead consider something based on
std::vector, e.g., like so:


struct matches_any_of : std::unary_function< char, bool > {

  std::vector<char> the_str;

  template < typename Iterator >
  void assign ( Iterator from, Iterator to ) {
    the_str.assign( from, to );
    std::sort( the_str.begin(), the_str.end() );
    the_str.erase( std::unique( the_str.begin(), the_str.end() ),
                   the_str.end() );
  }

  template < typename Iterator >
  matches_any_of ( Iterator from, Iterator to ) {
    assign( from, to );
  }

  matches_any_of ( char const * str ) {
    assign( str, str+std::strlen( str ) );
  }

  matches_any_of ( std::string const & str ) {
    assign( str.begin(), str.end() );
  }

  bool operator() ( char chr ) const {
    return( std::binary_search
            ( the_str.begin(), the_str.end(), chr ) );
  }

};

Of course, one should measure before betting on any of the solutions to be
the most efficient.

Best

Kai-Uwe Bux