Re: remove certain characters from a string
Daniel T. wrote:
Jerry Coffin <jcoffin@taeus.com> wrote:
jason.cipriani@gmail.com says...
[ ... ]
In-place, single pass through string, no unnecessary copies or moves:
void remove_chars (const string &bad, string &str) {
string::iterator s, d;
for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
str.resize(d - str.begin());
}
While this is strictly linear in terms of the manipulation of the target
string, it's still O(N*N) with respect to searches of the string holding
characters to remove.
If your list of characters to remove is very long, you might want to use
a bitmap indexed by the characters to determine whether a particular
character is allowed or not:
bool bitmap[UCHAR_MAX] = { false };
while (char *pos=bad; *pos; ++pos)
bitmap[*pos] = true;
Then instead of:
if (bad.find(*s) == string::npos)
you'd use something like:
if (!bitmap[*s])
giving constant complexity for each search. If you have a large
character set, that could waste a lot of storage though. If that's a
problem, you could sort the "bad" string, and use a binary search
instead. For a list of only 5 characters (or so) that would be a waste
of time and effort -- but if (for example) its a few hundred characters
long, the time saved could be considerable.
Good comment! I didn't think of that angle. The nice thing about using
the remove_copy_if algorithm and a functor (the solution I presented
yesterday) is that the time and effort to convert to a binary search is
minimal...
struct contains_any_of : unary_function< char, bool >
{
string str;
contains_any_of( const string& s ): str( s ) { }
bool operator()( char c ) const {
return find( str.begin(), str.end(), c ) != str.end();
}
};
str.erase( remove_if( str.begin(), str.end(),
contains_any_of( str2 ) ), str.end() );
Changing it simply requires a small chanage in the functor:
struct contains_any_of : unary_function< char, bool >
{
set<char> str;
contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
bool operator()( char c ) const {
return str.count( c ) == 1;
}
};
You are right, much more efficient.
Hm, this might well be a case where logarithmic is slower than linear for
all practically relevant cases. Due to its tree-like implementation,
std::set is not very local in memory and can cause cache misses. For a
logarithmic solution, I would instead consider something based on
std::vector, e.g., like so:
struct matches_any_of : std::unary_function< char, bool > {
std::vector<char> the_str;
template < typename Iterator >
void assign ( Iterator from, Iterator to ) {
the_str.assign( from, to );
std::sort( the_str.begin(), the_str.end() );
the_str.erase( std::unique( the_str.begin(), the_str.end() ),
the_str.end() );
}
template < typename Iterator >
matches_any_of ( Iterator from, Iterator to ) {
assign( from, to );
}
matches_any_of ( char const * str ) {
assign( str, str+std::strlen( str ) );
}
matches_any_of ( std::string const & str ) {
assign( str.begin(), str.end() );
}
bool operator() ( char chr ) const {
return( std::binary_search
( the_str.begin(), the_str.end(), chr ) );
}
};
Of course, one should measure before betting on any of the solutions to be
the most efficient.
Best
Kai-Uwe Bux