Re: __gnu_cxx::hash_set efficient union
On 2007-08-16 17:58, mark.dufour@gmail.com wrote:
hi all,
as part of an 'implicitly statically typed Python' to C++ compiler
called Shedskin I'm working on (http://mark.dufour.googlepages.com),
I'm using __gnu_cxx::hash_set to implement Python sets. unfortunately
I can't get the performance to quite match that of set. for example,
this is how I thought set union would be reasonably fast:
template<class T> set<T> *set<T>::_union(set<T> *s) {
set<T> *c = new set<T>();
c->units = units;
it2 = s->units.end();
for(it1 = s->units.begin(); it1 != it2; it1++)
c->units.insert(*it1);
return c;
}
(some context:
units is of type __gnu_cxx::hash_set<T, hashfunc<T>, hasheq<T>,
gc_allocator<T> > // Boehm GC
template<class T> class hashfunc
{
public: int operator()(T t) const { return hasher<T>(t); }
};
template<> int hasher(int a) { return a; }
template<class T> class hasheq {
public: int operator()(T t, T v) const { return __eq(t,v); }
};
template<> int __eq(int a, int b) { return a == b; }
)
..but for a simple union of two large sets of ints, this approach is
about three times slower than under the Python interpreter. just
copying the left argument (c->units = units;) is slower than the
complete union in Python..
am I doing something really silly here, or is hash_set just not that
efficient? perhaps the new unordered_set will be much more
performant..? (btw, will unordered_set have union etc. members?)
here is a link to the blog posting that triggered this, with some
Python code that becomes much slower after compilation:
http://pyinsci.blogspot.com/2007/08/set-implementation-performance.html
any help is much appreciated! suggestions on how to improve
performance of the implementations for the other Python builtins are
also very much welcome.
I don't know how Python's sets works, but what's wrong with std::set and
std::set_union()? I can see no reason to use a hash when dealing with
ints, since they can easily be compared as they are, without having to
compute their hashes first.
Also note that for performance std::vector is often best, so put the
ints in the two sets into two vectors and sort them (if they are not
sorted during insertion). Then create a new vector large enough to
contain all the ints in both sets and use std::set_union() to get the union.
--
Erik Wikstr?m