Re: Good ole gnu::hash_map, I'm impressed
James Kanze wrote:
On Jul 16, 10:53 pm, Mirco Wahab <wa...@chemie.uni-halle.de> wrote:
Q1: Does anybody else (besides me) like to "hash something"?
How do you do that?
It depends. You might like to have a look at my "Hashing.hh"
header (in the code at kanze.james.neuf.fr/code-en.html---the
Hashing component is in the Basic section). Or for a discussion
and some benchmarks,
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. (That
article is a little out of date now, as I've tried quite a few
more hashing algorithms. But the final conclusions still hold,
more or less.)
Ah, thanks for the links. I'll work through it. I see, you
took relatively small working sets. (I considered my 14MB
setup "small" ;-)
I'd try to use your implementation in comparision but
don't know which files are really necessary. Do you
have a .zip of the hash stuff?
Q2: Which "future" can be expected regarding "hashing"?
There will be an std::unordered_set and std::unordered_map in
the next version of the standard, implemented using hash tables,
and there will be standard hash functions for most of the common
types. (I wonder, however. Is the quality of the hashing
function going to be guaranteed?)
We'll see - if some usable implementations show up. In the mean time,
the old hash_map seems to be "good enough" for my kind of stuff.
I did additional tests regarding the *reading* speed from the map.
The whole problem would be now:
1) read a big text to memory (14 MB here)
2) tokenize it (by simple regex, this seems to be very fast or fast enough)
3) put the tokens (words) into a hash and/or increment their frequencies
4) sort the hash keys (the words) according to their frequencies into a vector
5) report highest (2) and lowest (1) frequencies
Now I added 4 and 5. The tree-based std::map falls further behind
(as expected). The ext/hash_map keeps its margin.
std::map (1-5) 0m8.227s real
Perl (1-5) 0m4.732s real
ext/hash_map (1-5) 0m4.465s real
Maybe I didn't find the optimal solution for copying the hash keys to the
vector (I'll add the source at the end).
From "visual inspection" of the test runs, it
can be seen that the array handling (copying
from hash to vector) is very efficient in Perl.
Furthermore, I run into the problem of how-to access the hash values
from a sort function. The only solution that (imho) doesn't involve
enormous complexity, just puts the hash module-global. How to cure that?
Regards
M.
Addendum:
[perl source] ==>
my $fn = 'fulltext.txt';
print "start slurping\n";
open my $fh, '<', $fn or die "$fn - $!";
my $data; { local $/; $data = <$fh> }
my %hash;
print "start hashing\n";
++$hash{$1} while $data =~ /(\w\w*)/g;
print "start sorting (ascending, for frequencies)\n";
my @keys = sort { $hash{$a} <=> $hash{$b} } keys %hash;
print "done, $fn (" . int(length($data)/1024) . " KB) has "
. (scalar keys %hash) . " different words\n";
print "infrequent: $keys[0] = $hash{$keys[0]} times\n"
. "very often: $keys[-2] = $hash{$keys[-2]} times\n"
. "most often: $keys[-1] = $hash{$keys[-1]} times\n"
<==
[hash_map source]==>
#include <boost/regex.hpp>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <string>
// define this to use the tree-based std::map
#ifdef USE_STD_MAP
#include <map>
typedef std::map<std::string, int> StdHash;
#else
#if defined (_MSC_VER)
#include <hash_map>
typedef stdext::hash_map<std::string, int> StdHash;
#else
#include <ext/hash_map>
namespace __gnu_cxx {
template<> struct hash< std::string > {
size_t operator()(const std::string& s) const {
return hash< const char* >()( s.c_str() );
} // gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html
}; // allow the gnu hash_map to work on std::string
}
typedef __gnu_cxx::hash_map<std::string, int> StdHash;
#endif
#endif
char *slurp(const char *fname, size_t* len);
size_t word_freq(const char *block, size_t len, StdHash& hash);
// *** ouch, make it a module global? ***
StdHash hash;
// *** how do we better compare on the external hash? ***
struct ExtHashSort { // comparison functor for sort()
bool operator()(const std::string& a, const std::string& b) const {
return hash[a] < hash[b];
}
};
int main()
{
using namespace std;
size_t len, nwords;
const char *fn = "fulltext.txt"; // about 14 MB
cout << "start slurping" << endl;
char *block = slurp(fn, &len); // read file into memory
// StdHash hash; no more!
cout << "start hashing" << endl;
nwords = word_freq(block, len, hash); // put words into a hash
delete [] block; // no longer needed
cout << "done, " << fn << " (" << len/1024
<< "KB) has " << nwords << " different words" << endl;
vector<string> keys;
keys.reserve(nwords);
cout << "sorting out the longest and shortest words" << endl;
StdHash::const_iterator p, end; // copy keys to vector
for(p=hash.begin(),end=hash.end(); p!=end; ++p) keys.push_back(p->first);
sort(keys.begin(), keys.end(), ExtHashSort()); // sort by hashed number value
cout << "infrequent:" << keys[0] << "=" << hash[keys[0]] << " times\n"
<< "very often:" << keys[nwords-2] << "=" << hash[keys[nwords-2]] << " times\n"
<< "most often:" << keys[nwords-1] << "=" << hash[keys[nwords-1]] << " times\n";
return 0;
}
char *slurp(const char *fname, size_t* len)
{
std::ifstream fh(fname); // open
fh.seekg(0, std::ios::end); // get to EOF
*len = fh.tellg(); // read file pointer
fh.seekg(0, std::ios::beg); // back to pos 0
char* data = new char [*len+1];
fh.read(data, *len); // slurp the file
return data;
}
size_t word_freq(const char *block, size_t len, StdHash& hash)
{
using namespace boost;
match_flag_type flags = match_default;
static regex r("\\w\\w*");
cmatch match;
const char *from=block, *to=block+len;
while( regex_search(from, to, match, r, flags) ) {
hash[ std::string(match[0].first, match[0].second) ]++;
from = match[0].second;
}
return hash.size();
}
<==