Re: Good ole gnu::hash_map, I'm impressed

From:

Mirco Wahab <wahab@chemie.uni-halle.de>

Newsgroups:

comp.lang.c++

Date:

Thu, 17 Jul 2008 14:07:49 +0200

Message-ID:

<g5ncmk$9k$1@nserver.hrz.tu-freiberg.de>

James Kanze wrote:

On Jul 16, 10:53 pm, Mirco Wahab <wa...@chemie.uni-halle.de> wrote:

Q1: Does anybody else (besides me) like to "hash something"?
How do you do that?

It depends. You might like to have a look at my "Hashing.hh"
header (in the code at kanze.james.neuf.fr/code-en.html---the
Hashing component is in the Basic section). Or for a discussion
and some benchmarks,
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. (That
article is a little out of date now, as I've tried quite a few
more hashing algorithms. But the final conclusions still hold,
more or less.)

Ah, thanks for the links. I'll work through it. I see, you
took relatively small working sets. (I considered my 14MB
setup "small" ;-)

I'd try to use your implementation in comparision but
don't know which files are really necessary. Do you
have a .zip of the hash stuff?

Q2: Which "future" can be expected regarding "hashing"?

There will be an std::unordered_set and std::unordered_map in
the next version of the standard, implemented using hash tables,
and there will be standard hash functions for most of the common
types. (I wonder, however. Is the quality of the hashing
function going to be guaranteed?)

We'll see - if some usable implementations show up. In the mean time,
the old hash_map seems to be "good enough" for my kind of stuff.
I did additional tests regarding the *reading* speed from the map.

The whole problem would be now:

1) read a big text to memory (14 MB here)
2) tokenize it (by simple regex, this seems to be very fast or fast enough)
3) put the tokens (words) into a hash and/or increment their frequencies
4) sort the hash keys (the words) according to their frequencies into a vector
5) report highest (2) and lowest (1) frequencies

Now I added 4 and 5. The tree-based std::map falls further behind
(as expected). The ext/hash_map keeps its margin.

  std::map (1-5) 0m8.227s real
  Perl (1-5) 0m4.732s real
  ext/hash_map (1-5) 0m4.465s real

Maybe I didn't find the optimal solution for copying the hash keys to the
vector (I'll add the source at the end).

From "visual inspection" of the test runs, it
can be seen that the array handling (copying
from hash to vector) is very efficient in Perl.

Furthermore, I run into the problem of how-to access the hash values
from a sort function. The only solution that (imho) doesn't involve
enormous complexity, just puts the hash module-global. How to cure that?

Regards

M.

Addendum:

[perl source] ==>
my $fn = 'fulltext.txt';
print "start slurping\n";
open my $fh, '<', $fn or die "$fn - $!";
my $data; { local $/; $data = <$fh> }

my %hash;
print "start hashing\n";
++$hash{$1} while $data =~ /(\w\w*)/g;

print "start sorting (ascending, for frequencies)\n";
my @keys = sort { $hash{$a} <=> $hash{$b} } keys %hash;

print "done, $fn (" . int(length($data)/1024) . " KB) has "
     . (scalar keys %hash) . " different words\n";

print "infrequent: $keys[0] = $hash{$keys[0]} times\n"
     . "very often: $keys[-2] = $hash{$keys[-2]} times\n"
     . "most often: $keys[-1] = $hash{$keys[-1]} times\n"
<==

[hash_map source]==>
#include <boost/regex.hpp>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <string>

// define this to use the tree-based std::map
#ifdef USE_STD_MAP
   #include <map>
   typedef std::map<std::string, int> StdHash;
#else
   #if defined (_MSC_VER)
     #include <hash_map>
     typedef stdext::hash_map<std::string, int> StdHash;
   #else
     #include <ext/hash_map>
     namespace __gnu_cxx {
        template<> struct hash< std::string > {
           size_t operator()(const std::string& s) const {
              return hash< const char* >()( s.c_str() );
           } // gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html
        }; // allow the gnu hash_map to work on std::string
     }
     typedef __gnu_cxx::hash_map<std::string, int> StdHash;
   #endif
#endif

char *slurp(const char *fname, size_t* len);
size_t word_freq(const char *block, size_t len, StdHash& hash);

// *** ouch, make it a module global? ***
StdHash hash;
// *** how do we better compare on the external hash? ***
struct ExtHashSort { // comparison functor for sort()
    bool operator()(const std::string& a, const std::string& b) const {
       return hash[a] < hash[b];
    }
};

  int main()
{
  using namespace std;
  size_t len, nwords;

  const char *fn = "fulltext.txt"; // about 14 MB
  cout << "start slurping" << endl;
  char *block = slurp(fn, &len); // read file into memory

  // StdHash hash; no more!
  cout << "start hashing" << endl;
  nwords = word_freq(block, len, hash); // put words into a hash
  delete [] block; // no longer needed

  cout << "done, " << fn << " (" << len/1024
       << "KB) has " << nwords << " different words" << endl;

  vector<string> keys;
  keys.reserve(nwords);

  cout << "sorting out the longest and shortest words" << endl;
  StdHash::const_iterator p, end; // copy keys to vector
  for(p=hash.begin(),end=hash.end(); p!=end; ++p) keys.push_back(p->first);
  sort(keys.begin(), keys.end(), ExtHashSort()); // sort by hashed number value

  cout << "infrequent:" << keys[0] << "=" << hash[keys[0]] << " times\n"
       << "very often:" << keys[nwords-2] << "=" << hash[keys[nwords-2]] << " times\n"
       << "most often:" << keys[nwords-1] << "=" << hash[keys[nwords-1]] << " times\n";

  return 0;
}

  char *slurp(const char *fname, size_t* len)
{
  std::ifstream fh(fname); // open
  fh.seekg(0, std::ios::end); // get to EOF
  *len = fh.tellg(); // read file pointer
  fh.seekg(0, std::ios::beg); // back to pos 0
  char* data = new char [*len+1];
  fh.read(data, *len); // slurp the file
  return data;
}

  size_t word_freq(const char *block, size_t len, StdHash& hash)
{
  using namespace boost;
  match_flag_type flags = match_default;
  static regex r("\\w\\w*");
  cmatch match;

  const char *from=block, *to=block+len;
  while( regex_search(from, to, match, r, flags) ) {
     hash[ std::string(match[0].first, match[0].second) ]++;
     from = match[0].second;
  }
  return hash.size();
}
<==