Re: high traffic/availability application and gnu_cxx::hash_map problem - better to use tr1/unordered_map et al?

From:

=?ISO-8859-1?Q?Marcel_M=FCller?= <news.5.maazl@spamgourmet.org>

Newsgroups:

comp.lang.c++

Date:

Wed, 05 Jan 2011 13:06:19 +0100

Message-ID:

<4d245ed2$0$6987$9b4e6d93@newsspool4.arcor-online.net>

Alex Wanderleit wrote:

When the second hash is ready (i.e. contains only the required items),
the main hash is cleared ("clear") and the second hash is assigned
to the main hash, then the second hash is cleared - all in one go:

hash.clear();
hash = hashTemp;
hashTemp.clear();

You should avoid copying large container. Especially not within the
mutex where the main lookup table is locked.

Use the swap method that almost all STL like containers provide. instead.

This works once after 24 hours - but: every second day the main
process gets slower and slower short after these assignments,

What says the memory footprint of your application? Do you leak memory
somewhere?

From an strace I can observe that the assignments above are performed
(even though this takes about 15 seconds to complete). A subsequent
config file open and read (that normally lasts less than 0.001
seconds)
now takes 7 seconds (!). And this slowness also applies to
all socket interactions (read/write) and other system calls (time,
getpeername, ...).

Looks like memory pollution. Maybe the internal memory pool of the
runtime is highly fragmented. In this case you should use a custom
allocator for each hash instance, that allocates from a private, hash
instance specific memory pool. Preferably requested in large chunks
directly from the operating system. Once the hash gets replaced, simply
destroy the old hash and the entire memory pool it uses. This will clean
up gracefully.

Currently I suspect that the gnu_cxx::hash_map performs
some action behind the scenes that triggers this behaviour.

Maybe. But maybe your code causes small leaks of memory that perforate
the virtual memory space. This can lead to highly inefficient memory
access and can completely eliminate the efficiency of any cache memory
in your computer, causing a slow down of some orders of magnitude.

There is no *alloc system call in the strace visible - so
what could cause the delay? (Is the gnu_cxx::hash_map doing
something nasty in another thread not monitored by strace?)

May be some kind of rehashing takes place?

Or excessive swapping occurs?

Maybe. But your operating system should tell you that.

Is there a way to find out what causes the observed behaviour?

Track your memory usage.

As far as I can see from the documentation, it is not possible
to control internal behaviour of rehashing or memory control in
general with __gnu_cxx::hash_map (though I perform a "reserve" of
10.000.000 buckets at startup):

What would not be that bad, if you actually know the approximate size.
Especially in conjunction with the swap instead of the assignment, you
could allocate the second hash, that becomes the first one after the
next swap, pretty accurate.

This is actually annoying, since the process consumes more and
more memory (1.9GB at startup) over time until a certain
"gnu_cxx::hash_map satisfaction"-level is reached (up to 2.6GB and
more -
which should still be within the limit of a 32bit process)

No. Usually the limit is 2GB private memory for each process plus 2GB
shared memory for /all/ 32 Bit processes. Your memory is allocated from
the private arena.
Most operating systems provide a kernel setting to enlarge the private
arena at the expense of less shared memory. But this is a system wide
setting and may cause other serious impact.

So everything is clear. Memory is your problem.

You should strongly reduce the memory footprint of your application at
least by a factor of 2. Or choose the easy way and switch to 64 bit
environment (hardware and software).

- but you can't
influence that behaviour - it is not possible to grab the required
memory in advance - even not with "reserve".

You can allocate a memory pool in advance where the custom allocator of
your hash map takes the storage from. However keep in mind that most of
your storage might not be located in the hashmap but rather in the
string IDs. This is the first point, where I would start optimization.

- 1. Store your IDs in a binary, compressed representation rather than
in a readable string. E.g. do not use hex numbers.

- 2. Check your IDs. Are they all of similar and limited length?
=> In this case use a fixed char array as key parameter of your hash.

- 3. Check if your IDs follow some regular pattern. E.g. all IDs start
with a limited set of of values. In this case you should split your IDs
into pieces of fixed length and create a tree of lookup tables instead
of one large one. This has the benefit of not storing common parts of
the keys for each entry redundantly some million times like a relational
database. Of course, the nested lookup tables should not contain the
common parts of the keys anymore. They derive it from their location in
the object tree.

- 4. Maybe use an adaptive structure that uses simple, fixed lookup
tables for the first parts of the keys and hash tables only in deeper
elements. E.g. use an array of 256 entries which contains the subtables
for any key that starts with the byte matching the array index. This is
highly effective because this byte of the key is no longer stored at all
into memory. You can apply this recursively as long as the the
fillfactor keeps high enough. I.e. as long as it is likely that many of
the entries in the lookup table do exist. The nested lookup objects
should only be created if at leas one entry uses them.
Of course, the exponential growth of this structures will limit the
usage to only a few levels. But for these few levels it is extremely fast.
If your keys are well distributed, you can also collect several bytes
into one lookup table with e.g. 65536 entries. This is even more
effective because you save the pointers for the nested tables.
The break even is quite easy to calculate: If you have 10 mio. entries
an use a 2 byte lookup table with 65536 entries you save 20 MB for the
first two bytes of the keys and need only 256kB for the lookup table.
OK, the overhead of the nested hash maps will add approximately another
one or two MB, but no more. If your keys are not that well distributed
it will be even more effective, because some of the nested hash maps are
not created at all. This is usually the major indication for such a design.

- 5. Align the nesting levels of your data structures at the logical
structure of your IDs. E.g. if you want to make a DNS service, the dots
in the domain name could be used. The top level structure contains one
entry for each TLD. The nested structures are for the subdomains a.s.f.
Since the value distribution in each sub map is limited and mainly
independant form other submaps (except for 'www', which should be
handled special) there is almost no redundancy in memory.
You mentioned a concationation of your IDs. This suggests that there is
a structure in the keys.

If you optimize your code well, it should be possible to process 10
million entries with a memory footprint of at most 300MB! Maybe even less.

Also the clear should enable the hash_map implementation to
start over with the already allocated memory - there shouldn't
be a "swiss cheese" memory allocation.

The clear is a design flaw anyway.

Should <tr1/unordered_map> be tried instead with a newer version
of gcc? Would there be more "control" over memory consumption or
rehashing?

This will most likely not solve your problem.

Marcel