Re: Memory allocation failure in map container

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sat, 8 Jan 2011 07:10:48 -0800 (PST)

Message-ID:

<1e2177f9-faee-4dbf-9934-ffb64fd012d7@e20g2000vbn.googlegroups.com>

On Jan 5, 7:51 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

James Kanze <james.ka...@gmail.com> wrote innews:5b17c708-0ca7-40aa-9dfd-4c85505a0a70@p8g2000vbs.googlegroups.com:

On Jan 4, 5:31 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

Saeed Amrollahi <amrollahi.sa...@gmail.com> wrote in
news:fe8f1a4b-1315-
4755-a30a-3566a52bd...@i17g2000vbq.googlegroups.com:

I am working on an application to manage Tehran Securities
Exchange organization. In this program, I try to manipulate
a lot of institutional investors (about 1,500,000). The
Investors are a struct like this:

struct Investor {
  int ID;
  std::wstring NID;
  std::wstring Name;
  std::wstring RegNum;
  std::wstring RegDate;
  std::wstring RegProvince;
  std::wstring RegCity;
  std::wstring Type;
  std::wstring HQAddr;
  // the deblanked or squeezed info
  std::wstring sq_Name;
  std::wstring sq_RegNum;
  std::wstring sq_RegProvince;
  std::wstring sq_RegCity;
};

Size of this struct is 584 bytes with my compiler. 1500000*584 is ca
835 MB. You are holding them twice (both in vector and map), which
makes it ca 1.7 GB. This brings you already quite close to the
maximum limits for 32-bit applications (2G in Windows by default),
hence the memory problems are expected.

Independantly of sizeof(Investor), all those strings are likely
to require additional memory (unless they are empty, or unless
the compiler uses the small string optimization, and they are
small enough---but things like names, provinces and cities
generally aren't). This could easily double the necessary
memory (with small string optimization) or multiply it by 5 to
10, or more (without small string optimization, but then,
sizeof(Investor) would probably be around 64).

You are right of course, I was assuming most of these strings
are short and fit in the small string optimization buffer.

What are usual values for the small string cutoff? I seem to
remember 8 being used somewhere, but if sizeof this structure is
584, and wchar_t is 2 bytes (Windows), then I'd guess more
likely 16 (or 8 if wchar_t is 4 bytes, as is the case on most
Unix systems). With 16, your almost certainly right---most of
these fields would fit into 16 characters (supposing the field
names mean what they seem to mean, and judging from the names I
see on maps, etc.). That does change things.

If he used UTF-8 encoded
std::string instead of std::wstring, this assumption might be true more
often (assuming the texts are mostly ASCII of course).

He said that this was for the Teheran securities exchange. The
local language in Teheran uses the Farsi alphabet, a modified
(or extended) form of the Arabic alphabet. In UTF-8, these are
all two byte characters, in the range D8 80--DB BF. If wchar_t
uses UTF-16, he gains absolutely nothing. (If it uses UTF-32,
on the other hand, he divides his memory use by 2. Maybe.)

As to whether it fits into the small string optimization or not,
all depends on how that is implemented. If the optimization
limit depends on the number of code points (char or wchar_t, and
the sizeof you mention would seem to suggest that), then he will
miss the small string optimization more often using UTF-8 than
using UTF-16 or UTF-32. (If the limit depends on the number of
bytes, then UTF-16 and UTF-8 are roughly equal, supposing most
text written in Farsi.)

The correct solution would be to redesign the application so that you
don't need to load all the data in the memory at once. Normally
database applications are happy to load only the data they need at
the moment.

It's hard to say what the correct solution is without more
context, but there's a very good chance you're right, at least
if the actual data is stored in a relational data base.

Another solution would be to use a 64-bit machine, OS and
compilation, this lifts the limits a bit, but does not help if your
database happens to grow 100 times.

If he's near the limits on a 32 bit machine, a 64 bit machine
should multiply the number he can handle by significantly more
than a 100. Moving to 64 bits is probably the simplest
solution, *IF* the program really needs to keep all of the data
virtually in memory.

The address space limits would be gone in 64-bit machine, but the new
limits will be set by the physical RAM and disk space present. A normal
desktop PC would have e.g. 8GB of RAM, which is only 4 times larger than
the 2GB limit for Windows. The disks are maybe 200GB or something, if one
(ab)uses that all for the pagefile, then one could hold theoretically
hold ca 100 times more data in process memory than for a 32-bit process
(assuming one has enough patience to wait for such process to complete
anything!).

Terabyte disks and larger are readily available, and not
expensive. On a 64 bit system, he should easily be able to get
rid of the bad_alloc. With in return so much swapping that the
system becomes unusable.

Yet another solution is to reduce the size of
Investor, for example putting all strings together in a single
std::wstring, separated by some delimiter like a zero byte. This
would make finding, accessing and changing the individual fields
slower and more cumbersome of course.

I'm not sure that that would gain that much.

There are lots of space overhead associated with each std::wstring. In my
implementation it seems the size of std::wstring is 32 bytes in 32-bit
compilations, from which only 16 bytes is used for the small string
buffer.

That seems like a lot, although... It's not too difficult to
implement the small string optimization with an overhead of only
two pointers. On the other hand, any quality implementation
will have a debugging option, which will require extra pointers
to keep a list of iterators, etc. From what I've seen, I have
the impression that VC++ tries to make it possible to link
debugging and non-debugging versions (although at least in VC8,
this doesn't actually work), and so uses the extra space even
when debugging is turned off.

For example, if all strings are very short and fit in the small
string buffers, then one can easily reduce the overall memory
consumptions by half by packing them all in a single string. If the
strings are large then this would not help of course.

If the strings are large, you would only have one allocation,
instead of many. And each allocation has significant overhead
as well (maybe 16 bytes), without mentionning the small string
buffer which isn't used.

No, you're right. Using a single string with a separator will
represent a significant gain (unless the strings are really,
really long, which I doubt is the case). Regardless of the
implementation of std::basic_string, and whether the strings fit
into the small string buffer or not.

What is probably,
on the other hand, is that there are a lot of duplicates in the
fields with province and city; using a string pool for these
could help. And it may be possible to generate the sq_...
values algorithmically from the non sq_ value; in that case,
they could be eliminated completely from the structure. But
none of these optimizations will last for long, if the table
significantly increases in size.

Right

Even without many duplicates: using string pooling would allow
putting all of the actual character data in a single string,
which as you correctly point out, is a win in itself. Each
string in his structure would have type PooledString, which
would consist of two size_t (indexes to the begin and end of the
string in the pool). On a 32 bit machine, that's 8 bytes,
instead of 584; there's still the overhead for the pool, but it
wouldn't surprise me if this halved the total memory use.

--
James Kanze