Re: How to find out memory occupied by vector, set, or map?
On May 20, 4:58 am, Lambda <stephenh...@gmail.com> wrote:
I'm trying to develop several interesting components
of a simple search engine follow some text books.
A book introduces some ways to compress dictionary,
allow it to stay on main memory.
In the dictionary, is a list of millions of words
with related data of several bytes.
Word / Term Freq(int) / Pointer to disk file
Hello 100
One method is:
concatenate all the words into one long contiguous string
and an array of 4-byte character pointers is used to access.
A classical string pool. It's not a solution to your problem,
just a very special way to represent strings.
But I think STL map is a suitable data structure for this
problem. I don't know the detailed internal implementation of
map. I think it's implemented as a binary search tree. Every
node occupy just enough memory. (I'm not sure) I also need the
search feature of BST.
Every node is dynamically allocated, and contains several
pointers in addition to the raw data, so it can be fairly
expensive in terms of memory use. On the other hand, for
something small, like a couple of million elements, why not.
Note that access time is O(lg n), and not constant. For more
than a couple of thousand elements, a hash table will be faster
(provided you use a good hash function).
So I'd like to know if it is a need to build the data
structure myself. How can I know the whole memory occupied by
a map, as well as other data structures such as vector, set.
(Or memory occupied by a map element).
You can't really, since there are so many hidden overheads. I'd
just go ahead and implement using std::map, but carefully
encapsulating it so that I could replace it if it did cause a
problem later.
BTW, should I use std::string or char[] to store the millions
of words?
std::string, definitely.
How much is the string overhead?
Again, it depends on the implementation. A lot of
implementations today use a small string optimization which
avoids dynamic allocation for strings shorter than some cut-off
size; if your words are English, there's a good chance that most
of them will fall under this limit.
All sorts of optimizations are possible, once you have the
application working, and can see if they're necessary.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34