Re: STL map or hash map using struct as data and find it

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 31 Dec 2007 06:52:14 -0800 (PST)

Message-ID:

<781d78ae-4f01-491a-94fb-6c431c46add8@j20g2000hsi.googlegroups.com>

On Dec 31, 1:51 pm, kl <kjell.ll...@gmail.com> wrote:

On 31 Dec, 13:21, James Kanze <james.ka...@gmail.com> wrote:

On Dec 31, 11:48 am, kl <kjell.ll...@gmail.com> wrote:

[...]

It is easy to use vector, linked list but I would like to try
to use map or hash map for more effecient way to look up or at
least test it out to compare some.

A sorted vector and lower_bound is likely to be faster than
std::map, since it will make roughly the same number of
comparisons, and will have a lot better locality. If speed
is really an issue, of course, the solution in this
particular case is probably to use a trie starting with the
high order byte. My guess is that a single table lookup
(using direct indexation) will suffice 99% of the time, or
more.

Why I would like to have some speed up is because the file has
around 80'000 lines

But how often are you doing a look-up. Still, 80000 lines is
probably enough to want to use something more efficient than a
linear search.

[...]

//Initilize the map when and call this function during start up
void initilizeMap()
{
               //read data from file and insert it into a map
        IPCOUNTRY g_structIP;
        FILE *fp=NULL;
        fp=fopen("ipcountry.dat", "rb");
        if(fp!=NULL)
        {
                while( !feof( fp ) )
                {
                        if(fread(&g_structIP, sizeof(g_structIP), 1, f=

p)!=0)

{
mIPcountry[g_structIP.startIP] = g=

_structIP;

                        }
                }
        }
        fclose(fp);
}

The above code will not work at all, for several reasons.

First, of course, you haven't told us the format of the file
you're reading, so it's difficult to say how you really should
read it, but fread is only a solution if you're reading the data
into a buffer of raw bytes, which you then parse manually.
(Most likely, you're dealing with a file in CSV format, unless
you're reading directly from a data base connection. So some
parsing will probably be necessary anyway.)

Well basicly the format of file is the IPCOUNTRY struct.

Which is? The format of the IPCOUNTRY struct will vary
enormously between compilers, even on the same platform, and can
change from one release of the compiler to the next, or even
with different compiler options. When you say that the format
is basically the same as that of the IPCOUNTRY struct, you're
basically saying that you don't know the format, and that you
don't care, because you're not going to ever reread the data at
a later date, when the program may have been recompiled with a
more recent version of the compiler, or with different options.

I have created a seperate converter function from CSV file
format to STRUCT format (this is never done on the retail
application) this is also have to do with speed when reading
from the file.

As long as you ensure that both this separate program and your
application are compiled with the same version of the compiler,
using the same options. It still sounds error prone to me (but
I can imagine cases where it is justified).

Maybe not future proof and user friendly but imagine if
everytime you start up an application and it gonna parse
80'000+ lines of column seperated lines with strtok or
similar... maybe not a big deal with todays computer. I just
thought this was a neat and lightweight solution at that point
of time.

Do you know of many programs that don't parse a couple of
thousand lines on start up?

I guess it depends on the context. I work on large scale
servers. They're started at most once a day---most of the
projects I've worked on are started once every three or four
years, or more. Whether the start up lasts 1 second, or 5 or
even 30, isn't a big deal.

Still, if you have to parse it, you have to parse it. Doing so
in a separate process is going to take (slightly) more time than
doing so in your own process. What might be useful is if you
only have to update the data (from the CVS file) once a week or
less, but you restart the individual application a lot. In that
case, I might use a memory dump with a header containing the
last update (from the CVS file) date and a checksum based on the
compiler/compiler options. (For the latter, on a Unix system,
I'd simply use the output of /dev/random to initialize an array
of about 16 bytes, regenerating the data each time I relinked.
So anytime I install a newly relinked version of the program, it
will go to the CVS data the first time.) And I'd still put the
parser for the CVS data in the same binary image, with something
like:

    if ( binaryDumpPresent() && binaryDumpOKToUse() ) {
        initializeFromBinaryDump() ;
    } else {
        initializeFromCVSFile() ;
        createBinaryDumpFromData() ;
    }

In this case, of course:

-- The actual data elements should be POD's, with only basic
    types (integral types, perhaps floating point types as well,
    although there's no need here, but no classes or pointers).

-- If you're using a sorted vector, you can do this with a
    single read/write, no need to read or write each element
    separately. (In fact, I'd use two reads/writes: one for the
    header, and one for the rest of the data.)

Depending on portability constraints, I might even use mmap (or
its equivalent under Windows); I've done this once (with a file
that took something like five minutes to parse on my old Sun
Sparc, resulted in a memory image of something like 10-15
Megabytes, and in which in 90% of the runs, I never accessed
beyond the first 128 entries---out of more than a
million---anyway).

Although formally not guaranteed... If you define the header
something like:

    struct IpCountryMapHeader
    {
        unsigned char magic[ 16 ] ;
        time_t saved ;
        size_t entryCount ;
    } ;

and each element as:

    struct IpCountryMapEntry
    {
        uint32_t start ;
        uint32_t end ;
        char countryName[ maxCountryNameLength + 1 ] ;
    } ;

will doubtlessly work. (In theory, you could run afoul of
alignment considerations if both size_t and time_t are smaller
than uint32_t. In practice, if size_t is less than 32 bits,
your table won't fit into memory, and time_t will never be less
than 32 bits anyway.)

Under Unix, a simple script like:

    (
        echo 'unsigned char referenceMagic[] = {' ;
        od -tx1 -N 16 /dev/random |
            tr '[:lower:]' '[:upper:]' |
            sed -e '2,$d'
                -e 's:^[0-9A-Z]* ::'
                -e 's:[0-9A-F][0-9A-F]:0x&,:g' ;
        echo '} ;'
    ) > referenceMagic.cc

in your makefile can be used to regenerate the magic every time
you relink; under Windows, you may have to write a simple C/C++
program to do something similar, using some sort of hash value
of the machines IP address, your current process id (a hex dump
of the HANDLE?), the current time, etc., to seed a quality
random generator, and output from it. (I use scripts like this
all the time in my makefiles, but in this case, if the OS
doesn't support /dev/random, or something similar, the script
won't work, even if you have a full Unix tool kit installed.)

(Also, FWIW: most of the tools which generate your CSV file will
generate closed intervals. I'd convert this to the classical
Unix half open interval when I read the file, and use that from
then on.)

Note too that std::lower_bound will work fine on a C style
array, either loaded with ifstream::read() (on a file opened in
binary mode) or mmapped. Just call it with startAddress,
startAddress + header.entryCount as the first two arguments.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34