Re: Merging a list of similar items

From:

"Jim Langston" <tazmaster@rocketmail.com>

Newsgroups:

comp.lang.c++

Date:

Wed, 13 Jun 2007 02:17:05 -0700

Message-ID:

<CCObi.90$471.22@newsfe02.lga>

"Henrik Goldman" <henrik_goldman@mail.tele.dk> wrote in message
news:466f96f6$0$52097$edfadb0f@dread11.news.tele.dk...

Hello,

I have a dataset which consist of a string username and string hostname as
a key and then an integer representing a count as the matching "second"
value in a pair.

So far I've used std::list to store this information by inserting objects
containing all 3 pieces of information. However now some users has
datasets in the range of 12000 records and my list seems to be taking like
forever to complete inserting and finding.
For this reason I'm looking for ways to speed this up.

Here is a bit of code from the existing solution (which works but is too
slow):

iterator Find(const string &strUsername, const string &strHostname)
{
   iterator it;

   for (it = begin(); it != end(); it++)
   {
     if (StrNoCaseCompare(it->m_strUsername, strUsername) &&
StrNoCaseCompare(it->m_strHostname, strHostname))
       return it;
   }

   return end();
}

void Insert(const string &strUsername, const string &strHostname, int
nCount)
{
   CDailyUsage DU;

   iterator it = Find(strUsername, strHostname);

   if (it == end())
   {
     DU.m_strUsername = strUsername;
     DU.m_strHostname = strHostname;
     DU.m_nCount = nCount;

     PushBack(DU);
   }
   else
   {
     it->m_nCount += nCount;
   }
}

I'm thinking about replacing it with map<pair<username,hostname>, ncount>
but I don't know how much this will give me.
Are anyone up for good ideas?

A std::map should give you a whole lot. A list is unindexed and for every
search you are iteratirng though every possible record. A map is indexed (I
believe they use binary trees or better) and searches are a lot faster, a
WHOLE lot faster. The difficulty comes in your strNoCase compare, but you
can get around that by making the key a class with an operator<..

The way you are using this you probably want a set anyway since your data is
in your class/structure.

Output of following program showing your type of functions for List, Map and
Set is:

*** List ***
Inserting group 0 100 unique entries to list: 94ns
Inserting group 1 100 unique entries to list: 266ns
Inserting group 2 100 unique entries to list: 343ns
Inserting group 3 100 unique entries to list: 469ns
Inserting group 4 100 unique entries to list: 609ns
Inserting group 5 100 unique entries to list: 719ns
Inserting group 6 100 unique entries to list: 860ns
Inserting group 7 100 unique entries to list: 1000ns
Inserting group 8 100 unique entries to list: 1171ns
Inserting group 9 100 unique entries to list: 1250ns
Search for non existant entry in List: 16ns
List size:1000

*** Map ***
Inserting group 0 100 unique entries to map: 47ns
Inserting group 1 100 unique entries to map: 62ns
Inserting group 2 100 unique entries to map: 63ns
Inserting group 3 100 unique entries to map: 78ns
Inserting group 4 100 unique entries to map: 78ns
Inserting group 5 100 unique entries to map: 94ns
Inserting group 6 100 unique entries to map: 94ns
Inserting group 7 100 unique entries to map: 94ns
Inserting group 8 100 unique entries to map: 78ns
Inserting group 9 100 unique entries to map: 94ns
Search for non existant entry in Map: 0.485ns
Map size:1000

*** Set ***
Inserting group 0 100 unique entries to set: 46ns
Inserting group 1 100 unique entries to set: 79ns
Inserting group 2 100 unique entries to set: 78ns
Inserting group 3 100 unique entries to set: 78ns
Inserting group 4 100 unique entries to set: 78ns
Inserting group 5 100 unique entries to set: 78ns
Inserting group 6 100 unique entries to set: 93ns
Inserting group 7 100 unique entries to set: 94ns
Inserting group 8 100 unique entries to set: 94ns
Inserting group 9 100 unique entries to set: 94ns
Search for non existant entry in Set: 0.531ns
Set size:1000

The main thing to notice is the search times. For your list it's a linear
search, takes longer and longer the more entries you have. For map and set,
they use indexes, so QUITE a bit faster. Here is program: (Please note
that followoing program is not optmized, I just tried to follow the way you
were doing things with maps and sets also).

#include <iostream>
#include <string>
#include <ctime>
#include <cstring>
#include <utility>
#include <sstream>

#include <list>
#include <map>
#include <set>

std::string Lowercase( const std::string& MixedString )
{
    std::string ReturnValue = "";
    for ( std::string::size_type i = 0; i < MixedString.size(); ++i )
        ReturnValue += static_cast<char>( tolower( MixedString[i] ) );
    return ReturnValue;
}

bool StrNoCaseCompare( const std::string& Str1, const std::string& Str2 )
{
    if ( Lowercase( Str1 ) == Lowercase( Str2 ) )
        return true;

    return false;
}

class UserInfo
{
public:
    UserInfo( const std::string& Username, const std::string& Hostname ):
Username( Username ), Hostname( Hostname ) {}
    std::string Username;
    std::string Hostname;
    int count; // Only used in set
};

bool operator<( const UserInfo& lhs, const UserInfo& rhs )
{
    // case insensitve compare
    // use your fastest method, this one just thrown together
    std::string Userlhs = Lowercase( lhs.Username );
    std::string Hostlhs = Lowercase( lhs.Hostname );
    std::string Userrhs = Lowercase( rhs.Username );
    std::string Hostrhs = Lowercase( rhs.Hostname );

    if ( Userlhs == Userrhs )
        if ( Hostlhs < Hostrhs )
            return true;
        else
            return false;
    else if ( Userlhs < Userrhs )
        return true;
    else
        return false;
}

typedef std::list<std::pair<UserInfo, int> > DataList;
typedef std::map<UserInfo, int> DataMap;
typedef std::set<UserInfo> DataSet;

DataList::iterator FindInList(const std::string &strUsername, const
std::string &strHostname, DataList& Data)
{
    DataList::iterator it;

    for (DataList::iterator it = Data.begin(); it != Data.end(); ++it)
    {
        if (StrNoCaseCompare(it->first.Username, strUsername) &&
            StrNoCaseCompare(it->first.Hostname, strHostname))
            return it;
    }

    return Data.end();
}

void InsertToList(const std::string &strUsername, const std::string
&strHostname, int nCount, DataList& Data)
{
    DataList::iterator it = FindInList(strUsername, strHostname, Data);

    if (it == Data.end())
    {
        Data.push_back(std::make_pair( UserInfo( strUsername, strHostname ),
nCount ));
    }
    else
    {
        it->second += nCount;
    }
}

DataMap::iterator FindInMap(const std::string &strUsername, const
std::string &strHostname, DataMap& Data)
{
    return Data.find( UserInfo( strUsername, strHostname ));
}

void InsertToMap(const std::string &strUsername, const std::string
&strHostname, int nCount, DataMap& Data)
{
    UserInfo Entry( strUsername, strHostname );
    DataMap::iterator it = Data.find( Entry );

    if (it == Data.end())
    {
        Data.insert(std::make_pair( Entry, nCount ));
    }
    else
    {
        it->second += nCount;
    }
}

DataSet::iterator FindInSet(const std::string &strUsername, const
std::string &strHostname, DataSet& Data)
{
    return Data.find( UserInfo( strUsername, strHostname ));
}

void InsertToSet(const std::string &strUsername, const std::string
&strHostname, int nCount, DataSet& Data)
{
    UserInfo Entry( strUsername, strHostname );
    Entry.count = nCount;

    DataSet::iterator it = Data.find( Entry );

    if (it == Data.end())
    {
        Data.insert(Entry);
    }
    else
    {
        it->count += nCount;
    }
}

std::string StrmConvert( int Number )
{
    std::stringstream Temp;
    Temp << Number;
    std::string Output;
    Temp >> Output;
    return Output;
}

int main ()
{
    const int Entries = 100;

    DataList ListData;
    DataMap MapData;
    DataSet SetData;

    clock_t Start;
    clock_t End;

    // Insert to list
    std::cout << "*** List ***\n";
    for ( int Group = 0; Group < 10; ++Group )
    {
        std::string GroupPrefix = StrmConvert( Group );
        Start = clock();
        for ( int i = 0; i < Entries; ++i )
        {
            InsertToList( GroupPrefix + StrmConvert( i ), GroupPrefix +
StrmConvert( i ), i, ListData );
        }
        End = clock();
        std::cout << "Inserting group " << Group << " " << Entries << "
unique entries to list: " << End - Start << "ns" << "\n";
    }
    Start = clock();
    FindInList( "Bogus", "Bogus", ListData );
    End = clock();
    std::cout << "Search for non existant entry in List: " << End - Start <<
"ns\n";

    std::cout << "List size:" << ListData.size() << "\n";

    // Insert to map
    std::cout << "\n*** Map ***\n";
    for ( int Group = 0; Group < 10; ++Group )
    {
        std::string GroupPrefix = StrmConvert( Group );
        Start = clock();
        for ( int i = 0; i < Entries; ++i )
        {
            InsertToMap( GroupPrefix + StrmConvert( i ), GroupPrefix +
StrmConvert( i ), i, MapData );
        }
        End = clock();
        std::cout << "Inserting group " << Group << " " << Entries << "
unique entries to map: " << End - Start << "ns" << "\n";
    }
    Start = clock();
    // Map find is so fast one search reports 0ns
    for ( int i = 0; i < 1000; ++i )
        FindInMap( "Bogus", "Bogus", MapData );
    End = clock();
    std::cout << "Search for non existant entry in Map: " <<
static_cast<double>( End - Start ) / 1000.0 << "ns\n";
    std::cout << "Map size:" << MapData.size() << "\n";

    // Insert to set
    std::cout << "\n*** Set ***\n";
    for ( int Group = 0; Group < 10; ++Group )
    {
        std::string GroupPrefix = StrmConvert( Group );
        Start = clock();
        for ( int i = 0; i < Entries; ++i )
        {
            InsertToSet( GroupPrefix + StrmConvert( i ), GroupPrefix +
StrmConvert( i ), i, SetData );
        }
        End = clock();
        std::cout << "Inserting group " << Group << " " << Entries << "
unique entries to set: " << End - Start << "ns" << "\n";
    }
    Start = clock();
    // Set find is so fast one search reports 0ns
    for ( int i = 0; i < 1000; ++i )
        FindInSet( "Bogus", "Bogus", SetData );
    End = clock();
    std::cout << "Search for non existant entry in Set: " <<
static_cast<double>( End - Start ) / 1000.0 << "ns\n";

    std::cout << "Set size:" << SetData.size() << "\n";

}