Re: Is there a faster way to fetch data from stl map?

From:

"jason.cipriani@gmail.com" <jason.cipriani@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 26 Dec 2008 20:25:47 -0800 (PST)

Message-ID:

<0d11a72f-f555-4f9e-a413-7651afb133c1@m15g2000vbl.googlegroups.com>

On Dec 26, 11:13 pm, "jason.cipri...@gmail.com"
<jason.cipri...@gmail.com> wrote:

There are a few ways you can optimize this; all depend on what exactly
you're doing with the data:

On Dec 26, 1:09 pm, Steve555 <foursh...@btinternet.com> wrote:

On 26 Dec, 16:38, LR <lr...@superlink.net> wrote:

Steve555 wrote:

On 26 Dec, 14:50, joec...@gmail.com wrote:

On Dec 26, 8:21 am, Steve555 <foursh...@btinternet.com> wrote:

Hi,
I'm retrieving data from a map of size approx 500,000, containing
structs and with an integer key.
I've been using myMap[myKey] most of the time as it's convenie=

nt and

is clear to read what's happening. But in a tight loop (2million+=

)

it's taking considerably longer to fetch values from a map like t=

his

than it is to do a large amount of number-crunching on the values
themselves.
I've tried
data = myMap.find(myKey)->second;
and that helps a tiny bit, reducing the loop time from about 2.0 =

1.5 seconds, but without accessing the map at all the loop time i=

<0.1, so the bulk of the time is spent retrieving the data.
I'm quite prepared to re-write my code from the ground up if this=

the

best I can get from a map, but hoped there may be another way.

No, find() is the best way if you are trying to search for an
arbitrary key from an stl map.
In reality, it may only be doing something like 18 comparisons to =

find

the data. 1.5 seconds for the loop, where you are doing 2,000,0=

00 (or

more) look-ups means each look-up is averaging less than 750
nanoseconds. Just paging through the memory might be taking so=

me of

the time (cache issues). Many systems are memory bound instead=

processor bound. (i.e. once the data is in cache, thing fly along)

I would immediately ask two questions:
1) What comparison operator are you using? Is it comparing m=

any

member variables, or just one POD type?
2) How large are your objects? Each time though the loop you=

are

copying the object ("data = "). Could this be replaced with a=

const

reference on the lhs? If so, this "may" help some speedup

Joe Cook

Thanks Joe. The map is sorted by the integer key as in

typedef map<long, myObject*, less<long> > myMap;

so I guess this means the comparison operator is the default
operator< for longs. Is that what you meant?

Sorry my initial description was not quite right, the contents are
pointers to a custom class, which are large and contain a few array=

so they're approx 800 bytes.
In another post I was advised that storing values is more efficient
than storing pointers (which I generally do now), but in this case =

the

overhead of passing and copying such big objects will be too high.

Do they get copied much? Do they get passed by value and not by re=

I'm curious to know what your loop looks like. If you're not using=

iterator as the loop index and if that would work for your applicatio=

is it possible that would be better?

data = myMap.find(myKey)->second;

How do you initialize myKey?

You say the size() of your map is around 500k, but you imply that you=

're

going through the loop around 2m times.

Is it the case that every call to myMap.find(myKey) will not return
myMap.end()?

Would it be possible for you to post a small, compilable example of y=

our

code showing your problem? Because I feel a little fuzzy on the detai=

ls.

Thanks LR, it's a huge project so I can't extract a compilable version
but this is the essence of it:
========================

==========================
=====================

struct SongRating
long userID;
double rating;};

//For each song,a list of users and their rating
typedef vector<SongRating> SongRatingVector;
typedef map<long, SongRatingVector*, less<short> >SongMap;//key=so=

ID

class MyUserClass
{ ... contains about 800bytes of data describing each user }
// map of users, Key = UserID (this is the map I need to optimise)
typedef map<long, MyUserClass*, less<long> > UsersByID;
//-------------------

(At this point I am accessing: UsersByID usersByIDMap, SongMap
*mySongMap )

MyUserClass *userPtr;
long songID, count=0;
SongRatingVector *songRatings;
SongRating *thisRatingPtr;
SongRatingVector::iterator rateIter;
SongMap::const_iterator songIter;

//loop through all 5000+/- songs
for(songIter = mySongMap->begin();songIter!=mySongMap->end();+
+songIter)
{
  songID = songMapIter->first;
  songRatings = probeIter->second;
  //for each song loop through 400+/- rating
  for(rateIter = songRatings->begin();rateIter!=songRatings->end(=

); +

+rateIter)
   {
      thisRatingPtr = &*ratingsIter;
// Find the userID associated with this rating
      userID = thisRatingPtr->userID;
//Find one of 500,000 users with this userID
      userPtr = usersByIDMap.find(userID)->second;
// Do some numerical checks on the user's member data
       if(CheckUserSuitable(thisUser) == true)
      count++;
   }}

========================

==========================
=====================>>How do you=
initialize myKey?

You say you load all data from a text file. What is the format of your
text file? You may consider changing the format a bit to optimize
things. For example, your inner loop above tries to map userID ->
MyUserClass* for 500,000 iterations for every user. This is a
bottleneck. You could save a lookup in a 500,000 entry map by simply
changing SongRating to this:

struct SongRating {
  MyUserClass *user;
  double rating;

};

Now you never have to look up the user ID in the map. However, this
means that when loading the text file, you have to load the user data
first, and then fill these structures as you load the song rating
data, doing the user map lookup there. Depending on how many times you
run through your huge loop below, doing the lookup (for every song and
rating) *once* when you load the file, rather than more than once,
will save you a lot of time.

Then, you can optimize your text file format to eliminate user map
lookups even on load. For example, say your text file is (for
example):

<songid> <song info...>
<songid> <song info...>
<songid> <song info...>
...
begin user <userid> <user info...>
  <songid> <rating>
  <songid> <rating>
  <songid> <rating>
  <songid> <rating>
  ...
end user

You are now explicitly associating SongRatings with users in your text
file. This eliminates any user map lookups on load at all. Loading is
simply this:

  1. Load all song information into songID -> song map.
  2. Load all user information, for each user:
     - Create MyUserClass and load user info into it.
     - Create entry in userID -> user map (if you even need to).
     - Create the SongRatings, and you already know exactly which
MyUserClass they're associated with!

The key there is grouping song ratings by user in the data file means
you always know precisely which user is associated with a song rating,
as you load it, without having to look that user up by ID.

That's the biggest optimization I can think of given how you say you
are using the data. It eliminates all userID -> MyUserClass lookups
entirely.

That is only one suggestion. Read on!

There is another possible optimization you can make. It appears that
you are examining every single rating for every single song, and
counting the total number of ratings made by a given user where
CheckUserSuitable(user) is true.

Well, CheckUserSuitable depends only on the user. It does not depend
on the song or the rating. So why iterate through all the songs and
ratings in an outer loop? You say the data never changes, so simply
count the number of SongRatings each user is associated with on load,
store that count in MyUserClass, and *all* you need to do to compute
'count' is iterate through your users and sum up all the counts for
users where CheckUserSuitable is true:

class MyUserClass {
public:
  int getNumberOfSongRatings () const;

};

You can store the users in a map, or even a vector, if it turns out
you don't need to actually do the userID -> MyUserClass lookups any
more; here I'll use a map:

map<long,MyUserClass *> users;
map<long,MyUserClass *>::const_iterator user;
int count = 0;

for (user = users.begin(); user != users.end(); ++ user)
  if (CheckUserSuitable((*user).second))
    count += (*user).second->getNumberOfSongRatings();

However, note that if you're counting the number of song ratings each
user has on load, you still may have to do some map lookups on load.
Well, if you change your text file format to the example I gave above,
counting song ratings per user becomes trivial -- since song ratings
are grouped per user, you do not need to do any map lookups at all to
count song ratings as you load them and store that count in
MyUserClass.

In general, it sounds like you have a huge amount of data, but you're
not storing it in a way that is optimal for what you're actually doing
with it. You need to look at what operations are the most time
consuming, and the figure out how to get rid of them: Looking up users
in the user map is time consuming, to get rid of it you can store the
MyUserClass* instead of the user ID in SongRating. To optimize that,
you can modify your text data format accordingly. Iterating over every
song and rating when CheckUserSuitable depends on neither is also time
consuming, you can do some "preprocessing" on load by counting the
number of song ratings per user to eliminate those loops from the
picture.

There are plenty of other easy optimizations you can make depending on
how you are accessing the data. For example, if you believe you will
frequently be looking up users with the same ID, then cache, say, the
last 10 userID -> MyUserClass* entries, and search your cache, hitting
the map only if the cache misses.

Storing pointers to iterators or actual data structures will also gain
you a lot of performance over storing IDs and looking them up in the
maps every time. You know the user ID -> MyUserClass mapping will
never change, so why bother dealing with the user ID at all? Same goes
for song IDs.

The data and the key in the map that I need to optimise
(usersByIDMap) is read in from a text file, and the objects are adde=

one at a time with usersByIDMap[userID] = myUserClassObj;
Once created the map is never altered in any way.
usersByIDMap.find(userID) will never return usersByIDMap.end().

You say the size() of your map is around 500k, but you imply that you=

're

going through the loop around 2m times.

As you can see, I don;t iterate through usersByIDMap, I iterate
through other maps that provide me with a key value to look up in
usersByIDMap

All that being said, another reasonable possibility is to ditch the
idea of actually dealing with this data on your own and start using a
proper SQL database to store the data in. There are many excellent
free database servers, such as MySQL, or MS SQL Server. Using
something like MySQL's C API, or even SQLAPI++ (which is not free,
although it's an excellent API, Google for it), makes development a
snap. Those are just some examples.

You have a lot of data, and you're doing the type of access that SQL
is designed for. Database servers are designed *specifically* for this
type of task, and all popular ones are mature enough to be very well
optimized for any kind of access like that.

For example, I created some test data in 3 tables:

users:
  userid
  username

songs;
  songid
  songname

ratings:
  userid
  songid
  rating

I created an index on ratings.userid and ratings.songid. I generated
500,000 users, 5000 songs, and 400 random ratings for each song
(random user + rating). There are 2,000,000 entries in the ratings
table. Using a query like this, for example:

  SELECT userid, count(*) FROM ratings GROUP BY userid

With MS SQL 2005, I was able to retrieve a list of *every* single user
that had rated a song, as well as the number of songs each user had
rated, in about 450 milliseconds (30 ms on server, 420 ms on client
side). 450 ms may seem like a lot to you but consider, if you store
all your data in the database, not a text file, that's 450 ms *total*,
from start to finish. Time spent loading data is eliminated.

Another example of what you can do, I can retrieve the name and
average rating for every song who's name contains the number
'5' (which, in my test case, is 1355 songs), sorted by highest average
rating:

  SELECT (SELECT songname FROM songs WHERE songid = ratings.songid) A=

sname, AVG(rating) AS avgrating FROM ratings GROUP BY songid HAVING
((SELECT songname FROM songs WHERE songid = ratings.songid) LIKE
'%5%') ORDER BY avgrating DESC

That query took about 1250 milliseconds (1200 ms on the server, 50 ms
on client side). Again, while that may seem like a lot, consider what
it did, and how long it might take you to implement the same with your
map (note that searching for the digit 5 only added 50 milliseconds to
the entire query). Also remember, there's no load times, it operates
directly on the data.

P.S. The server and the client were on the same machine; a Fujitsu
Lifebook with 1.6GHz Intel Core 2 Duo, 2 GB RAM, running Windows XP.
Not a spectacular piece of hardware, and performance was still great.

Just a thought, but the optimizations I mentioned above can be very
effective as well. The quantity of data you are dealing with is
approaching dbms territory.

Jason

Do they get copied much? Do they get passed by value and not by re=

Er, don't know (shuffles awkwardly!), I need to trawl through and re-
examine.
In this case, do you think it could make a big difference?

Whether you are copying data or copying pointers will certainly make a
difference. However, improvements to your algorithm will make the
biggest difference. No amount of memory copying and moving operations
will give you the magnitude of improvement that, say, removing an
inner loop search through the user map entirely will give you.

HTH,
Jason