You say you load all data from a text file. What is the format of your
text file? You may consider changing the format a bit to optimize
things. For example, your inner loop above tries to map userID ->
MyUserClass* for 500,000 iterations for every user. This is a
bottleneck. You could save a lookup in a 500,000 entry map by simply
changing SongRating to this:
struct SongRating {
MyUserClass *user;
double rating;
};
Now you never have to look up the user ID in the map. However, this
means that when loading the text file, you have to load the user data
first, and then fill these structures as you load the song rating
data, doing the user map lookup there. Depending on how many times you
run through your huge loop below, doing the lookup (for every song and
rating) *once* when you load the file, rather than more than once,
will save you a lot of time.
Then, you can optimize your text file format to eliminate user map
lookups even on load. For example, say your text file is (for
example):
<songid> <song info...>
<songid> <song info...>
<songid> <song info...>
...
begin user <userid> <user info...>
<songid> <rating>
<songid> <rating>
<songid> <rating>
<songid> <rating>
...
end user
You are now explicitly associating SongRatings with users in your text
file. This eliminates any user map lookups on load at all. Loading is
simply this:
1. Load all song information into songID -> song map.
2. Load all user information, for each user:
- Create MyUserClass and load user info into it.
- Create entry in userID -> user map (if you even need to).
- Create the SongRatings, and you already know exactly which
MyUserClass they're associated with!
The key there is grouping song ratings by user in the data file means
you always know precisely which user is associated with a song rating,
as you load it, without having to look that user up by ID.
That's the biggest optimization I can think of given how you say you
are using the data. It eliminates all userID -> MyUserClass lookups
entirely.
That is only one suggestion. Read on!
There is another possible optimization you can make. It appears that
you are examining every single rating for every single song, and
counting the total number of ratings made by a given user where
CheckUserSuitable(user) is true.
Well, CheckUserSuitable depends only on the user. It does not depend
on the song or the rating. So why iterate through all the songs and
ratings in an outer loop? You say the data never changes, so simply
count the number of SongRatings each user is associated with on load,
store that count in MyUserClass, and *all* you need to do to compute
'count' is iterate through your users and sum up all the counts for
users where CheckUserSuitable is true:
class MyUserClass {
public:
int getNumberOfSongRatings () const;
};
You can store the users in a map, or even a vector, if it turns out
you don't need to actually do the userID -> MyUserClass lookups any
more; here I'll use a map:
map<long,MyUserClass *> users;
map<long,MyUserClass *>::const_iterator user;
int count = 0;
for (user = users.begin(); user != users.end(); ++ user)
if (CheckUserSuitable((*user).second))
count += (*user).second->getNumberOfSongRatings();
However, note that if you're counting the number of song ratings each
user has on load, you still may have to do some map lookups on load.
Well, if you change your text file format to the example I gave above,
counting song ratings per user becomes trivial -- since song ratings
are grouped per user, you do not need to do any map lookups at all to
count song ratings as you load them and store that count in
MyUserClass.
In general, it sounds like you have a huge amount of data, but you're
not storing it in a way that is optimal for what you're actually doing
with it. You need to look at what operations are the most time
consuming, and the figure out how to get rid of them: Looking up users
in the user map is time consuming, to get rid of it you can store the
MyUserClass* instead of the user ID in SongRating. To optimize that,
you can modify your text data format accordingly. Iterating over every
song and rating when CheckUserSuitable depends on neither is also time
consuming, you can do some "preprocessing" on load by counting the
number of song ratings per user to eliminate those loops from the
picture.
There are plenty of other easy optimizations you can make depending on
how you are accessing the data. For example, if you believe you will
frequently be looking up users with the same ID, then cache, say, the
last 10 userID -> MyUserClass* entries, and search your cache, hitting
the map only if the cache misses.
Storing pointers to iterators or actual data structures will also gain
you a lot of performance over storing IDs and looking them up in the
maps every time. You know the user ID -> MyUserClass mapping will
never change, so why bother dealing with the user ID at all? Same goes
for song IDs.
The data and the key in the map that I need to optimise
(usersByIDMap) is read in from a text file, and the objects are adde=
going through the loop around 2m times.
As you can see, I don;t iterate through usersByIDMap, I iterate
through other maps that provide me with a key value to look up in
usersByIDMap
All that being said, another reasonable possibility is to ditch the
idea of actually dealing with this data on your own and start using a
proper SQL database to store the data in. There are many excellent
free database servers, such as MySQL, or MS SQL Server. Using
something like MySQL's C API, or even SQLAPI++ (which is not free,
although it's an excellent API, Google for it), makes development a
snap. Those are just some examples.
You have a lot of data, and you're doing the type of access that SQL
is designed for. Database servers are designed *specifically* for this
type of task, and all popular ones are mature enough to be very well
optimized for any kind of access like that.
For example, I created some test data in 3 tables:
users:
userid
username
songs;
songid
songname
ratings:
userid
songid
rating
I created an index on ratings.userid and ratings.songid. I generated
500,000 users, 5000 songs, and 400 random ratings for each song
(random user + rating). There are 2,000,000 entries in the ratings
table. Using a query like this, for example:
SELECT userid, count(*) FROM ratings GROUP BY userid
With MS SQL 2005, I was able to retrieve a list of *every* single user
that had rated a song, as well as the number of songs each user had
rated, in about 450 milliseconds (30 ms on server, 420 ms on client
side). 450 ms may seem like a lot to you but consider, if you store
all your data in the database, not a text file, that's 450 ms *total*,
from start to finish. Time spent loading data is eliminated.
Another example of what you can do, I can retrieve the name and
average rating for every song who's name contains the number
'5' (which, in my test case, is 1355 songs), sorted by highest average
rating:
SELECT (SELECT songname FROM songs WHERE songid = ratings.songid) A=
P.S. The server and the client were on the same machine; a Fujitsu
Lifebook with 1.6GHz Intel Core 2 Duo, 2 GB RAM, running Windows XP.