Re: How to statistic the log effectively?
On Nov 30, 3:40 am, dhruv <dhruvb...@gmail.com> wrote:
On Nov 28, 8:33 pm, feifei <dongfei...@gmail.com> wrote:
Suppose there is a log containing 1 billion records which is composed
with query phrase. how to select the hottest query phrase(with the
highest frequency) from the log file effectively?
With the Linux shell, I often write the command like this:
sort log | uniq -c | sort -nr| head 10
if using map in C++ , I can write it easily as:
using namespace std;
bool sort_aid(const pair<string,int>& left, const pair<string,int>&
right)
{
return left.second > right.second;
}
map<string,int> mapLine;
while read line from log
mapLine[line] +=1
vector<string ,int > vecResult;
map<string,int>::iterator it;
for(it= mapLine.begin(); it!= mapLine.end(); it++)
{
vecResult.push_back(make_pair(it->first, it->second));}
std::sort(vecResult.begin(), vecResult.end(), sort_aid);
=======================
Is there more effective solution ?
You can try these are see if they perform better on your data set.
1. Just add a line: vecResult.reserve(mapLine.size()); after the
declaration of vecResult.
2. Create a vector with all the elements from the file(use reserve()
of course), sort on 1st value(string), perform aggregation(go through
each element and bump up the count if the following elements are the
same, etc.). Remove repeated elements(std::unique) and sort now on the
2nd value(count). You have the highest repeating element at the front
(or back) depending on > or <.
I feel 2 is better because if you have 10^9 entries, then std::map<>
has an overhead of 3 pointers per object, which is (10^9)*24 bytes
which is about 230MB. However, if you have many repeats(say the top
element comes 30% of the time), then the map approach would be better
considering that a query string is considerably longer than 24 bytes.
3. For even larger data sets, you may want to try Hadoop.
Two more possibilities:
4. After copying all the data to the vector(not aggregated or unique
removed), you repeatedly use partition() on the data till one of the
sides of the partition is of size 1(or if you know the highest
repeating query is going to occur say 1000 times of size 1000 or
less). Just pick up any value from there, and you have your answer.
You don't even need to bother about sorting, which imho would be much
faster if you have that kind of memory on your system.
5. Alternatively, you can also reduce the partition size repeatedly to
say 20,000 entries and then perform a sort?(because it saves you from
choosing a pivot to partition on, and once the data size gets smaller
it may become increasingly difficult to choose a pivot that divides
the data into approximate halves, so this approach may just be
faster!)
Regards,
-Dhruv.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]