Re: How to statistic the log effectively?
On Nov 28, 8:33 pm, feifei <dongfei...@gmail.com> wrote:
Suppose there is a log containing 1 billion records which is composed
with query phrase. how to select the hottest query phrase(with the
highest frequency) from the log file effectively?
With the Linux shell, I often write the command like this:
sort log | uniq -c | sort -nr| head 10
if using map in C++ , I can write it easily as:
using namespace std;
bool sort_aid(const pair<string,int>& left, const pair<string,int>&
right)
{
return left.second > right.second;
}
map<string,int> mapLine;
while read line from log
mapLine[line] +=1
vector<string ,int > vecResult;
map<string,int>::iterator it;
for(it= mapLine.begin(); it!= mapLine.end(); it++)
{
vecResult.push_back(make_pair(it->first, it->second));}
std::sort(vecResult.begin(), vecResult.end(), sort_aid);
=======================
Is there more effective solution ?
You can try these are see if they perform better on your data set.
1. Just add a line: vecResult.reserve(mapLine.size()); after the
declaration of vecResult.
2. Create a vector with all the elements from the file(use reserve()
of course), sort on 1st value(string), perform aggregation(go through
each element and bump up the count if the following elements are the
same, etc.). Remove repeated elements(std::unique) and sort now on the
2nd value(count). You have the highest repeating element at the front
(or back) depending on > or <.
I feel 2 is better because if you have 10^9 entries, then std::map<>
has an overhead of 3 pointers per object, which is (10^9)*24 bytes
which is about 230MB. However, if you have many repeats(say the top
element comes 30% of the time), then the map approach would be better
considering that a query string is considerably longer than 24 bytes.
3. For even larger data sets, you may want to try Hadoop.
Regards,
-Dhruv.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]