Re: Finding most frequent element in std::vector?

From:

Joshua Maurice <joshuamaurice@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Thu, 25 Mar 2010 04:02:13 -0700 (PDT)

Message-ID:

<16fb074c-de02-4459-a600-43eb806afdb9@v34g2000prm.googlegroups.com>

On Mar 25, 2:59 am, Joshua Maurice <joshuamaur...@gmail.com> wrote:

On Mar 24, 11:47 pm, Steve555 <foursh...@btinternet.com> wrote:

On 24 Mar, 14:56, Michael Doubez <michael.dou...@free.fr> wrote:

On 24 mar, 15:12, Steve555 <foursh...@btinternet.com> wrote:

On 24 Mar, 12:19, Bart van Ingen Schenau <b...@ingen.ddns.info> wro=

te:

On Mar 24, 9:11 am, Steve555 <foursh...@btinternet.com> wrote:

Given a vector, I'm trying to find the most frequently occurrin=

element, and return it, along with a count of its frequency.

Although I've got a way of doing it below, this gets called mil=

lions

of times in an inner loop and I need to speed it up if possible=

I'm looping through each element of the vector, and using std::=

count

to count it's frequency:

[snip]

Thanks for the many suggestions, using various ideas from each
(especially using freqArray) I've been able to reduce the time for =

million runs from 1.91 seconds down to 0.69, so a huge improvement.
(I used the code directly in the end, rather than in a separate
function, but both times reflect this)

// assuming vec already exists
long *ptr, *freqArray = new long[201];
long maxElement, maxFreq;
for(long l=0;l<1000000;l++)
{
  //clear the freqArray (using a ptr was quicker than []
  ptr = freqArray;
  for(long j=0; j<=200; j++) *ptr++ = 0;
  maxFreq = 0;
  for(long i=0; i<vec.size(); i++)
  {
      ++freqArray[vec[i]];
      if(freqArray[vec[i]] > maxFreq)
      {
        maxElement = vec[i];
        maxFreq = freqArray[vec[i]];
      }
  }

}

P. Lepin: In this case, map worked out slower, probably the look=

time.

I guess it is the locality and the dynamic allocation.

Michael: Sorry don't know what you mean by "quadratic" in this
context.

Quadratic means that the complexity is proportional to the square of
the number of elements.

It is as worse as it can get except for exponential or NP-complete
complexity.

And at 49 1/4, I'm a bit old for homework ;-)

Never too late.

[snip]

Out of shear curiosity could you try the following algorithm:

long FindMostFrequentInVector(vector<long> *inVec, long *maxElement)
{
  vector<long>::iterator it = inVec->begin();
  vector<long>::iterator end = inVec->end();

  // sort inVec - best would be for it to be already sorted
  std::sort(it,end);

  long maxCount = 0;

  // assert( !inVec->empty() );
  long count = 1;
  long element = *it;
  while( ++it != end )
  {
    if( *it == element )
    { // count current element
      ++count;
    }
    else
    { // new element value
      if( count > maxCount )
      {
        maxCount = count;
        *maxElement = element;
      }
      element = *it;
      count = 1;
    }
  }

  return maxCount;

}

--
Michael

Thanks Michael. Running your algorithm (explicitly rather than in a
separate function) gave a time of .51 (compared to original .69 for 1
million loops). I've found (on my system at least) that iterators
carry some overhead compared to simple array indexing. Hence, using an
array buffer to keep track of the counts instead, along with memset()
to quickly clear it, brought the time down to .22.

What's also quite interesting about all the different methods
suggested is the amount of variation from run to run. The ones where I
use iterators or stl algorithms are very consistent (if a little
slower), whereas the best solution (using a freqArray and memset() )
can vary by as much as 60 ms (~25%). That said, it's a bit of an
artificial situation, loading a program, setting timers, looping a
million times and exiting... maybe there's some system crap causing
the variability. Anyway, not a problem, just a curiosity.
(I'm compiling in Xcode on OSX 10.6.2 on a 2.8ghz Intel Mac, timimg
done in microseconds, rounded)

First, you are in one of the rare situations where std::vector is
legitimately significantly slower than built-in arrays.
vector<int> x(10);
will initialize all of its 10 elements to 0. A built-in array
int x[10];
will not. This is usually not a concern, but in your case it might be.
Zero initializing all the bins of the bin sort might be quite costly.

Second, if you're using microsoft visual studios compiler, consider
turning off checked iterators. In release mode, VS includes some extra
run time checking in its STL iterators.
http://msdn.microsoft.com/en-us/library/aa985965%28VS.80%29.aspx

Otherwise, with a halfway decent optimizing compiler with its
optimizations turned on, there should be no noticeable difference
between built-in arrays and std::vectors.

I would suggest a basic bin sort, or some variation:
- Create a built-in array of bins for the bin sort. (Do not initialize
it. We're trying to save this cost.)
- For each element in the input vector, zero the corresponding bin.
- Loop over the input vector, keeping track of the "best thus far"
value and "number of times seen" for the "best thus far" value. For
each element, increment the corresponding bin, and then update if
necessary the "best thus far" counters.

This has 2 loops of 10 iterations each which is almost certainly
faster than the algorithm with a loop of 200 iterations. However,
std::memset may be significantly faster than a loop of 200 iterations,
so it might be faster to memset all 200 bins to 0 instead of looping
an additional time up front to zero out the 1-10 "will be used" bins.

We traverse the input vector in increasing address order, which caches
tend to like. However, we are hitting the bins in a random order,
which isn't too cache friendly, though I would guess that it's still
faster than our alternatives. Possible alternatives: You might be able
to use a hash table for the bins, though I suspect the extra overhead
of "hashing" would outweigh any benefits, and there's always those
pesky problems of getting a good hash function and the potential
really slow worst case. You might try having a map of pairs of (value,
frequency), or more specifically a sorted array of pair<int, int>. I
suspect that the caching effects with the small input size would
easily favor the sorted built-in array over "Big O"-superior std::map.
A linear search on the sorted array might even outperform a binary
search on the sorted array due to its small size. Hell, it might even
be faster to not sort at all and just do a linear lookup in the bins
for each of the 10 elements of the input vector.

Note specifically that many of the alternatives presented fly in the
face of conventional Big O analysis. However, you specifically said
that your algorithm has these fixed \small\ values, and Big O only
applies when the size is "large enough". Note however that this is not
an excuse to code yourself into a corner if at all possible, so if and
when the range of values becomes "any float", or the input size
becomes "several million", you are not screwed.

As always, the basic "guideline" of optimization is to first write
code "sanely", then measure to find the slow spots, then optimize /
hack the slow spots, measuring all alternatives. I've listed a lot of
alternatives, and I really don't know which would perform the best.
I'd guess that it depends a lot on the hardware and distribution of
input values.

Entertainingly enough, on my windows desktop, my guesses were spot on
(from my quickie testing). I coded up quickie implementations for
each, and barring mistakes on my part, it seems that your best bet for
the common windows desktop is a simple bin sort with two passes over
the input data, the first to zero out the "to be used" bins, and the
second pass to compute the most frequent element (or one of the best
if there's a tie). I'm sure there's plenty of variables I'm not
controlling for, but I think it's a somewhat fair test of the OP's use
case.

In case I made a mistake somewhere, here my results and code, results
first, when run with standard optimization options, including checked
iterators off:

starting tests
get_most_occuring_element_linear_map_lookup 1 seconds, 296
milliseconds
get_most_occuring_element_binary_map_lookup 2 seconds, 922
milliseconds
get_most_occuring_element_bin_sort_two_pass 782 milliseconds
get_most_occuring_element_bin_sort_with_memset 2 seconds, 31
milliseconds

//start code:

#include <windows.h>

#include <algorithm>
#include <iostream>
#include <sstream>
#include <vector>

#include <ctime>
#include <cstdlib>
using namespace std;

typedef unsigned long long my_time_t;

my_time_t convert(FILETIME const & fileTime)
{ //TODO type punning
    union { DWORD x[2]; my_time_t y; };
    x[0] = fileTime.dwLowDateTime;
    x[1] = fileTime.dwHighDateTime;
    y *= (my_time_t)100;
    return y;
}

my_time_t jGetCurrentTime()
{ SYSTEMTIME currentTime_SYSTEMTIME;
    GetSystemTime(&currentTime_SYSTEMTIME);
    FILETIME currentTime_FILETIME;
    if (SystemTimeToFileTime(&currentTime_SYSTEMTIME,
&currentTime_FILETIME) == 0)
        abort();
    return convert(currentTime_FILETIME);
}

struct HumanUnderstandableTimeDuration
{ int hours; int minutes; int seconds; int milliseconds; int
nanoseconds; };

HumanUnderstandableTimeDuration
jToHumanUnderstandableTimeDuration(my_time_t x)
{ HumanUnderstandableTimeDuration result;

    result.nanoseconds = x % ((my_time_t)1000 * (my_time_t)1000);

    result.milliseconds = x / ((my_time_t)1000 * (my_time_t)1000);
    result.nanoseconds = x % ((my_time_t)1000 * (my_time_t)1000);

    result.seconds = result.milliseconds / (my_time_t)1000;
    result.milliseconds = result.milliseconds % (my_time_t)1000;

    result.minutes = result.seconds / (my_time_t)60;
    result.seconds = result.seconds % (my_time_t)60;

    result.hours = result.minutes / (my_time_t)60;
    result.minutes = result.minutes % (my_time_t)60;

    return result;
}

string stringify(HumanUnderstandableTimeDuration x)
{
    bool needsHours = x.hours;
    bool needsMinutes = x.minutes || ((x.hours) &&
(x.seconds || x.milliseconds || x.nanoseconds));
    bool needsSeconds = x.seconds || ((x.hours || x.minutes)
&& (x.milliseconds || x.nanoseconds));
    bool needsMilliSeconds = x.milliseconds || ((x.hours || x.minutes
|| x.seconds) && (x.nanoseconds));
    bool needsNanoSeconds = x.nanoseconds;

    stringstream ss;
    bool needComma = false;
    if (needsHours)
    { ss << x.hours << " hours";
        needComma = true;
    }
    if (needsMinutes)
    { if (needComma)
            ss << ", ";
        ss << x.minutes << " minutes";
        needComma = true;
    }
    if (needsSeconds)
    { if (needComma)
            ss << ", ";
        ss << x.seconds << " seconds";
        needComma = true;
    }
    if (needsMilliSeconds)
    { if (needComma)
            ss << ", ";
        ss << x.milliseconds << " milliseconds";
        needComma = true;
    }
    if (needsNanoSeconds)
    { if (needComma)
            ss << ", ";
        ss << x.nanoseconds << " nanoseconds";
    }
    if (!needsHours && !needsMinutes && !needsSeconds && !
needsMilliSeconds && !needsNanoSeconds)
        ss << "~0";
    return ss.str();
}

struct bin_t
{ int label, frequency;
  bin_t() {}
  bin_t(int label_) : label(label_) {}
  bool operator< (bin_t y) const { return label < y.label; }
  bool operator== (bin_t y) const { return label == y.label; }
};

struct get_most_occuring_element_linear_map_lookup
{ inline int operator() (vector<int> const& x)
    { bin_t bins[10];
        bin_t* bins_start = bins;
        bin_t* bins_end = bins;

        int best_thus_far = bins[0].label = x[0];
        int best_thus_far_frequency = bins[0].frequency = 1;
        for (int i=1; i<10; ++i)
        { int ele = x[i];
            bin_t* insert_location = find(bins_start, bins_end,
bin_t(ele));
            if (insert_location == bins_end)
            { insert_location->label = ele;
                insert_location->frequency = 0;
                ++bins_end;
            }
            if (++ insert_location->frequency >
best_thus_far_frequency)
            { best_thus_far = ele;
                best_thus_far_frequency = insert_location->frequency;
            }
        }
        return best_thus_far;
    }
};

struct get_most_occuring_element_binary_map_lookup
{ inline int operator() (vector<int> const& x)
    { bin_t bins[10];
        bin_t* bins_start = bins;
        bin_t* bins_end = bins;

        int best_thus_far = bins[0].label = x[0];
        int best_thus_far_frequency = bins[0].frequency = 1;
        for (int i=1; i<10; ++i)
        { int ele = x[i];
            bin_t* insert_location = lower_bound(bins_start, bins_end,
bin_t(ele));

            if (insert_location == bins_end)
            { insert_location->label = ele;
                insert_location->frequency = 0;
                ++bins_end;
            }else if (insert_location->label != ele)
            { //make room for insert
                for (bin_t * b = bins_end;;)
                { if (b == insert_location)
                        break;
                    --b;
                    b[1] = b[0];
                }
                insert_location->label = ele;
                insert_location->frequency = 0;
                ++bins_end;
            }
            if (++ insert_location->frequency >
best_thus_far_frequency)
            { best_thus_far = ele;
                best_thus_far_frequency = insert_location->frequency;
            }
        }
        return best_thus_far;
    }
};

struct get_most_occuring_element_bin_sort_two_pass
{ inline int operator() (vector<int> const& x)
    { int bins[200];

        for (int i=1; i<10; ++i)
            bins[x[i]] = 0;

        int best_thus_far = x[0];
        int best_thus_far_frequency = bins[x[0]] = 1;

        for (int i=1; i<10; ++i)
        { int ele = x[i];
            if (++ bins[ele] > best_thus_far_frequency)
            { best_thus_far = ele;
                best_thus_far_frequency = bins[ele];
            }
        }
        return best_thus_far;
    }
};

struct get_most_occuring_element_bin_sort_with_memset
{ inline int operator() (vector<int> const& x)
    { int bins[200];
        memset(bins, 0, 200 * sizeof(int));

        int best_thus_far = x[0];
        int best_thus_far_frequency = bins[x[0]] = 1;

        for (int i=1; i<10; ++i)
        { int ele = x[i];
            if (++ bins[ele] > best_thus_far_frequency)
            { best_thus_far = ele;
                best_thus_far_frequency = bins[ele];
            }
        }
        return best_thus_far;
    }
};

template <typename sorter_t>
void do_test(char const* test_name, int& result, vector<vector<int> >
const& test_data)
{
    sorter_t sorter;

    //to "warm up" the instruction cache
    result += sorter(test_data[0]);

    int const num_runs = 1000;

    my_time_t startTime = jGetCurrentTime();
    for (int runs = 0; runs < num_runs; ++runs)
        for (int i=0; i<test_data.size(); ++i)
            result += sorter(test_data[i]);
    my_time_t endTime = jGetCurrentTime();
    HumanUnderstandableTimeDuration diff =
jToHumanUnderstandableTimeDuration(endTime - startTime);
    cout << test_name << " " << stringify(diff) << endl;
}

int main()
{
    int result = 0;
    vector<vector<int> > test_data;
    for (int i=0; i<10000; ++i)
    { test_data.push_back(vector<int>());
        for (int j=0; j<10; ++j)
            test_data.back().push_back(rand() % 200);
    }

    /*
    (not entirely correct) sanity checks that the algorithms are doing
the right thing.
    test_data[0][0] = test_data[0][1];
    test_data[1][8] = test_data[1][9];
    for (int i=0; i<10; ++i)
        cout <<
get_most_occuring_element_linear_map_lookup(test_data[i]) << " ";
    cout << endl;
    for (int i=0; i<10; ++i)
        cout <<
get_most_occuring_element_binary_map_lookup(test_data[i]) << " ";
    cout << endl;
    for (int i=0; i<10; ++i)
        cout <<
get_most_occuring_element_bin_sort_two_pass(test_data[i]) << " ";
    cout << endl;
    */

    cout << "starting tests" << endl;

do_test<get_most_occuring_element_linear_map_lookup>("get_most_occuring_ele=
ment_linear_map_lookup",
result, test_data);

do_test<get_most_occuring_element_binary_map_lookup>("get_most_occuring_ele=
ment_binary_map_lookup",
result, test_data);

do_test<get_most_occuring_element_bin_sort_two_pass>("get_most_occuring_ele=
ment_bin_sort_two_pass",
result, test_data);

do_test<get_most_occuring_element_bin_sort_with_memset>("get_most_occuring_=
element_bin_sort_with_memset",
result, test_data);
    return result;
}