For Loops and Lambda in "Lies, Damned Lies, and Microbenchmarks"

From:

Damien Kick <dkixk@earthlink.net>

Newsgroups:

comp.lang.c++.moderated

Date:

Fri, 12 Oct 2007 03:13:57 CST

Message-ID:

<13grad62umf5aa2@corp.supernews.com>

Having encountered a number of haters of C++ template programming in
general and Boost libraries in particular, I decided to start
experimenting with a few code snippets. A particular target of ire is
the subject of writing for loops by hand versus using algorithms. Those
who are opposed to the use of algorithms and function objects
frequently speak of the "unnecessary complexity and overhead" of using
algorithms and having to write functions or function objects
out-of-line. I had mentioned that Boost's lambda library allowed for
writing in-line function objects but the general thought was that there
would be too much overhead with something like this. So I thought I
would experiment with a few microbenchmarks. Pulled out of the air, for
no very good reason (and I know that this is often the death of whether
or not a benchmark is even worth consideration, but once more into the
breach, as it were), all variations used the following:

class Foo {
    int bar_;
    double baz_;
public:
    Foo(int const bar, double const baz) : bar_(bar), baz_(baz) { }
    int bar() const { return this->bar_; }
    double baz() const { return this->baz_; }
};

typedef boost::uniform_int<> uniform_int_type;
typedef boost::variate_generator<boost::mt19937&,uniform_int_type>
        generator_type;

class Make_foo {
    generator_type numerator_;
    generator_type denominator_;
public:
    Make_foo(generator_type const& numerator,
             generator_type const& denominator)
        : numerator_(numerator), denominator_(denominator)
        { }
    Foo operator () ()
        { return Foo(numerator_(),
                     (double(numerator_()) / double(denominator_()))); }
};

std::size_t const useless_microbenchmark_data_size = 1000000;

int main()
{
    boost::mt19937 rng;
    uniform_int_type zero_to_ninety_nine(0, 99);
    uniform_int_type one_to_one_hundred(1, 100);
    generator_type numerator(rng, zero_to_ninety_nine);
    generator_type denominator(rng, one_to_one_hundred);
    Make_foo make_foo(numerator, denominator);

    volatile ACE_Time_Value start_time =
        ACE_High_Res_Timer::gettimeofday_hr();

    // ... various for, for_each, etc. loops go here ...

    volatile ACE_Time_Value end_time =
        ACE_High_Res_Timer::gettimeofday_hr();
    volatile ACE_Time_Value duration =
        const_cast<ACE_Time_Value const&>(end_time) -
        const_cast<ACE_Time_Value const&>(start_time);

    std::cout << "duration.sec() = "
              << const_cast<ACE_Time_Value const&>(duration).sec()
              << "\nduration.usec() = "
              << const_cast<ACE_Time_Value const&>(duration).usec()
              << '\n';
    std::cout << "qux.size() = " << qux.size() << '\n';
    std::cout << "value = " << value << '\n';
}

What Foo? Just so that I'd have an excuse to write a lambda function
expression to use the baz but ignore the bar. Why make baz a double?
<shrug> Because double is a type frequently used by those with whom I've
been conversing. Why not? Why use the Boost random number library?
Just because. Why use ACE_Time_Value? It was just handy. I printed
out the value to show that the same pseudo-random numbers were adding to
the same value, i.e. to show that the arithmetic worked the same in all
cases. Not an interesting observation, I know.

The "simple" (duff.cc) variation was:

    std::vector<Foo> qux;
    std::size_t const qux_size = useless_microbenchmark_data_size;
    qux.reserve(qux_size);
    for (std::size_t i = 0; i < qux_size; ++i) {
        qux.push_back(make_foo());
    }
    double value = 0.0;
    for (std::vector<Foo>::const_iterator pos = qux.begin();
         pos != qux.end(); ++pos)
    {
        value += pos->baz();
    }

The "for_each" (fudd.cc) variation was:

    std::vector<Foo> qux;
    std::size_t const qux_size = useless_microbenchmark_data_size;
    qux.reserve(qux_size);
    std::generate_n(std::back_inserter(qux), qux_size, make_foo);
    double value = 0.0;
    std::for_each(qux.begin(), qux.end(), value += bind(&Foo::baz, _1));

I did this with the following:

Damien-Kicks-Computer:~/tmp dkick$ g++ --version
powerpc-apple-darwin8-g++-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build
5341)
[...]

Without optimization, the for loop performed better. I had expected
this. I was very interested to see that the for_each version seemed to
perform better than the for loop version when optimization was -O3. I
had expected it to be about identical with the for loop version.

Damien-Kicks-Computer:~/tmp dkick$ g++ -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 9
duration.usec() = 372856
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 9
duration.usec() = 394740
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 9
duration.usec() = 348901
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 10
duration.usec() = 245389
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 10
duration.usec() = 199316
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 11
duration.usec() = 788393
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O1 -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -O1 -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 865765
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 822372
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 869480
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 823178
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 864894
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 867775
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O2 -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -O2 -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 846027
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 796215
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 792571
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 876464
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 867439
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 843071
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O3 -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -O3 -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 686661
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 674884
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 719897
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 566605
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 596694
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 598898
qux.size() = 1000000
value = 2.57784e+06

And then it occurred to me to try a variation with "accumulate"
(algorithms.cc):

    std::vector<Foo> qux;
    std::size_t const qux_size = useless_microbenchmark_data_size;
    qux.reserve(useless_microbenchmark_data_size);
    std::generate_n(std::back_inserter(qux), qux_size, make_foo);
    double const value =
        std::accumulate(qux.begin(), qux.end(), double(0),
                        _1 + bind(&Foo::baz, _2));

I was interested to see that things seemed to get worse at -O2 but that
with -O3, it seemed to be the best performing variation.

Damien-Kicks-Computer:~/tmp dkick$ g++ -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 10
duration.usec() = 241283
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 10
duration.usec() = 236267
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 10
duration.usec() = 255965
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O1 -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 940784
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 929024
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 905576
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O2 -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 2
duration.usec() = 201833
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 939748
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 2
duration.usec() = 276619
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O3 -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 524692
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 587146
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 576647
qux.size() = 1000000
value = 2.57784e+06

So what I'm wondering is ... just how flawed is my methodology, code,
etc.? Are these numbers of any use, even within the already flawed
realm of microbenchmarks?

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]