Re: C++ FAQ

From:

Keith H Duggar <duggar@alum.mit.edu>

Newsgroups:

comp.lang.c++.moderated

Date:

Thu, 16 Jul 2009 01:48:09 CST

Message-ID:

<86058662-2dfd-4be1-9052-093f28fd5083@t13g2000yqt.googlegroups.com>

Jerry Coffin <jerryvcof...@yahoo.com> wrote:

Keith H Duggar wrote:

Can you provide the code you used in your "bit of testing"? Also
can you reconcile your speculations with these results? I've put
the assembler code .s files generated by g++ for both the tested
platforms in the same directory to help refine your assumptions.

Sure:

Thanks for the effort. Unfortunately, I'm sorry to say that your
results are fundamentally flawed and useless, for the purpose of
comparing these functions at least. The critical problem is that
you did not disable inlining when you compiled the code.

When profiling toy programs, such as this, where the entire test
code (including multiple versions of very similar functions with
simple driver loops and stress functions) resides in one and the
same translation unit, inlining triggers destructive (for timing
purposes) optimizations that convolve the code beyond repair and
dash any hopes we have of measuring the algorithms of interest.

One can see the problems by looking at the assembly code. In the
particular case of your code some examples are 1) since the time
function loops limit times for a *constant* radius the optimizer
can lift code that only depends on radius (such as the if (r==0)
check you added to your do/while) outside of the limit test loop
2) since the stress function, f, is so simple specifically a few
adds, the optimizer can remove the adds for the f(x,y) call when
either x or y is zero (for example if one tries to fix your loop
by adding an f(0,0) call to the if (r==0) it is simply optimized
away) 3) the optimizer's heuristics can cause different inlining
decisions for equivalent code in different functions 4) code can
be shared across functions causing jumps across hundreds or more
bytes of code memory 5) etc.

Finally, there is another problem with your timings. You iterate
time about a thousand times and yet the total clock_t is roughly
only a 1000 or so. This means that the CLOCKS_PER_SEC on systems
you tested is too small (perhaps 100). Thus the entire time loop
runs in single clock tick or less on average. This can introduce
a systematic bias in the timings that depends on code structure.
This is another reason to use profilers as they address this.

I added the full set of functions and changed the numbering (and
descriptions) to match my earlier post. I also fixed some issues
with using clock() on systems with low resolution ticks and also
added sample normalization code to make sure large radius values
do not dominate the time sums. The clock() time results averaged
over eight runs (without inlining!) gives

================================================================
relative times - lower is faster
radius range [08,32) and [08,1024)
radius calls 1000 x 8 runs
----------------------------------------------------------------
Intel Core Duo, g++ 4.0.1, clock() timing
.................................................................
circle0 100% 100% : Kuzmin
circle5 100% 98% : Duggar modified Kuzmin removing uc jump

circle4 102% 100% : Glassborow triple duplicate forever

circle1 119% 108% : Duggar while
circle6 109% 108% : Duggar while + goto to remove f call

circle2 109% 107% : Steinbach/Glassborow forever
circle3 153% 122% : Coffin do/while (broken for r==0)
----------------------------------------------------------------
PPC G4, g++ 3.3, clock() timing (may be inaccurate; slow clock)
.................................................................
circle0 100% 100% : Kuzmin
circle5 102% 101% : Duggar modified Kuzmin removing uc jump

circle4 100% 106% : Glassborow triple duplicate forever

circle1 108% 105% : Duggar while
circle6 105% 105% : Duggar while + goto to remove f call

circle2 105% 113% : Steinbach/Glassborow forever
circle3 106% 102% : Coffin do/while (broken for r==0)
================================================================

Giving the nearly same ordering as my profiler measurements. The
narrowing of the gaps results from the classic problems of using
coarse clock() calls. For one it cannot distinguish between self
and children time so the times now include timer loop iteration,
f body time, etc.

As for reconciling your results, given the real facts of the
situation, I see no reason to even attempt to do so. What you
tested was NOT what either of us actually posted!

What ever are you talking about?? Here is the code you posted:

template <class F>
inline void circle_octant(int r, F &f) {
    int x = r;
    int y = 0;
    int odd = r&1;
    int e = (-r/2)-odd;
    do {
       f(x,y);
       e += y++ + odd;
       if (e >= 0)
          e -= --x;
       odd ^= 1;
    } while (y < x);
    f(x,y);
}

and here is the code take from the file I posted

void circle3 ( int r )
{
    int x = r;
    int y = 0;
    int odd = r&1;
    int e = (-r/2)-odd;
    do {
       f(x,y);
       e += y++ + odd;
       if (e >= 0)
          e -= --x;
       odd ^= 1;
    } while (y < x);
    f(x,y);
}

and that is identical except for the function header! As pointed
out before inlining must be disabled to obtain accurate results.
And, surely you realize the extra f argument, and especially the
missing template keyword are trivial differences in this context
here of comparing the function body performance? Although before
you were hung up on the template vs non-template when you wrote:

ancient 6502/6510/65xx series -- but I'm fairly sure there's no such
thing as a reasonably modern C++ compiler (e.g. that includes
templates) for such targets, so that pretty much eliminates your code
from consideration for that target.

which of course was just a trivial nit-pick so maybe that's it?

As for Francis' duplicate version, it is also identical to what
he posted except for necessary bug fixes! And his first compact
while version is only cosmetically different from the Steinbach
version posted two years ago (ex while(true) vs for(;;)). Since
I already had Steinbach's version I just left as is.

So this:

What you tested was NOT what either of us actually posted!

has no basis that I can see.

Since I prefer to have some built-in timing code, so you don't need
profilers and such to get results, I reworked things a bit more to

That's fine. We can use either. Doesn't Visual C++ have profiler
tools that might be decent? However, as mentioned earlier, there
are a number issues to consider when using clock() and your code
did not address them.

As to your new code, it doesn't compile because it tries to bind
non-const references to temporaries in the lines such as

time1 += timer(circle0<noshow>(), radius, limit, noshow1);

I fixed this by making the operator() const and changing time to
take them by const &. I also changed the function names to match
my initial file (to avoid future confusion such as the confusion
you have about which version is who's), changed the descriptions
because they are wrong, and added the variants that are actually
mine (circle1, 5, and 6). This file is located at

http://www.duggar.org/pub/code/circle/coffin.tgz

Keith's original: 1120
Jerry's do/while: 961
Keith's "improved": 1159
Francis's while loop: 1196

You have confused the owners. Your time3/circle2 is NOT "Keith's
improved" anything. It is the Steinbach/Glassborow forever loop,
despite cosmetic differences between the two. timer4/circle3 is
Francis' *second* triple duplicate forever loop. Finally, time1
circle0 is Kuzmin's original, not mine. Please be more careful.

Strangely, adding the test for r==0 to my code actually makes it run
a bit faster (even though I'm never giving it a radius of 0, so that

That should have been a red flag that something was amiss. See
the comments about inlining at the begging of this post.

branch is never taken -- apparently that test lets the oompiler
simplify some other part of the code, though I haven't looked at the
assembly output to figure out what). In terms of relative speed, this
doesn't make a whole lot of difference though:

It makes a big difference on Intel. On the PPC the compiler gets
the static branch prediction right reducing the cost to zero for
the radius > 0 cases.

I won't bore you more with pages of results from other compilers and
platforms, but the summary is pretty simple: my code is consistently
the fastest. Francis's code is the least consistent, varying all the
way from last place to a close second place.

If we only look at the best set of options for each piece of code, on
each platform, the results are more consistent: my code is the
fastest, Francis's is a close second place, and your two attempts are
nearly tied for last place, though the original is probably slightly
faster than your modified...

Why do you keep referring to Kuzmin's algorithm as my attempt?
You have included none of my attempts (circle1, 5, 6 in the web
file) in your test code. Perhaps if we keep the numbering fixed
to match my file we will have less confusion.

Anyhow, the results were bogus because of inlining convolutions.
Once that is turned off the timings with your code (after fixes)
are consistent with the profiling results posted previously. The
Kuzmin and modified Kuzmin (1,5) outperform all others excepting
Glassborow duplicate (4) in certain scenarios.

The new Coffin do/while is really falling behind here. Probably
because of the new extra branch (that wasn't actually tested in
your trials because the inlining allowed the optimizers to lift
the check out of the time loop into the outer radius loops) and
it still produces the wrong f call set. As said in other posts,
it should call f(r,0) when r == 0. If does fine on PPC; but who
knows how it will do once it actually gives the correct f calls.

KHD

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]