Re: Run-time efficiency: Branch prediction

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Sun, 25 Mar 2007 14:42:48 CST

Message-ID:

<1174850481.837743.76060@o5g2000hsb.googlegroups.com>

On Mar 25, 6:10 pm, "Rune Allnor" <all...@tele.ntnu.no> wrote:

On 25 Mar, 16:10, "James Kanze" <james.ka...@gmail.com> wrote:

On Mar 25, 10:54 am, "Rune Allnor" <all...@tele.ntnu.no> wrote:

On 25 Mar, 01:30, "Daniel Kr?gler" <daniel.krueg...@googlemail.com>
wrote:

Rune Allnor schrieb:

So the simple answer is that one should write naive
code and let the compiler earn its salary?

The question is: What do you mean with "writing naive"
code?

Code that describes the algorithm's general functionality
but which does not exploit any additional knowledge about
CPU functionality or problem domain specifics.

I don't think that you can always ignore the problem domain
specifics. If you have to sort, for example, knowing that there
will never be more than 10 elements, or that there will never be
less than a million, would definitly affect the choice of
algorithm.

A few monts ago I implemented an algorithm wich was based on
searching for medians in a frame which slid long the data
series. Once you have done the job in one N-length frame,
the oldest elemnt in the frame is deleted and the next in
line in the data series is include, and the job is done
again.

The guy who provided me the algorithm sorted each frame
independently. I took advatage of the fact that elemnts
were deleted and appended to already sorted sequences.

The run-times of our implementations were very different.

Certainly. That's the sort of domain specific optimization
which makes sense.

In general, I tend to ignore anything but big-O differences when
writing code. I especially ignore CPU issues, since these vary
from one CPU to the next, even within the same architecture.

Variant 1, scaling by division, was demonstrated to
be very much more expensive than variant 2, scaling
by multiplication, what run-time was concerned.

With what compiler, on what machine? There's no difference on
most of the machines I've used, including machines of 20 years
ago, and most compilers have no trouble optimizing this when
there is a difference.

This would have been Borland Turbo Pascal 5.0 and 6.0 and
Turbo Assembler on PCs, running i 386 + 387 FPU around 1990.

The argument was that a large number of divisions took more
time to compute than a similar number of multiplications, due
to low-level implementation details in the CPUs/FPUs.
We were taught to actively code variant 2 in order to
avoid overhead in our codes.

Anyone who taught that, even 20 years ago, is simply
incompetent.

When I bought TPascal 6.0 + TASM 2.0(?) in 1990, an ASM quick
reference guide where the clock cycles required for the mult
and div instructions, was included. I don't remember the exact
numbers, but the general impression was the the warning was
correct: Divisions took significantly longer than mults.

For early 8086's, a division did take about twice as long as a
multiplication. And both were more than an order of magnitude
slower than addition or subtraction. But even back then,
compilers were able to choose the best solutions. And of
course, most of the time, you weren't doing enough
multiplications or divisions for it to make a difference.

And of course, early 8086's are older than 20 years. To give
some concrete cases:

-- In the late 1970's, I worked on the Perkin Elmer (nee
    Interdata) 8/32. A mini-computer, but whose Fortran
    compiler would definitly convert division to multiplication,
    when it would make a difference.

-- In the late 1908's (which is 20 years ago), I programmed for
    the NSC 32000, a microprocessor where multiplication and
    division were both a single clock---and integer
    multiplication was faster than shifting.

Now, such systems certainly weren't universal at that time, but
still... It would be foolish to optimize before knowing the
details, since they certainly existed.

Even 20 years ago, we knew the dangers of
premature optimization.

Ah, the curse of Hoare's dictum truncated... People
usually quote quote the first part. The full message
is something like "premature optimization should be
avoided _in_99%_of_the_cases_".

Actually, the quote I've seen was '97%'. Formally, "premature
optimization" is always wrong, since if it wasn't wrong, it
wouldn't be premature. In practice, most of the time, the small
differences don't matter, and if your code is well designed,
it's easy to fix even the big differences later, when you know
that it's necessary. My experience has been that programs where
the programmer worried about such problems before having
actually written the code and profiled end up running slower
anytime it matters. And being incredibly fragile and difficult
to maintain. The most important thing to do for optimization is
encapsulation, so that you can easily change something when you
know where the problem is.

The median processing I implemented crtainly falls
inside the one remaining percent, as does a lot of
the algorithms I work with.

I don't consider using the correct algorithm "premature
optimization". I believe, in fact, that I said that my choice
of a sorting algorithm would depend on application specific
knowledge concerning how many elements I had to sort.

Worrying about the details of the code the compiler might
generate is premature optimization, until you've actually
measured.

--
James Kanze (Gabi Software) email: james.kanze@gmail.com
Conseils en informatique orient?e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]