Re: Designing Numerical Computation in C++

From:

"=?iso-8859-1?q?Erik_Wikstr=F6m?=" <eriwik@student.chalmers.se>

Newsgroups:

comp.lang.c++

Date:

26 Mar 2007 23:40:30 -0700

Message-ID:

<1174977630.028311.67180@n76g2000hsh.googlegroups.com>

On 26 Mar, 22:45, "Phocas" <kjcam...@gmail.com> wrote:

Hello everyone.

I've been using C++ for many years now, and I am starting to work on
my first large project which has many different components. Some of
these components do extremely performance critical things like
scientific computations and ray-tracing. These are done for many
iterations so the code should run as fast as possible. On the other
hand, I also want the code to make sense to people, be easy to
understand and modify, and to change internally if a higher
performance solution comes along. This has been very challenging so
far. In this post I talk about a particular problem I've been having,
in the hope that those who've worked with C++ in a professional
setting might share some good advice about what they would do in
similar situations.

In particular I have tried to implement a vector class, which models n-
dimensional Cartesian space vectors. Let me start off by saying that
I can't really envision the application right now needing anything
other than 2, 3, or 4 dimensional vectors. I thought of 3 different
ways to make this class:

1.) Don't do anything, use Blitz++, tvmet, uBLAS, etc. Unfortunately,
the expression template techniques which make these go so much faster
than normal C++ on statements like "v = 2*f - dot(f, v)*g + 3/lengh(h)
* k" make them go significantly slower in statements like "v = 2*f"
because of the extra constructor invocation. Even tvmet shows a 50%
penalty when performing y = a*x + b for very small vectors. This can
cause a 50% performance hit because there are so few instructions to
begin with, that adding some extra in tight loop kills performance.
In ray tracing, I think stuff like this really matters because you
measure in millions of rays per second. My solution is to not use
these classes. In the documentation, I write: "No methods contain
Vector as the return type. That means vec.method(...) cannot evaluate
to another Vector. This is because doing so creates a copy (i.e.,
runs the constructor). If this is what you need to do, you should
_see_ the cost of constructing a new object to ensure you really need
it. If you do not want to change "this", manually create a copy and
call an operation on the copy that will change "this". The assignment
operators +=, -=, etc. are implemented because they change "this".
Operators like +, -, etc. are not implemented because they should
treat both operands as const and return a new Vector." Already this
is a big hit to usability and creating a "friendly" library for future
developers, but I really think this is worth it.

2.) The absolute guaranteed fastest solution, I also really dislike.
This is to create Vector2f, Vector2d, Vector3f, etc. for the different
dimensions and floating point precisions. Unfortunately this ensures
that adding new code, changing code, etc., must be changed in many
different places. However, there are several advantages to this
method: if I were to implement something like x86 SEE parallelized
versions of vector additions, subtractions, etc., they must be done in
assembly, in completely different ways for single and double precision
floating pointing. So they are easier to specialize if the classes
are seperate anyway. Plus I doubt anyone will want to change this
extremely simplistic classes anyway. Still, the solution is very
"unelegant".

3.) The way I've actually done it now, with templates and template
partial specializations. There is a base class
VectorBase<NumericType, Size> which implements many of the general
functionality like +=, -=, dot, length, etc. in terms of for loops
whose size is known at compile time. There are specialized versions
of Vector, which inherit most of their functionality from VectorBase
(I used this technique because member partial specialization is not
possible), but can also add special extra operations like the cross
product for 3 dimensional vectors only, and special constructors which
use the known size of the vectors. Mostly all they do is implement
constructors which call the VectorBase constructors, which the C++0x
forwarding constructor syntax should clean up nicely, when it comes
out.

So I was really happy with (3), which dramatically minimizes the code
compared to (2) but should go as fast since the loops should get
unrolled. But the compiler I am using (Microsoft Visual C++) won't
unroll the loops, even when they're size 2. This actually results in
a factor of 2 performance difference when you're doing just a few
floating point multiplies and adds (which is roughly like what is used
in a k-d tree traversal in raytracing, although I used the dot product
to test it). And there is no way to force them to unroll that I can
find, except with Intel's compilers which I do not have access to.
Basically, this frustrating exercise has just shown me the cynical
thing is true: you really do need to sacrifice clarity for
performance. I will probably end up implementing 6 high performance
vector classes, and 6 high performance matrix classes, all with
basically the same code. At least I can make SSE versions more
cleanly.

What bothers me is all the time I spent on this. I'm sure the more
experienced software developers out there would look at situations
like this and had some intution about the "best" thing to do. I, on
the other hand, have a directory full of dead implementations of the
same thing. My biggest problem is probably worrying too much over
details, and not knowing what its important. Does anyone have any
advice, if its actually possible to give advice this general?

Been working on a similar thing for some time and I too choose to
implement my own version of vectors and matrices using templates, much
for the same reasons. I was wondering what compiler settings you are
using and if you have asked about it at MS groups like
microsoft.public.vc or microsoft.public.dotnet.languages.vc (I don't
know if my loops get unrolled or not). Perhaps there are some pragmas
that can help?

Another way to improve speed (but code cleanliness will take a hit) is
to use OpenMP, if you use VS2005 (non-express) just look it up in the
help, I've personally seen that it can improve performance with very
little effort.

--
Erik Wikstr=F6m