Re: Can I avoid the use of arrays in this situation or do I have to use them?

From:

terminator <farid.mehrabi@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sun, 25 Nov 2007 05:35:14 -0800 (PST)

Message-ID:

<cc0036ce-7789-41be-b843-5661c6b0a271@e10g2000prf.googlegroups.com>

On Nov 24, 11:14 pm, mike3 <mike4...@yahoo.com> wrote:

On Nov 24, 1:03 am, "Alf P. Steinbach" <al...@start.no> wrote:

* mike3:

On Nov 23, 3:02 pm, "Alf P. Steinbach" <al...@start.no> wrote:
<snip>

Why do you have the length arguments?

A std::vector has a length, accessible via size() and resize().

I thought resize() is to do just that, resize the vector,
not get it's length.

It sets the length. Plus allocates if necessary, and initializes
elements that become accessible.

Oh, so then size() gets the lenth, resize() resets the length.

Also, I have then length arguments
so I can use lengths different from those of the vector
if one needs to do that for some reason.

Yeah, but /do/ you need that?

Well, not for the multiplication, so maybe I can get rid of it there.
But for the addition and subtraction, I need variable legnths _and_
variable starting offsets so I can manipulate pieces of the digit
strings (to allow for the bit/word shift used when adding floating
point).

Without needing it the passed lengths are just one more thing that can
go wrong.

Is it possible though the std::vector may expand/shrink on me and I
have to update the "length" field in my bignums accordingly? Or be
prepared to handle operands of differing lengths going into the
routine (which is bad for the addition/subtraction as every operation
counts in terms of speed and coding up things to handle varying
lengths when in the floating point arithmetic all passed lengths of
digit slices _should_ be the same is a waste of performance AND a
silly way to code something! Why put in what you don't use?)?!

{
       /* Set up temporary digit buffer */
       Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
                                      // maximize performance. Vector
                                      // takes forever to constantly
de/
                                      // reallocate.

Have you measured this?

Sure did. Took ~34 secs, for 100,000,000 calls, with all
compiler optimizations turned on.

I just reduced the routine to this:

FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
                         const std::vector<Digit> &vec1,
                         const std::vector<Digit> &vec2,
                         int length1, int length2)
{
     std::vector<Digit> tmpbuf(length1 + length2);
     return(FG3DError(FG3D_SUCCESS));
}

and just ran it over and over 100,000,000 times to
see how long that allocate/free takes and I was
stunned to find it wastes 34 seconds, more time than
the entire multiplication loop!

Well, it comes down to usage pattern.

If the out-parameter is generally an empty vector or of less capacity
than required for the result, then there will be an allocation anyway in
most cases. And then the most efficient will generally be to declare
the temporary as a vector, and swap at the end.

But the simple act of declaring the vector (as I showed above) takes
mammoth amounts of time -- unreasonable amounts, in fact.

On the other hand, if the out-parameter is generally of sufficient
capacity, then from a local point of view the most efficient will
generally be to reuse that capacity, i.e. to use a raw array for the
temporary or to build the result right into the out-parameter.

This is the case I have. These low-level digit routines
are used as building blocks to build a floating point
bignum package. The out-parameter _is_ of sufficient
capacity, as it's a temporary buffer _inside the floating
point routine_. So I suppose I could scrap the
buffer in the mul routine above entirely. But that of
course raises the question of how to handle the buffer
in the floating point routine. How could I do that?

However, reuse of capacity runs the risk of "leaking" memory in the
sense that the capacity is never lowered, only increasing.

In the floating point library, the precision/"capacity"
of all operands should be fixed, not increasing _or_
decreasing as if this happens too frequently (as in
more than almost never) it's going to cut into the
performance of the fractal calculation.

So measuring for actual usage, typical usage pattern, is critical.

But what this shows is the mere act of creating the temporary buffer
consumes ludicrous amounts of time.
This is an application in which every last bit of speed
I can get counts. There are calculations that need
to be done involving these operations -- lots of 'em --
for many different data points (and they're all complex
numbers -- don't even get me started on evaluating
transcendental functions). This has ramifications for
the floating point library which needs temporary buffers
to avoid munging operands (in the case of addition/
subtraction with it's bit shift) or to give the
correct result (in the case of multiplication which
requires the upper bits of the 2x-width product).

[snip]

Ensure class invariants in operations so that you don't have to pepper
the code with checks all over the place.

In other words,

   class Bignum
   {
   private:
       std::vector<Digit> myDigits;
   public:
       // Whatever, all operations assume a max length
       // and ensure that max length.
   };

The problem is I need to be able to work on pieces of the
digit string as well as the entire string with some
operations (addition, subtraction).

Sounds like slices.

The standard library isn't very well equipped when it comes to slices:
you may have to build that yourself.

How? Any ideas on how to design it for maximum performance?

There is std::valarray, which I think supports slices, but that's a
not-quite-finished piece of the library. If it works, great. If not...

So does this mean I should just scrap the vector<>, use
arrays, and just be careful with the length/offset parameters I pass?

And if the operations
ensure the max length does not that require those if()
checks in them to make sure the length parameter passed
does not exceed the limit?

No, the point is to not pass separate length parameters, that whatever
object you're passing knows its length.

But for maximum speed, I'd rather demand all operands
to, say, the addition/subtraction routines have the same
length, as that's all that's going to occur in my floating
point arithmetic. Having all that redundancy not only
wastes performance but just doesn't feel good in my
opinion.

But again, it sounds as if you're dealing with slices.

Then, if so, then Bignum (or whatever) needs to support slices,
efficiently. In the operation that creates a logical slice there will
be a length check. But that will then be the only one, centralized.

Does such an operation cost less than the addition/
subtraction of the digits & the bit shifting combined,
even when repeated billions of times as it is in the
fractal generation?

There are 4 possible cases that can occur:

Floating Point Digit Add/Sub Scenarios:

R = A +/- B

MSD --- LSD
RRRRRRRRRRR
AAAAAAAAAAA
      BBBBB (we need 2 slices of A, 1 of B, 2 of R)

RRRRRRRRRRR
AAAAAAAAAAA
    BBBB (we need 3 slices of A, 1 of B, 3 of R)

RRRRRRRRRRR
AAAAAAA
     BBBBBB (we need 2 slices of A, 2 of B, 3 of R)

RRRRRRRRRRR
AAAAA
       BBBB (we need 1 slice of A, 1 of B, 3 of R)

Any one of these cases may occur, although case 1
will be the most frequent due to the numbers involved
having equal precision. That requires 5 slice requests.
With around 5 slice-requiring floating point operations
per fractal iteration, we must do this 25 times per
iteration, and one fractal image may average 12,000
iterations and hence require over 92 billion such
requests.

<snip>

Well, the ordinary arithmetic promotions were designed to handle this
kind of stuff.

It's enough that one operand of an operation is of type TwoDigit in
order to force the other operand to be promoted.

But doesn't that require a cast? Also, carry is of type
TwoDigit, so should not it promote everything else?

This is one case where I'd forsake the newfangled casts for clarity,

TwoDigit tmp = TwoDigit(*v1i)*(*v2i) + tmpbuf[i+j] + carry;

"TwoDigit(*v1i)" is a syntax that was introduced in C++ to make it look
like a constructor call, but is defined for a primitive type as
equivalent to a C style cast "((TwoDigit)*v1i)". I.e. in some other
context it could invoke reinterpret_cast or const_cast or even cast to
inaccessible base. I don't know why it wasn't defined as static_cast.
Probably an oversight or misunderstanding: they bungled it.

Thanks, I'll do that instead. Although, what about carry?
That's a TwoDigit, would not it automatically do the
promotion and hence one would not even need the thing
around (*v1i)? Or does one operand for each binary operator need to be
TwoDigit (in which case putting it at the beginning ensures all the
successive operations recieve at least one TwoDigit)?

             carry = tmp>>DIGIT_SIZE;
             tmp &= MAX_DIGIT;
             tmpbuf[i+j] = tmp;
          }
          tmpbuf[i+length2] = carry;
       }
       /* Copy result back */
       vec0.reserve(length1 + length2);
       std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());

Don't modify out-arguments directly.

Create the result in a local variable, then at the end swap().

For exception safety.

The result is already being built in the temp buffer, that's the
point. And does this mean one cannot resize the output vector if
it cannot hold the full size product? Or will swap() do that
automatically? Is swap() slower than the above?

You can always resize. Whether swap is more or less efficient depends
on the context, in effect on the usage pattern for the calls of this
function. However, given your measurements I think what I wrote above
is too categorical: you'll need to try out both approaches (array temp +
resize + copy, or vector temp + swap) in the actual typical usage,
measuring.

Well in my usage, it might be possible to entirely avoid the in-place
multiplication in the tight inner fractal loops at least for the
Mandelbrot set (that takes 1 complex multiplication, equivalent to 3
real multiplications), however for floating point multiplication one
still needs a temporary buffer to hold the 2x-width product of the
mantissas which is is then truncated to 1x-width and rounded. So that
simply moves the problem outside the multiplication routine I gave and
into that of it's caller.> Happily there's not much difference regarding the rest of the code.

testing the following urged me that the you need to pay a time
overhead because you can not put a large number of large objects on
stack and if you use dynamic allocation(as vector does) then
allocation and deallocation costs you some time. In place calculation
is the only case that does not demand dynamic allocation and only for
this one case using a normal array placed on the stack can
help.Putting arrays on stack is not normally a good idea .

void arr3(){
    int *arr=new int[sz];
    fill(arr,&(arr[sz]),0);
    delete arr;
};

void vec4(){
    vector <int> v;
    v.reserve(sz);
};

void vec5(){
    vector <int> v;
    v.reserve(sz);
    v.insert(v.begin(),sz,0);
};

void vec6(){
    vector <int> v;
    v.insert(v.begin(),sz,0);
};

void vec7(){
    vector <int> v;
    v.resize(sz);
};

ps:
reserve dose not add elements to vector,it just increase current
capacity, so use insert or resize to add elements to the vector.

regards,
FM.