Re: Fast return of two longs

From:

Pavel <dot_com_yahoo@paultolk_reverse.yourself>

Newsgroups:

comp.lang.c++

Date:

Tue, 21 Jul 2009 02:04:56 GMT

Message-ID:

<cn99m.301$646.257@nwrddc01.gnilink.net>

tni wrote:

Pavel wrote:

tni wrote:

Pavel wrote:

I did something similar recently. What I found out was that, on
modern Intel platforms at least, the time required by
multiplications themselves clearly dominated.

Modern Intel CPUs are quite good for 32 -> 64 bit multiplications.
For something like:

uint32_t v1, v2;
uint64_t result = uint64_t(v1) * uint64_t(v2);

both GCC 3/4 and Visual Studio 2005/2008 generate a single 'mul'

I am not sure what instruction you refer to. Do you mean IMUL?

No, I mean 'mul' (unsigned multiply).

instruction (that will take 1-5 clock cycles on a core2), so any

1-5 clock cycles (actually 1-3 for ) is a throughput of these
instructions

The throughput is 1/cycle on an Intel Core 2, with a latency of 5 clock
cycles (I'm not talking about a P4 which has horrible latencies for
pretty much everything). Multiply is fully pipelined.

(that is, how fast you can give the next back-to-back IMUL instruction
; however, the latency is at least 10 cycles (depending on the core
model).

No, it doesn't, assuming a 'long' is 32-bit, and a 'long long' 64-bit on
IA-32.

-------------------------------------------------
#include "pstdint.h"
#include <iostream>

__declspec(noinline)
uint64_t mul(uint32_t a, uint32_t b) {
    return uint64_t(a) * uint64_t(b);
}

int main(int argc, char* argv[]) {
    uint32_t a = std::rand();
    uint32_t b = std::rand();
    std::cout << mul(a, b) << std::endl;
    return 0;
}
--------------------------------------------------

Visual Studio 2005, built for IA-32, disassembled:

uint64_t mul(uint32_t a, uint32_t b) {
    return uint64_t(a) * uint64_t(b);
00401000 mul eax,dword ptr [esp+4]
}
00401004 ret

The result is 64 bits, in eax:edx

\\

On x64, the 64-bit x 64-bit -> 128-bit multiply is also a single
instruction (but you have to use a non-standard compiler intrinsic).

------------------------------------------------
#include "pstdint.h"
#include <iostream>
#include <intrin.h>

__declspec(noinline)
void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
    res1 = _mul128(a, b, &res2);
}
------------------------------------------------

Visual Studio 2005, built for x64:

void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
    res1 = _mul128(a, b, &res2);
0000000140001000 mov rax,rdx
0000000140001003 imul rcx
0000000140001006 mov qword ptr [r9],rdx
0000000140001009 mov qword ptr [r8],rax
}
000000014000100C ret

The result is 128 bits, qword ptr [r9] : qword ptr [r8].

\\

> OP's problem will require 4 IMUL instructions (unless
> he resorts to assembler which does not seem to be in the cards)

As per above, on IA-32/x64, the compiler does the job - no need for
assembly.

> and intermediate ADDs may make next IMUL wait till the end of ADD.

Then you are doing something wrong, e.g. look at:
http://www.mips.com/media/files/white-papers/64%20bit%20architecture%20speeds%20RSA4x.pdf

for an algorithm. The 4 required multiplies are independent and can run
in parallel on current IA-32/x64 CPUs (non-x64 P4's don't have a fully
pipelined multiply, they can only do 0.2 muls/clock throughput and have
high-latency shifts which doesn't help either, since those will be in
the dependency chain).

> That will still take 10 cycles though (actually, up to 14, depending
> on a particular model of the processor architecture) on Intel.

Only the P4 has latencies that high, Pentium Pro, P2, P3, PM, Core
Solo/Duo, Core 2, Core i7 all have fast multiplies; AMD Athlon, Athlon
64, Opteron and Phenom are fast as well.

For MUL, "INTEL? 64 AND IA-32 PROCESSOR ARCHITECTURES" gives 10 cycles
latency for the model they refer to as used for Core 2 (From C.3.1: "The
column represented by 0xF3n also applies to Intel
processors with CPUID signature 0xF4n and 0xF6n".."?n? indicates it
applies to any value in the stepping encoding".."Intel Core Solo and
Intel Core Duo processors are represented by 06_0EH") See the bottom of
the table C-13 there, the lowest latency for MUL is shown there for the
model 0xF3H and it says 10 cycles. And my testing attested for every of
these 10 cycles, unfortunately :(.

-Pavel