Pavel wrote:
tni wrote:
Pavel wrote:
I did something similar recently. What I found out was that, on
modern Intel platforms at least, the time required by
multiplications themselves clearly dominated.
Modern Intel CPUs are quite good for 32 -> 64 bit multiplications.
For something like:
uint32_t v1, v2;
uint64_t result = uint64_t(v1) * uint64_t(v2);
both GCC 3/4 and Visual Studio 2005/2008 generate a single 'mul'
I am not sure what instruction you refer to. Do you mean IMUL?
No, I mean 'mul' (unsigned multiply).
instruction (that will take 1-5 clock cycles on a core2), so any
1-5 clock cycles (actually 1-3 for ) is a throughput of these
instructions
The throughput is 1/cycle on an Intel Core 2, with a latency of 5 clock
cycles (I'm not talking about a P4 which has horrible latencies for
pretty much everything). Multiply is fully pipelined.
(that is, how fast you can give the next back-to-back IMUL instruction
; however, the latency is at least 10 cycles (depending on the core
model).
No, it doesn't, assuming a 'long' is 32-bit, and a 'long long' 64-bit on
IA-32.
-------------------------------------------------
#include "pstdint.h"
#include <iostream>
__declspec(noinline)
uint64_t mul(uint32_t a, uint32_t b) {
return uint64_t(a) * uint64_t(b);
}
int main(int argc, char* argv[]) {
uint32_t a = std::rand();
uint32_t b = std::rand();
std::cout << mul(a, b) << std::endl;
return 0;
}
--------------------------------------------------
Visual Studio 2005, built for IA-32, disassembled:
uint64_t mul(uint32_t a, uint32_t b) {
return uint64_t(a) * uint64_t(b);
00401000 mul eax,dword ptr [esp+4]
}
00401004 ret
The result is 64 bits, in eax:edx
\\
On x64, the 64-bit x 64-bit -> 128-bit multiply is also a single
instruction (but you have to use a non-standard compiler intrinsic).
------------------------------------------------
#include "pstdint.h"
#include <iostream>
#include <intrin.h>
__declspec(noinline)
void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
res1 = _mul128(a, b, &res2);
}
------------------------------------------------
Visual Studio 2005, built for x64:
void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
res1 = _mul128(a, b, &res2);
0000000140001000 mov rax,rdx
0000000140001003 imul rcx
0000000140001006 mov qword ptr [r9],rdx
0000000140001009 mov qword ptr [r8],rax
}
000000014000100C ret
The result is 128 bits, qword ptr [r9] : qword ptr [r8].
\\
> OP's problem will require 4 IMUL instructions (unless
> he resorts to assembler which does not seem to be in the cards)
As per above, on IA-32/x64, the compiler does the job - no need for
assembly.
> and intermediate ADDs may make next IMUL wait till the end of ADD.
Then you are doing something wrong, e.g. look at:
http://www.mips.com/media/files/white-papers/64%20bit%20architecture%20speeds%20RSA4x.pdf
for an algorithm. The 4 required multiplies are independent and can run
in parallel on current IA-32/x64 CPUs (non-x64 P4's don't have a fully
pipelined multiply, they can only do 0.2 muls/clock throughput and have
high-latency shifts which doesn't help either, since those will be in
the dependency chain).
> That will still take 10 cycles though (actually, up to 14, depending
> on a particular model of the processor architecture) on Intel.
Only the P4 has latencies that high, Pentium Pro, P2, P3, PM, Core
Solo/Duo, Core 2, Core i7 all have fast multiplies; AMD Athlon, Athlon
64, Opteron and Phenom are fast as well.
model 0xF3H and it says 10 cycles. And my testing attested for every of
these 10 cycles, unfortunately :(.