concept that it's probably the least efficient way to do so. Even
the presumbalby space inefficent:
t = a; a = b; t = b;
might not actually allocate any memory for the temporary
value, it may in fact just be an transient register that
may have been allocated anyway.
LD R1, a
LD R2, b
ST a, R2
ST b, R1
where the XOR debacle
LD R1, a
LD R2, b
XOR R1,R1,R2
XOR R2,R2,R1
XOR R1,R1,R2
ST a, R1
ST b, R2
even old x86 machines support some 'exg' opcode that swaps
two registers in just one instruction ,so a good compiler
can handle triple assignment much better than you
mentioned in case both ints have register storage.The xor
trick has less chance of such optimization, but it is a
joyfull solution to a programming delima(I like it more
than add/subtract solution).
I seem to recall some 8086 compilers actually recognizing
the classical swap idiom (with the temporary) and generating
the exch instruction to implement it. Modern x86 compilers
don't, however, probably because on more recent x86
processors, xchg has an implied lock prefix, which acts as a
memory fence (which in turn means that the instruction is
considerably slower than it would be otherwise).
I just ran some quick benchmarks on an Intel based Linux
machine here (using g++ 4.1.0, -O3), using the following
"swappers":
struct SwapperClassic
{
void operator()( int& a, int& b )
{
int tmp = a ;
a = b ;
b = tmp ;
}
} ;
struct SwapperXor
{
void operator()( int& a, int& b )
{
a ^= b ;
b ^= a ;
a ^= b ;
}
} ;
struct SwapperAsm
{
void operator()( int& a, int& b )
{
asm ( "movl %[a], %%eax\n xchgl %%eax,%[b] \n movl %%eax,%
[a]"
: [a] "+m" (a), [b] "+m" (b) : : "%eax" ) ;
}
} ;
On this particular machine (not sure of its actual spec's,
which processor or the clock frequency), I got:
ns per machine memory
iter. instr. accesses
SwapperClassic: 1.7 5 5
SwapperXor: 2.5 9 7
SwapperAsm: 84.4 4 4
Tests run on 500 million iterations. In this case, the actual
function was invoked within a virtual member function, and
swapped two member variables. And the last two columns do not
include the standard function prefix or postfix.
A quick glance at the generated assembler showed that none of
the three versions used any local variables.
And as you can see, the cost of the memory fence due to the
implicit lock prefix on the xchgl instruction is very, very
high. The generated code uses one less instruction, and has
one less memory access, but requires roughly 50 times more
time to run.