Re: Swap two integers without using temporary variable

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 30 Nov 2007 01:37:06 -0800 (PST)

Message-ID:

<d9c84f7d-d61e-4ae2-a402-8079ba394be1@s12g2000prg.googlegroups.com>

On Nov 29, 3:54 pm, terminator <farid.mehr...@gmail.com> wrote:

On Nov 28, 1:03 pm, James Kanze <james.ka...@gmail.com> wrote:

On Nov 27, 4:21 pm, terminator <farid.mehr...@gmail.com> wrote:

On Nov 26, 4:02 pm, Ron Natalie <r...@spamcop.net> wrote:

The people who thing the XOR behavior is a neat idea totally lose th=

concept that it's probably the least efficient way to do so. Even
the presumbalby space inefficent:
        t = a; a = b; t = b;
might not actually allocate any memory for the temporary
value, it may in fact just be an transient register that
may have been allocated anyway.
        LD R1, a
        LD R2, b
        ST a, R2
        ST b, R1
where the XOR debacle
        LD R1, a
        LD R2, b
        XOR R1,R1,R2
        XOR R2,R2,R1
        XOR R1,R1,R2
        ST a, R1
        ST b, R2

even old x86 machines support some 'exg' opcode that swaps
two registers in just one instruction ,so a good compiler
can handle triple assignment much better than you
mentioned in case both ints have register storage.The xor
trick has less chance of such optimization, but it is a
joyfull solution to a programming delima(I like it more
than add/subtract solution).

I seem to recall some 8086 compilers actually recognizing
the classical swap idiom (with the temporary) and generating
the exch instruction to implement it. Modern x86 compilers
don't, however, probably because on more recent x86
processors, xchg has an implied lock prefix, which acts as a
memory fence (which in turn means that the instruction is
considerably slower than it would be otherwise).

I just ran some quick benchmarks on an Intel based Linux
machine here (using g++ 4.1.0, -O3), using the following
"swappers":
    struct SwapperClassic
    {
        void operator()( int& a, int& b )
        {
            int tmp = a ;
            a = b ;
            b = tmp ;
        }
    } ;

    struct SwapperXor
    {
        void operator()( int& a, int& b )
        {
            a ^= b ;
            b ^= a ;
            a ^= b ;
        }
    } ;

    struct SwapperAsm
    {
        void operator()( int& a, int& b )
        {
            asm ( "movl %[a], %%eax\n xchgl %%eax,%[b] \n movl %%eax,%
[a]"
                  : [a] "+m" (a), [b] "+m" (b) : : "%eax" ) ;
        }
    } ;

On this particular machine (not sure of its actual spec's,
which processor or the clock frequency), I got:

                       ns per machine memory
                        iter. instr. accesses

    SwapperClassic: 1.7 5 5
    SwapperXor: 2.5 9 7
    SwapperAsm: 84.4 4 4

Tests run on 500 million iterations. In this case, the actual
function was invoked within a virtual member function, and
swapped two member variables. And the last two columns do not
include the standard function prefix or postfix.

A quick glance at the generated assembler showed that none of
the three versions used any local variables.

And as you can see, the cost of the memory fence due to the
implicit lock prefix on the xchgl instruction is very, very
high. The generated code uses one less instruction, and has
one less memory access, but requires roughly 50 times more
time to run.

I wrote that **if both ints have register storage
class.Exchanging to memory **has to lock** the bus (I would be
suprised otherwise)for it is the simplest facility to handle a
semaphor.

There is no such thing as "register storage class" in C++. If
you have to read both int's into a register, then the xchg
instruction is just an extra, unneeded instruction (on most
machines, anyway). And there is no real reason why xchg should
lock the bus, any more that incr or decr or any other
instruction which might require two (or more) memory accesses,
and even less reason why it should implement a memory fence.
You already have separate instructions/prefixes for those
things, which can be used in the cases where they are needed.
(The original 8086, and at least through the 80386, did not lock
the bus, nor does the normal exchange instruction on any other
architecture I've used.) And for most multi-thread algorithms,
xchg isn't enough; you need a cas instruction (or some other
combinations of instructions: g++ uses "lock add" for its atomic
increment, for example---which can't be implemented using xchg).

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34