Re: C++ Threads, what's the status quo?

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

11 Jan 2007 10:17:41 -0500

Message-ID:

<1168523258.732635.5920@o58g2000hsb.googlegroups.com>

Le Chaud Lapin wrote:

Pete Becker wrote:

Le Chaud Lapin wrote:

The mutex is simpler but slower:

Mutex counter_mutex; // global

void increment_counter ()
{
   counter_mutex.acquire();
   ++count;
   counter_mutex.release();
}

The increment has no visible side effects, so the compiler is not
obliged to sandwich it between the acquire and the release. How do you
ensure that the compiler won't do either of those?

I would take whatever recourse is available in all situations like
these. If I may answer with an example...

Consider a single-threaded program that has a function that aims a
missile at an adjacent country and launches it. Aiming is achieved by
incrementing an memory-mapped register. Launching is achieved by
setting a different memory mapped register to "true":

static unsigned int &register_azimuth = *reinterpret_cast<unsigned int
*>(0x12345678);
static bool &register_launch = *reinterpret_cast<bool *>(0x12345680);

void aim_and_launch_missile ()
{
register_azimuth += 75; // degrees
register_launch = true;
}

Technically, IIUC, the compiler has the right to reorder these two
statements, since they are visibly independent, thus launching the
missile at a potentially friendly neighboring country and subsequently
pointing the launch pad at the enemy.

Not just technically. It wouldn't be at all surprising if the
compiler did. And no embedded programmer with any experience
would write something like that.

C++ does have something (inherited from C) designed to support
memory mapped IO: volatile. The exact semantics of it are
implementation defined, because obviously, memory mapped IO
pretty much depends on the platform you're running the program
on. The intent, however, is clear here (even if neither Sun CC
nor g++). An embedded programmer would declare his references
volatile, which means that the order between accesses through
these references (but not for anything else) is ordered, and he
would also read the documentation of the compiler very
carefully, to ensure that it interpreted volatile in a way that
was compatible with his needs. (Although to tell the truth, as
a long time embedded programmer, I would write such critical
parts in assembler. Just to be sure.)

Note that volatile only applies to accesses through a volatile
qualified lvalue expression. Just declaring one of the
references volatile isn't sufficient, since the compiler could
still move the other. And that an object being constructed is
never volatile (just as it is never const); volatile only takes
effect after construction.

Another example where a programmer has written a piece of code that he
thought was portable that gives a rough idea of how fast 1,000,000
global memory references can be performed by the CPU:

static int x;
int main ()
{
    Instant before = now();
    unsigned int count = 1000000;
    while (count--)
        x = 0;
    Instant after = now();
    cout << count << " memory references in " << (after-before) << "
seconds." << endl;
    return 0;
}

On some compilers, he finds that the while loop is not performed until
just before the "return" 0. On other compilers, he finds that the
while loop has been completely omitted, as it is obviously superfluous,
resulting in that machine counting to a 1,000,000 is almost 0 seconds.

I think that almost all current compilers fall into the latter
category. G++, Sun CC and VC++ certainly do: the code generated
by g++ (for Sparc) for the loop is:

    sethi %hi(x), %g1
    mov %o0, %l1
    st %g0, [%g1+%lo(x)]

No loop at all. Sun CC gives:

        sethi %hi(0xf4000),%g2
        sethi %hi(x),%g3
        st %g0,[%g3+%lo(x)]
        add %g2,575,%i0
        or %g0,%o0,%i1
        orcc %g0,%i0,%g0
       .L900000116:
        add %i0,-1,%i0
        bne .L900000116
        orcc %g0,%i0,%g0

The loop is still there, but the assignment to x has been moved
out of it. And for what it's worth, VC++ (on a Windows PC, this
time) generates:

        mov DWORD PTR _x, 0

(and mixes it in with the storing of the results of the first
call to time()). Again, no loop.

I know that when I designed my benchmark framework, I had to
jump through hoops to ensure that the compiler didn't optimize
the benchmark loop away completely. (And I know that some
really sophisticated compilers can still do it.) And that all
of the standard benchmarks (Whetstone, dhrystone, etc.) go to
great lengths to output a value which depends on all of the
loops actually being executed.

(I might add that on any modern machine, 1,000,000 global memory
references will take considerably less than a second. In fact,
on most modern machines, even if the compiler didn't optimize,
the hardware would; the code in the loop would get around to the
second write before the first was finished, the hardware would
notice that they both had the same effect, and suppress one of
them. Unless, of course, you inserted the necessary fence or
membar instructions.)

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientie objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Simard, 78210 St.-Cyr-l'Icole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]