Re: Am I or Alexandrescu wrong about singletons?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Mon, 29 Mar 2010 16:53:44 CST

Message-ID:

<288ce9ed-4773-4dbf-bec8-b2e7953c7755@g10g2000yqh.googlegroups.com>

On Mar 28, 10:05 pm, George Neuner <gneun...@comcast.net> wrote:

On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.ka...@gmail.com>
wrote:

On Mar 25, 7:10 pm, George Neuner <gneun...@comcast.net> wrote:

On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov

[...]

As you noted, 'volatile' does not guarantee that an OoO CPU will
execute the stores in program order ...

Arguably, the original intent was that it should. But it
doesn't, and of course, the ordering guarantee only applies to
variables actually declared volatile.

"volatile" is quite old ... I'm pretty sure the "intent" was defined
before there were OoO CPUs (in de facto use if not in standard
document). Regardless, "volatile" only constrains the behavior of the
*compiler*.

More or less. Volatile requires the compiler to issue code
which is conform to what the documentation says it does. It
requires all accesses to take place after the preceding sequence
point, and the results of those accesses to be stable before the
following sequence point. But it leaves it up to the
implementation to define what is meant by "access", and most
take a very, very liberal view of it.

for that you need to add a write fence between them. However,
neither 'volatile' nor write fence guarantees that any written
value will be flushed all the way to memory - depending on
other factors - cache snooping by another CPU/core, cache
write back policies and/or delays, the span to the next use of
the variable, etc. - the value may only reach to some level of
cache before the variable is referenced again. The value may
never reach memory at all.

If that's the case, then the fence instruction is seriously
broken. The whole purpose of a fence instruction is to
guarantee that another CPU (with another thread) can see the
changes.

The purpose of the fence is to sequence memory accesses.

For a much more rigorous definition of "access" that that used
by the C++ standard.

All the fence does is create a checkpoint in the instruction
sequence at which relevant load or store instructions
dispatched prior to dispatch of the fence instruction will
have completed execution.

That's not true for the two architectures whose documentation
I've studied, Intel and Sparc. To quote the Intel documentation
of MFENCE:

     Performs a serializing operation on all load and store
     instructions that were issued prior the MFENCE
     instruction. This serializing operation guarantees that
     every load and store instruction that precedes in
     program order the MFENCE instruction is globally visible
     before any load or store instruction that follows the
     MFENCE instruction is globally visible.

Note the "globally visible". Both Intel and Sparc guarantee
strong ordering within a single core (i.e. a single thread);
mfence or membar (Sparc) are only necessary if the memory will
also be "accessed" from a separate unit: a thread running on a
different core, or memory mapped IO.

There may be separate load and store fence instructions and/or
they may be combined in a so-called "full fence" instruction.

However, in a memory hierarchy with caching, a store
instruction does not guarantee a write to memory but only that
one or more write cycles is executed on the core's memory
connection bus.

On Intel and Sparc architectures, a store instruction doesn't
even guarantee that. All it guarantees is that the necessary
information is somehow passed to the write pipeline. What
happens after that is anybody's guess.

Where that write goes is up to the cache/memory controller and
the policies of the particular cache levels involved. For
example, many CPUs have write-thru primary caches while higher
levels are write-back with delay (an arrangement that allows
snooping of either the primary or secondary cache with
identical results).

For another thread (or core or CPU) to perceive a change a
value must be propagated into shared memory. For all
multi-core processors I am aware of, the first shared level of
memory is cache - not main memory. Cores on the same die
snoop each other's primary caches and share higher level
caches. Cores on separate dies in the same package share
cache at the secondary or tertiary level.

And on more advanced architectures, there are core's which don't
share any cache. All of which is irrelevant, since simply
issuing a store instruction doesn't even guarantee a write to
the highest level cache, and a membar or a fence instruction
guarantees access all the way down to the main, shared memory.

[...]

The reason volatile doesn't work with memory-mapped
peripherals is because the compilers don't issue the
necessary fence or membar instruction, even if a variable is
volatile.

It still wouldn't matter if they did. Lets take a simple case of one
thread and two memory mapped registers:

  volatile unsigned *regA = 0x...;
  volatile unsigned *regB = 0x...;
  unsigned oldval, retval;

    *regA = SOME_OP;
    *regA = SOME_OP;

    oldval = *regB;
    do {
       retval = *regB;
       }
       while ( retval == oldval );

Let's suppose that writing a value twice to regA initiates
some operation that returns a value in regB. Will the above
code work?

Not on a Sparc. Probably not on an Intel, but I'm less sure.
It wouldn't surprise me if Intel did allow certain segments to
be configured with an implicit fence around each access, and if
the memory mapped IO were in such a segment, it would work.

No. The processor will execute both writes, but the cache
will combine them so the device will see only a single write.
The cache needs to be flushed between writes to regA.

Again, the cache is really irrelevant here. The combining will
already occur in the write pipeline.

[...]

The upshot is this:
- "volatile" is required for any CPU.

I'm afraid that doesn't follow from anything you've said.
Particularly because the volatile is largely a no-op on most
current compilers---it inhibits compiler optimizations, but the
generated code does nothing to prevent the reordering that
occurs at the hardware level.

- fences are required for an OoO CPU.

By OoO, I presume you mean "out of order". That's not the only
source of the problems.

- cache control is required for a write-back cache between
CPU and main memory.

The cache is largely irrelevent on Sparc or Intel. The
processor architectures are designed in a way to make it
irrelevant. All of the problems would be there even in the
absence of caching. They're determined by the implementation of
the write and read pipelines.

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]