Re: Am I or Alexandrescu wrong about singletons?

From:

George Neuner <gneuner2@comcast.net>

Newsgroups:

comp.lang.c++.moderated

Date:

Sun, 28 Mar 2010 15:05:33 CST

Message-ID:

<3jltq5lciv1n8c10mlt00uoj7nkj1guf6j@4ax.com>

On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.kanze@gmail.com>
wrote:

On Mar 25, 7:10 pm, George Neuner <gneun...@comcast.net> wrote:

On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov

[...]

As you noted, 'volatile' does not guarantee that an OoO CPU will
execute the stores in program order ...

Arguably, the original intent was that it should. But it
doesn't, and of course, the ordering guarantee only applies to
variables actually declared volatile.

"volatile" is quite old ... I'm pretty sure the "intent" was defined
before there were OoO CPUs (in de facto use if not in standard
document). Regardless, "volatile" only constrains the behavior of the
*compiler*.

for that you need to add a write fence between them. However,
neither 'volatile' nor write fence guarantees that any written
value will be flushed all the way to memory - depending on
other factors - cache snooping by another CPU/core, cache
write back policies and/or delays, the span to the next use of
the variable, etc. - the value may only reach to some level of
cache before the variable is referenced again. The value may
never reach memory at all.

If that's the case, then the fence instruction is seriously
broken. The whole purpose of a fence instruction is to
guarantee that another CPU (with another thread) can see the
changes.

The purpose of the fence is to sequence memory accesses. All the
fence does is create a checkpoint in the instruction sequence at which
relevant load or store instructions dispatched prior to dispatch of
the fence instruction will have completed execution. There may be
separate load and store fence instructions and/or they may be combined
in a so-called "full fence" instruction.

However, in a memory hierarchy with caching, a store instruction does
not guarantee a write to memory but only that one or more write cycles
is executed on the core's memory connection bus. Where that write
goes is up to the cache/memory controller and the policies of the
particular cache levels involved. For example, many CPUs have
write-thru primary caches while higher levels are write-back with
delay (an arrangement that allows snooping of either the primary or
secondary cache with identical results).

For another thread (or core or CPU) to perceive a change a value must
be propagated into shared memory. For all multi-core processors I am
aware of, the first shared level of memory is cache - not main memory.
Cores on the same die snoop each other's primary caches and share
higher level caches. Cores on separate dies in the same package share
cache at the secondary or tertiary level.

The same holds true for all separate CPU shared memory multiprocessors
I am aware of ... they are connected so that they can snoop other's
caches at some level, or an additional level of shared cache is placed
between the CPUs and memory, or both.

(Of course, the other thread also needs a fence.)

Not necessarily.

OoO execution and cache behavior are the reasons 'volatile'
doesn't work as intended for many systems even in
single-threaded use with memory-mapped peripherals.

The reason volatile doesn't work with memory-mapped peripherals
is because the compilers don't issue the necessary fence or
membar instruction, even if a variable is volatile.

It still wouldn't matter if they did. Lets take a simple case of one
thread and two memory mapped registers:

  volatile unsigned *regA = 0x...;
  volatile unsigned *regB = 0x...;
  unsigned oldval, retval;

    *regA = SOME_OP;
    *regA = SOME_OP;

    oldval = *regB;
    do {
       retval = *regB;
       }
       while ( retval == oldval );

Let's suppose that writing a value twice to regA initiates some
operation that returns a value in regB. Will the above code work?

No. The processor will execute both writes, but the cache will
combine them so the device will see only a single write. The cache
needs to be flushed between writes to regA.

Ok, let's assume there is a flush API and add some flushes:

    *regA = SOME_OP;
  FLUSH *regA;
    *regA = SOME_OP;
  FLUSH *regA;

    oldval = *regB;
    do {
       retval = *regB;
       }
       while ( retval == oldval );

Does this now work?

Maybe. It will work if the flush operation includes a fence,
otherwise you can't know whether the write has occurred before the
cache line is flushed.

Ok, let's assume there is a fence API and add fences:

    *regA = SOME_OP;
  SFENCE;
  FLUSH *regA;
    *regA = SOME_OP;
  SFENCE;
  FLUSH *regA;

    oldval = *regB;
    do {
       retval = *regB;
       }
       while ( retval == oldval );

Does this now work?

Yes. Now I am guaranteed that the first value will be written all the
way to memory (and to my device) before the second value is written.

Now the question is whether a cache flush includes a fence operation
(or vice versa)? The answer is "it depends". On many architectures,
the ISA has no cache control instructions - the cache controller is
mapped to reserved memory addresses or I/O ports. Some cache
controllers permit only programming replacement policy and do not
allow programs to manipulate the entries. Some controllers flush
everything rather than allowing individual lines to be flushed. It
depends.

If there is a language level API for cache control or for fencing, it
may or may not include the other operation depending on the whim of
the developer.

The upshot is this:
  - "volatile" is required for any CPU.
  - fences are required for an OoO CPU.
  - cache control is required for a write-back cache between
    CPU and main memory.

James Kanze

George

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]