Re: Am I or Alexandrescu wrong about singletons?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Tue, 30 Mar 2010 16:39:33 CST

Message-ID:

<1bc69c38-68de-48ce-9812-5471379ae307@u31g2000yqb.googlegroups.com>

On Mar 30, 12:05 pm, George Neuner <gneun...@comcast.net> wrote:

On Mon, 29 Mar 2010 16:53:44 CST, James Kanze <james.ka...@gmail.com>
wrote:

On Mar 28, 10:05 pm, George Neuner <gneun...@comcast.net> wrote:

[...]

The purpose of the fence is to sequence memory accesses.

For a much more rigorous definition of "access" that that used
by the C++ standard.

Not exactly. I agree that the processor's memory model
guarantees stronger ordering than that of the C++ standard (or
almost any language standard, for that matter), but you are
attributing semantics to "fence" that aren't there.

I'm not attributing anything. I'm just quoting the
documentation.

All the fence does is create a checkpoint in the instruction
sequence at which relevant load or store instructions
dispatched prior to dispatch of the fence instruction will
have completed execution.

That's not true for the two architectures whose documentation
I've studied, Intel and Sparc.

Then you'd better go back and study 8)

I've quoted Intel. Regretfully, the Sparc site was down when I
tried to access it, but I've studied their documentation fairly
intensely, and it basically guarantees the same thing.

To quote the Intel documentation of MFENCE:

   Performs a serializing operation on all load and store
   instructions that were issued prior the MFENCE
   instruction. This serializing operation guarantees that
   every load and store instruction that precedes in
   program order the MFENCE instruction is globally visible
   before any load or store instruction that follows the
   MFENCE instruction is globally visible.

Now look at LFENCE, SFENCE and CLFUSH and think about why they
are provided separately. Also look at PREFETCH and see what
it says about fences.

There are many types of fences. Obviously, in any given case,
you should use the one which provides the guarantees you need.

Intel provides MFENCE as a heavyweight combination of LFENCE,
SFENCE and CLFLUSH. MFENCE does propagate to memory *because*
it flushes the cache. However the primitive, SFENCE, ensures
propagation of writes only to L2 cache.

So what use is it, then?

Sparc has no single instruction that both fences and flushes
the cache. MEMBAR ensures propagation only to L2 cache. A
separate FLUSH instruction is necessary to ensure propagation
to memory.

That's not what it says in the Sparc Architecture Specification.
(Levels of cache are never mentionned; the architecture allows
an implementation with any number of levels.)

Sparc also does not have separate load and store fences, but
it offers two variants of MEMBAR which provide differing
consistency guarantees.

There is only one Membar instruction, with a 4 bit mask to
control the barriers: LOADLOAD, LOADSTORE, STORELOAD and
STORESTORE. (There are some other bits to control other
functionality, but they are irrelevant with regards to memory
synchronization in a multithreaded environment.)

Note the "globally visible". Both Intel and Sparc guarantee
strong ordering within a single core (i.e. a single thread);
mfence or membar (Sparc) are only necessary if the memory will
also be "accessed" from a separate unit: a thread running on a
different core, or memory mapped IO.

Again, you're attributing semantics that aren't there.

I just quoted the documentation. What part of "globally
visible" don't you understand.

For a store to be "globally visible" means that the value must
be visible from outside the core. This requires the value be
in *some* externally visible memory - not *main* memory in
particular. For both x86 and Sparc, this means L2 cache - the
first level that can be snooped off-chip.

That's an original definition of "global".

For a load "globally visible" means that the value is present
at all levels of the memory hierarchy and cannot be seen
differently by an external observer. This simply follows from
the normal operation of the read pipeline - the value is
written into all levels of cache (more or less) at the same
time it is loaded into the core register.

Note also that some CPUs can prefetch data in ways that bypass
externally visible levels of cache. Sparc and x86 (at least
since Pentium III) do not permit this.

Sparc certainly does allow it (at least according to the Sparc
Architecture Specification), and I believe some of the newer
Intel do as well.

However, in a memory hierarchy with caching, a store
instruction does not guarantee a write to memory but only that
one or more write cycles is executed on the core's memory
connection bus.

On Intel and Sparc architectures, a store instruction doesn't
even guarantee that. All it guarantees is that the necessary
information is somehow passed to the write pipeline. What
happens after that is anybody's guess.

No. On both of those architectures a store instruction will
eventually cause the value to be written out of the core
(except maybe if a hardware exception occurs).

Not on a Sparc. At least not according to the Sparc
Architecture Specification. Practically speaking I doubt that
this is guaranteed for any modern architecture, given the
performance implications.

Additionally the source register may renamed or the stored
value may be forwarded within the core to rendezvous with a
subsequent read of the same location already in the pipeline
... but these internal flow optimizations don't affect the
externally visible operation of the store instruction.

As long as there is only a single store instruction to a given
location, that store will eventually percolate out to the main
memory. If there are several, it's quite possible that some of
them will never appear outside the processor.

For another thread (or core or CPU) to perceive a change a
value must be propagated into shared memory. For all
multi-core processors I am aware of, the first shared level of
memory is cache - not main memory. Cores on the same die
snoop each other's primary caches and share higher level
caches. Cores on separate dies in the same package share
cache at the secondary or tertiary level.

And on more advanced architectures, there are core's which
don't share any cache. All of which is irrelevant, since
simply issuing a store instruction doesn't even guarantee a
write to the highest level cache, and a membar or a fence
instruction guarantees access all the way down to the main,
shared memory.

Sorry, but no. Even the architectures we've discussed here, x86 and
Sparc, do not satisfy your statement.

I quoted the specification from Intel for the x86. The Sparc
site was down, and my copy of the Sparc Architecture
Specification is on a machine in France, so I'm sorry, I can't
quote it here. But I do know what it says. And a membar
instruction does guarantee strong ordering.

There might be architectures I'm unaware of which can elide an
off-core write entirely by rendezvous forwarding and register
renaming, but you haven't named one. I would consider eliding
the store to be a dangerous interpretation of memory semantics
and I suspect I would not be alone.

Dangerous or not, no processor can afford to neglect this
important optimization opportunity. And it causes no problems
in single threaded programs, nor in multithreaded programs which
use proper synchronization methods.

I'm not familiar with any cached architecture for which
fencing alone guarantees that a store writes all the way to
main memory - I know some that don't even have/need fencing
because their on-chip caches are write-through.

I just pointed one out. By quoting the manufacturer's
specifications for the mfence instruction. If I were on my Unix
machine in Paris, I could equally quote similar text for the
Sparc.

The upshot is this:
- "volatile" is required for any CPU.

I'm afraid that doesn't follow from anything you've said.
Particularly because the volatile is largely a no-op on most
current compilers---it inhibits compiler optimizations, but the
generated code does nothing to prevent the reordering that
occurs at the hardware level.

"volatile" is required because the compiler must not reorder
or optimize away the loads or stores.

Which loads and stores. The presence of a fence (or inline
assembler, or specific system or library calls) guarantee that
the compiler cannot reorder around it. And whether the compiler
reorders or suppresses elsewhere is irrelevant, since the
hardware can do it regardless of the code the compiler
generates.

- fences are required for an OoO CPU.

By OoO, I presume you mean "out of order". That's not the only
source of the problems.

OoO is not the *only* source of the problem. The compiler has
little control over hardware reordering ... fences are blunt
instruments that impact all loads or stores ... not just those
to language level "volatiles".

Agreed. Volatile has different semantics (at least that was the
intent). See Herb Sutter's comments else thread.

- cache control is required for a write-back cache between
CPU and main memory.

The cache is largely irrelevent on Sparc or Intel. The
processor architectures are designed in a way to make it
irrelevant. All of the problems would be there even in the
absence of caching. They're determined by the implementation of
the write and read pipelines.

That's a naive point of view. For a cached processor, the
operation of the cache and it's impact on real programs is
*never* "irrelevant".

I was speaking uniquely in the context of threading. The
operation of the cache is very relevant with regards to
performance, for example.

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]