Re: Am I or Alexandrescu wrong about singletons?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Wed, 24 Mar 2010 05:40:48 CST

Message-ID:

<369f77bc-0993-4799-b77f-ca6ae821289a@g11g2000yqe.googlegroups.com>

On Mar 20, 8:22 am, Tony Jorgenson <tonytinker2...@yahoo.com> wrote:

[...]

I understand that volatile does not guarantee that the order
of memory writes performed by one thread are seen in the same
order by another thread doing memory reads of the same
locations. I do understand the need for memory barriers
(mutexes, atomic variables, etc) to guarantee order, but there
are still 2 questions that have never been completely
answered, at least to my satisfaction, in all of the
discussion I have read on this group (and the non moderated
group) on these issues.

First of all, I believe that volatile is supposed to guarantee the
following:

Volatile forces the compiler to generate code that performs
actual memory reads and writes rather than caching values in
processor registers. In other words, I believe that there is a
one-to-one correspondence between volatile variable reads and
writes in the source code and actual memory read and write
instructions executed by the generated code. Is this correct?

Sort of. The standard uses a lot of weasel words (for good
reasons) with regards to volatile, and in particular, leaves it
up to the implementation to define exactly what it means by
"access". Still, it's hard to imagine an interpretation that
doesn't imply a machine instruction which loads or stores.

Of course, on modern machines, a store instruction doesn't
necessarily result in a write to physical memory; you typically
need additional instructions to ensure that. And on the
compilers I know (g++, Sun CC and VC++), volatile doesn't cause
them to be generated. (My most concrete experience is with Sun
CC on a Sparc, where volatile doesn't ensure that memory mapped
I/O works correctly.)

Question 1:
My first question is with regard to using volatile instead of
memory barriers in some restricted multi-threaded cases. If my
above statements are correct, is it possible to use _only_
volatile with no memory barriers to signal between threads in
a reliable way if only a single word (perhaps a single byte)
is written by one thread and read by another?

No. Storing a byte (at the machine code level) on one processor
or core doesn't mean that the results of the store will be seen
on another processor. Modern processors reorder memory writes
in hardware, so the given the sequence:

     volatile int a = 0, b = 0; // suppose int atomic

     void f()
     {
         a = 1;
         b = 1;
     }

another thread may still see b == 1 and a == 0.

Question 1a:
First of all, please correct me if I am wrong, but I believe
volatile _must_always_ work as described above on any single
core CPU. One CPU means one cache (or one hierarchy of caches)
meaning one view of actual memory through the cache(s) that
the CPU sees, regardless of which thread is running. Is this
much correct for any CPU in existence? If not please mention a
situation where this is not true (for single core).

The standard doesn't make any guarantees, but all of the
processor architectures I know do guarantee coherence within a
single core.

The real question here is rather: who has a single core machine
anymore? The last Sparc I worked on had 32 core, and I got it
because it was deemed to slow for production work (where we had
128 core). And even my small laptop is a dual core.

Question 1b:
Secondly, the only way I could see this not working on a
multi-core CPU, with individual caches for each core, is if a
memory write performed by one CPU is allowed to never be
updated in the caches of other CPU cores. Is this possible?
Are there any multi-core CPUs that allow this? Doesn?t the
MESI protocol guarantee that eventually memory cached in one
CPU core is seen by all others? I know that there may be
delays in the propagation from one CPU cache to the others,
but doesn?t it eventually have to be propagated? Can it be
delayed indefinitely due to activity in the cores involved?

The problem occurs upstream of the cache. Modern processors
access memory through a pipeline. And optimize the accesses in
hardware. Reading and writing a cache line at a time. So if
you read a, then b, but the hardware finds that b is already in
the read pipeline (because you've recently accessed something
near it), then the hardware won't issue a new bus access for b;
it will simply use the value already in the pipeline. Which may
be older than the value of a, if the hardware does have to go to
memory for a.

All processors have instructions to force ordering: fence on an
Intel (and IIRC, a lock prefix creates an implicit fence),
membar on a Sparc. But the compilers I know don't issue these
instructions in case of volatile access. So the hardware still
remains free to do the optimizations that volation has forbid
the compiler.

Question 2:
My second question is with regard to if volatile is necessary
for multi-threaded code in addition to memory barriers. I know
that it has been stated that volatile is not necessary in this
case, and I do believe this, but I don?t completely understand
why. The issue as I see it is that using memory barriers,
perhaps through use of mutex OS calls, does not in itself
prevent the compiler from generating code that caches
non-volatile variable writes in registers.

Whether it prevents it or not is implementation defined. As
soon as you start doing this, you're formally in undefined
behavior as far as C or C++ are concerned. Posix and Windows,
however, make additional guarantees, and if the compiler is
Posix compliant or Windows compliant, you're safe with regards
to code movement accross any of the API's which forbid it.

If you're using things like inline assembler, or functions
written in assembler, you'll have to check your compiler
documentation, but in practice, the compiler will assume that
the inline code modifies all visible variables (and so ensure
that they are correctly written and read with regards to it)
unless it has some means to know better, and those means will
also allow it to take a possible fence or membar instruction
into account.

I have heard it written in this group that posix, for example,
supports additional guarantees that make mutex lock/unlock
(for example) sufficient for correct inter-thread
communication through memory without the use of volatile. I
believe I read here once (from James Kanze I believe) that
?volatile is neither sufficient nor necessary for proper
multi- threaded code? (quote from memory). This seems to imply
that posix is in cahoots with the compiler to make sure that
this works.

Posix imposes additional constraints on C compilers, in addition
to what the C standard does. Technically, Posix doesn't know
that C++ exists (and vice versa); practically, C++ compilers do
claim Posix compliance, and exterpolate the C guarantees in a
logical fashion. (Given that they generally concern basic types
like int, this really isn't too difficult.)

I've seen less formal specification with regards to Windows (and
heaven knows, I'm looking, now that I'm working in an almost
exclusively Windows environment). But practically speaking,
VC++ behaves under Windows like Posix compliant compilers under
Posix, and you won't find any other compiler breaking things
that work with VC++.

If you add mutex locks and unlocks (I know RAII, so please
don?t derail my question) around some variable reads and
writes, how do the mutex calls force the compiler to generate
actual memory reads and writes in the generated code rather
than register reads and writes?

That's the problem of the compiler implementor. Posix
(explicitly) and Windows (implicitly, at least) say that it has
to work, so it's up to the compiler implementor to make it work.
(In practice, most won't look into a function for which they
don't have the source code, and won't move code accross a
function whose semantics they don't know.)

I understand that compilation optimization affects these
issues, but if I optimize the hell out of my code, how do
posix calls (or any other OS threading calls) force the
compiler to do the right thing? My only conjecture is that
this is just an accident of the fact that the compiler can?t
really know what the mutex calls do and therefore the compiler
must make sure that all globally accessible variables are
pushed to memory (if they are in registers) in case _any_
called function might access them. Is this what makes it work?

In practice, in a lot of cases, yes:-). It's an easy and safe
solution for the implementor, and it really doesn't affect
optimization that much---critical zones which include system
calls or other functions for which the compiler doesn't have the
source code aren't that common. In theory, however, a compiler
could know the list of system requests which guarantee memory
synchronization, and disassemble the object files of any
functions for which it didn't have the sources, to see if they
made any such requests. I just don't know of any compilers
which do this.

If not, then how do mutex call guarantee the compiler doesn?t
cache data in registers, because this would surely make the
mutexes worthless without volatile (which I know from
experience that they are not).

The system API says that they have to work. It's up to the
compiler implementor to ensure that they do. Most adopt the
simple solution: I don't know what this function does, so I'll
assume the worst. But at least in theory, more elaborate
strategies are possible.

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]