Re: Am I or Alexandrescu wrong about singletons?

From:

Andy Venikov <swojchelowek@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Tue, 30 Mar 2010 05:03:11 CST

Message-ID:

<hori4b$ihk$1@news.eternal-september.org>

Herb Sutter wrote:

Please remember this: Standard ISO C/C++ volatile is useless for
multithreaded programming. No argument otherwise holds water; at best
the code may appear to work on some compilers/platforms, including all
attempted counterexamples I've seen on this thread.

You have an enormous clout on C++ professionals, including myself, so
before permanently agreeing to such an all-encompassing statement allow
me to maybe step back a little and see what it is that's at the core of
this argument. Maybe we're arguing the same point. Or maybe I'm missing
something big in which case I'll be doubly glad to have been shown my
wrong assumptions.

I understand that volatile never was supposed to be of any help for
multithreaded programming. I don't expect it to issue any memory fences
  nor make any guarantees whatsoever about anything thread-related...
Yet, on all the compilers I know of (gcc, mingw, MSVC, LLVM, Intel) it
produces just the code I need for my multithreaded programs. And I
really don't see how it wouldn't, given common-sense understanding of
what it should do in single-threaded programs. And I'm pretty sure that
it's not going to change in a foreseeable future.

So my use of volatile maybe not standard-portable, but it sure is
real-life portable.

Here's the point of view I'm coming from.
Imagine that someone needs to implement a library that provides certain
multithreading (multiprogramming) tools like atomic access,
synchronization primitives and some lock-free algorithms that will be
used by other developers so that they wouldn't have to worry about
things like volatile. (Now that boost.atomic is almost out, I'll happily
use it. But Helge Bahmann (the author of the library) didn't have such a
luxury, so to make his higher-level APIs work he had to internally
resort to low-level tools like volatiles where appropriate.)

So, with the above said, here's a concrete example of how I'd use
volatile without an access to a ready-made library. Let's take Magued
Michael's lock-free queue ("Simple, Fast and Practical Non-blocking and
blocking queue algorithms", Magued Michael & Michael Scott; 1996). It
uses a technique similar to DCL to verify a validity of a read. Look
into it's deque() method.
  I'll provide the pseudo code here:

dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean
D1: loop # Keep trying until Dequeue is done
D2: head = Q?>Head # Read Head
D3: tail = Q?>Tail # Read Tail
D4: next = head?>next # Read Head.ptr?>next
D5: if head == Q?>Head # Are head, tail, and next consistent?
D6: if head.ptr == tail.ptr # Is queue empty or Tail falling behind?
D7: if next.ptr == NULL # Is queue empty?
D8: return FALSE # Queue is empty, couldn?t dequeue
D9: endif
           # Tail is falling behind. Try to advance it
D10: CAS(&Q?>Tail, tail, <next.ptr, tail.count+1>)
D11: else # No need to deal with Tail
     # Read value before CAS, otherwise another dequeue might free the
next node
D12: *pvalue = next.ptr?>value
             # Try to swing Head to the next node
D13: if CAS(&Q?>Head, head, <next.ptr, head.count+1>)
D14: break # Dequeue is done. Exit loop
D15: endif
D16: endif
D17: endif
D18: endloop
D19: free(head.ptr) # It is safe now to free the old dummy node
D20: return TRUE # Queue was not empty, dequeue succeeded

Look at line D5: it needs to check if Q->Head is still the same as what
we read from it before. Otherwise two possibilities for breaking the
correctness arise: 1) it would be possible for the element pointed to by
Q->Head to be re-inserted back into the queue with NULL in the "next"
and then dequeue would return "empty" when in reality the queue was
never empty in any given moment; or 2) The first element was removed
after we've read Q->Head and before we've read next thus there could be
garbage in head->next by the time we read it and we'd try to access
garbage on line D12.

This piece of pseudo code could be naively translated to a following c++
code:

while (true)
{
Node * localHead = head_;
Node * localTail = tail_;
Node * localNext = localHead->next;
if (localHead == head_)
{
   ...
}

But it wouldn't work for the obvious reasons.
One needs to insert MemoryFences in the right places.
Memory fences is something that is highly platform-specific, so one
would define macros for them that would expand to different instructions
on different platforms.
Here's the code with memory fences inserted:

while (true)
{
Node * localHead = head_;
Node * localTail = tail_;
DataDependencyBarrier(); //All the systems that I know of will do
                          //this sort of barrier automatically, so
                          //this macro will expand to nothing
Node * localNext = localHead->next;
LoadLoadBarrier(); //on x86 this will expand to nothing
if (localHead == head_)
{
....
}

This is much better, but it still got problems: first, on x86, the
LoadLoadBarrier() will expand to nothing and there will be no indication
to the compiler not to re-order different loads; and second (and I think
it's the crux of my argument) that an optimizing compiler will dispose
of the "if" statement even in the face of memory barriers. No matter how
many or what type of memory barriers you insert, the compiler will be
allowed to omit the if statement. The ONLY way to force the compiler
(any compiler for that matter) to generate it is to declare head_ as
volatile.

Here's the final code:
struct Node
{
    <unspecified> data;
    Node volatile * pNext;
};
Node volatile * volatile head_;
Node volatile * volatile tail_;

dequeue()
{
   while (true)
   {
     Node volatile * localHead = head_;
     Node volatile * localTail = tail_;
     DataDependencyBarrier();
     Node volatile * localNext = localHead->next;

     if (localHead == head_)
     {
      ...
     }
.....
}

Now this code will produce the intended correct object code on all the
compilers I've listed above and on at least these CPUs: x86, itanium,
mips, PowerPC (assuming that all the MemoryBarriers have been defined
for all the platforms). And without any modifications to the above code.
How's that for portability?

I think my fault was that in my previous posts I was pushing more
heavily on volatile's ability to tell the compiler not to reorder the
instructions it generates (which is still useful) rather than to
emphasize the fact that I want volatile to tell the compiler not to
optimize away certain instructions. The reordering problem could be
circumvented by using inline asm statements (and then again, on x86,
LoadLoadBarrier would expand to nothing, so we'd be forced to use a
bogus inline asm statement - I'd rather chose to use volatile), but I
don't see how the optimizing away problem could be circumvented without
the use of volatile.

Now, after writing all this, I realize that I could've used a simpler
example - a simple Peterson's algorithm for two threads wouldn't work
without a use of a volatile: the "turn" variable is assigned the same
value as it's being compared to later, so the compiler will omit the "if
turn == x" part in the if statement.

<snip>

I hope this clears matters - I'm sorry if I wasn't clear before.

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

Andy.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]