Re: Am I or Alexandrescu wrong about singletons?

From:

Joshua Maurice <joshuamaurice@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Mon, 5 Apr 2010 18:18:26 CST

Message-ID:

<f256b4a5-74b6-46a5-825f-20a2b115abaa@q15g2000yqj.googlegroups.com>

On Mar 31, 3:35 pm, Andy Venikov <swojchelo...@gmail.com> wrote:

James Kanze wrote:

<snip>

I'm not sure I follow. Basically, the fence guarantees that the
hardware can't do specific optimizations. The same
optimizations that the software can't do in the case of
volatile. If you think you need volatile, then you certainly
need a fence. (And if you have the fence, you no longer need
the volatile.)

Ah, finally I think I see where you are coming from. You think that if
you have the fence you no longer need a volatile.

I think you assume too much about how fence is really implemented. Since
the standard says nothing about fences you have to rely on a library
that provides them and if you don't have such a library, you'll have to
implement one yourself. A reasonable way to implement a barrier would be
to use macros that, depending on a platform you run, expand to inline
assembly containing the right instruction. In this case the inline asm
will make sure that the compiler won't reorder the emitted instructions,
but it won't make sure that the optimizer will not throw away some
needed instructions.

For example, following my post where I described Magued Michael's
algorithm, here's how relevant excerpt without volatiles would look like:

//x86-related defines:
#define LoadLoadBarrier() asm volatile ("mfence")

//Common code
struct Node
{
     Node * pNext;};

Node * head_;

void f()
{
     Node * pLocalHead = head_;
     Node * pLocalNext = pLocalHead->pNext;

     LoadLoadBarrier();

     if (pLocalHead == head_)
     {
         printf("pNext = %p\n", pLocalNext);
     }

}

Just to make you happy I defined LoadLoadBarrier as a full mfence
instruction, even though on x86 there is no need for a barrier here,
even on a multicore/multiprocessor.

And here's how gcc 4.3.2 on Linux/x86-64 generated object code:

0000000000400630 <_Z1fv>:
   400630: 0f ae f0 mfence
   400633: 48 8b 05 fe 09 20 00 mov 0x2009fe(%rip),%rax #
601038 <head_>
   40063a: bf 5c 07 40 00 mov $0x40075c,%edi
   40063f: 48 8b 30 mov (%rax),%rsi
   400642: 31 c0 xor %eax,%eax
   400644: e9 bf fe ff ff jmpq 400508 <printf@plt>
   400649: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)

As you can see, it uselessly put mfence right at the beginning of
function f() and threw away the second read of head_ and the whole if
statement altogether.

Naively, you could say that we could put "memory" clobber in the inline
assembly clobber list like this:
#define LoadLoadBarrier() asm volatile ("mfence" : : : "memory")

This will work, but it will be a huge overkill, because after this the
compiler will need to re-read all variables, even unrelated ones. And
when f() gets inlined, you get a huge performance hit.

Volatile saves the day nicely and beautifully, albeit not "standards"
portably. But as I said elsewhere, this will work on most compilers and
hardware. Of course I'd need to test it on the compiler/hardware
combination that client is going to run it on, but such is the peril of
trying to provide portable interface with non-portable implementation.
But so far I haven't found a single combination that wouldn't correctly
compile the code with volatiles. And of course I'll gladly embrace C++0x
atomic<>... when it becomes available. Right now though, I'm slowly
migrating to boost::atomic (which again, internally HAS TO and IS using
volatiles).

I was hoping someone more knowledgeable would reply, but I guess it's
up to me. My experience has been mostly limited to POSIX, WIN32, and
Java threading, so what I'm about to say I say without the highest
level of confidence.

I understand your desire to use volatile in this way, and I think it's
a reasonable use case and desire. You are assuming that all relevant
shared state between the two threads is accessible only through head_.
You issue the hardware only fence instruction with "asm volatile" to
make sure the compiler does not remove it. You do not put the "memory"
clobber, and thus the compiler does not understand what it does.
Finally, you use "volatile" to make sure that the compiler will put
another load instruction after the fence in the compiled machine
code.

My replies are thus:

First, you don't want the "memory" clobber because you know that the
(relevant) shared state is only accessible through head. If the
compiler will load in data dependency order and the volatile read will
do a read after the hardware fence, then everything should work out.
I'm not sure if you could find such guarantee published for any
compiler, though. However, I have a hard time thinking of a compiler
which would not do this, but I am not sure, and I would not rely upon
it without checking the compiled output.

This is partly an argument over the definition of fence. One might say
that when one talks about a portable fence, it applies to all memory,
not just a single load specified by the coder and all data dependent
loads. The "general" definition of fence demands the "memory" clobber
(absent the existence of a less draconian clobber). However, this is
irrelevant to the discussion of volatile and threading.

Finally, it comes down to whether volatile will force a load
instruction to be emitted from the compiler. At the very least, this
seems to be included in the intent and spirit of volatile, and I would
hazard a guess that all compilers would emit a load (but not
necessarily anything more [not that anything more would be needed in
this case], and ignoring volatile bugs, which greatly abound across
compilers). However, you're already at the assembly level. Would it be
that much harder to use an "asm volatile" to do a load instead of a
volatile qualified load? You're already doing assembly hackery, so at
least use something which is "more guaranteed" to work like an "asm
volatile" load and not volatile which was never intended to be a
useful threading primitive. Perhaps supply a primitive like
DataDependencyFenceForSingleLoad? I don't know enough about hardware
to even hazard a guess if such a thing is portably efficient. I do
know enough that what you're doing is not portable as is.

All in all though, this does not change the fact that volatile is not
a useful, correct, portable threading primitive. All you've
demonstrated is that volatile in conjunction with assembly (not
portable) can be a useful, correct, non-portable threading primitive,
though I would argue that the code has poor "style" and should not be
using volatile.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]