Re: Threading issue in next standard

From:

"kanze" <kanze@gabi-soft.fr>

Newsgroups:

comp.std.c++

Date:

Wed, 6 Sep 2006 11:26:19 CST

Message-ID:

<1157532880.157500.119000@i3g2000cwc.googlegroups.com>

Alan McKenney wrote:

kanze wrote:

wkaras@yahoo.com wrote:

.... typically, what is needed in a multithreaded
context is an explicitly induced point in the program where all
previous writes (regardless of the target type) become visible
(before any of the following writes become visible).

.... What you usually need in multithreading is a
sequencing guarantee---that (all) preceding writes will become
visible to all observers before any of the following writes.
(Note that this generally implies some sequencing actions on the
part of the observers as well.)

Does it need to be "all writes"?

Not really. That was a major simplification on my part. The
important point is that there are a set of writes all of which
must become visible before any of the writes in a second set,
although the order within each set is not important. To do this
with volatile (assuming strong volatile semantics) would require
declaring all of the objects in both sets volatile, creating a
total ordering of all of the writes, which is not necessary and
which imposes an extreme performance penalty. (Note too that
when I speak of the ordering of the writes, I really mean the
order that is seen by all observers.)

Perhaps I've been corrupted by my years doing
supercomputing, but when I think of parallel processing
(and I think of multithreading as parallel processing), I
envision the model system as one with a bunch of
CPUs (possibly with local memory)
with a vast network between the CPUs and the (shared)
memory. When a CPU updates a (shared) variable,
the update slowly "percolates" out through the network.

In this situation, waiting for "all writes" from all CPUs to
be done would require all CPUs to stop and wait for the memory
network to become quiescent. Since this would happen every
time any CPU requests a "wait for all writes", it would cause
an O(no of processors) performance hit.

Yes, but only at one very specific point in time.

In fact, it's a little bit more subtle. The "writing" processor
uses a primitive to ensure that it "exports" all of the
preceding writes before it "exports" any following writes; on
many modern processors, a store A instruction, followed by a
store B instruction, may result in B being "written" before A
unless special steps are taken. And the "reading" processors
use a primitive to ensure that all following reads access
"later" values than all previous reads.

Consider a simple example: p is a pointer, initialized with
null:

processor A processor B

p = new C; if ( p != NULL ) p->someFunctionInC() ;

This doesn't work, of course, because there is no ordering
between the writes in the constructor of C, and the write to p;
the inversion of the ordering may occur when actually writing to
global memory, in processor A, or when reading from global
memory, in processor B.

In this case, there is no way to make volatile at an object
level work, no matter how strong it is made, because volatile
doesn't engage until the constructor has finished.

I don't know about anyone else, but when I use
mutexes, I always associate each mutex with a set
of (shared) variables that it controls, so what I would
want is to be assured that all writes to the variables
controlled by this mutex were visible to my
thread before the mutex was considered locked.

At least under Posix, you have this. It's one of the Posix
guarantees concerning pthread_mutex_lock (and all of the other
pthread synchronization requests). I think (hope?) is is a
foregone conclusion that any standard mutexes, regardless of the
syntax finally adopted, will adopt these guarantees. If you
wrap the two operations above in a mutex lock, there is no
problem.

The interest in atomic operations, observable, and such, is for
lock free algorithms. On a Sun Sparc, for example, I don't need
a lock to make the above work; just inserting a few membar
instructions in critical places is sufficient.

In other words, for me, synchronization always
applies to an object or set of objects, not to the
universe of objects.

At the design level, you are certainly correct. In practice,
the synchronization is done by means of a system request
(pthread_mutex_lock, for example) which doesn't know what the
set of objects is, and synchronizes everything. At least on a
Sparc (the architecture I know best), this memory
synchronization is done by means of a machine instruction
membar, and this instruction synchronizes everything.

If I may invent some ill-advised syntax, I'd want something
like

synchronizable group_a { int a; std::string b; MyClass c; };

lock_group( group_a );
a += 1;
unlock_group( group_a );

I don't think it would buy much on most modern processors.

--
James Kanze GABI Software
Conseils en informatique orient9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S9mard, 78210 St.-Cyr-l'cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]