Re: question re. usage of "static" within static member functions of a class
"Jerry Coffin" <jerryvcoffin@yahoo.com> wrote in message
news:MPG.2515d484bbea6a1b9897c7@news.sunsite.dk...
In article <edee09a7-fbc2-41fd-84b4-
dcdae859b12e@a21g2000yqc.googlegroups.com>, james.kanze@gmail.com
says...
[ ... using a memory barrier ]
In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.
I have to disagree with both of these. First, a memory barrier is
quite a bit different from a mutex. Consider (for example) a store
fence. It simply says that stores from all previous instructions must
complete before any stores from subsequent instructions (and a read
barrier does the same, but for reads). It's basically equivalent to a
sequence point, but for real hardware instead of a conceptual model.
As far as cost goes: a mutex normally uses kernel data, so virtually
every operation requires a switch from user mode to kernel mode and
back.
Well, a `CRITICAL_SECTION' is a mutex. A HANDLE returned from
`CreateMutex()' is also mutex. Therefore, a mutex implementation does not
need to always enter the kernel. IMVHO, Windows made a poor choice of naming
their intra-process mutex. IMO, a "critical section" does not refer to the
synchronization primitive, but to the actual locked portion of the caller's
code.
The cost for that will (of course) vary between systems, but is
almost always fairly high (figure a few thousand CPU cycles as a
reasonable minimum).
A memory barrier will typically just prevent combining a subsequent
write with a previous one.
A #StoreLoad memory barrier can cripple performance by forcing the previous
stores to be performed before any subsequent load can be committed. This
defeats the purpose of caching and pipeline:
http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce
As long as there's room in the write queue
for both pieces of data, there's no cost at all.
There is a cost when you start getting into load-to-store and store-to-load
ordering constraints. For instance, the following store membar:
MEMBAR #StoreStore
is MUCH less expensive when compared to a version which has a store-to-load
ordering constraint:
MEMBAR #StoreLoad | #StoreStore
You need a store-to-load ordering in the acquisition portion of a
traditional mutex. This is why even user-space fast-pathed mutexs can have
some nasty overheads even in the non-contended case.
In the (normally
rare) case that the CPU's write queue is full, a subsequent write has
to wait for a previous write to complete to create an empty spot in
the write queue. Even in this worst case, it's generally going to be
around an order of magnitude faster than a switch to kernel mode and
back.
The problem is that C++ (up through the 2003 standard) simply
lacks memory barriers. Double-checked locking is one example
of code that _needs_ a memory barrier to work correctly -- but
it's only one example of many.
It can be made to work with thread local storage as well,
without memory barriers.
Well, yes -- poorly stated on my part. It requires _some_ sort of
explicit support for threading that's missing from the current and
previous versions of C++, but memory barriers aren't the only
possible one.
Yes. The "problem" with DCLP is in fact just a symptom of a
larger problem, of people not understanding what is and is not
guaranteed (and to a lesser degree, of people not really
understanding the costs---acquiring a non-contested mutex is
really very, very cheap, and usually not worth trying to avoid).
At least under Windows, this does not fit my experience. Of course,
Windows has its own cure (sort of) for the problem -- rather than
using a mutex (with its switch to/from kernel mode) you'd usually use
a critical section instead.
Entering a critical section that's not in
use really is very fast.
Not in all cases... Try a test with something like the following crude
pseudo-code:
________________________________________________________________
struct data
{
int array[128];
};
static struct data g_data = { { 0 } };
void locked_reader_threads()
{
unsigned iterations;
CRITICAL_SECTION mutex;
InitializeCriticalSection(&mutex);
for (iterations = 0 ;; ++iterations)
{
EnterCriticalSection(&mutex);
LeaveCriticalSection(&mutex);
for (int i = 0; i < 128; ++i)
{
int x = g_data.array[i];
if (x) abort();
}
}
DeleteCriticalSection(&mutex);
}
void naked_reader_threads()
{
unsigned iterations;
for (iterations = 0 ;; ++iterations)
{
for (int i = 0; i < 128; ++i)
{
int x = g_data.array[i];
if (x) abort();
}
}
}
________________________________________________________________
See how many iterations each reader thread can perform per-second under
worst case load for prolonged periods of time (e.g., like sustained high
intensity bursts of traffic on a database server or something). The
`locked_reader_threads()' should perform less reads per-second-per-thread.
Then again, a critical section basically is itself just a double-
checked lock (including the necessary memory barriers).
AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive mutexs.
Could the inter-process mutexs share the same properties? I think so.
They have two
big limitations: first, unlike a normal mutex, they only work between
threads in a single process. Second, they can be quite slow when/if
there's a great deal of contention for the critical section.
WRT slow, are you referring to the old implementation of `CRITICAL_SECTION's
which used to hand off ownership? IIRC, MS changed the implementation to
allow for a thread to sneak in and acquire the mutex which increases
performance considerably. However, unlike handing off ownership, it's not
fair and can be subjected to "starvation".