Re: question re. usage of "static" within static member functions of a class
"Jerry Coffin" <jerryvcoffin@yahoo.com> wrote in message
news:MPG.2517f1748d65d5c29897cb@news.sunsite.dk...
In article "Chris M. Thomasson" <h8hrgk$29h9$1@news.ett.com.ua>,
no@spam.invalid says...
[...]
A #StoreLoad memory barrier can cripple performance by forcing the
previous stores to be performed before any subsequent load can be
committed. This defeats the purpose of caching and pipeline:
That's not (usually) entirely true. A typical processor maintains
coherency between the cache and the memory -- i.e. anything that's
been written to the cache is considered visible throughout the
system.
The memory barrier normally affects the CPUs write queue between the
CPU proper, and the cache (specifically, the first level of cache
that isn't write-through). If the writes involved happened to be to
areas of memory that are uncachable, then you'd see a really serious
performance hit. Otherwise, while there will be some performance hit,
it'll typically be _fairly_ small (the write queue is rarely more
than about 10 deep, and can typically write one entry something like
every clock or two.
Of course, if you have a cache that doesn't do snooping to ensure
that everything written to the cache is immediately visible to the
rest of the system, things get uglier in a hurry...
FWIW, I know that a `#StoreLoad' membar tanks performance in certain
algorithms. Take plain SMR (Safe Memory Reclamation) hazard pointers. The
algorithm requires a `#StoreLoad' on every store into a hazard pointer. This
really shows up when one it iterating through a linked data-structure. I
have included a crude little program at the end of this message which
attempts to demonstrate the performance difference between a membar based
SMR and a naked version.
Then again, a critical section basically is itself just a double-
checked lock (including the necessary memory barriers).
AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive
mutexs.
Could the inter-process mutexs share the same properties? I think so.
Yes, it could. It would require putting the data used for the fast-
path lock into memory that's visible to all processes involved.
Indeed.
[...]
WRT slow, are you referring to the old implementation of
`CRITICAL_SECTION's which used to hand off ownership? IIRC, MS
changed the implementation to allow for a thread to sneak in and
acquire the mutex which increases performance considerably.
However, unlike handing off ownership, it's not fair and can be
subjected to "starvation".
Well, I haven't looked at it again recently to check, so things could
have changed since then. What I'm talking about, however, was where
EnterCriticalSection would basically do a spin-lock, repeatedly
trying to acquire the mutex, and only after several attempts would it
give up and let another thread run.
Yeah. `CRITICAL_SECTION's are adaptive. However, what about:
SetCriticalSectionSpinCount(&mutex, 0);
?
lol... ;^)
Here is a crappy little test program for GCC and ia-32:
______________________________________________________________________
#include <time.h>
#include <stdio.h>
#include <pthread.h>
#define MFENCE() __asm__ __volatile__ ("MFENCE" : : : "memory")
#define RUNS 2
#define DEPTH 256
#define ITERS 1000000
#define READERS 4
struct node
{
struct node* volatile next;
};
static struct node g_nodes[DEPTH];
static struct node* volatile g_head = NULL;
static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;
void
node_init(void)
{
unsigned i;
for (i = 0; i < DEPTH; ++i)
{
g_nodes[i].next = g_head;
g_head = g_nodes + i;
}
}
void*
reader_naked(void* state)
{
unsigned i;
for (i = 0; i < ITERS; ++i)
{
struct node* n = g_head;
while (n)
{
n = n->next;
}
}
return 0;
}
void*
reader_plain_smr(void* state)
{
unsigned i;
for (i = 0; i < ITERS; ++i)
{
struct node* n = g_head;
MFENCE();
while (n)
{
n = n->next;
MFENCE();
}
}
return 0;
}
void*
reader_locked(void* state)
{
unsigned i;
for (i = 0; i < ITERS; ++i)
{
struct node* n;
pthread_mutex_lock(&g_mutex);
n = g_head;
while (n)
{
n = n->next;
}
pthread_mutex_unlock(&g_mutex);
}
return 0;
}
int
main(void)
{
unsigned i, r;
pthread_t tid[READERS];
clock_t start;
double end;
node_init();
for (r = 0; r < RUNS; ++r)
{
printf("executing test run %d of %d...\n",
r + 1,
RUNS);
/* Naked */
start = clock();
for (i = 0; i < READERS; ++i)
{
pthread_create(tid + i, NULL, reader_naked, NULL);
}
for (i = 0; i < READERS; ++i)
{
pthread_join(tid[i], NULL);
}
end = (double)(clock() - start) / CLOCKS_PER_SEC;
printf("reader_naked - %.3f ms / %lu reads per-second/per-thread\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));
/* Plain SMR */
start = clock();
for (i = 0; i < READERS; ++i)
{
pthread_create(tid + i, NULL, reader_plain_smr, NULL);
}
for (i = 0; i < READERS; ++i)
{
pthread_join(tid[i], NULL);
}
end = (double)(clock() - start) / CLOCKS_PER_SEC;
printf("reader_plain_smr - %.3f ms / %lu reads
per-second/per-thread\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));
/* Locked */
start = clock();
for (i = 0; i < READERS; ++i)
{
pthread_create(tid + i, NULL, reader_locked, NULL);
}
for (i = 0; i < READERS; ++i)
{
pthread_join(tid[i], NULL);
}
end = (double)(clock() - start) / CLOCKS_PER_SEC;
printf("reader_locked - %.3f ms / %lu reads
per-second/per-thread\n\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));
}
puts("\n\ncompleted!");
return 0;
}
______________________________________________________________________
Here is the output I get on my old P4 3.06gz hyperthreaded processor:
______________________________________________________________________
executing test run 1 of 2...
reader_naked - 546.000 ms / 1831501 reads per-second/per-thread
reader_plain_smr - 18313.000 ms / 54606 reads per-second/per-thread
reader_locked - 3484.000 ms / 287026 reads per-second/per-thread
executing test run 2 of 2...
reader_naked - 563.000 ms / 1776198 reads per-second/per-thread
reader_plain_smr - 18312.000 ms / 54608 reads per-second/per-thread
reader_locked - 3547.000 ms / 281928 reads per-second/per-thread
completed!
______________________________________________________________________
The only difference between `reader_naked()' and `reader_plain_smr()' is
that the latter has the same memory barrier overhead that a plain SMR
implementation would have. I can say that the `MFENCE' instruction on IA-32
can really devastate performance. I have seen too many examples and
measurements. BTW, you will get similar performance numbers on SPARC or
PowerPC. The locked version beats the plain SMR version because, blocking
aside for a moment, it amortizes memory barrier usage across the depth of
the iteration...