Re: question re. usage of "static" within static member functions of a class

From:

"Chris M. Thomasson" <no@spam.invalid>

Newsgroups:

comp.lang.c++

Date:

Mon, 14 Sep 2009 04:10:25 -0700

Message-ID:

<h8l88j$kq8$1@news.ett.com.ua>

"Jerry Coffin" <jerryvcoffin@yahoo.com> wrote in message
news:MPG.2516c74b876eda129897ca@news.sunsite.dk...

In article <8c8edcc3-d7f4-4890-9f43-c05db50bb41b@
37g2000yqm.googlegroups.com>, james.kanze@gmail.com says...

On Sep 13, 1:01 am, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

In article <edee09a7-fbc2-41fd-84b4-
dcdae859b...@a21g2000yqc.googlegroups.com>,
james.ka...@gmail.com says...

[ ... using a memory barrier ]

In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.

I have to disagree with both of these.

You're arguing against actual measurements made on a Sun Sparc,
under Solaris.

The degree of similarity (or lack thereof) between a memory barrier
1) is entirely platform independent, and 2) isn't really open to
measurement.

Based on what you've said, it comes down to this: the platforms with
which you're familiar include a double-checked lock in their
implementation of a mutex (as long as under Windows you treat
"mutex" as meaning "critical section").

Going back to your original statement, that there's little point in
using double-checked locking, I'd certainly agree that when the
system builds a double-checked lock into its mutex (or what you use
as a mutex anyway), then there's little or no gain from duplicating
that in your own code.

FWIW, there is a fairly big difference between the fast-path on double
checked locking and the fast-path of a mutex. The check of a DCL algorithm
does not involve any atomic RMW operations. However, Petersons algorithm
aside for a moment, the check of a mutex does use atomic RMW.

[ ... ]

It there's a lot of contention, any locking mechanism will be
expensive.

Yes, but in Windows if there's a lot of contention, a critical
section is a lot _more_ expensive than a mutex.

I am not exactly sure why it would be a "lot _more_ expensive". I can see a
certain advantage, but it should not make that much of a difference. I will
post (*) a quick and dirty example program at the end of this message which
attempts to show performance difference between `CRITICAL_SECTION, `HANDLE'
mutex, and no lock at all...

Between processes... The Posix mutex works between
processes, with no kernel code if there is no contention. On
the other hand (compared to Windows), it doesn't use an
identifier; the mutex object itself (pthread_mutex_t) must be in
memory mapped to both processes.

In other words, as I pointed out above, it's doing a double-checked
lock. It manipulates some data directly (with appropriate memory
barriers), and iff that fails, uses the real mutex that involves a
switch to kernel mode (and allows the kernel to schedule another
thread).

The difference is that DCL does not need to manipulate/mutate data in order
to skip mutex acquisition. It only needs to perform a data-dependant load
which is a plain naked load on basically every system out there expect DEC
Alpha.

(*)
________________________________________________________________________
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <process.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>

#define READERS 4
#define ITERS 999999
#define L2DEPTH 64
#define L2CACHE 64
/* #define USE_CRITICAL_SECTION */
/* #define NO_LOCKS */

#define ALIGN(p, a) \
    ((((DWORD_PTR)(p)) + ((DWORD_PTR)(a)) - 1U) & \
    ((DWORD_PTR)(-((LONG_PTR)(a)))))

#define ALIGN_PTR(p, a) ((void*)ALIGN(p, a))

#define ALIGN_BUFSIZE(s, a) \
    (((DWORD_PTR)(s)) + ((DWORD_PTR)(a)) - 1U)

#define ALIGN_CHECK(p, a) \
    (! (((DWORD_PTR)(p)) % ((DWORD_PTR)(a))))

#if ! defined (NO_LOCKS)
# if defined (USE_CRITICAL_SECTION)
        typedef CRITICAL_SECTION mutex_type;

# define mutex_create(s) \
           InitializeCriticalSection((s))

# define mutex_destroy(s) \
            InitializeCriticalSection((s))

# define mutex_lock(s) \
            EnterCriticalSection((s))

# define mutex_unlock(s) \
            LeaveCriticalSection((s))

# else
        typedef HANDLE mutex_type;

# define mutex_create(s) \
            (*(s) = CreateMutex(NULL, FALSE, NULL))

# define mutex_destroy(s) \
            CloseHandle(*(s))

# define mutex_lock(s) \
           WaitForSingleObject(*(s), INFINITE)

# define mutex_unlock(s) \
           ReleaseMutex(*(s))
# endif

#else
    typedef int mutex_type;

# define mutex_create(s) ((void)0)
# define mutex_destroy(s) ((void)0)
# define mutex_lock(s) ((void)0)
# define mutex_unlock(s) ((void)0)

#endif

struct l2cache
{
    char buffer[L2CACHE];
};

struct data
{
    struct l2cache array[L2DEPTH];
};

struct global
{
    mutex_type mutex;
    char l2pad2_1[L2CACHE - sizeof(mutex_type)];
    struct data data;
    LONG finish_ref;
    HANDLE finished;
    char l2pad2_2[L2CACHE - (sizeof(LONG) + sizeof(HANDLE))];
};

typedef char static_assert
[
    ! (sizeof(struct global) % L2CACHE) &&
    ! (sizeof(struct data) % L2CACHE) &&
    sizeof(struct data) / L2CACHE == L2DEPTH
    ? 1 : -1
];

static char g_raw_buffer
[
    ALIGN_BUFSIZE(sizeof(struct global), L2CACHE)
] = { '\0' };

static struct global* g_global = NULL;

unsigned WINAPI
reader_thread(void* state)
{
    unsigned i;
    struct l2cache cmp = { { '\0' } };

    for (i = 0; i < ITERS; ++i)
    {
        unsigned d;

        mutex_lock(&g_global->mutex);

        for (d = 0; d < L2DEPTH; ++d)
        {
            if (memcmp(g_global->data.array + d,
                       &cmp,
                       sizeof(cmp)))
            {
                abort();
            }
        }

        mutex_unlock(&g_global->mutex);
    }

    if (! InterlockedDecrement(&g_global->finish_ref))
    {
        SetEvent(g_global->finished);
    }

    return 0;
}

int main(void)
{
    size_t i;
    unsigned id;
    double end;
    clock_t start;
    HANDLE tid[READERS];
    unsigned long int iter_avg_per_thread, iter_avg_total;

    g_global = ALIGN_PTR(g_raw_buffer, L2CACHE);

    assert(ALIGN_CHECK(g_global, L2CACHE));

    g_global->finished = CreateEvent(NULL, FALSE, FALSE, NULL);
    g_global->finish_ref = READERS;

    mutex_create(&g_global->mutex);

    for (i = 0; i < READERS; ++i)
    {
        tid[i] = (HANDLE)_beginthreadex(NULL,
                                        0,
                                        reader_thread,
                                        NULL,
                                        CREATE_SUSPENDED,
                                        &id);
    }

    start = clock();

    for (i = 0; i < READERS; ++i)
    {
        ResumeThread(tid[i]);
    }

    WaitForSingleObject(g_global->finished, INFINITE);

    end = ((double)(clock() - start)) / CLOCKS_PER_SEC;

    if (end)
    {
        iter_avg_per_thread =
            (unsigned long int)(ITERS / end);

        iter_avg_total =
            (unsigned long int)((ITERS * READERS) / end);
    }

    else
    {
        iter_avg_per_thread = ITERS;
        iter_avg_total = ITERS * READERS;
    }

    for (i = 0; i < READERS; ++i)
    {
        WaitForSingleObject(tid[i], INFINITE);
        CloseHandle(tid[i]);
    }

    mutex_destroy(&g_global->mutex);

    CloseHandle(g_global->finished);

    printf("Threads: %u\n"
           "Time: %.3f ms\n"
           "Total Iterations Per-Second: %lu\n"
           "Iterations Per-Second/Per-Thread: %lu\n",
           READERS,
           end * 1000.0,
           iter_avg_total,
           iter_avg_per_thread);

    return 0;
}

________________________________________________________________________

You define `NO_LOCKS' for a lock-free version, you define
`USE_CRITICAL_SECTION' for a version using critical-sections, and undefine
`NO_LOCKS' and `USE_CRITICAL_SECTION' for the kernel mutex version. Here is
the output I get for the `CRITICAL_SECTION' version:

Threads: 4
Time: 28297.000 ms
Total Iterations Per-Second: 141357
Iterations Per-Second/Per-Thread: 35339

Here is the Kernel mutex:

Threads: 4
Time: 28078.000 ms
Total Iterations Per-Second: 142460
Iterations Per-Second/Per-Thread: 35615

Here is lock-free:

Threads: 4
Time: 11515.000 ms
Total Iterations Per-Second: 347372
Iterations Per-Second/Per-Thread: 86843

This is on XP with an old P4 3.06gz HyperThreaded processor. Anyway, I
cannot see that much of a difference between the two mutex based versions.
However, I can see a major difference in the lock-free one! This does not
really prove anything, but it's a little bit interesting.

;^)