Re: STL Queue and multithreading
"James Kanze" <james.kanze@gmail.com> wrote in message
news:f6311a58-68da-425e-865d-5f46b72bd297@v12g2000vbh.googlegroups.com...
On Nov 11, 11:51 am, "Chris M. Thomasson" <cris...@charter.net> wrote:
"James Kanze" <james.ka...@gmail.com> wrote in message
news:b7531a6f-7c92-482c-a24b-342515d918e6@e20g2000vbn.googlegroups.com...
On Nov 10, 10:26 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
[...]
FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...
Compared to what?
Well, perhaps compared to efficient synchronization algorithms that are
based on plain atomic loads and stores for instance. Various
single-producer/single-consumer algorithms come to mind; it's basically
impossible for a mutex to outperform them... Period.
Certainly. But when is synchronization a bottleneck (except in
cases of contention)?
IMVHO, shared writes to a single location is a "bottleneck" by default; it
simply cannot scale...
(And why two atomic operations? I think
I could implement a Mutex lock with a single CAS.)
Nearly all of the mutex implementations I have checked
out/disassembled have two atomic RMW operations. One for
locking, and the other one for unlocking.
Oops. Yes, I misread you. I was just thinking of the lock.
You have to do the same thing when unlocking, of course.
Indeed. ;^)
The memory barriers are expensive compared to any single
instruction, but in most programs, you shouldn't require the
lock very often. And they aren't expensive, or even
significant, compared to any other important calculations.
The `#StoreLoad' style memory barrier is fairly expensive. It
forces a strict ordering and basically ruins any accumulated
cache that the executing processor may have had...
I know, but if you're invoking it that often, you probably have
a problem of granularity. (But I'm sure that there are
exceptions.)
Well, no matter how fine-grained the data-structure is, a thread will always
need to execute the `#StoreLoad' memory barrier. Think about it for a
moment. That barrier is executed regardless if there is any contention or
not...
It's not a question of the ratio of time: you can probably
synchronize twice as fast as a mutex for some uses.
^^^^^^^^^^^^^^
Order of magnitude... Indeed I have seen clever non-blocking synchronization
simply blow a nice lock-based solution completely out of the damn water;
bloody mess!
Ouch. :^/
But the
fact remains that the total time in both cases is an
insignificant part of the time of most applications. If I'm
passing a message in a queue, who cares whether it takes 750ns,
or 1.5us, if the calculations behind it take hundreds of
microseconds?
Fair point. Well, all of that time does add up. Take a data-structure
protected by a read/write mutex versus one that is protected by a mechanism
that allows readers to execute concurrently with writers. Okay, let's focus
on the reads. Many rw-mutex implementations I have seen have 2 atomic RMW
and 2 memory barriers for a read_lock/unlock pair. On the other hand, the
"exotic" reader/writer solution does not require any atomics RMW, or any
explicit memory barriers. Here is the scenario:
10 reader threads receive search queries and execute them:
___________________________________________
void reader_thread()
{
for (;;)
{
msg* m = wait_for_message();
data_structure.search(m);
}
}
___________________________________________
Let's say that all of the reader threads happened to get 10,000 search
requests each. Well, in the lock-based version, that would be 20,000 atomic
RMW and 20,000 memory barriers per-thread. In the non-blocking reader
version... Well, let's just say that there would be zero atomic/explicit
membars!
Keep in mind that the lock-based version would execute 200,000 atomic RMW
and 200,000 membars in _total_ to process 100,000 queries!!!! ;^/
:^o
Do the numbers make sense to you James? What do you think?