Re: multithreading.

From:

Jon Harrop <usenet@jdh30.plus.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 31 Mar 2008 04:37:13 +0100

Message-ID:

<13v0nlf78cq57c3@corp.supernews.com>

Chris Thomasson wrote:

"Jon Harrop" <usenet@jdh30.plus.com> wrote in message
news:13uvuuv2uqoqt8a@corp.supernews.com...

A data structure implementor using reference counting must be aware of
the overhead of reference counting and work around it by amortizing
reference counts whenever possible. You yourself only just described this
in the context of performance.

A data-structure implementer should always be aware of how he is going to
safely reclaim memory in the presence of multi-threading.

Yes.

Garbage collection does not get you a free lunch here.

Your example before was a linked list. With a GC, these are implemented
completely naively because the GC takes care of everything (a free lunch).
Look at the OCaml and F# standard libraries, for example.

Memory overhead? No, a reference count releases the object when the
reference drops to zero which is more accurate than a traditional GC
could ever be.

That is commonly claimed but actually wrong.

Please show me a 100% accurate GC; good luck finding one...

That is irrelevant. Your argument in favor of reference counting was
completely fallacious.

Your incorrect. You need to open your eyes here:

I have proven you wrong by providing a worked counter example. There is no
point in discussing this further.

You claimed was that reference counting is "more accurate than a
traditional
GC could ever be". Consider:

{
   Bar bar()
   f(bar);
   g()
}

Reference counting will keep "bar" alive until its reference count
happens to be zeroed when it drops out of scope even though it is not
reachable during the call to "g()". Real garbage collectors can collect
"bar" during the call to "g" because it is no longer reachable.

So GCs can clearly collect sooner than reference counters.

Now your just trolling. Why are you thinking in terms of scope here? Did
you know that C++ allows the programmer to do something like:

{
  {
    Bar bar()
    f(bar);
  }
  g()
}

See? We can go in circles for ever here. Your posing non-sense.

This neither proves nor disproves your claim. However, it is relevant to the
previous point because, again, GC provides a free lunch here by collecting
earlier without forcing you to rewrite your code as reference counting did.

Scope can keep reference counts above zero and values alive
unnecessarily
when a real GC can collect them because they are unreferenced even
though
they are still in scope.

It will only collect anything when the GC decides to run!

Which may well be before the scope happens to end.

Probably not.

I just tested this in OCaml and F# and both collect the value before the
scope ends.

Anyway, traditional GC is NOT about performance, its about helping some
programmers that do not know how, or want, to manage their memory
manually.

By the same logic: C++ is for programmers who don't know, or want, to
write assembler manually.

You comparing creating efficient manual memory management schemes with
programming assembly language?

Yes. Both are only necessary in very specific circumstances.

Are you talking about the extra word of space per-object?

Yes.

How is that different than GC meta-data?

Reference counts consume a lot more space.

Oh boy. Here we go again. I ask you: What reference counting algorithm?
You make blanket statements that are completely false!

I have described OCaml's zero overhead in this context. Show me a reference
counter that uses no space for its reference counts?

Failing that, show me a benchmark where any reference counted program uses
less memory than my GC'd equivalent?

BTW, there are refcounting
algorithms that do not need any per-object meta-data. For instance,
vZOOM keeps this information on a per-thread basis. There is GC
meta-data that keeps track of objects. There has to be. Think about it.

Yes. For example, OCaml hides two bits inside each pointer totalling zero
overhead. In contrast, a reference counting system is likely to add a
machine word, bloating each and every value by 8 bytes unnecessarily on
modern hardware.

There are some counting algorithms that use pointer stealing. Anyway, your
make false assumptions based on blanket statements. Drill down on some
specific reference counting algorithms please.

I am referring to all reference counting algorithms that require any space
for their counts, i.e. all of them.

Let's do a single-threaded benchmark first.

That's no good! Multi-threading is the way things are now. You don't get a
free lunch anymore.

Very well. Let's do it in parallel.

What collection-intensive tasks?

Symbolic rewriting would make a good benchmark. Here is one:

http://www.lambdassociates.org/studies/study10.htm

Try implementing this using reference counting and we can see how it
compares (probably on a trickier rewrite). Replace the
arbitrary-precision ints with doubles.

I need to take a look at this, but, are you sure that this even needs
memory management?

Yes: the task rewrites expressions trees. There is a very similar C++
program here:

http://www.codecodex.com/wiki/index.php?title=Derivative

You could use that as a basis and add reference counting to it.

Could I just reserve a large block of memory and carve data-structures out
of it and use caching?

You will run out of memory very quickly. This program spends all of its time
in allocation and collection.

I don't think this needs GC or reference counting.

Unless you detect unused subexpressions and deallocate them you will run out
of memory.

I know I can compete, and most likely beat, a traditional GC with
handcrafted implementations that C++ give me the freedom to work with.

I admire your faith but I would like to see some evidence to back up such
claims because they run contrary to common wisdom accumulated over the
past
few decades.

Here is a stupid contrived example:
________________________________________________________
struct object {
  object* cache_next;
  bool cached;
  [...];
};

#define OBJ_DEPTH() 100000

static object g_obj_buf[OBJ_DEPTH()] = { NULL, true };
static object* g_obj_cache = NULL;

void object_prime() {
  for(int i = 0; i < OBJ_DEPTH(); ++i) {
    g_obj_buf[i].cache_next = g_obj_cache;
    g_obj_cache = &g_obj_buf[i];
  }
}

object* object_pop() {
  object* obj = g_obj_cache;
  if (! obj) {
    if (obj = malloc(sizeof(*obj))) {
      obj->cached = false;
    }
  }
  return obj;
}

void object_push(object* obj) {
  if (obj->cached) {
    obj->cache_next = g_obj_cache;
    g_obj_cache = obj;
  } else {
    free(obj);
  }
}

void foo() {
  for (;;) {
    object* foo = object_pop();
    object_push(foo);
  }
}
________________________________________________________

This is a fast object cache that will likely perform better than doing it
the GC way where nobody wants to manage their own memory:
________________________________________________________
void foo() {
  for (;;) {
    object* foo = gc_malloc(sizeof(*foo));
  }
}
________________________________________________________

Yes manual memory management is more work and if you don't want to do
that, well, GC can be a life saver indeed.

We need this program to perform an irreducible task that I can code up in a
GC'd language to compare performance.

Not incremental? Drill down on that one some more please.

You actually just described the problem: reference counting releases
values when their reference drops to zero.

Yeah. This is NOT true with a traditional GC.

Absolutely. GCs do not have to wait that long: they can collect sooner.

A garbage collector generally cannot be as accurate as a reference count.
Your contrived scope example is non-sense.

I have proven your original statement wrong. Adding "generally" is better
but there is no evidence to support it.

We can at least test this on the symbolic rewriter by measuring the memory
consumption.

At the end of a scope, many reference
counts can be zeroed at the same time and the ensuing collections can
stall the program for an unacceptable amount of time. This can also be
extremely difficult to workaround without simply resorting to a real
GC.

Stall? Are you sure you know how Proxy GC works? There is no stall.
There is no contrived introspection of thread stacks. There is no
collect-the-world. There is no mark-and-sweep. There is no thread
suspension. Ect, ect...

You've just made a series of irrelevant references here.

How are the implementation details of a GC irrelevant to a discussion on
GC?

Those aren't implementation details of a GC or, if they were supposed to be,
they are decades out of date. Mark and sweep has been incremental for over
three decades.

Of course. I fully expect you to fail on the above benchmark but am
willing to give you the benefit of the doubt.

Any benchmark on GC these days has to be able to use multiple processes or
threads. Ahh... Here is something we can do! Okay, we can create a
multi-threaded daemon that serves factory requests to multiple processes.
This would use multi-threading, multi-processing, and shared memory:

I want to implement the wait-free factory as a multi-threaded daemon
process that multiple producers can register the path and name of a shared
library, which concurrent consumer processes can lookup by name;
dynamically link with the resulting library and call a "common" instance
function (e.g., define common api). I guess you could think of it as a
highly concurrent per-computer COM daemon. The factory consumer threads
can use the lock-free reader pattern, and the producer threads can use a
form of mutual exclusion; including, but not limited, to locks.

Or we can do a efficient observer pattern:

I want the wait-free observer to be a multi-threaded daemon that can allow
multiple threads to create/register delegates; message-types and allow
consumer threads to register with those delegates and receive their
messages; producer threads create messages with subsequently signal the
delegates that manages those messages to multicast to their registered
consumers.

Now those programs will definitely test a GC to the limits. BTW, do you
know of a GC that can collect across multi-process working with shared
memory???

Not multi-process, no. Either multi-threaded with shared memory or
multi-process with message passing.

Which one do you want to do? The multi-thread/process factory or the
multi-thread observer?

I can't see how to make an irreducible task with a correct answer out of
these problem descriptions. I can try to port your C++ code but my
translations will be open to accusions of cheating in the absence of a
well-defined problem to solve.

That would be a specific kind of cache. Caches do not inherently have
anything to do with allocation.

A cache is an aid to allocation.

The cache on a harddrive is not an "aid to allocation", for example.

Are you sure you know what a cache is? A cache keep objects that have no
references around to optimize further allocations.

Apparently what you wanted to ask me was how I would implement an
allocation cache in a garbage collected language.
The answer depends upon the language. If you look at OCaml, for example,
there is no need to because the run-time has already automated this.

Then pick a language that does do that.

Allocations are already cached by the minor (or first generation) heap of
most GCs. There is some sense in caching allocations on .NET because it
incurs a lock. You could do that in F# by preallocating an array of values.

However, I'm not sure that would be beneficial. I think allocating an array
of objects would incur multiple locks and allocating an array of structs
would incur a single lock but an indirection at every access (you cannot
pass structs by reference). I'd have to check this out though.

Do you know about the Solaris slab allocator?

No.

You should take a look at it.

Will do.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u