Re: multithreading.
"Stephen J. Bevan" <stephen@dino.dnsalias.com> wrote in message
news:871w44nxni.fsf@dnsalias.com...
"Chris Thomasson" <cristom@comcast.net> writes:
"Stephen J. Bevan" <stephen@dino.dnsalias.com> wrote in message
Using malloc(3) directly causes the time for the reference counting C
code to be 30 seconds i.e. 6x slower than O'Caml. So I used a simple
object pooling scheme. Although there are 50000011 allocations, only
14 of them result in a call to malloc(3). The same pool is used in
the linear code so while it may be possible to improve the pool, the
fact that the linear code is 8x quicker than the reference counting
version shows that the (native) reference couting is the place to look
for performance improvements.
The above and the fact that your code is C++ and I'm using C means
that there is is little incentive to convert your C++ to C to try it.
Here is a crude C version of my single-threaded region allocator:
There were two reasons for not using your C++ allocator, the fact that
it was in C++ was the second reason. Having a C version does nothing
to address the first reason: the linear version 8x quicker than the
reference counting version using the _same_ allocator. Thus improving
the allocator will not make the reference counting version faster than
the linear version. It _may_ make the reference counting version
faster than O'Caml. However, looking at your C allocator code it does
observably more work per allocation than the pooling allocator I'm
using so it isn't going to be faster.
Why type of pool allocator are you using? AFAICT, pool allocator is
analogous to region allocator in that allocation and reclamations can be
as simple as:
_____________________________________________________________
#define BUF_SIZE 8192
struct region {
unsigned char buf[BUF_SIZE];
size_t offset;
};
void* region_malloc(
region* const _this,
size_t size
) {
size_t const offset = _this->offset;
size = /* adjust size for proper alignment */
if (size + offset <= BUF_SIZE) {
_this->offset += size;
return _this->buf + offset;
}
return NULL;
}
void region_reset(
region* const _this
) {
_this->offset = 0;
}
_____________________________________________________________
Now you can do stuff like:
void thread() {
int o, i1, i2;
region region[2] = { { { '\0' }, 0 }, { { '\0' }, 0 } };
for (o = 1 ;; ++o) {
region_malloc(region, (o % 128) + 1);
for (i1 = 1; i1 < (o % 512) + 2; ++i1) {
for (i2 = 1; i2 < (o % 32) + 2; ++i2) {
region_malloc(region + 1, (i2 % 16) + 1);
}
region_reset(region + 1);
}
region_reset(region);
}
}
The ratio of mallocs to frees does not need to be equal. Instead, it
can be N mallocs to 1 reset. The granularity can be improved by clever
partitioning of multiple regions.
Talk is cheap so I actually
tested it :-
$ time ./simplify speed
(0x804b148/0x804e008/1/8192)::allocator_sys_expand
(0x804b148/0x8052028/2/16384)::allocator_sys_expand
(0x804b148/0x804e008/2/8192)::allocator_sys_promote
real 0m17.530s
user 0m17.307s
sys 0m0.005s
$
This is one second slower than my original code.
Did you free each object individually or did you take advantage of the
allocator_reset procedure? The region allocator I posted does not perform
all that well when you call allocator_release. Its best to create multiple
allocators and partition them into zones, and only call reset on a zone when
its in a persistent quiescent state.
Not a huge slowdown,
so as code goes it is fine. However, it clearly doesn't make things
faster and so just re-inforces my first reason for not using your
C++ code: allocation is not the bottlneck, (naive) reference counting is.
I clearly misunderstood you; sorry about that non-sense. I was under the
impression that the allocator was the bottle neck. Humm, can you post the
test code your using? Perhaps it contains some reference counting patterns
that are compatible with fine/medium-grain partitioned reference counting.
Think of zone-based object reclamation. You can amortize a plurality of
reference increments/decrements at so called check-points that only need to
be executed on a periodic/episodic basis.
Also, if the benchmark is single-threaded, well, you could apply more and
more elaborate amortizations for existing refcounting techniques.