Re: Bignums, object grind, and garbage collection

From:

John Ersatznom <j.ersatz@nowhere.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Sat, 16 Dec 2006 17:38:57 -0500

Message-ID:

<em1se6$bp$1@aioe.org>

Lew wrote:

The GC is fast indeed for multiplicities of small objects with short
lifespans. They are quickly discarded and quickly reclaimed.

It is also tunable, and different algorithms are invokable.

I can't expect the end-users to do lots of tuning as part of the
installation, I'm afraid. Unless there's a way to include such tuning in
the distribution, and have the tuning work under different use patterns
by different users, and have it not impact anything else on the
deployment system, including other Java stuff that may live there.

Object creation time may be a tad difficult, but I have read that Java's
creation times are faster than most people believe. I have never heard
that Sun's JVMs re-use objects in some sort of creation pool, but that
doesn't mean that they don't.

It still seems to me that given these two choices, the former must
invariably be faster:

x.add(y) changing the fields in x, whose old value is no longer needed
and where the add can be implemented using either no temporaries, or
reused temporaries.

Foo z = x.add(y) where in addition to the fields in z having to be set,
which is the same work done above, but also:
* a new Foo has to be allocated
* The disused x has to be garbage collected later
(Here the add impmementation ends with something like "return new
Foo(args)".)

However little amortized overhead the additional allocation of z and the
collecting of x takes, it isn't zero and it adds up.

The memory usage is clearly also higher.

Do not overlook the possible influence of JIT compilation to optimize
away apparent inefficiencies in your source, especially under load over
time.

If you can convince me that the Hotspot server VM JIT is smart enough to
change Foo z = x.add(y) under the hood into effectively x.add(y) where x
is simply changed in content and renamed as z. If x is a local variable
and the VM can *prove* that x gets the sole reference to its object, and
that reference never leaves the function (x being a temporary in a
calculation), then that might be possible. Of course, that depends on
several things: probably the object x refers to had to be allocated in
the same method using x, rather than coming in in a parameter; the class
must be provably not keeping an internal reference, such as for a cache...

Anything involving billions of anything will be slow.

True. And it will be more sensitive to inefficiencies. A single
calculation taking a fraction of a second can be 30% slower and no-one
cares, so long as it's isolated and the throughput bottleneck remains
user interaction or I/O. A simulation running for hours taking 130 hours
instead of 100 is taking more than an extra *day*, which is definitely
noticeable! Also, anything involving billions of anything is potentially
going to crash with OOME. (Most VMs including Sun's also get unstable
when memory gets scarce, and abnormal termination of the VM is IME
frequent when memory is tight and especially if the app has already
recovered from OOME, and quite rare otherwise at least for Sun VMs.)
If we have a sequential calculation, such as a simulation with
successive states, ideally we only need a process size around twice the
size of the state plus a little -- to hold the existing state, the next
state being built with reference to the existing state, and some
temporaries. At the end of the cycle, the references to next state and
existing state can simply be swapped, and the new state overwriting the
two-iterations-ago state as it gets built in turn, as well as the
temporaries overwritten.

Use immutable objects and lots of "new" each cycle, however, and the
picture changes. The size grows with every iteration. Then if you're
lucky the gc begins to use a significant fraction of the CPU and the
size stabilizes, with old state data being deallocated at the same rate
as new state data is allocated. The process is somewhat larger and
(maybe only slightly) slower than the picture painted above, but it's
stable and it's pretty internally rigorous, as there's no danger
something is overwritten whose old value is still being used, creating
errors that creep into the calculations quietly.

If you're unlucky the gc can't keep up with the sheer rate of object
allocation demand and either the gc dominates the cpu time as the
simulation is forced to slow down until it only wants new objects at the
maximum rate the gc/allocation code of the VM can produce them or the gc
can't cope and OOME gets thrown. Then the simulation crashes, or it
tries to save or recover only to risk the VM crashing.

You are probably stuck with having to implement prototypes of more than
one approach, and actually profile and measure them. You could start
with the LPGL libraries nebulous pointed out, compared with Java's
standard implementations.

Actually they were BSD licensed, but that's even more liberal from the
look of things.

You clearly know your algorithms. The right algorithms will have so much
more effect than putative difficulties with GC or object creation times.

The right algorithms combined with the sleekest memory management will
be better than either by itself. :)

You can also throw hardware at the problem, like multiway /
multithreaded multiprocessors.

The right algorithms, the sleekest memory management, and the fastest
hardware will be better than any two by themselves. ;)

I know it's not crucial to heavily optimize quick, event-driven bits of
code; only stuff that, in total, takes a user-perceptible amount of
time. But the runs may take hours between the user initiating one and
getting a result. A 3-ms routine taking one millisecond longer to
respond to a keystroke is not noticeable. A 3-hour job slowing to take 4
hours is.