Re: Serious concurrency problems on fast systems

From:

Robert Klemme <shortcutter@googlemail.com>

Newsgroups:

comp.lang.java.programmer

Date:

Wed, 09 Jun 2010 19:30:33 +0200

Message-ID:

<87a1dvF4rrU1@mid.individual.net>

On 09.06.2010 08:06, Kevin McMurtrie wrote:

In article<86mc28Fn90U1@mid.individual.net>,
Robert Klemme<shortcutter@googlemail.com> wrote:

On 02.06.2010 07:45, Kevin McMurtrie wrote:

In article<4c048acd$0$22090$742ec2ed@news.sonic.net>,
Kevin McMurtrie<mcmurtrie@pixelmemory.us> wrote:

I've been assisting in load testing some new high performance servers
running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
suspending threads for time-slicing in very unfortunate locations. For
example, a thread might suspend in Hashtable.get(Object) after a call to
getProperty(String) on the system properties. It's a synchronized
global so a few hundred threads might pile up until the lock holder
resumes. Odds are that those hundreds of threads won't finish before
another one stops to time slice again. The performance hit has a ton of
hysteresis so the server doesn't recover until it has a lower load than
before the backlog started.

The brute force fix is of course to eliminate calls to shared
synchronized objects. All of the easy stuff has been done. Some
operations aren't well suited to simple CAS. Bottlenecks that are part
of well established Java APIs are time consuming to fix/avoid.

Is there JVM or Linux tuning that will change the behavior of thread
time slicing or preemption? I checked the JDK 6 options page but didn't
find anything that appears to be applicable.

To clarify a bit, this isn't hammering a shared resource. I'm talking
about 100 to 800 synchronizations on a shared object per second for a
duration of 10 to 1000 nanoseconds. Yes, nanoseconds. That shouldn't
cause a complete collapse of concurrency.

It's the nature of locking issues. Up to a particular point it works
pretty well and then locking delays explode because of the positive
feedback.

If you have "a few hundred threads" accessing a single shared lock with
a frequency of 800Hz then you have a design issue - whether you call it
"hammering" or not. It's simply not scalable and if it doesn't break
now it likely breaks with the next step of load increasing.

My older 4 core Mac Xenon can have 64 threads call getProperty(String)
on a shared Property instance 2 million times each in only 21 real
seconds. That's one call every 164 ns. It's not as good as
ConcurrentHashMap (one per 0.30 ns) but it's no collapse.

Well, then stick with the old CPU. :-) It's not uncommon that moving to
newer hardware with increased processing resources uncovers issues like
this.

Many of the basic Sun Java classes are synchronized. Eliminating all
shared synchronized objects without making a mess of 3rd party library
integration is no easy task.

It would certainly help the discussion if you pointed out which exact
classes and methods you are referring to. I would readily agree that
Sun did a few things wrong initially in the std lib (Vector) which they
partly fixed later. But I am not inclined to believe in a massive (i.e.
affecting many areas) concurrency problem in the std lib.

If they synchronize they do it for good reasons - and you simply need to
limit the number of threads that try to access a resource. A globally
synchronized, frequently accessed resource in a system with several
hundred threads is a design problem - but not necessarily in the
implementation of the resource used but rather in the usage.

Next up is looking at the Linux scheduler version and the HotSpot
spinlock timeout. Maybe the two don't mesh and a thread is very likely
to enter a semaphore right as its quanta runs out.

Btw, as far as I can see you didn't yet disclose how you found out about
the point where the thread is suspended. I'm still curios to learn how
you found out. Might be a valuable addition to my toolbox.

I have tools based on java.lang.management that will trace thread
contention.

Which tools?

Thread dumps from QUIT signals show it too. The threads
aren't permanently stuck, they're just passing through 100000 times
slower than normal.

I am not sure I understand how you found out with these tools that
threads are suspended "for time-slicing in very unfortunate locations".

The problem with staying with on the old system is that Oracle bought
Sun and some unpleasant changes are coming. MacOS X is only suited for
development machines.

Which changes do you expect?

Problem areas:

java.util.Properties - Removed from in-house code but still everywhere
else for everything. Used a lot by Sun and 3rd party code. Only
performs poorly on Linux.

Even if not shared across threads?

org.springframework.context.support.ReloadableResourceBundleMessageSource
- Single-threaded methods down in the bowels of Spring. Only performs
poorly on Linux.

Log4J - Always sucks and needs to be replaced. In the meantime,
removing logging calls except when critical.

Hm, so far we haven't had issues with Log4J unless used for excessive
logging (i.e. running production in DEBUG which is not really intended
use). As long as you log into a single sink then any concurrently used
log solution will have good potential for contention. :-)

Pools, caches, and resource managers - In-house code that is expected to
run 100 - 300 times per second. Has no dependencies during
synchronization. Has been carefully tuned to be capable of millions of
calls per second on 2, 4, and 8 core hardware. They only stall on a
high-end Linux boxes.

Since your high end box has more cores (does it?) and is generally
faster it will sooner exhibit bottlenecks via the cascading effect Lew
described earlier. Although I would readily concede that JVMs and Java
standard libraries do have bugs I am generally more inclined to believe
in a design level solution. For example: if you have a global
connection pool and all threads share it, increasing the number of
threads will at some point lead to contention. In that case you might
have to group threads with a fixed max group size and have a pool per
group. We did a similar thing with ThreadPoolExecutor where we created
several ThreadPoolExecutors and at enqueue time we use round robin to
schedule instances. This limits the number of threads competing for a
single queue's locks. Scheduling is done via
AtomicInteger.incrementAndGet().

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/