Re: Increase WinXP/jre CPU usage?

From:

Patricia Shanahan <pats@acm.org>

Newsgroups:

comp.lang.java.programmer

Date:

Wed, 15 Nov 2006 13:15:35 GMT

Message-ID:

<XpE6h.6921$L6.5769@newsread3.news.pas.earthlink.net>

Chris Uppal wrote:

Steve Brecher wrote:

Or -- to put it another way -- the CPU usage reported by TaskManager
is misleading. It suggests that 50% of your available horse-power is
unused. My bet would be that it's more like 5% -- if not actually
zero.

I'm curious about why that would be, but as implied above it's rather idle
curiosity.

Well, the generally reported figure is in that ball-park.

As for explaining it, I should first warn you that I'm not especially
knowledgeable about hardware/chip design, and I'm also relying on a (possibly
faulty) memory, so take all of the following with the usual pinch of salt, and
verify (or refute) it for yourself before depending on it.

That said, my understanding is that, although the Intel HT stuff duplicates
enough registers to allow two independent execution streams, it does /not/
duplicate the ALUs, or the instruction decode pipeline. So the actual
processing power is shared between the two threads, or the two "CPU"s running
them. That means that the HT architecture only provides a benefit when one
thread is stalled on a cache read, or otherwise has nothing in its instruction
pipeline, /and/ the other thread /does/ have all the data and decoded
instructions necessary to proceed. Since the two threads are competing for the
cache space (and in any case most programs spend a lot of time stalled one way
or another) that doesn't happen too very often.

There /are/ programs which benefit usefully from HT, but the general experience
seems to be that they are not common. The ideal case (I think) would be when
the two threads were executing the same (fairly small) section of code and (not
too big) section of data (so the instruction pipeline and cache would serve
both as well as the same circuitry could serve either one); and the mix of data
accesses is such that the interval between stalls for a cache refill is
approximately equal to the time taken for a cache refill. The less the actual
mix of instructions seen by each CPU resembles that case, the more the whole
system will degrade towards acting like one CPU time-sliced between the two
threads.

Note that, in the worst case, the cache behaviour of the two threads executing
at the same time may be worse than it would be if the same two threads were
time-sliced at coarse intervals by the OS but had the whole of the cache
available to each thread at a time.

I think it comes down to two questions:

1. When run as a single thread, does the job leave wasted cycles on the
table? That is, does it do significant amounts of waiting for cache
misses, mispredicted branches etc.? If the single thread really uses
almost all the instruction issue opportunities there is no gain.

2. Can the two thread share all the caches, branch predictors etc.
without getting in each other's way? That can happen either if they are
happy with the same cache contents or if they don't need much cache.

From this point of view, the two job test may have been unfair, because
two independent jobs are less likely to do a good job of cache sharing
than two threads in the same job.

Patricia