Re: Saturating message queue?

From:

"L.Allan" <l.allan@worldnet.att.net>

Newsgroups:

microsoft.public.vc.mfc

Date:

Fri, 4 Jan 2008 11:19:27 -0700

Message-ID:

<emHHZ5vTIHA.1168@TK2MSFTNGP02.phx.gbl>

Oooops ... forehead, meet palm <g>

I had kind of suspected recursion ... especially since it was happening with
the release build.

I reworked the logic, and it seems to work better. Thanks for pointing out
the flaws (red face in the mirror).

I bumped the BB_LOOP_COUNTER to 100,000 to see larger elapsed times, and
check that recursion was (probably) not happening.

My impression is that the Frame and Viewer "conversing" with PostMessage
would be much less likely to encounter recursion.

I didn't do the multi-modality distribution, but I think I understand how
that would take into account time-slicing, interruptions from other
processes, and the o/s ... I just bumped the counter to big enough to get
closer to a second or so elapsed and ran it several times at
ABOVE_NORMAL_PRIORITY_CLASS.

Overall, it seemed that SendMessage was about 5x to 10x faster than
PostMessage (see comments about HiResTimer below)

Using function calls instead of the message system ... such as
   loop 100,000 times in CMainFrame pView->CallPassToken(token)
   CMyView::CallPassToken does the following ... pFrame->CallFrame(token)
   CMainFrame::CallFrame is a placeholder to avoid recursion
seemed to be about 100x to 500x faster (if I did it correctly ... big if
<g>)

Do the numbers above seem more or less typical and expected as far as the
magnitude of the difference? Or indicative of flawed code?

Also, the code does use QueryPerformanceCounter with the capability of
setting the priority, such as ABOVE_NORMAL_PRIORITY_CLASS. HiResTimer just
converts the 64-bit (stop - start) to a double. The results of
double elapsed = HrtTimerElapsedMillis();
would be something like 0.01234 for the small values in this code.

// HiResTimer.h
void HrtInit()
{
   processHandle = GetCurrentProcess();
   prevPriorityClass = GetPriorityClass(processHandle);
   QueryPerformanceFrequency(&freq);
}
void HrtStart() { QueryPerformanceCounter(&start); }
void HrtSetPriority(DWORD priority
                     = ABOVE_NORMAL_PRIORITY_CLASS)
{
   prevPriorityClass = GetPriorityClass(processHandle);
   SetPriorityClass(processHandle, priority);
}
void HrtResetPriority(void) { SetPriorityClass(processHandle,
                                                                       prevPriorityClass);
}
double HrtElapsedMillis()
{
   QueryPerformanceCounter(&stop);
   elapsedCounts = (stop.QuadPart - start.QuadPart);
   elapsedMillis = ((elapsedCounts * 1000.0) / freq.QuadPart);
   return elapsedMillis;
}

An aside: I believe we both subscribe to the DK notion of "optimization is
at the root of most software evil". The point of what I'm doing is related
to an attempt to get a quad-core to scale nicely for a app that *should*
benefit from multi-threading (if I knew what I was doing).

So far, I'm seeing only about 60% speedup instead of 400% (a task taking
10ms with one file and one thread is taking 25ms with four files to
accomplish ... instead of 11ms to 12ms.). I am the first to acknowledge that
my expertise with threading is even lower than my messaging expertise. I've
attempted to implement the previous suggestions in an earlier newsgroup post
to replace WFMO with user-interface threads and messaging ... but the degree
of speedup is still about 60%. So I'm trying to "drill down" to better
understand what is happening and what I'm doing incorrectly.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:jkjsn35rel8mvfudalu3tj8o5va2on3fh3@4ax.com...

See below...
On Fri, 4 Jan 2008 04:30:12 -0700, "L.Allan" <l.allan@worldnet.att.net>
wrote:

I was interested in timing how long it took for a parent and child thread
to
exchange a series of window messages. I thought I would start with the
Frame
and View in the main thread of a doc/view sdi app for simplicity.

I set up a menu item to initiate a loop. Then the Frame and View exchange
a
token until the token is incremented to a certain value. However, it hangs
if I use a token value of more than 20. Am I doing something wrong? Is
"SendMessageToDescendants" an inappropriate choice? Is the message queue
"saturating"? ... which would seem odd since the app is using
"SendMessage"
rather than "PostMessage"?

*****
Actually, there are two phenomena here: recursive calls and the fact that
the TRACE buffer
can overflow and lose TRACE messages. So you might be seeing saturation
of the TRACE
buffer and not of your app

The code below doesn't "exchange tokens", what it does is generate a set
of
mutually-recursive function calls; the fact that the messaging system is
involved in doing
this is of no consequence.

OnTimeMessaging->OnBbFrameToView->OnBbViewToFrame->OnBbFrameToView->OnBbViewToFrame->...

indefinitely, or at least until you run out of stack space. Your
experiment is flawed.
These are all synchronous calls. The recur on each other until they fall
off the end of
the world, or the token limit is hit, whichever comes first. And in any
case, because
your stack depth is increasing unnaturally, your experiment is being
perturbed by
gratuitous page faults as the stack extends. So whatever data this
experiment generates,
assuming it was getting resolution to something meaningful (which 1ms is
not), will not
reflect the reality of SendMessage performance.

You should do a single SendMessage and timestamp it, and then run 100,
1,000, and 10,000
examples of a single SendMessage, compute mean and standard deviation, and
look for
modality (if you have a multi-modal distribution, it means you are seeing
the timeslicing
happening). Use QueryPerformanceCounter, and ignore the concept of ms as
a useful unit
for measuring this experiment. Microseconds are probably too coarse as
well, but hundreds
of nanoseconds is probably getting down to something that is measurable.
I'd be
disinclined to trust data with resolution greater then tens of
nanoseconds.

LONGLONG sumx = 0;
LONGLONG sumx2 = 0;

for(int i = 0; i < n; i++)
   {
    LARGE_INTEGER start;
    start = QueryPerformanceCounter();
    SendMessage(UWM_SOME_MESSAGE);
    LARGE_INTEGER end = QueryPerformanceCounter();
    LONGLONG dt = (end.QuadPart - start.QuadPart);
    sumx += dt;
    sumx2 += dt * dt;
   }

LONGLONG variance = (n * sumx2 - (sumx*sumx)) / (n * (n - 1);

double sd = sqrt((double)variance);
double mean = (double)sumx / (double)n;

LRESULT OnSomeMessage(WPARAM, LPARAM)
   {
    return 0;
   }

To do standard deviation, you need three variables as shown (this formula
can be obtained
in nearly any book on introductory statistics; for example, on page 198 of
the 1962
edition of John Freund's text Mathematical Statisics). The only
indication this gives of
modality, however, is that if there is a wide variance you might have a
bimodal
distribution. Or a single-modal distribution with a lot of noise.

joe

*****

// MainFrame handles menu command to initiate
void CBbMainFrame::OnTimeMessaging()
{
SendMessageToDescendants(UWM_BB_FRAME_TO_VIEW, 1, 0);
}

// View handles message and checks token value
// 20 or less is ok, hangs with 21
LRESULT CBbView::OnBbFrameToView(WPARAM token, LPARAM)
{
TRACE1("View:FrameToView %d\n", token);
if (token == 1) {
   HiResTimerStart(); // Uses QueryPerformanceCounter
}
if (token < 20) {
   m_pFrame->SendMessage(UWM_BB_VIEW_TO_FRAME, token, 0);
}
else {
   double elapsed = HiResTimerElapsedMillis();
   CString csMsg;
   csMsg.Format("20 loops Token: %d elapsed: %.3f\n", token, elapsed);
   AfxMessageBox(csMsg);
}
return 0;
}

// Frame handles message, increments token, and sends another message
LRESULT CBbMainFrame::OnBbViewToFrame(WPARAM token, LPARAM)
{
TRACE1("Frame:ViewToFrame Token: %d\n", token);
SendMessageToDescendants(UWM_BB_FRAME_TO_VIEW, (token+1), 0);
return 0;
}

****
This results in a series of recursive calls; you could also be running out
of stack space.

What is "HiResTimerElapsedMillis"? QueryPerformanceCounter rounded to ms?
THat's a
REALLY bad idea, because 20 loops could take under 1ms. 1ms is 1000us,
1,000,000ns, and a
pipelined superscalar architecture can dispatch 2 instructions per clock
cycle (350ps)
meaning each instruction can complete in 0.175ns, so 1ms is enough time
to, under ideal
conditions, execute 57,000,000 instructions. For instruction timings, ms
is so coarse as
to be almost unusable.
*****

Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm