Re: Can extra processing threads help in this case?

From:

Hector Santos <sant9442@nospam.gmail.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Sun, 21 Mar 2010 20:07:01 -0400

Message-ID:

<#SsWXOVyKHA.5292@TK2MSFTNGP06.phx.gbl>

This is a multi-part message in MIME format.
--------------030409070101070100000404
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Peter Olcott wrote:

I have an application that uses enormous amounts of RAM in a
very memory bandwidth intensive way. I recently upgraded my
hardware to a machine with 600% faster RAM and 32-fold more
L3 cache. This L3 cache is also twice as fast as the prior
machines cache. When I benchmarked my application across the
two machines, I gained an 800% improvement in wall clock
time. The new machines CPU is only 11% faster than the prior
machine. Both processes were tested on a single CPU.

I am thinking that all of the above would tend to show that
my process is very memory bandwidth intensive, and thus
could not benefit from multiple threads on the same machine
because the bottleneck is memory bandwidth rather than CPU
cycles. Is this analysis correct?

As stated numerous times, your thinking is wrong. But I don't fault
you because you don't have the experience here, but you should not be
ignoring what EXPERTS are telling you - especially if you never
written multi-threaded applications.

Attached C/C++ simulation (testpeter2t.cpp) illustrates how your
single main thread process with a HUGE redundant memory access
requirement is not optimized for a multi-core/processor machine and
for any kind of scalability and performance efficiency.

Compile the attach application.

TestPeter2T.CPP will allow you to test:

   Test #1 - a single main thread process
   Test #2 - a multi-threads (2) process.

To run the single thread process, just run the EXE with no switches:

Here is TEST #1

V:\wc5beta> testpeter2t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
---------------------------------------
Time: 12297 | Elapsed: 0 | Len: 0
---------------------------------------
Total Client Time: 12297

The source code is set to allocate DWORD array with a total memory
block of 1.4 GB. I have a 2GB XP Dual Core Intel box. It should 50%
CPU.

Now this single process test provides the natural quantum scenario
with a processdata() function:

void ProcessData()
{
    KIND num;
    for(int r = 0; r < repeat; r++)
       for (DWORD i=0; i < size; i++)
          num = data[i];
}

By natural quantum, there is NO "man-made" interupts, sleeps or
yields. The OS will preempt this as naturally it can do it every quantum.

If you ran TWO single process installs like so:

   start testpeter2T
   start testpeter2T

On my machine it is seriously degraded BOTH process because the HUGE
virtual memory and paging requirements. The page faults were really
HIGH and it just never completed and I didn't wish to wait because it
was TOO obviously was not optimized for multiple instances. The
memory load requirements was too high here.

Now comes test #2 with threads, run the EXE with the /t switch and
this will start TWO threads and here are the results:

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 13500 | Elapsed: 0 | Len: 0
1 | Time: 13016 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 26516

BEHOLD!! Scalability using a SHARED MEMORY ACCESS threaded design.

I am going to recompile the code for 4 threads by changing:

#define NUM_THREADS 4 // # of threads

Lets try it:

V:\wc5beta>testpeter2t /t
- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
- Resuming thread# 3 [000007D4] in 500 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 26078 | Elapsed: 0 | Len: 0
1 | Time: 25250 | Elapsed: 0 | Len: 0
2 | Time: 25250 | Elapsed: 0 | Len: 0
3 | Time: 24906 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 101484

So the summary so far:

    1 thread - 12 ms
    2 threads - 13 ms
    4 threads - 25 ms

This is where you begin to look at various designs to improve things.
There are many ideas but it requires a look at your actual work load.
  We didn't use a MEMORY MAP FILE and that MIGHT help. I should try
that, but lets try a 3 threads run:

#define NUM_THREADS 3 // # of threads

and recompile, run testpeter2t /t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 19453 | Elapsed: 0 | Len: 0
1 | Time: 13890 | Elapsed: 0 | Len: 0
2 | Time: 18688 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 52031

How interesting!! To see how one thread got a near best case result.

You can actually normalize all this can probably come how with a
formula to guessimate what the performance with be with requests. But
this is where WORKER POOLS and IOCP come into play and if you are
using NUMA, the Windows NUMA API will help there too!

All in all peter, this proves how multithreads, using shared memory is
FAR superior then your misconceived idea that your application can not
be resigned for multi-core/processor machine.

I am willing to bet this simulator is for more stressful than your own
DFA/OCR application in its work load. ProcessData() here is don't NO
WORK at all but accessing memory. You will not be doing this, so the
ODDS are very high you will run much more efficiently than this simulator.

I want to hear you say "Oh My!" <g>

--
HLS

--------------030409070101070100000404
Content-Type: text/plain;
name="testpeter2t.cpp"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="testpeter2t.cpp"

//*************************************************************
// File: TestPeter2T.cpp
//
// Example Large Memory Applicatication to illustrate how
// a multi-thread huge shared data process is superior over
// running multiple process instances with redundant huge
// data loading.
//
//*************************************************************

#include <stdio.h>
#include <windows.h>
#include <string.h>
#include <conio.h>

//------------------------------------------------------
// Parameters to play with
//------------------------------------------------------

#define KIND DWORD // array element type
#define NUM_THREADS 2 // # of threads
#define repeat 10 // data access repeats

DWORD size = MAXLONG/6; // ~1.4GB
KIND *data = NULL;

//------------------------------------------------------
// Functions to simulate application work load
// The process data function simply reads the
// memory.
//------------------------------------------------------

void AllocateData() { data = new KIND[size]; }
void DeallocateData() { delete data; }
void ProcessData()
{
   KIND num;
   for(int r = 0; r < repeat; r++)
      for (DWORD i=0; i < size; i++)
         num = data[i];
}

//------------------------------------------------------
// Thread data to keep timing stats
//------------------------------------------------------

typedef struct _tagTThreadData {
   DWORD index;
   DWORD dwStartTime;
   DWORD dwEndTime;
   DWORD dwLength;
   DWORD dwElapsed;
} TThreadData;

TThreadData ThreadData[NUM_THREADS] = {0};

//----------------------------------------------------------------
// Client Thread
//----------------------------------------------------------------

void WINAPI ClientThread(TThreadData *data)
{
    data->dwStartTime = GetTickCount();
    ProcessData();
    data->dwEndTime = GetTickCount();
    return;
}

//----------------------------------------------------------------
// Starts the Thread version of this test
//----------------------------------------------------------------

void DoThreads()
{

    ZeroMemory(&ThreadData,sizeof(ThreadData));

    HANDLE hThreads[NUM_THREADS];
    DWORD tid;
    int i;

    //--------------------------------------------------------
    _cprintf("* Starting threads\n");
    //--------------------------------------------------------

    for(i=0;i < NUM_THREADS;i++){
        printf("- Creating thread %d\n", i);
        ThreadData[i].index = i;
        hThreads[i] = CreateThread(
                      NULL,
                      0,
                      (LPTHREAD_START_ROUTINE) ClientThread,
                      (void *)&ThreadData[i],
                      CREATE_SUSPENDED,
                      &tid);
    }

    //--------------------------------------------------------
    _cprintf("* Resuming threads\n");
    //--------------------------------------------------------

    for(i=0;i < NUM_THREADS;i++) {
         int msecs = (rand() % 1000);
         printf("- Resuming thread# %d [%08X] in %d msecs.\n",i,hThreads[i],msecs);
         Sleep(msecs);
         ResumeThread(hThreads[i]);
    }

    //--------------------------------------------------------
    _cprintf("* Wait For Thread Completion\n");
    //--------------------------------------------------------

    while (WaitForMultipleObjects(NUM_THREADS, hThreads, TRUE, 100) == WAIT_TIMEOUT) {
      if (_kbhit() && _getch() == 27) {
         break;
      }
    }

    //--------------------------------------------------------
    _cprintf("* Done\n");
    //--------------------------------------------------------

    printf("---------------------------------------\n");
    DWORD dwTime = 0;
    for (i = 0; i < NUM_THREADS; i++) {
       TThreadData dt = ThreadData[i];
       dwTime += dt.dwEndTime-dt.dwStartTime;
       printf("%-3d | Time: %-6d | Elapsed: %-5d | Len: %-5d\n",
                  i,
                  dt.dwEndTime-dt.dwStartTime,
                  dt.dwElapsed,
                  dt.dwLength);
    }
    printf("---------------------------------------\n");
    printf("Total Time: %d\n",dwTime);
}

//----------------------------------------------------------------
// Starts the process (main thread) version of this test
//----------------------------------------------------------------

void DoSingle()
{
    TThreadData dt;
    ZeroMemory(&dt,sizeof(dt));

    ClientThread(&dt);

    printf("---------------------------------------\n");
    DWORD dwTime = dt.dwEndTime-dt.dwStartTime;
    printf("%Time: %-6d | Elapsed: %-5d | Len: %-5d\n",
               dt.dwEndTime-dt.dwStartTime,
               dt.dwElapsed,
               dt.dwLength
               );
    printf("---------------------------------------\n");
    printf("Total Client Time: %d\n",dwTime);

}
//----------------------------------------------------------------
// Main Thread
//----------------------------------------------------------------

void main(char argc, char *argv[])
{

    bool bThreads = false;
    for (int i=1; i < argc; i++) {
      if ((argv[i][0] == '-') || (argv[i][0] == '/')){
         if (!_stricmp(argv[i]+1, "t")) bThreads = true;
      }
    }

    _int64 msize = size*sizeof(KIND);
    _int64 msizek = size*sizeof(KIND) / 1024;

    printf("- size : %d\n",size);
    printf("- memory : %I64u (%I64uK)\n",msize, msizek);
    printf("- repeat : %d\n",repeat);

    AllocateData();

    if (bThreads) {
       DoThreads();
    } else {
       DoSingle();
    }

    DeallocateData();

}

--------------030409070101070100000404--