Re: Can extra processing threads help in this case?

From:

Hector Santos <sant9442@nospam.gmail.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Sun, 21 Mar 2010 21:30:22 -0400

Message-ID:

<Oepx78VyKHA.3536@TK2MSFTNGP06.phx.gbl>

This is a multi-part message in MIME format.
--------------010103010301060503070509
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Attached "version 2" of Testpeter2t.cpp, with command line help and
more options to play with different scenarios without recompiling.

    testpeter2t /?

    testpeter2t [options]

    /t - start 2 threads to test
    /t:n - start N threads to test
    /s:# - # of DWORDs in array, default creates 1.4GB bytes
    /r:# - repeat memory reader loop # times (10)

    No switches will start a single main thread process test

    Example: start 8 threads with ~390 MB array

      Testpeter2t /t:8 /s:100000000

Example result on a DUAL CORE 2GB Windows XP

- size : 100000000
- memory : 400000000 (390625K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
- Resuming thread# 3 [000007D4] in 500 msecs.
- Resuming thread# 4 [000007D0] in 169 msecs.
- Resuming thread# 5 [000007CC] in 724 msecs.
- Resuming thread# 6 [000007C8] in 478 msecs.
- Resuming thread# 7 [000007C4] in 358 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 10687 | Elapsed: 0 | Len: 0
1 | Time: 11157 | Elapsed: 0 | Len: 0
2 | Time: 11922 | Elapsed: 0 | Len: 0
3 | Time: 11984 | Elapsed: 0 | Len: 0
4 | Time: 12125 | Elapsed: 0 | Len: 0
5 | Time: 12000 | Elapsed: 0 | Len: 0
6 | Time: 11438 | Elapsed: 0 | Len: 0
7 | Time: 11313 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 92626

--
HLS

Hector Santos wrote:

Peter Olcott wrote:

I have an application that uses enormous amounts of RAM in a
very memory bandwidth intensive way. I recently upgraded my
hardware to a machine with 600% faster RAM and 32-fold more
L3 cache. This L3 cache is also twice as fast as the prior
machines cache. When I benchmarked my application across the
two machines, I gained an 800% improvement in wall clock
time. The new machines CPU is only 11% faster than the prior
machine. Both processes were tested on a single CPU.

I am thinking that all of the above would tend to show that
my process is very memory bandwidth intensive, and thus
could not benefit from multiple threads on the same machine
because the bottleneck is memory bandwidth rather than CPU
cycles. Is this analysis correct?

As stated numerous times, your thinking is wrong. But I don't fault you
because you don't have the experience here, but you should not be
ignoring what EXPERTS are telling you - especially if you never written
multi-threaded applications.

Attached C/C++ simulation (testpeter2t.cpp) illustrates how your single
main thread process with a HUGE redundant memory access requirement is
not optimized for a multi-core/processor machine and for any kind of
scalability and performance efficiency.

Compile the attach application.

TestPeter2T.CPP will allow you to test:

  Test #1 - a single main thread process
  Test #2 - a multi-threads (2) process.

To run the single thread process, just run the EXE with no switches:

Here is TEST #1

V:\wc5beta> testpeter2t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
---------------------------------------
Time: 12297 | Elapsed: 0 | Len: 0
---------------------------------------
Total Client Time: 12297

The source code is set to allocate DWORD array with a total memory block
of 1.4 GB. I have a 2GB XP Dual Core Intel box. It should 50% CPU.

Now this single process test provides the natural quantum scenario with
a processdata() function:

void ProcessData()
{
   KIND num;
   for(int r = 0; r < repeat; r++)
      for (DWORD i=0; i < size; i++)
         num = data[i];
}

By natural quantum, there is NO "man-made" interupts, sleeps or yields.
The OS will preempt this as naturally it can do it every quantum.

If you ran TWO single process installs like so:

  start testpeter2T
  start testpeter2T

On my machine it is seriously degraded BOTH process because the HUGE
virtual memory and paging requirements. The page faults were really
HIGH and it just never completed and I didn't wish to wait because it
was TOO obviously was not optimized for multiple instances. The memory
load requirements was too high here.

Now comes test #2 with threads, run the EXE with the /t switch and this
will start TWO threads and here are the results:

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 13500 | Elapsed: 0 | Len: 0
1 | Time: 13016 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 26516

BEHOLD!! Scalability using a SHARED MEMORY ACCESS threaded design.

I am going to recompile the code for 4 threads by changing:

#define NUM_THREADS 4 // # of threads

Lets try it:

V:\wc5beta>testpeter2t /t
- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
- Resuming thread# 3 [000007D4] in 500 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 26078 | Elapsed: 0 | Len: 0
1 | Time: 25250 | Elapsed: 0 | Len: 0
2 | Time: 25250 | Elapsed: 0 | Len: 0
3 | Time: 24906 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 101484

So the summary so far:

   1 thread - 12 ms
   2 threads - 13 ms
   4 threads - 25 ms

This is where you begin to look at various designs to improve things.
There are many ideas but it requires a look at your actual work load.
We didn't use a MEMORY MAP FILE and that MIGHT help. I should try that,
but lets try a 3 threads run:

#define NUM_THREADS 3 // # of threads

and recompile, run testpeter2t /t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 19453 | Elapsed: 0 | Len: 0
1 | Time: 13890 | Elapsed: 0 | Len: 0
2 | Time: 18688 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 52031

How interesting!! To see how one thread got a near best case result.

You can actually normalize all this can probably come how with a formula
to guessimate what the performance with be with requests. But this is
where WORKER POOLS and IOCP come into play and if you are using NUMA,
the Windows NUMA API will help there too!

All in all peter, this proves how multithreads, using shared memory is
FAR superior then your misconceived idea that your application can not
be resigned for multi-core/processor machine.

I am willing to bet this simulator is for more stressful than your own
DFA/OCR application in its work load. ProcessData() here is don't NO
WORK at all but accessing memory. You will not be doing this, so the
ODDS are very high you will run much more efficiently than this simulator.

I want to hear you say "Oh My!" <g>

--------------010103010301060503070509
Content-Type: text/plain;
name="testpeter2t.cpp"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="testpeter2t.cpp"

//*************************************************************
// File: TestPeter2T.cpp
//
// Example Large Memory Applicatication to illustrate how
// a multi-thread huge shared data process is superior over
// running multiple process instances with redundant huge
// data loading.
//
// By Hector Santos (sant9442@gmail.com)
//
//*************************************************************

#include <stdio.h>
#include <windows.h>
#include <string.h>
#include <conio.h>

//------------------------------------------------------
// Parameters to play with
//------------------------------------------------------

#define KIND DWORD // array element type
#define MAX_THREADS 64 // # of threads
DWORD nRepeat = 10; // data access repeats
DWORD nTotalThreads = 2; // # of threads
DWORD size = MAXLONG/6; // ~1.4GB
KIND *data = NULL;

//------------------------------------------------------
// Functions to simulate application work load
// The process data function simply reads the
// memory.
//------------------------------------------------------

void AllocateData() { data = new KIND[size]; }
void DeallocateData() { delete data; }
void ProcessData()
{
   KIND num;
   for(int r = 0; r < nRepeat; r++) {
      for (DWORD i=0; i < size; i++) {
         num = data[i];
      }
   }
}

//------------------------------------------------------
// Thread data to keep timing stats
//------------------------------------------------------

typedef struct _tagTThreadData {
   DWORD index;
   DWORD dwStartTime;
   DWORD dwEndTime;
   DWORD dwLength;
   DWORD dwElapsed;
} TThreadData;

TThreadData ThreadData[MAX_THREADS] = {0};

//----------------------------------------------------------------
// Client Thread
//----------------------------------------------------------------

void WINAPI ClientThread(TThreadData *data)
{
    data->dwStartTime = GetTickCount();
    ProcessData();
    data->dwEndTime = GetTickCount();
    return;
}

//----------------------------------------------------------------
// Starts the Thread version of this test
//----------------------------------------------------------------

void DoThreads()
{

    ZeroMemory(&ThreadData,sizeof(ThreadData));

    HANDLE hThreads[MAX_THREADS] = {0};
    DWORD tid;
    DWORD i;

    //--------------------------------------------------------
    _cprintf("* Starting threads\n");
    //--------------------------------------------------------

    for(i=0;i < nTotalThreads;i++){
        printf("- Creating thread %d\n", i);
        ThreadData[i].index = i;
        hThreads[i] = CreateThread(
                      NULL,
                      0,
                      (LPTHREAD_START_ROUTINE) ClientThread,
                      (void *)&ThreadData[i],
                      CREATE_SUSPENDED,
                      &tid);
    }

    //--------------------------------------------------------
    _cprintf("* Resuming threads\n");
    //--------------------------------------------------------

    for(i=0;i < nTotalThreads;i++) {
         int msecs = (rand() % 1000);
         printf("- Resuming thread# %d [%08X] in %d msecs.\n",i,hThreads[i],msecs);
         Sleep(msecs);
         ResumeThread(hThreads[i]);
    }

    //--------------------------------------------------------
    _cprintf("* Wait For Thread Completion\n");
    //--------------------------------------------------------

    while (WaitForMultipleObjects(nTotalThreads, hThreads, TRUE, 100) == WAIT_TIMEOUT) {
      if (_kbhit() && _getch() == 27) {
         break;
      }
    }

    //--------------------------------------------------------
    _cprintf("* Done\n");
    //--------------------------------------------------------

    printf("---------------------------------------\n");
    DWORD dwTime = 0;
    for (i = 0; i < nTotalThreads; i++) {
       TThreadData dt = ThreadData[i];
       dwTime += dt.dwEndTime-dt.dwStartTime;
       printf("%-3d | Time: %-6d | Elapsed: %-5d | Len: %-5d\n",
                  i,
                  dt.dwEndTime-dt.dwStartTime,
                  dt.dwElapsed,
                  dt.dwLength);
    }
    printf("---------------------------------------\n");
    printf("Total Time: %d\n",dwTime);
}

//----------------------------------------------------------------
// Starts the process (main thread) version of this test
//----------------------------------------------------------------

void DoSingle()
{
    TThreadData dt;
    ZeroMemory(&dt,sizeof(dt));

    ClientThread(&dt);

    printf("---------------------------------------\n");
    DWORD dwTime = dt.dwEndTime-dt.dwStartTime;
    printf("%Time: %-6d | Elapsed: %-5d | Len: %-5d\n",
               dt.dwEndTime-dt.dwStartTime,
               dt.dwElapsed,
               dt.dwLength
               );
    printf("---------------------------------------\n");
    printf("Total Client Time: %d\n",dwTime);

}
//----------------------------------------------------------------
// Main Thread
//----------------------------------------------------------------

void help()
{
   printf("testpeter2t [options]\n\n");
   printf("/t - start 2 threads to test\n");
   printf("/t:n - start N threads to test\n");
   printf("/s:# - # of DWORDs in array, default creates 1.4GB bytes\n");
   printf("/r:# - repeat memory reader loop # times (10)\n");
   printf("\nNo switches will start a single main thread process test\n");
   printf("\nExample: start 8 threads with ~390 MB array\n");
   printf("\n Testpeter2t /t:8 /s:100000000\n");
   return;
}
void main(char argc, char *argv[])
{

    bool bThreads = false;
    for (int i=1; i < argc; i++) {
      if ((argv[i][0] == '-') || (argv[i][0] == '/')){
         if (!_stricmp(argv[i]+1, "?")) {
             help();
             return;
         }
         if (!_stricmp(argv[i]+1, "t")) {
             bThreads = true;
             continue;
         }
         if (!_strnicmp(argv[i]+1, "t:",2)) {
             bThreads = true;
             nTotalThreads = atoi(argv[i]+3);
             if (nTotalThreads == 0 && nTotalThreads >= MAX_THREADS) {
               printf("! incorrect threads count, 2 to 63\n");
             }
             continue;
         }
         if (!_strnicmp(argv[i]+1, "s:",2)) {
             bThreads = true;
             size = atoi(argv[i]+3);
             if (size == 0 && size > MAXLONG/6) {
               printf("! incorrect size count\n");
             }
             continue;
         }
         if (!_strnicmp(argv[i]+1, "r:",2)) {
             bThreads = true;
             nRepeat = atoi(argv[i]+3);
             if (nRepeat == 0 && nRepeat >= 10) {
               printf("! incorrect repeat count, 1 to 10\n");
             }
             continue;
         }
      }
    }

    _int64 msize = size*sizeof(KIND);
    _int64 msizek = size*sizeof(KIND) / 1024;

    printf("- size : %d\n",size);
    printf("- memory : %I64u (%I64uK)\n",msize, msizek);
    printf("- repeat : %d\n",nRepeat);

    AllocateData();

    if (bThreads) {
       DoThreads();
    } else {
       DoSingle();
    }

    DeallocateData();

}

--------------010103010301060503070509--