Re: Can extra processing threads help in this case?

From:

"Peter Olcott" <NoSpam@OCR4Screen.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Tue, 6 Apr 2010 16:51:26 -0500

Message-ID:

<7OSdnWx97IB9MSbWnZ2dnUVZ_g2dnZ2d@giganews.com>

"Hector Santos" <sant9442@nospam.gmail.com> wrote in message
news:%23QIKZaa1KHA.220@TK2MSFTNGP06.phx.gbl...

Peter Olcott wrote:

I would envision only using anything as heavy weight as
SQLite for just the financial aspect of the transaction.

SQLITE is not "heavy weight," its lite weight and only
good for a single accessor applications. Very popular for
applications in configurations or user recorsd, but only
THEY have access and no one else.

You can do handle multiple access but at the expense of
speed. The SQLITE people makes no bones about that.
SQLITE works because the target market don't have any sort
of critical speed requirement and can afford the latency
in DATAFILE sharing.

SQLITE uses what is called a Reader/Writer Lock technique
very common in synchronization of a common resource among
threads

Compared to a simple file even SQLite is too heavy for the
transaction log because SQL has no concept of record number
that maps to a file offset. This means that one has to have
an index just to keep the records in append order. Also even
if you have the closest thing that SQL has to a record
number, you can't use this as a file byte offset for a seek.

   You can have many readers, but one writer.
   If readers are active, the writer must wait until no
more readers
   if writers are active, the reader must wait until no
more writers

If you use OOPS with a class based ReaderWriter Class,, it
makes the programming easier:

  Get
  {
     CReader LOCK()
     get record
  }

  Put
  {
     CWriter LOCK()
     put record
  }

The nice thing is that when you lose local scope, the
destructor of the reader/writer lock will
release/decrement the lock references.

Now in Windows, thread synchronization is generally done
use whats called Kernel Objects. They are SEMAPHORES, a
MUTEX is a special type of semaphore.

For unix, I am very rusty here, but it MIGHT still use the
old school method which was also used in DOS using what I
called "File Semaphores." In other words, a FILE is used
to signify a LOCK.

So one process will create a temporary file:

        process-id.LCK

and other other processes will wait on that file
disappearing and only the OWNER (creator of the lock) can
release/delete it.

As I understood it, pthreads was an augmented technology
and library to allow unix based applications to begin
using threads. I can't tell you the details but as I
always understood it they all - WINDOWS and UNIX - are
conceptually the same when it comes to common resource
sharing models. In other words, you look for the same type
of things in both.

The queue of HTTP requests would use a lighter weight
simple
file.

For you, you can use a single log file or individual *.REQ
files which might be better/easier using a File
Notification event concept. Can't tell you abou *nix, but
for Windows:

FindFirstChangeNotification()
ReadDirectoryChangeW()

The former might be available under *nix since its the
older idea. The latter was introduced for NT 3.51 so its
available for all NT based OSes. It is usually used with
IOCP designs for scalability and performance.

In fact, one can use ReadDirectoryChangeW() along with
Interlocked Singly Linked Lists:

http://msdn.microsoft.com/en-us/library/ms684121(v=VS.85).aspx

to give you a highly optimized, high performance atomic
FIFO concept. However, there is a note I see for 64bit
operations.

I would use some sort of IPC to inform the OCR that a
request is available to eliminate the need for a polled
interface. The OCR process would retrieve its jobs form
this
simple file.

See above.

According the Unix/Linux docs multiple threads could
append
to this file without causing corruption.

So does windows. However, there could be a dependency on
the storage device and file drivers.

In general, as long as you open for append, write and
close, and do leave it open, don't use any files stat
readings or seeking on your own, it works very nicely:

I need to have the file opened for append by one process,
opened for read/write for another process, can't I just keep
it open?
If I do close it like you suggest, will it being opened by
one process prevent it from being opened by another?

It seems like one process could append and another one
read/write without interfering with each other.

   FILE *fv = fopen("request.log","at");
   if (fv) {
       fprint(fv,"%s\n",whatever);
       fclose(fv);
   }

However, if you really wanted a guarantee, then you can
user a critical section, a named kernel object (named so
it can be shared among processes), or use sharing mode
open file functions with a READ ONLY sharing attribute.
Using CreateFile(), it would look like this:

It would be simpler to bypass the need of this and simply
delegate writing the transaction log file to a single
thread.
Also if the OS already guarantees that append is atomic why
slow things down uncessarility?

BOOL AppendRequest(const TYourData &data)
{
  HANDLE h = INVALID_HANDLE_VALUE;
  DWORD maxTime = GetTickCount()+ 20*1000; // 20 seconds
max wait
  while (1)
  {
    h = CreateFile("request.log",
                    GENERIC_WRITE,
                    FILE_SHARE_READ,
                    NULL,
                    OPEN_ALWAYS,
                    FILE_ATTRIBUTE_NORMAL,
                    NULL);
    if (h != INVALID_HANDLE_VALUE) break; // We got a good
handle
    int err = GetLastError();
    if (err != 5 && err != 32) {
       return FALSE;
    }
    if (GetTickCount() > maxTime) {
       SetLastError(err); // make sure error is preserved
       return FALSE;
    }
    _cprintf("- waiting: %d\n",GetTickCount()-maxTime);
    Sleep(50);
  }
  SetFilePointer(h,0,NULL,FILE_END);

  DWORD dw = 0;
  if (!WriteFile(h,(void *)&data,sizeof(data),&dw,NULL)) {
       // something unexpected happen
       CloseHandle(h);
       return FALSE;
  }

  CloseHandle(h);
  return TRUE;
}

If this is not the
case then a single thread could be invoked through some
sort
of FIFO, such as in Unix/Linux is implemented as a named
pipe, with each of the web server threads writing to the
FIFO.

If that is all *nix has to offer, historically, using
named pipes can be unreliable, especially under multiple
threads.

There are several different types of IPC, I chose the named
pipe because it is inherently a FIFO queue.

But since you continue to mix up your engineering designs
and you need to get that straight, process vs threads, the
decision will decide what to use.

The web server will be a process with one thread per HTTP
request. The OCR will be a process with at least one thread.
I may have multiple threads for differing priorities and
have the higher priority thread preempt the lower ones, such
that only one thread is running at a time.

Lets say you listen and ultimately design a multi-thread
ready EXE and you want to also allow multiple EXE to run,
either on the same machine or another machine and want to
keep this dumb FIFO design for your OCR, then by
definition you need a FILE BASED sharing system.

The purpose of the named pipe is to report the event that
the transaction log has a new transaction available for
processing. I am also envisioning that another named pipe
will report the event that processing is completed on one
HTTP request.

While there are methods to do cross machine MESSAGING,
like named pipes, it is still fundamentally based on a
file concept behind the scenes, they are just "special
files".

The processes are on the same machine. Apparently this
"file" is not a "disk" file, everything occurs in memory.

You need to trust my 30 years of designing server with
HUGE IPC requirements. You can write your OWN "messaging
queue" with ideas based on the above AppendRequest(), just
change the file name to some shared resource location:

\\SERVER_MACHINE\SharedFolder\request.log

and you got your Intra and Inter Process communications,
Local, Remote, Multi-threads, etc.!

Of course, using an shared SQL database with tables like
above to do the same thing.

More overhead.

Your goal as a good "Software Engineer" is to outline the
functional requirements and also use BLACK BOX
interfacing. You could just outline this using an
abstract OOPS class:

class CRequestHandlerAbstract {
public:
    virtual bool Append(const TYourData &yd) = 0;
    virtual bool GetNext(TYourData &yd) = 0;
    virtual bool SetFileName(const char *sz) { return sfn
= sz; }

    struct TYourData {
       ..fields...
    };
protected:
    virtual bool OpenFile() = 0;
    virtual bool CloseFile() = 0;
    string sfn;
};

and that is all you basically need to know. The
implementation of this abstract class will be for the
specific method and OS you will be using. What doesn't
change is your Web server and OCR. It will use the
abstract methods as the interface points.

At this early stage of my learning process I also need to
get physical so that I better understand what kinds of
things are feasible, and the degree of difficulty in
implementing the various approaches.

Yes that is the sort of system that I have been
envisioning. I still have to have SQL to map the email
address login ID to customer number.

That will depends on how you wish to define your customer
number. Its a purely numeric and serial, i.e, start at 1,
then you can define in your SQL database table schema, an
auto-increment id field which the SQL engine will
auto-increment for you when you first create the user
account with the INSERT command.

Yes, that is the idea.

Example, a table "CUSTOMERS" in the database is create:

CREATE TABLE customers (
  id int auto_increment,
  Name text,
  Email Text,
  Password text
)

When you create the account, the insert will look like
this:

INSERT INTO customers values
    (NULL,'Peter','pete@abc.com','some_hash_value')

By using the NULL for the first ID field, SQL will
automatically use the next ID number.

In general, a typical SQL tables layout uses auto-increase
ID fields as the primary or secondary key for each table,
that allows you to not duplicate data. So you can have an
SESSIONS table for currently logged in users:

CREATE TABLE sessions (
  id int auto_increment, <<--- view it as your
transaction session id
  cid int,
  StartTime DateTime,
  EndTime DataTime,
  ..
  ..
)

where the link is Customers.id = Sessions.cid.

WARNING:

One thing to remember is that DBA (Database Admins) value
their work and are highly paid. Do not argue or dispute
with them as you

I did non SQL database programming for a decade.

normally do. Most certainly will not have the patience
shown here to you. SQL setups is a HIGHLY complex subject
and it can be easy if you keep it simple. Don't get LOST
with optimization until the need arises, but using common
sense table designs should be non-brainer upfront. Also,
while there is a standard in the "SQL language" there are
differences between SQL engines, like the above CREATE
statements, they are generally slightly different for
different SQL engines. So I advise you to use common SQL
data types and avoid special definitions unless you made
the final decision to stick with one vendor SQL engine.

You are a standard design, all you will need at a minimum
for tables are:

  customers customer table
                     auto-increment primary key: cid

  products customer products limits, etc, table
                     auto-increment primary key: pid
                     secondary key: cid

                     This would be a one to many table.

                     customers.cid <---o products.cid

                     select * from customers, products
                         where customers.cid =
products.cid

                     You can use a JOIN here too which a
DBA will
                     tell you to do, but the above is the
BASIC
                     concept.

  sessions sessions management table
                     can server as session history log as
well

                     auto-increment primary key: sid
                     secondary key: cid

  requests Your "FIFO"
                     auto-increment primary key: rid
                     secondary key: cid
                     secondary key: sid

Some DBAs might suggest combining tables, Using or not
using indices or secondary keys, etc. There are is no
real answer and it highly depends on the SQL when it comes
to optimization. So DON'T key lost with it. You can
ALWAYS create indices if need be.

I already know about third normal form, and canonical
synthesis. Its all probably moot on this simple little
database. The only index value will be user email address.

I have been envisioning the primary means of IPC, as a
single binary file with fixed length records. I have also
envisioned how to easily split this binary file so that
it does not grow too large. For example automatically
split it every day, and archive the older portion.

Well, to do that you have no choice but to implement your
own file sharing class as shown above. The concept is
basically a Log Rotater.
You can now update the CRequestHandlerAbstract class with
one more method requirement:

I am not sure if that is true. One process appends to the
file. Another process uses pread() and pwrite() to read and
write to the file. These are supposed to be guaranteed to be
atomic, which I am taking to mean that the OS forces them to
occur sequentially.

class CRequestHandlerAbstract {
public:
    virtual bool Append(const TYourData &yd) = 0;
    virtual bool GetNext(TYourData &yd) = 0;
    virtual bool SetFileName(const char *sz) { return sfn
= sz; }

    virtual bool RotateLog() = 0; // << NEW REQUIREMENT

    struct TYourData {
       ..fields...
    };
protected:
    virtual bool OpenFile() = 0;
    virtual bool CloseFile() = 0;
    string sfn;
};

But you also achieve rotation if you use a special file
naming nomenclature, this is called Log Periods. It could
be based on today's date.

     "request-{yyyymmdd}.log"

That will guarantee a daily log, or do it other periods:

     "request-{yyyy-mm}.log" monthly
     "request-{yyyy-ww}.log" week number
     "request-{yyyy-mm}.log" monthly
     "request-{yyyymmddhh}.log" hourly

and so on, and you also couple it by size.

This can be handle by adding a LogPeriod, FileNameFormat,
MaxSize variables which the OpenFile() can use;

class CRequestHandlerAbstract {
public:
    virtual bool Append(const TYourData &yd) = 0;
    virtual bool GetNext(TYourData &yd) = 0;
    virtual bool SetFileName(const char *sz) { return sfn
= sz; }

    virtual bool RotateLog() = 0; // << NEW REQUIREMENT

    struct TYourData {
       ..fields...
    };
protected:
    virtual bool OpenFile() = 0;
    virtual bool CloseFile() = 0;
    ctring sfn;

public:
    int LogPeriod; // none, hourly, daily, weekly,
monthly...
    int MaxLogSize;
    Ctring FileNameFormat;
};

and by using a template idea for the file name you can use
string replacements very easily.

    GetSystemTime(&st)

    CString logfn = FileNameFormat;
    if (logfn.Has("yyyy"})
logfn.Replace("yyyy",Int2Str(st.wYear));
    if (logfn.Has("mm"})
logfn.Replace("mm",Int2Str(st.wMonth));
    ... etc ...

    if (MaxLogSize > 0) {
       DWORD fs = GetFileSizeByName(logfn,NULL);
       if (fs != -1 && fs >= MaxLogSize) {
           // Rename file with unique serial number
           // "request-yyyymm-1.log"
           // "request-yyyymm-2.log"
           // etc.
           // finding highest #.

           RenameFileWithASerialNumberAppended(logfn)
       }
    }

etc.

--
HLS