Re: MFC updates and enhancements

From:

"Tom Serface" <tom.nospam@camaswood.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Mon, 3 Dec 2007 10:25:48 -0800

Message-ID:

<4C825FF2-96F4-4EFE-B9B7-D073B4B28B51@microsoft.com>

I don't think anyone would argue of 1.5 seconds unless it had to happen a
lot. I can't imagine why there would be any noticeable difference between C
and C++ in this case (except the C code would be very difficult to read). C
and C++ ultimately end up using the same functions to read the file and
there is very little layering overhead on each. I think Python is an
executed script language, but even so that time seems very long. Did you
try it using something like Perl?

Tom

"Giovanni Dicanio" <giovanni.dicanio@invalid.it> wrote in message
news:ey90gddNIHA.1208@TK2MSFTNGP03.phx.gbl...

"Joseph M. Newcomer" <newcomer@flounder.com> ha scritto nel messaggio
news:g45uk39imme3inivutij0mbmafnp461hpe@4ax.com...

The more fanatic still think C is a barely acceptable language for
efficiency reasons, but as I said, with only one exception I've sold them
on MFC.

With modern hardware, powerful CPUs, big amount of RAM, etc. I think that
C has no big efficiency advantage over C++...

Last week, a cousin of mine who is pursuing a PhD in biochemistry, asked
me to help him.
The problem was the following: he has two text files.
File #1, which we could call "data" file, stores one record in each text
line. Each text line is a single record.
Record fields are separated by space (' '), and the first field is the
record "key".

Then there is a second file, which we could call "valid keys" file, which
stores a key in each line. It stores a list of valid keys, one key per
line.

The task is to write a program which gets as input the data file name and
the key file name, and prints only those lines in the data file, whose
record keys are present in the valid-keys file.
The entire line (entire record) from file #1 is printed, not only the key.

e.g.

-- File #1 - data file -- (each line is a record, record fields are
separated by space)
ZINCO003 34 10 3.4 zinco03.mol ...more data...
ZINCO039 10 -3 2.34 ...
ZINCO394 -3 4 32.3 ...
...

-- File #2 - key list --
ZINCO039
ZINCO394

Given the above files as input to this program, it should just prints:

ZINCO039 10 -3 2.34 ...
ZINCO394 -3 4 32.3 ...

Because the valid keys are ZINCO039 and ZINCO394 (read from file #2), and
these keys are the keys for second and third lines in data file (file #1).

You can use the program like so:

ExtractData data_file key_file >result.txt

and result.txt stores only the interesting records from data_file.

He tried developing it in Python, because developing in Python is very
fast. It worked for small files, but it took 3-4 *hours* to process a
"real world" file, where "real world" files are: 80-100 MB (or more) of
text data file (which can contain 70-80,000 records) and about 800KB for
key file.
So he asked me to write the same program in the week-end and spare time,
but in C, optimized for speed (3-4 hours for a single processing was
super-wasted time for him).

I developed this program first in C, then in C++, and then in C#, because
I was very interested in speed comparisons.

The C and C++ versions have the *same* speed! You can't figure out what
version is C and what version is C++ from the execution time, because
their execution times are the same (maybe some milliseconds difference,
but a human being can't see that :)
For some real-world files, my cousin's version in Python took 3-4 hours,
instead my C and C++ versions took just *one second*!
He was amazed :)

(BTW: I did not use specific Win32 features like memory mapped files or
other Win32 APIs, because my cousin is a big Linux fan and user, and so he
asked 100% standard C and C++ code. Just a rebuild from Windows/Visual
Studio to Linux/GCC and g++, and the program ran fine on Linux, too.)

The C# version took a little bit more, maybe 1.5 seconds or 2 seconds, but
also this C# execution time is of course fine, as well (not the Python's
3-4 hours :)

Now consider the quality of code and programmer's productivity in these
three scenarios: the C code has IMHO a worst quality than C++ code, and
it's harder to mantain.
The C++ code is more robust: e.g. I used STL container like std::vector,
or std::string in the C++ version, while in C version I used raw C arrays,
also for strings. Using raw C arrays is harder, require more attention,
can cause more bugs, is more fragile, etc.

The C# code quality was also a bit better than C++. However, porting from
C++ to C# was trivial (while porting from C to C++ was more work, or
better: a complete rewrite :)

I don't know if my cousin's original Python version was so slow because of
Python, or because of some wrong design decisions of him... For example,
my cousin just used a linear search for keys, and did not read all valid
keys from file in memory, instead I preferred to:

1) read all keys from file #2 in an array in memory
2) sort the valid keys array
3) for each line read from data file
3.1 - extract key from record (line)
3.2 - if the key is a valid key (i.e. it is in the valid keys array of
point #2), print the entire line

In 3.2 I could use a *binary* search (O(log(N)), because the keys array
was sorted. And O(log(N)) gives important speed gain when you have lots of
items (not 100-200 :) but much more!).

The bottom line is that we have no speed gain in this particular context
on using C versus C++, and even C# has great performance here.
...and the productivity of the programmer, and the quality of code and its
robustness, are different in these three scenarios.

Of course, I think that people with big experience who programmed also the
mainframes (where I think there were few hardware resources available, and
there was no Python or other high-level languages), like Joe and others,
could write a code faster than mine (with more advanced tricks).

Giovanni