Re: inconsistencies when compiling

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 28 Jan 2008 01:11:52 -0800 (PST)

Message-ID:

<17657871-dabc-45ec-a04b-1dc6a6ea3f39@v46g2000hsv.googlegroups.com>

Jerry Coffin wrote:

In article <6aec2eb4-8ab4-4e50-b94e-
3194bf4433ed@e23g2000prf.googlegroups.com>, james.kanze@gmail.com
says...

[ ... ]

Certainly. You *can* hide all of the program in a few fancy >>
or << operators. Something like "aligneq" (see
http://kanze.james.neuf.fr/code-en.html, then navigate through
the sources in the Exec branch), where main() basically just
does:

[ ... copy in and copy out ]

I'm not at all excited about this structure at all. Rather the contrary,
I think it really does obfuscate what the program does -- if main looks
like it just copies the data in and then copies it back out, that's
generally what it should do.

That's more or less what I was saying. Although in this case,
one could argue that that's all the program does---reads the
data in and then copies it back out. Of course, the output
format is not exactly the same as the input format (but only
white space is changed). But I'm still not totally convinced
that it's a good idea.

My point isn't to hide the guts of the program into fancy >>
and << operators -- rather, it's to isolate the physical
representation of the data into a few specific routines, and
let the rest of the program work with the data in a purely
logical representation.

At the same time, if it's doing processing, I think that
processing should be apparent, and to the extent reasonable
the _type_ of processing should be apparent from how its being
called. That's why I pointed out the use of accumulate vs.
transform. It's also part of why I stay away from for_each
most of the time -- for_each doesn't even give a clue about
what sort of final result to expect from the processing.

I understand, sort of. In the case of the code I referred to,
the only "processing" is reformatting. And formatting is
traditionally the role of <<. On the other hand, I'd argue that
in most cases, parsing multiple fields is more than what one
expects from >>. To tell the truth, I just don't know. (If I
were doing it again, I'd probably read into the array, breaking
the line down into fields, the extract the necessary global
information from the array, and copy out, possibly using
transform with the global data as the functional object.)

[ ... ]

One thing at a time, though. And except for special cases, I'm
not sure that this isn't obfuscation. To tell the truth, I'm
not even sure that it isn't obfuscation here. But it was fun to
write, and I've found it quite easy to modify, adding additional
options as time goes on. But I'm not sure that it's a good
general solution---it works well here because the output is a
direct line by line mapping of the input. (And even here, the
Line class collects a lot of additional data during input, which
is used in output.)

I haven't looked through the code (yet) but your description sounds to
me like it really is obfuscation. To the extent possible, operator>>
should be devoted to reading in data and converting it from a physical
to a logical representation. Of course, it needs to deal with errors and
such, but it generally should NOT do processing beyond that.

All it does in addition, in this particular case, is collect
information.

The code has sort of "grown" in time, of course, with more and
more added features---it also uses static members in which to
collect the data, which is a sure sign that it wasn't well
designed. But it started out as a quick hack, to solve one
small problem, and then like most "throw away programs", got
reused and reused, each time with another feature being tacked
on. It probably (certainly) needs a major rewrite (as do one or
two other tools in that directory), but I never seem to find the
time.

That can lead to a problem: istream_iterator is basically
purely sequential, and in some cases you don't want to operate
on everything in sequence. An obvious example is parsing log
files for records of a particular type (or a few particular
types). For a job like this, I'd consider using a
boost::filter_iterator.

Another case is when only part of the file has a specific
format; I encounter this a lot. And of course, every text file
should allow empty lines and comments. The filter_iterator
would handle the first, and a filtering streambuf can usually be
used effectively for the second---or both---but you still do
want to be able to include the line number in case of an error.

When all of this is considered, I find it rare that
istream_iterator can be used that much. Most often, the loop
will look something like:

    while ( std::getline( in, line ) ) {
        ++ lineNumber ;
        if ( Gabi::trim( line ) != "" ) {
            if ( ! parseLine( line ) ) {
                std::cerr << progName << ": [" << filename <<
                ':' << lineNumber << "]: syntax error" << std::endl ;
            } else {
                processData( ... ) ;
            }
        }
    }

I can imagine ways of handling both the error message and the
line number in operator>> (converting an erroneous line into an
empty line, so that filter_iterator will skip it), but they
really are obfuscation (use of ios::iword(), for example, to
track the line number). Somehow, it just doesn't seem natural.
Where as the above seems like the standard processing idiom for
a text file.

Similarly for output. I almost never use ostream_iterator,
because much of the time, I'm doing something like:

    int elementsInLine = 0 ;
    for ( C::const_iterator iter = c.begin() ; c != c.end() ; ++ c ) {
        if ( elementsInLine == 0 ) {
            out << start of line...
        } else {
            out << element separator...
        }
        out << *iter ;
        ++ elementsInLine ;
        if ( elementsInLine == maxElementsInLine ) {
            out << '\n' ; // or other end of line data...
            elementsInLine = 0 ;
        }
    }
    if ( elementsInLine != 0 ) {
        out << '\n' ;
    }

Depending on the case, there may also be a test for iter ==
c.begin() in the loop, or I'll use a while, put the
incrementation in the loop, and test for iter == c.end() after
it, so that I don't get an extra separator. (But a lot of the
time, when I'm outputting text, it's C++ code, typically table
initializers, so an extra separator at the end doesn't matter.)

At least from my viewpoint, the idea is not to obfuscate the
program by hiding all or most of the processing in operators
<< and >>. Rather, it's to make the program more transparent
by providing a clear division of responsibility between input
conversion, filtering, processing, possible further filtering,
and output.

The problem is that in most peoples eyes, I suspect that
breaking text up into fields is not just "input conversion".
That's why I said before: if the program design would logically
lead you to have a class containing the data in the line, then
fine. If it doesn't, I wouldn't force the issue, and create a
class just so that I could use this idiom; that seems to be the
tail wagging the dog to me.

That's also part of why I rarely use std::for_each -- it does
nothing to even give the reader a clue about what sort of
results I'm expecting to produce from this collection of data.

*IF* the functional object has a good, logical name, I don't
have any problem with it. Again, the problem occurs when you
are forced to create a class type soley to use for_each.

In practice, of course, transform and accumulate are probably
better choices most of the time. But there too: how far do you
push the idiom? My CRC, MD5 and SHA classes are "accumulators",
to be used with std::accumulate. Again, it was fun, and the
idiom looks cool, but is it really more readable? (There's also
the problem that the "accumulator" in std::accumulate gets
copied around an awful lot. In the case of CRC, that's not too
much of a problem---the accumulator may be a class type, but
it's only 16 or 32 bits in size. In the case of MD5 and the SHA
classes, however, it has a very noticeable impact on
performance---accumulating a single char is usually only a write
and an index manipulation, but you end up copying something like
48 bytes, twice.)

If I use std::transform, that gives a good idea that each
record that's processed will produce a record of output. By
contrast, if I use std::accumulate, they can expect that I'm
producing some sort of summary about the data set as a whole.

And what do you use when each output record is derived from
several input records:-)?

The use of standard vs. filter iterators contributes as well
-- I could put the filtering part into the functor that does
the processing, but that hides the intent from the reader. My
ultimate intent is for main() (or whatever function) to give a
clear, concise summary of what's being done.

Agreed, but what is "clear" often depends on the readers
expectations.

[ ... ]

I think it depends somewhat on the context. If it makes sense
for the parsed data to be a single class, then I'll go this way;
if it doesn't, then I probably won't. The choice of whether
there is a ParsedData class or not is made at a higher level,
according to the design of the application, and I rather think
introducing it only to be able to use istream_iterator is a bit
of obfuscation.

I definitely wouldn't introduce it _only_ to allow the use of
istream_iterator. At the same time, I'd have to wonder about the design
of a file format if it grouped fields together onto a line, but those
fields really weren't related.

The problem is often that file formats are designed around
pragmatic considerations, not just program design
considerations. If nothing else, you want (must?) allow empty
lines and comments---it's not rare to expect continuatin lines
as well. Depending on the case, it may be more or less
difficult to handle these, and still be able to output the
correct line number in case of error. Or it may have to deal
with two different formats (e.g. a Windows .ini file).

The question is always, how far to you take things? I've
experimented with using filtering streambufs to read .ini
files: the first level strips comments (if they're supported)
and empty linesand handles continuation lines, and tracks the
line number (since nothing downstream sees all of the '\n'
characters); the second is used to read the attribute value
pairs in a section: it's inserted after the [...] line has been
read, and declares EOF when it sees a line starting with a '['.
So you can use an istream_iterator< AttributeValuePair > and
std::copy to read an entire section. But it was just that, an
experiment. I feel that the results really were obfuscation,
and while I'm familiar enough with the idioms to be able to
follow it, I don't really expect that to be the case in general.
I suspect that for most readers, it would be total obfuscation,
and I now use a classical "while ( getline(...) )" loop, with a
couple of if's in the loop, depending on what regular expression
matches.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34