Re: assertions: does it matter that they are disabled in production?

From:

Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>

Newsgroups:

comp.lang.c++.moderated

Date:

Tue, 19 Aug 2008 03:42:31 CST

Message-ID:

<K5tqKE.MHv@beaver.cs.washington.edu>

David Abrahams wrote:

on Sun Aug 17 2008, Andrei Alexandrescu
<SeeWebsiteForEmail-AT-erdani.org> wrote:

David Abrahams wrote:

there's no reason to think that code far from the point
where a bug is detected is in a better position to correct the problem.

Actually there may as well be. Intuitively, by and large the deeper on
the call stack you are, the more state you are dealing with. In main,
at the top level, you have no other state than globals and main's only
locals.

That's the exact opposite of my inuition. The program's globals (or the
space in a very shallow stack frame -- it amounts to almost the same
thing) lead to all its other state.
In a well-written program, the
leaves of the call tree are typically general-purpose functions that
make very few assumptions about the whole program and deal only with the
data accessible via their parameters, while further up the stack you
have functions dealing with larger abstractions such as documents,
graphs, etc., whose state encompasses the state being operated on at the
leaves.

There is a misunderstanding at work somewhere around here. The amount of
state accessible to any given function is not the focus here. It's just
the amount of state, period. This is because a failed assertion could
indicate out-of-bounds access, following a stray pointer or whatnot, so
you can't count of visibility taking care of protecting data for you. Of
course we could define "a well-written" program as a program that has no
such problems, but then one could define a well-written program as one
that does not need assertions in the first place etc.

Anyhow, even as far as visibility is concerned, in main() it is minimal,
then it increases as the worker functions take over, and then indeed it
decreases as specialized functions are being called. So we could say
that falling back to main() or a function on the call stack close to it
reduces the amount of state the program is having, and therefore the
risk of that state being corrupted.

The less state there is, the less risk for corruption there
is.

Of course, there is always risk that the global heap, the stack,
the global data section, or the code section get corrupted, but
statistically, it is reasonable to think that if you have a large
amount of state that is corrupt in an unpredictable point, reducing
that state by a lot will reduce your risks of corruption as well.

Agreed, but from my point of view, you've got it upside-down.

It's not upside-down even as visibility of state is concerned. Few
programs make their all or most of their state visible from main or from
globals. So as you walk the stack trace from main() to the top, you'll
see that accessible state first grows then shrinks.

By that reasoning, the farther the point of correction from the point
of assertion, the more credible the correction.

Sorry, I just can't get it.

See above.

Throwing exceptions when preconditions have been violated is usually

^^^^^^^

just a way of avoiding responsibility for the hard decision about
whether to terminate the program.

And IMHO that's not a bad thing. Getting back to the batch system I
described in another post, in main() I catch any failed assertion and
proceed to opening and processing a new file. I would have been less
happy if the machine learning library I use decided it's best to abort
the program on my behalf.

I am not an expert in the domain of long-running computations, but those
experts I've talked to use checkpointing, as suggested elsewhere in this
thread. I didn't find your argument for continuing to be persuasive,
although I could be missing something -- you left out a lot of detail.
What do the results mean once your state has been corrupted? Are they
even reproducible?

Good suggestions could go on forever, as long as the amount of work
involved is not an issue. I myself would rather work on fixing the cause
of the assertion instead of adding checkpointing.

I left out a lot of detail to focus on the main thrust, which is that
sometimes it's good that an assert throws a catchable exception. Yes,
results are meaningful once my state has been corrupted simply because I
happened to get rid of all corrupted state.

At the highest level, a machine learning system must make prediction of
a label (e.g. word, syllable, part of speech) from some features (e.g.
sound samples, written text). Rinse, lather, repeat millions of times.
So the system eats numbers and utters predictions. You have the true
labels in a separate file for comparison. If the label space is
sufficiently large (e.g. beyond 5-6 labels), it is virtually impossible
for a corrupt system to systematically make better guesses than a
carefully-setup learner. The madman blabbering incoherently may, with a
stretch of imagination, answer something true or deep or interesting
once in a while, but cannot *systematically* do so millions of times.

That said, I'm not absolutist about this; I did write "usually." There
are systems where the price of catastrophic failure that might ensue
from continuing with broken state is low enough that robust measures
like redundant systems or checkpointing aren't worth implementing, but
where it would be better to continue than abort. Toys might fall into
that category. There are also systems where you can have reasonable
confidence that broken invariants are confined to a limited area of the
system (behind a process boundary makes a very strong example).
However, if you find yourself making that argument to yourself about
code under your control, it's a good sign you don't really understand
what that code is doing, and you should rewrite it so that you are able
to better reason about its behavior.

Between calling my dissertation a toy and the social stigma of not
understanding what it is doing, guess I'll choose the former :o).

<snip description of possible use of shell script to establish process
boundary> ...

Besides, all of this is going on a cluster where it's somewhat easier
and more efficient to run one straight native binary instead of a
shell script that in turns loads a binary multiple times.)

I don't want to presume, but what you're doing, from an architectural
point-of-view, doesn't sound that different from what many other
researchers do with long-running computations. So it seems to me that
there must be plenty of other experience in this area. Did you consult
precedent before deciding how to handle these situations?

There are 3-4 other labs in the country that do Automatic Speech
Recognition, and there is sharing of such knowledge and code among them.

Andrei

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]