Saturday, September 22, 2007

Fail Early (But Not Often)

Civic-minded citizens have long been reminded to "Vote Early". A corrupt Chicago politician extended it to "Vote Early And Often." Failures should be handled as early as possible, but shouldn't (under most circumstances) happen very often.

My primary principle is "The earlier the failure, the less the consequences." For example, if I didn't check the return code on a file open, the code would merrily continue on and try to write to the bogus file descriptor returned from the "file open". Most current operating systems give you a bad handle error, but earlier ones would just return junk at best and might do something grotesque to the file system at worst. Also, the code has probably already gone past the point where it can put up a coherent error message that includes the filename and system error code. This the leads to my second principle about failure handling, which is: "The better the error message, the less tech support hates my guts." One annoying error message that I've had to deal with (as an end user) is "The [someLibrary.DLL] is in use by another program." Which program? Do I have to go through my Task Manager, or use "ps" on *nix, trying to kill one program at a time (not a good idea anyway!) to find out which is the offender?

Most contemporary programming languages have some sort of exception mechanism to help me write code to handle failures. In a language like C++ where I can throw just about anything, I create a superclass designed to handle failures (subclassed as appropriate) combined with a macro that catches all possible types that any code in my executable might throw. As long as I put my catch blocks in the right places (e.g., at the top of a thread or in top-level event handlers) I can return errors without bad side effects. Java, on the other hand, will only allow me to throw descendants of Throwable, which removes the need for the "glue" class and macro that I described for C++, but adds the complexity of having to deal with multiple subclasses of Throwable. There's no such thing as a free lunch.

What if I am working in a language that supports exceptions, but I am using some API that doesn't? For example, suppose I needed to use fopen (the low level 'C' routine) while working in C++. In this case I always check the return code from that call (as well as all other file system calls). If an error is returned from any of those calls, I convert it into a coherent exception and throw that exception.

To move farther back in the evolutionary ladder of languages, what happens when I am working in a language (such as shell script) where there is no exception mechanism at all? Here I check for errors after every possible line of shell script that can possibly fail ("assume the worst"). I keep the code readable by putting the status check off to the right of the "meaningful" code. This means that when I read the code I can concentrate on what it should be doing in most cases, rather than the edge cases of failure. For example, this shell script contains a shell function that displays an error message, the name of the script and how to get instructions for running the script.

Note that the error might not be caused by inappropriate usage, but this at least gives the person who ran the script something to see before reading the code in the script.

In the spirit of continuous improvement, I've tried moving the status test down into the shell function, but I discovered (at least on Cygwin) that making the function call changes the global status variable ($?). I even briefly considered doing something like this where the _chk_ function runs lines of script and then checks the error code.


_chk_( 'first line of script' )
_chk_( 'second line of script' )

I rejected this, however, because (a) the noise level is too high, and (b) I had serious concerns about quoting problems within each line of shell script code.

Sometimes you have to know when to stop refactoring.