Tuesday, December 21, 2004

Learning From Accidents and a Terrorist Attack

Click here for AmazonThe always insightful Dan Bricklin (co-inventor of the spreadsheet with Bob Frankston) recently wrote an essay analyzing major accidents (e.g., Three Mile Island) and the catastrophic terrorist attack on the WTC. Read the whole thing. But if you don't have time, the summary is excellent advice for all who create systems upon which the population depends... and software developers especially.

I want to point out one bullet-item specifically, which I have bolded (below). Bricklin advocates instrumenting subsystems and components. This ensures that, in addition to errors, any noteworthy events are logged and surfaced to an appropriate level. This will allow administrators or monitoring processes to react to changes to the system.

I made a similar point a while back when arguing that developers should strive to use return-codes rather than exceptions. The reason? We can instrument our code, whether it succeeds or fails, in its appropriate home venue (method, function, etc.). I know of no easy way to force exceptions into this model.

I am paranoid. I want full instrumentation. It sounds like Dan Bricklin does too.

There are principles that may be gleaned by looking at Normal Accident Theory and the 9/11 Commission Report that are helpful for software development.

This essay covers a wide range of topics. It introduces "Normal Accident Theory", looks at some of the aspects of a major terrorist attack, and proposes some areas for design that are suggested by the results of that attack. The original goal, though, was to come up with some principles that could be applied to making software that fits with the long-term needs of society. Here are some of those principles:

Instrument the sub-systems and components so that failures can be detected and so that behavior can be monitored when there are changes. There is a need to know "what is going on".

Examine failures and share what is found with others so that there is learning.

Try to keep sub-systems loosely coupled, the interfaces understandable, and the intermediate steps comprehensible.

Allow for, and anticipate, improvisation. The design of instrumentation and the coupling of sub-systems can make improvisation easier or harder.

Those who deal with changes may not be the ones for whom the designers planned nor who were pre-trained to deal with those changes. This affects the design of instrumentation, coupling, and documentation.

Generic, "global" resources help and should be able to be used as part of instrumentation and improvisation.

Dan Bricklin: Learning From Accidents and a Terrorist Attack

No comments: