Safety Critical
Devices
Safer
Systems Through Better User Interfaces appeared in Embedded
Systems Programing magazine, and probably summarises my best
writting on the topic of safety.
The
World Wide Web Virtual Library page on Safety-Critical Systems
gives a huge range of links to other safety related resources.
The
Ariane 5 explosion as seen by a software engineer is an interesting
insight into the loss of an unmanned rocket, including a discussion
of the formal methods used, and the float to int cast that caused
the explosion.
A more detailed discussion
of that incident is given in ARIANE
5 Flight 501 Failure Report by the Inquiry Board.
Another Space Storey.
This insightful article discusses some of the people factors when
a team has to produce ultra reliable software. Burning the midnight
oil might get a product out fast, but the slow burn is more important
when the product has to be perfect. Read http://www.fastcompany.com/online/06/writestuff.html
for a description of the kind of people you will find writing
software for the space shuttle.
If you are using Commercial
Off The Shelf (COTS) software, such as an RTOS or graphics library,
in medical devices which are to be submitted to the FDA then you
may be interested in a document the FDA has released called Guidance
for Off The Shelf Software Use in Medical Devices.
The Therac-25 Medical
Device
The Therac-25 was a
cancer irradiation device whose faulty operation led to a number
of deaths. One of the safety features in the original design was
that all of the settings for the device had to be entered through
a terminal as well as on a control panel. This was seen as redundant
by users of a prototype. It was "redundant" in the best sense
of the word, but this was not appreciated by the users, who assumed
that the safety of the equipment was beyond doubt. The design
was changed before release so that the settings could be entered
on the terminal alone. Once the settings were accepted by hitting
the return key, the user was asked to confirm that the settings
were those that were actually required. This confirmation was
performed by pressing the return key again. This extra step was
considered a replacement for the original second interface.
Unfortunately users
soon learned to press the return key twice in succession, since
they knew that they would always be asked for confirmation. The
two presses, similar to a double-click on a mouse became a single
action in the mind of the user, and no actual review of the settings
was performed. Due to a bug in the software, some of the settings
were, occasionally, not properly recorded. The bug was a race
condition created because proper resource locking of the data
was not exercised. Since the cross check of having the settings
entered twice had been removed the fault was not detected. This
was a case where the design was altered to favor usability, but
the safety of the device was compromised.
It is fair to say that
if the rest of the design had been sound then removing the second
set of inputs would not have been significant, but the whole point
of having a safety infrastructure in place is to allow for the
times when something does go wrong.
Another point to note
from this example is that the later design was also more susceptible
to the simple user error of the user entering a wrong value. If
the user has to enter the value twice on two different displays
the chances that the same wrong value being entered would be slim.
The software would have detected the mismatch and not applied
either set of settings. It is often the case that safety measures
serve the dual purpose of protecting against either device error
or user error. In intensive care medical ventilators the pressure
rise in the patient's lung is a function of the volume of the
lung and the volume of gas added. There is a pressure valve which
opens at a fixed pressure limit. Once the valve is open an alarm
will sound and the patient is exposed to room air pressure in
the fail-safe state. This serves to protect the patient against
an electronic or software fault which may cause a large volume
to be delivered. it also protects the patient from a user setting
3.0 liters rather than the intended 0.3 liters. This brief description
only touched on one aspect of the failures of the Therac-25. See
An
Investigation of the Therac-25 Accidents by Nancy Leveson
and Clark S. Turner for a full description of the accidents and
their causes.
Aviation Safety
The
Aviation Safety Reporting System Home Page contains a huge amount
of information related to aviation safety in the United States.
Their newsletter, DirectLine, summaries many of the findings. The
March 1997
issue provides, sometimes amusing, accounts of safety hazards caused
by uncooperative or unfortunate passengers. The NASA Aviation Operations
Branch site contains a number of aviation safety related publications
including "OOPS,
IT DIDN'T ARM." - A CASE STUDY OF TWO AUTOMATION SURPRISES
Comp.Risks
The newsgroup comp.risks
carries a moderated discussion of safety and risk related issues.
Areas as diverse as aviation safety, invasion of privacy, computer
viruses and fraud are discussed in a reasonably technically sophisticated
manner. Crucial reading if the systems you are working on may
cause risk to life, money or the fabric of society. Peter Neuman
is the group moderator, and author of 'Computer Related Risks'.
Peter's homepage
is also a mine of information on this area.
|