Human performance

Nov 10

A new way to crash a plane – mode confusion

Computerized ‘fly by air’ commercial aircraft have created a new type of pilot error: mode confusion.

From Lectures in Aviation Safety, G.F. MARSTERS, PhD., P.Eng.:

The first full authority fly-by-wire civil aircraft, introduced into commercial service by Airbus Industrie, has precipitated myriad new safety issues, primarily related to the way in which the pilot interacts with the aircraft. As new, much larger, and much more fully automated aircraft are planned, these human factors safety issues take on increased urgency. As a result of these complex interactions between the human operator and the machine a new accident “cause” denoted “mode confusion” has emerged, as it becomes apparent that pilots often do not fully understand the logic systems that were introduced to reduce pilot workload and propensity for error. Since about two thirds of all accidents are attributable to human error, reducing the opportunity for such errors is an essential step. Regardless of the level of complexity, the pilot must still ultimately control the aircraft, and to do that, it is essential to have a full appreciation for the way in which the automated systems function and interact. There have been several recent accidents where it is clear that the pilots were unsure of what the aircraft systems were doing, and as a result, took incorrect corrective actions. The accident at Nagoya, Japan is an example of this. In this case, the pilots engaged in actions that contradicted the logic of the autopilot. Not realizing the effects of their efforts, they were unable to take the correct actions, and ultimately lost control of the aircraft. This was a clear case of “mode confusion”.

Mode confusion has caused many accidents in the aviation, nuclear and petrochemical industries.

A similar everyday experience is drivers’ reaction to antilock braking. Ever hit the brakes on a slippery stretch and felt a moment of panic when the brake pedal began chattering? Some drivers unfamiliar with the feel of ABS react by taking their foot of the brake, thinking they’ve pushed too hard and broken something. This very simple form of mode confusion has been the cause of many traffic accidents.

To prevent mode confusion:

  • System users need to have a greater understanding of the logic and automation in the system, and how it can behave in off-nominal situations. (For example, drivers should get used to their antilock brakes.)
  • System designers need to ensure that system interfaces provide cues that notify or remind users about automated actions and states.
  • Importantly, interfaces should not provide feedback that could be misinterpretted during critical moments. (The pulsing of the ABS should not be transmitted back to the driver through the brake pedal, as the notification that ABS is engaged is not useful to the driver.)
  • There should be defence-in-depth within the automation to identify and react to user misoperation in response to automation.

Mode confusion will become a greater challenge as more and more systems (even in daily life) develop greater levels of complexity, automation and intelligence. Mode confusion can also occur when complex, automated systems interact in unexpected ways.

The greatest threat occurs when personnel are required to perform both active tasks (doing something to the system to control it) and supervisory tasks (monitoring the system’s automated behaviour). System designers and users must be aware of this risk when dealing with system components that are automated but autonomous (and thus able to interact to create greater complexity).

Dec 08

Out-of-the-box learning

In 1993 I wrote a book for Enterprise Planning Systems in Ottawa to show manufacturing companies how to use EPS’s software to improve business and industrial processes and optimize manufacturing performance. This seemed at first like a very dry topic, until a co-worker loaned me a novel called “The Goal”, by Elijah Goldratt.

The Goal was actually a textbook on systems management and business process engineering, written as paperback fiction. It told the story of a company executive suddenly faced with the imminent failure of his business … unless he could transform his manufacturing operation and achieve profitability. There were jobs at stake, a career on the line, and from what I remember even a failing marriage. The protagonist’s arc as he fought through problems, failed, experimented and finally succeeded actually served as the backbone for a stealthy course in systems engineering.

The use of the fiction genre to deliver training was both bizarre and effective. I read the 300 odd pages in a weekend, and at the end of it I didn’t just know more about systems science than I ever wanted to, I actually, intuitively understood it.

Unfortunately my own book was less inspiring. My brief was to produce a manual. A how-to guide that let people figure out very quickly how to do specific tasks using EPS’s software. It wasn’t flowery, it wasn’t complicated. It became, however, one of the pieces I’m most proud of in my own portfolio. It was simple without being simplistic,  balanced yet comprehensive, and profoundly effective. I think part of what allowed me to get such a good result was that intuitive understanding of the domain I gained from Goldratt’s work.

The lesson for me was that effective knowledge transfer doesn’t always come in conventionally packaged forms. Anything that gets the learner to engage with the knowledge-set is worth trying, especially if that engagement happens through a range of types of cognition (including emotional ones).

A more recent example of out-of-the-box learning is the stage play Charlie Romeo Victor (wiki here). CVR dramatizes a series of aviation accidents by re-enacting the cockpit conversations the pilots had during those accidents. (The scripts are taken directly from transcripts of the accident cockpit voice recorder — CVR — tapes.) The play is emotionally compelling, but it’s also an incredibly effective training tool that provides a crash course in human performance and crew resource management. (CVR has been used, in fact, by the US military for pilot training.)

Learners need to be captured and inspired — if not by the content itself, then by the medium used to deliver it.

Aug 08

The man who saved the world

On September 26 1983, a lowly leftenant colonel in the Soviet Air Defence Forces single-handedly prevented the accidental launch of World War III. Stanislav Petrov’s story first  emerged in the 1990s after the fall of the Soviet Union, but it’s only recently received any real attention. (Wired wrote an article on him last fall.) Petrov quite literally saved the world.

I won’t get into the whole story. (Wikipedia does it justice.) In a nutshell, Petrov was on duty when an alert came in from the Soviet satellite-based launch detection system indicating a US first strike had been launched. Petrov ignored protocol and decided that the alert had to be in error despite the fact that alert confidence was high. Petrov chose not to report the alert up the chain of command to the Soviet leadership, who at that time were in such a state of paranoia that they expected a first strike and were likely to launch on warning. It’s felt by many that this is about the closest we ever came to nuclear holocaust.

This is interesting from a human performance perspective because Petrov’s decision was the result of knowledge-based performance when he should have been in a rule-based mode. Instead of simply picking up the phone as he was trained and obligated to do, he stopped, thought, acted … or in this case didn’t … and reviewed. In effect following STAR. Petrov put aside protocol and applied a ‘questioning attitude’. He analyzed the situation and determined he was ‘out of procedure’ (OOP) based on the premise that a first strike was unlikely to take the form of a single missile. This initiative — and for a Soviet missile command officer ignoring nuclear combat protocol, this courage — allowed the time needed to review the situation and determine that the alert was in error. (In fact, it had been caused by sunlight reflecting off clouds.)

The issue of personnel operating in a knowledge-based mode when they should be in a rule-based mode is a complex one. This example clearly demonstrates the value of allowing a task performer to use knowledge-based decision-making to recognize when a rule-based mode is no longer appropriate. However, operating in a knowledge-based mode can also introduce its own risks.

The knowledge requirements for a worker operating in knowledge-based mode are much higher than those needed to operate in rule-based mode. If a worker without that knowledge is allowed to step out of rule-based mode (that is, to improvise), the results can be unexpected. This is a problem if the organizational system and processes have been designed to allow less knowledgeable workers to perform tasks (the monkey grinder approach).

Different industries have tackled this issue in different ways.

In the traditional telecoms industry in most of the western world (the southern US aside because of its literacy challenges), craftspeople working on systems in rule- or skill-based mode were almost universally overqualified. A craftsperson was expected to recognize when they were OOP and further empowered to switch to a knowledge-based troubleshooting or problem-solving mode as soon as it became necessary.

This was possible in part because of the incredible redundancy built into modern switching systems and in part because of the availability of a highly trained, skilled and disciplined workforce.

This approach was necessary because of the time-sensitivity of telecommunications problems — you can’t just turn off a switch and walk away, because people still keep trying to make calls — and because of the level of complexity in modern telecommunications systems. (By the early 1990s, a DMS switching station had more than ten million lines of code and an enormous range of failure modes.) There’s no such thing as rendering a phone system safe. The people pulling line cards and responding to switch alarms had to be able to quickly characterize and respond to complex problems, both within procedures in rule-based mode, and outside procedures in knowledge-based mode.

In contrast, in the nuclear power industry workers are very strongly discouraged from moving to a knowledge-based performance mode without substantial oversight and process overheads. In keeping with this culture, many  performers doing rule-based work on nuclear systems lack the knowledge needed to work in a knowledge-based mode on those same systems.

In other words, the worker may know what switches to throw and what warning lights to look for, but may have little or no understanding of the underlying system they are manipulating.

The worker relies on the rigor and conservative design of the procedure to identify when the work is OOP. Once OOP, work stops altogether and another worker (typically a system specialist with extensive systems knowledge) is brought in to continue in a knowledge-based mode.

This approach works because the systems are robust and workers typically stay within procedure. It works because the system is designed to fail safe and be easily rendered safe (removing the urgency in moving to a knowledge-based mode) and because the technology itself is relatively primitive. Workers are unlikely to encounter complex or ambiguous failure modes that cannot be easily recognized as OOP. (However, when they do, the results can be profound — as happened at Three Mile Island.)

This approach is necessary because of the potential for extreme outcomes if workers move into a knowledge-based mode without the necessary knowledge. As importantly, it is necessary because of the required regulatory oversight within the nuclear industry. It’s essential that workers stay within procedure because procedures are binding not only within the job function or organization, but also with the regulator. Procedures are crafted with enormous care, requiring the approval of system specialists, management and in many cases regulator representatives. Going out of procedure is thus exceptional. Within the nuclear power industry, procedural adherence is paramount.

Coming back to the example of Petrov, a duty officer at a nuclear command centre — at least, within the Soviet command system — didn’t need to have an understanding of nuclear combat strategies because they weren’t expected to make those types of evaluations. Petrov’s job was to respond to launch alerts from the automated monitoring systems and report those alerts to the Soviet leadership. Hear a bell, read data off a screen, and pick up a phone.

More than that, the requirements regarding procedural adherence in a nuclear combat environment would have been far more extreme than even those of the nuclear power industry. The expectation to stay within procedure — to follow orders — is absolute. The potential outcomes are almost unimaginable. Both the demand for compliance and the need to avoid an error, once Petrov recognized the likelihood of that error, are truly exceptional. Petrov had reason to fear the repercussions of ignoring that protocol — especially that protocol — in Soviet Russia. He could also have been wrong and facilitated a US first strike on his country.

In this case, whether Petrov’s understanding of the situation was simply fortuitous, or whether there was some unrecognized defence-in-depth within the system (because, perhaps, duty officers performing that ‘monkey grinder’ role were intentionally overqualified) isn’t obvious.

The lesson in terms of human performance is that there can be enormous value in making sure that even workers expected to perform only in rule-based mode have the knowledge and understanding needed to recognize and react to OOP conditions. Systems and procedures aren’t always rigorous enough, and there are times despite our best efforts when the only thing preventing catastrophy is knowledge, understanding and confidence.

I’ll leave you with this last thought … where were you on September 26, 1983?