Fault Tolerance Versus Fault Containment

When you design a fly-by-wire system (FBW) and hook it up to a central nervous system and then plug in a brain, it’s prudent to ensure that brain failure or dementia will not cause system “death”, paralysis or Parkinson’s Disease.

For the crew of 9M-MRG, a MAS 777-200 departing Perth Western Australia on Aug. 1, 2005, they didn’t know what to think. The book had failed them.

The designers of their fault tolerant and fault-contained flight control system had failed to consider a particular latent failure mode. When the aircraft ran amok at Flight Level 410 and flight control took on a mind of its own, the crew had no checklist to turn to.

The designers had considered this failure mode “impossible”. However, despite ongoing flight control upsets and largely thanks to a non-mandatory intermediating tertiary system, sufficient control was regained to return to Perth. How wild was their ride and why?

During their climb through flight level 380, the crew noted a LOW AIRSPEED advisory on the aircraft’s Engine Indication and Crew Alerting System (EICAS). Simultaneously, the aircraft’s slip/skid indication on the Primary Flight Display (PFD) deflected to full right.

The PFD airspeed display showed incongruously that the aircraft was approaching both the overspeed limit and the stalling speed. The aircraft pitched 18� nose-up and climbed at 10,560ft/min to approximately FL410 and the indicated airspeed decreased from 270 kts to 158 kts.

The stall warning and stick shaker devices also intervened. The excursions continued, with snap accelerations as large as minus 2.3g and plus 3.1g achieved over the space of 0.5 secs.

This went on for some minutes until the pilot achieved a semblance of control. A 4.4mb zipped animation is available at tinyurl.com/yuecud (requiring a free player from tinyurl.com/2ef2bo).

On the ground, the flight data recorder (FDR), cockpit voice recorder and the air data inertial reference unit (ADIRU) were removed for downloading. The FDR recorded unusual accelerations around all three axes. (The ADIRU’s internal history showed that one of its six accelerometers had failed at the time of the occurrence but that another accelerometer had quietly failed, and been excluded from contention, back in June 2001.)

That was the one that would come back to haunt, but there had been other transparently silent failures within the 7 fault containment areas of the ADIRU. Processor #2 failed in Nov. 2004 and gyro #1 on May 30, 2005, but overall it was a case of “Move along, nothing to see here”. Redundancy was still intact. In such a system, failures are both expected and allowed.

The degree of redundancy, fault tolerance and fault containment built into the ADIRU was such that a process of graceful degradation was permitted that would be transparent to the flight-crew. What “the mind don’t see, the heart don’t grieve over” is the philosophy.

However, all faults are faithfully logged internally. The logs aren’t analyzed until particular maintenance messages (MM’s) are generated or the crew sees an ADIRU Status Light on the EICAS (which they did, just before the upset). All it was ever meant to mean was “ADIRU is faulted below normal certification requirements. The next ADIRU failure can cause it to shut down”. That proved not to be the case.

Because of the fail-safe stature accorded the system, such a status message would have normally required maintenance action to replace the ADIRU within three days of the message. However, up to that point the software hierarchy, based on internal system redundancy, had not considered the degraded condition of the ADIRU sufficient to generate either an MM or the EICAS status message.

The status message was generated only when the second accelerometer failed passing FL380. Because the designers had counted on a standard inerted (i.e., “fail to zero”) style of unserviceability, neither they nor the programmers had countenanced a failure to an erratically high output voltage style of eliminatory dysfunction.

Due to a flawed algorithm, the Aug. 1, 2005 failure of the #6 accelerometer also allowed the June 2001 rejected #5 accelerometer back into the game. Freddie Krueger was back. His spastically generated outputs created the impromptu roller-coaster ride. It could have been worse, except for a lucky coincidence: 9M-MRG’s SAARU was up and running and had a say in things.

On August 9, 2005, the aircraft manufacturer issued a Multi Operators Message, to all B777 operators that recommended that they should not dispatch an aircraft with an inoperative secondary attitude air data reference unit (SAARU), which was previously permitted under the conditions of the Master Minimum Equipment List.

The reason this was so quickly accomplished, in 7 days, was because the SAARU had been the intermediating tertiary system mentioned in para 1 above and luckily it was doing its duty. The SAARU provides an independent back-up source of attitude, heading and air data, but it also serves another vital role.

Thanks to the early design inclusion of a mid-value select (MVS) modality within the primary flight computer (PFC), the SAARU’s output was available to be averaged into (and thus mitigate) the #5 accelerometer’s wild radicalism. The MVS function had been included in the primary flight computer’s initial design to moderate the effect of any anomalous outputs from the ADIRU.

Analysis and testing during initial development indicated that these theorized outputs could not occur, and the MVS function was deemed no longer necessary. However, luckily a default decision was made by the aircraft manufacturer to retain the MVS function in the PFC (easier to leave it in than to take it out).

The ADIRU’s software had been tested and certified to the standard required at the time of certification. However, under that standard, known as DO178B, that testing was limited to the original specification and requirements of the component. Whatever’s not specified just doesn’t get either tested or accommodated (see FMEA comment later).

The crew faced the combined effects of a hardware defect as accentuated by a software anomaly, for which there was no procedure. Investigators found that the software had allowed inputs from a disaffiliated faulty accelerometer to be processed by the ADIRU and used by the primary flight computer, autopilot and other aircraft systems. “Other aircraft systems?” We’ve mentioned the wild ride at FL410, but what were the other symptoms later experienced by the crew — and why?

After disconnecting the autopilot and lowering the nose while the First Officer punched out a Mayday call, the captain then experienced a further series of pitch-ups and zoom climbs with the auto-throttles, despite being cancelled, adding thrust — and although the thrust levers were being manually returned to idle.

Later, on descent passing FL200, although the PFD read-out was normal, engagement of either left or right autopilots would cause a rapid roll and a nose pitch-down. In spite of the pilot continually pressing the autothrottle disconnect switches on the thrust levers, the auto-throttles would also intervene.

Because of the duff speed values being outputted, below 3000ft the crew received spurious windshear alerts and, on the landing roll-out, manual foot-braking would not deselect the pre- selected AutoBrake 3 setting.

The autothrottle glitch was a “gotcha” that came packaged with Freddie’s reappearance. Each engine has an auto-throttle ARM switch on the glareshield and an engage push-button on the center panel. As long as the system is armed, the autothrottle can kick-in and respond to what is seen as an inappropriate speed, in this case the invalid outputs from the failed ADIRU.

All features of the ADIRU navigation software releases had been checked, but none of the tests had factored in exactly the elements of the occurrence; an accelerometer failure resulting in high value output, followed by a On/Off power cycle, followed by a second large-magnitude accelerometer failure, while maintaining the large value on the first accelerometer.

The increased use of automation to manage internal hardware failures was designed into the 777 to reduce the workload of the flight crew, by reducing the number of checklists that required attention in the event of a non-normal situation. But when this particular hardware failure occurred, combined with the software anomaly, the crew was faced with an unexpectedly complex situation that had never been foreseen.

The status of the failed #5 accelerometer unit had been recorded in the on-board maintenance computer memory since June 2001, but that memory was not being checked by the ADIRU’s Fault Detection and Isolation (FDI) software during the start-up initialization sequence. That ominous software error had not been detected during the original certification of the ADIRU and was present in all versions of the software.

Due diligence was overdue. Until Failure Modes and Effects Analysis (FMEA) of hardware and firmware is properly integrated with software development and its periodic changes, similar alarming automation failures will continue to crop up. And continue to do so in flight control systems? That’s FBW heresy.

Software validation and verification isn’t the same as FMEA. You can put the purest jet fuel in a serviceable aircraft, but without the fuel-pump lubricant and anti-icing additives, you’re still halfway to a hiccup. For the Airbus version of a similar software versus hardware redundancy glitch, see the G-VATL incident in the April, 4, 2005 issue of ASW at tinyurl.com/299kdf.

All other Boeing types have a checklist for “unreliable airspeed”. However, because the ADIRU was designed specifically to eliminate that type of malfunction, no such checklist or procedure was provided for 777 crews.

If multiple erroneous ADIRU or source failures occur, then (we’re told) the EICAS message NAV AIR DATA SYS would show up. The nature of the failure in this case was such that it couldn’t trigger that message. The captain is left with a gremlin’s homily (see tinyurl.com/2cp77f). Yet despite the startle factor, rapid attitude changes, “g” and objectionable auto-throttle behavior, the MAS crew kept on top of the fact that, dud automation be damned, attitude is everything.