A Modern Icarus — the short story of Ariane 5 Flight 501

On June 4, 1996 in Kourou, French Guiana, the maiden flight 501 of the Ariane 5 rocket ended almost as soon as it began. About 37 seconds after the initial launch sequence (30 seconds after takeoff), at an altitude of 4000 meters, the rocket deviated 90 degrees from its intended flight path due to a software failure, experienced severe aerodynamic stress tearing its boosters from the main stage, and thereby triggered a controlled self-destruction that culminated in the spacecraft exploding in a fireball of liquid hydrogen.

The European Space Agency (ESA) had ambitions to take a leadership role in the commercial space business and surpass Japan, Russia, and the USA. The Ariane 4 (A4) had been in service for more than 20 years and boasted an excellent record of more than 100 successful launches with no failures. The new Ariane 5 (A5) rocket would carry larger satellite payloads than earlier versions, and flight 501 was carrying a payload of four satellites intended for researching the Earth’s magnetosphere. ESA had spent 10 years and $7 billion developing the A5, and flight 501 itself cost $370 million. The success of the Ariane 4 and ESA budget pressures resulted in the reuse of A4 software by the A5 program team including its navigation system and flight path optimization libraries.

ESA organized an inquiry board immediately after the crash to investigate the disaster and using flight data, optical observations (IR camera, film), inspection of recovered material, and review of the software code, the board identified the following sequence of events that resulted in the crash.

  • At T + 37 seconds, the Internal Reference System (SRI) that measured the rocket’s spatial attitude (altitude, motion, and position) sent incorrect data to the Flight Control System instead of the actual flight data because an arithmetic overflow occurred inside its alignment function when a 64-bit floating point number for the Horizontal Basis variable (BH) could not be cast and converted to a signed 16-bit integer. The SRI in the A5 was reused as a black box from the A4. Furthermore, the BH value was higher than expected because the early part of the A5 trajectory differed from the A4 and resulted in higher horizontal velocity values (five times as much). This error did not occur earlier in the flight because the vehicle speed was lower and the calculated values were small enough to fit into the program’s data types. Since there was no exception handling in the Ada code for this alignment function, the operand error bubbled up and the SRI component entered a failed state and returned a diagnostic value, intended for debugging purposes only.
  • The backup SRI, identical in hardware and software to the active SRI, could not be activated because it had failed for the same reason.
  • At T + 38 seconds, the On-Board Computer that executed the flight program then commanded course corrections as a result of the incorrect SRI data, and the rocket’s nozzles were deflected changing the flight path at an extreme angle.
  • At T + 39 seconds, high aerodynamic forces led to separation of the boosters from the main stage and triggered the self-destruct subsystem.
Ariane 5 Flight 501 @ T+ 39 seconds

The inquiry board further analyzed the SRI software and overall A5 program and arrived at several conclusions:

  • The SRI alignment function was used to perform ground-based alignment of the inertial platform prior to lift-off (around T-3 seconds) and once the rocket took off the alignment function would not serve a purpose. It was left running in the A5 for the first 50 seconds as a “special feature” in case the system needed to be restarted in the event of a brief pause in the countdown before lift off; such resets could take hours in the A4 and this would speed up the process.
  • The SRI code had been analyzed for exceptions by the A4 team, and seven variables were deemed at risk of operand error. Since a maximum SRI CPU utilization target of 80% had been set for the A4, only four variables were protected and three were left unprotected, including the BH variable. The original reason given for this decision was that “they were physically limited or there was a large margin of safety.” However, the CPU constraint only applied to the A4 — not the A5, and the SRI code was never re-analyzed by the A5 team using realistic A5 input data.
  • The operand error in the SRI was not sufficient for system failure. The specification and design of the SRI exception handling mechanism also contributed to the failure. In the event of any exception, the system specification stated the failure should be indicated on the data bus, the failure context should stored in an EEPROM memory, and finally the SRI processor should be shut down. The reason behind this acute action was the engineering culture within the Ariane program focused on hardware failure instead of software failures since the former occurred more often than the latter. A rational approach to handling random hardware failures is to shutdown active systems and switch to the backup. However, a better approach in this software fault scenario would have been to provide best-effort estimates of the required altitude, position, and velocity. The inquiry board wrote that “software should be assumed faulty until applying the currently accepted best practice methods can demonstrate that it is correct.”
  • While there was some unit testing and integration testing with A4 data, neither end-to-end, integration testing with hardware and software nor test simulations with realistic data from the A5 trajectory data were ever performed. Post-501 flight simulations running the SRI software in the context of actual 501 trajectory data reproduced the chain of events leading to the SRI failure.

read original article here