Software that depends heavily on numeric computation is highly susceptible to subtle errors. This fact derives from the imperfect way we are forced to model decimal (real number) arithmetic using a computer. Here, we will examine two relatively simple numeric errors that led to the failure of complex software-resulting in the loss of many millions of dollars in one case and perhaps contributing to the loss of lives in another.
On the morning of June 4, 1996, a French Arianne 5 rocket, carrying a European Space Agency (ESA) satellite, was scheduled for its first launch in French Guyana. About 37 seconds into its flight, the rocket veered off its flight path, broke up, and exploded. A board of inquiry was immediately appointed by the ESA and CNES (Centre National des Etudes Spatiales).
The investigation examined telemetry data received on ground through 42 seconds after liftoff, trajectory data from radar stations, observations from infrared cameras, and the inspection of recovered debris. The origin of the failure was soon narrowed to the flight control system. Fortunately, the two primary computer-controlled inertial reference subsystems were both recovered, and after further investigation the source of the failure was determined to be a software error within these units.
Particularly, the error was traced to software controlling the alignment of the rocket's strap-down inertial platform. An integer overflow error occurred when the program attempted to convert a 64-bit floating point number to a 16-bit integer. The floating point number, which measured a quantity related to the horizontal velocity of the platform, was simply too large to be represented as a 16-bit integer. The Arianne 5 software had been derived from the software for the previous generation Arianne 4 rocket. The Arianne 4 had a different initial trajectory that produced smaller horizontal velocity values. Hence, the larger values recorded during the Arianne 5 flight were out of the range that the software was designed to handle.
The overflow error caused the computer in the primary inertial reference system to shut down and attempt to switch to the backup (redundant) system. Unfortunately, the redundant system had experienced exactly the same fault and had already shut itself down when the primary computer attempted to transfer control to it.
Had this problem been identified in preflight software testing, it could have been corrected very easily. Such is the case for many software errors. The smallest of problems can cause a program to crash or shut down. This is one of the reasons why it is so much harder to engineer reliable software than to engineer reliable bridges and other complex physical structures. An analogy might be that the failure of one very small bolt in a bridge could cause the entire structure to collapse-possible perhaps, but highly unlikely. Yet the failure of "small bolts" often causes software systems to collapse. Our next example provides further demonstration of this fact.
The American public was delighted to hear the Pentagon's estimates that approximately 90 percent of all Iraqi Scud missiles were being intercepted by the Patriot antimissile defense system during the Persian Gulf War in 1991. In the months following the end of the war, however, rumors about the Patriot's ineffectiveness began to surface. One critical failure of the system had already been observed during the war when a Scud missile hit a U.S. military barracks in Dhahran, Saudi Arabia, killing 28 U.S. soldiers.
A careful analysis after the war ended revealed that the earlier estimates of very high Patriot hit rates had been hastily constructed on the basis of insufficient data and that they were, in fact, inaccurate. An article in the February 15, 1992, issue of the New Scientist reported on findings by MIT professor Ted Postol, who reexamined the Patriot's war record at the request of a Congressional committee. Postol's basic conclusion was that Patriot missiles missed many of the Iraqi missiles that the United States thought had been shot down during the Gulf War, and that deploying the Patriot antimissile defense system did not reduce damage during Iraq's missile attacks on Israel and Saudi Arabia. One reason cited was that Iraq's modified Scud missile, called the Al-Husayn, was difficult to hit because it was so unstable that it broke into pieces when it reentered the atmosphere, creating a confusing barrage of debris. Although the debate about its effectiveness continues, most observers believe the Patriot hit rate was closer to 10 percent than to 90 percent.
In late March 1992, the U.S. General Accounting Office's report to Congress on the Patriot's problems was delivered. It identified a software error due to numeric calculations as the primary cause for the failure of the Patriot system. The report's own language provides a very succinct description of the problem, and so we quote a portion of it here.
The [system's] prediction of where the Scud will next appear is a function of the Scud's known velocity and the time of the last radar detection. Velocity is a real number that can be expressed as a whole number and a decimal (e.g., 3750.2563 . . . miles per hour). Time is kept continuously by the system's internal clock in tenths of seconds but is expressed as an integer or whole number (e.g., 32, 33, 34 . . .). The longer the system has been running, the larger the number representing time. To predict where the Scud will next appear, both time and velocity must be expressed as real numbers. Because of the way the Patriot computer performs its calculations and the fact that its registers are only 24 bits long, the conversion of time from an integer to a real number cannot be any more precise than 24 bits. This conversion results in a loss of precision causing a less accurate time calculation. The effect of this inaccuracy on the [system's] calculation is directly proportional to the target's velocity and the length that the system has been running. Consequently, performing the conversion after the Patriot computer system has been running continuously for extended periods causes the [system's estimated Scud position] to shift away from the center of the target, making it less likely that the target will be successfully intercepted. [Excerpted from Report to the Chairman, Subcommittee on Investigations and Oversight, Committee on Science, Space, and Technology, House of Representatives: Patriot Missile Defense-Software Problem Led to System Failure at Dhahran, Saudi Arabia, March 1992.]
The error occurred when translating between integer and decimal
number (floating point) formats. The error could become quite significant when
the system was run for long periods without resetting. For example, after 100
hours of continuous operation, the error in the estimate of the position of the
target is almost 1/3 of a mile. We should note that the Patriot system software
error had been discovered prior to the Dhahran barracks attack, and a software
"patch" had been devised and shipped. Unfortunately, it arrived in Saudi Arabia
the day after the Dhahran barracks attack.