April 21, 2020

The Pentium Bug, Revisited

The jokes still circulate on the Internet:

Q: How many Pentium designers does it take to screw in a light bulb?
A: 1.99995827903, but that's close enough for nontechnical people.
A. They aren't certain. Every time they do the math, they get a different answer.

At Intel, quality is job 0.999999998.

Q. What do you get when you cross a Pentium PC with a research grant?
A. A mad scientist.

The incident that triggered this barrage of ridicule occurred in the fall of 1994, a few months after Intel had introduced its Pentium microprocessor. The furor started with an e-mail message from Thomas R. Nicely, a mathematician at Lynchburg College in Virginia, who was doing computations related to the distribution of prime numbers. Nicely pointed out that the chip gave incorrect answers to certain floating-point (decimal) division calculations. Other users quickly confirmed the problem and identified additional examples in which an error occurs.

Here's one of the more dramatic instances:
x = 4195835, y = 3145727, and z = x − (x/y)*y.

Computed exactly, the answer, z, should be zero. The Pentium chip gave 256 as the answer.

Typically, the flawed chip generated a slightly inaccurate answer for only a few pairs of numbers. For example, dividing 5505001 by 294911 produced 18.66600093 instead of 18.66665197. Multiplying (or dividing) either number of the pair by any power of 2 also gave an incorrect answer. Intel claimed that only one in nine billion division operations would exhibit reduced precision.

So, the probability of randomly coming across the affected numbers was very small. But computations don't necessarily involve random selections of numbers, and such errors can matter in scientific and engineering applications and even in some spreadsheet calculations.

The fault itself had occurred because of the omission of five entries in a table of 1,066 values (part of the chip's circuitry) used by a speedy algorithm known as SRT division, after the initials of the three researchers who independently introduced the scheme more than 30 years ago. The five cells should have contained the constant +2, but because the cells were empty, the processor fetched a zero. This could throw off a calculation, which would result in a number that was always slightly less precise than the correct answer.

At risk were the binary divisors beginning with 1.0001, 1.0100, 1.0111, 1.1010, and 1.1101. These particular bit patterns had to be followed by a string of six 1s to boost the probability of error.

Intel corrected the problem and a variety of other glitches, and the Pentium chip (in all its variants) is now the workhorse of the desktop information age. However, that's not really the end of the story.

"It turns out the mathematics is more interesting than people think," says Alan Edelman, an applied mathematician at the Massachusetts Institute of Technology. "Most people link [the Pentium bug] to a simple arithmetic mistake, but it is far more subtle."

Edelman describes some of these nuances in a paper, "The Mathematics of the Pentium Division Bug," published in the March 1997 SIAM Review. "The bug in the Pentium was an easy mistake to make, and a difficult one to catch," he says.

The radix-4 SRT algorithm used on the Pentium chip represents a variation of the long-division method we learned in school. Suppose we want to divide 63 into 819762: 63 goes into 81 once, leaving a remainder of 18; then 9 is appended to the remainder to produce 189, and 63 divides into this number 3 times, leaving a remainder of zero. So far, the quotient is 13. The same procedure is repeated until the computation is complete.

The SRT algorithm does essentially the same thing, except that it works with the numbers -2, -1, 0, 1, and 2 instead of 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9—that is, in base 4 instead of base 10. Its advantage in a computer is that each step adds another 2-bit digit to the quotient, speeding up the rate at which division can be performed.

The Pentium chip's table, implemented as a triangular array, allowed it to look up values required for doing the computation. "When the algorithm was implemented in silicon, five entries on the boundary of this triangular portion were dropped," says Cleve Moler of MathWorks, Inc., who worked out a method for circumventing the problem. "These missing entries . . . effectively mean[s] that, for certain combinations of bits developed during the division process, the chip makes an error while updating the remainder."

Edelman's analysis reveals that a buggy value can be reached only from the entry below it in the chip's table. "This entry just below the bad entry plays the role of a foothold," he says. "The existence of this foothold is a subtle phenomenon that may not have been readily guessed from general properties of SRT division.

"I believe that its existence has surprised everybody," Edelman continues. "Any thought that each table entry is somehow equally likely to be reached is very wrong."

In effect, there are only two viable routes to arrive at a bug. Even then, it is quite possible to reach the foothold entry yet not hit the buggy entry on the next step. So, errors in the final output would occur only under very specific circumstances.

"Naively, one might have thought that the bad table entries would be hit uniformly at random," Edelman observes. "If this had been the case, the bug surely would have been noticed earlier and would have had the potential to cause more significant damage in every instance it went unnoticed."

In 1997, the Internet was buzzing about chip errors again, this time with reports of a bug involving floating-point arithmetic in the new Pentium II microprocessor. Floating-point numbers are stored inside the microprocessor as strings of 80 bits, whereas integers are stored in 16 or 32 bits. Sometimes a floating-point number converted to an integer doesn't fit the smaller format, and the microprocessor is supposed to provide an "overflow" warning. Apparently, the Pentium II chip failed to do so in many cases.

Interestingly, the launch failure of the first test flight of the Ariane 5 rocket, which exploded 37 seconds after liftoff on June 4, 1996, occurred because of a software error that resulted from converting a 64-bit floating point number to a 16-bit integer. The value of the floating point number happened to be larger than could be represented by a 16-bit integer. The overflow wasn't handled properly, and in response, the computer cleared its memory. The memory dump was interpreted by the rocket as instructions to its rocket nozzles, and an explosion resulted.

It doesn't take much!

Originally posted May 12, 1997

References:

Cipra, Barry. 1995. How number theory got the best of the Pentium chipScience 267(Jan. 13):175.

Coe, Tim, et al. 1995. Computational aspects of the Pentium affairIEEE Computational Science & Engineering (Spring):18-30.

Fisher, Lawrence M. 1997. Flaw reported in new Intel chipNew York Times (May 5). Available at http://www.nytimes.com/library/cyber/week/050697intel-chip-flaw.html

Geppert, Linda. 1995. Biology 101 on the Internet: Dissecting the Pentium bug. IEEE Spectrum (February):16-17.

Moler, Cleve. 1995. A tale of two numbersSIAM News 28(January):1.

Nuseibeh, Bashar. 1997. Ariane 5: Who Dunnit? IEEE Software 14(May/June):15-16.

No comments: