Hacker News new | past | comments | ask | show | jobs | submit login

Author here if anyone has Pentium questions :-)

My Mastodon thread about the bug was on HN a few weeks ago, so this might seem familiar, but now I've finished a detailed blog post. The previous HN post has a bunch of comments: https://news.ycombinator.com/item?id=42391079






In my view, this $475M was perhaps the best marketing spend for Intel. Because of the bug and recall, everyone including those not in tech knew about Intel. Coming from the 486 when people were expecting 586 or 686 but then suddenly "Pentium", this bug and recall built a reputation and good will that carried on later with Pentium MMX.

Nah, Intel already did a big Pentium marketing blitz with the bunny people before this bug.

Bunny people were part of the MMX and PII marketing.

Great article and analysis as always, thanks! Somewhat crazy to remember that a (as you argue) minor CPU erretum made world wide headlines. So many worse ones out there (like you mention from Intel) but others as well, that are completely forgotten.

For the Pentium, I'm curious about the FPU value stack (or whatever the correct term is) rework they did. It's been a long time, but didn't they do some kind of early "register renaming" thing that had you had to manually manage doing careful fxchg's?


Yes, internally fxch is a register rename—_and_ fxch can go in the V-pipe and takes only one cycle (Pentium has two pipes, U and V).

IIRC fadd and fmul were both 3/1 (three cycles latency, one cycle throughput), so you'd start an operation, use the free fxch to get something else to the top, and then do two other operations while you were waiting for the operation to finish. That way, you could get long strings of FPU operations at effectively 1 op/cycle if you planned things well.

IIRC, MSVC did a pretty good job of it, too. GCC didn't, really (and thus Pentium GCC was born).


FMUL could only be issued every other cycle, which made scheduling even more annoying. Doing something like a matrix-vector multiplication was a messy game of FADD/FMUL/FXCH hot potato since for every operation one of the arguments had to be the top of the stack, so the TOS was constantly being replaced.

Compilers got pretty good at optimizing straight line math but were not as good at cases where variables needed to be kept in the stack during a loop, like a running sum. You had to get the order of exchanges just right to preserve stack order across loop iterations. The compilers at the time often had to spill to memory or use multiple FXCHs at the end of the loop.


> FMUL could only be issued every other cycle, which made scheduling even more annoying.

Huh, are you sure? Do you have any documentation that clarifies the rules for this? I was under the impression that something like `FMUL st, st(2) ; FXCH st(1), FMUL st, st(2)` would kick off two muls in two cycles, with no stall.


Agner Fog's manuals are clear on this. Only the last of FMUL's 3 cycles can overlap with another FMUL.

You can immediately overlap with a FADD.


AFAIK, the FPU was a stack calculator. So you pushed things on and ran calculations on the stack. https://en.wikibooks.org/wiki/X86_Assembly/Floating_Point

It's only a stack machine in front, really. Behind-the-scenes, it's probably just eight registers (the stack is a fixed size, it doesn't spill to memory or anything).

Definitely was 8 regs: https://intranetssn.github.io/www.ssn.net/twiki/pub/CseIntra... also where you'd see 'long double'

> The bug is presumably in the Pentium's voluminous microcode. The microcode is too complex for me to analyze, so don't expect a detailed blog post on this subject.

How hard is it to "dump" the microcode into a bitstream? Could it be done programatically from high resolution die photographs? Of course, I appreciate that's probably the easy part in comparison to reverse engineering what the bitstream means.

> By carefully examining the PLA under a microscope

Do you do this stuff at home? What kind of equipment do you have in your lab? How did you develop the skills to do all this?


Dumping the microcode into a bitstream can be done in an automated way if you have clear, high-resolution die photos. There are programs to generate ROM bitsreams from photos. Part of the problem is removing all the layers of metal to expose the transistors. My process isn't great, so the pictures aren't as clear as I'd like. But yes, the hard part is figuring out what the microcode bitstream means. Intel's patents explained a lot about the 8086 microcode structure, but Intel revealed much less about later processors.

I do this stuff at home. I have an AmScope metallurgical microscope; a metallurgical microscope shines light down through the lens, rather than shining the light from underneath like a biological microscope. Thus, the metallurgical microscope works for opaque chips. The Pentium is reaching the limits of my microscope, since the feature size is about the wavelength of light. I don't have any training in this; I learned through reading and experimentation.


One tidbit to add about scopes: some biological scopes do use "epi" illumination like metallurgical scopes. It's commonly used on high end scopes, in combination with laser illumination and fluorescence. They are much more complicated and require much better alignment than a regular trans illumination scope.

I suppose you might be able to get slightly better resolution using a shorter wavelength, but at that point, it requires a lot of technical skill and environmental conditions and time and money, Just getting to the point you've reached (and knowing what the limitations are) can be satisfying in itself.


I was about to ask if the explanation of floating point numbers was using Avogadro's number on purpose, but then I realized the other number was Planck's constant.

Yes, I wanted to use meaningful floating point examples instead of random numbers. You get a gold star for noticing :-)

Thank you very much for this detailed article.

I never realised this is how floating point division can be implemented. Actually funny how I didn't realise that multiple integer division steps are required to implement floating point division :-)

In hindsight one could wonder why the unused parts of the lookup table were not filled with 2 and -2 in the first place.


Tour de force, truly. Amazing work!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: