Hacker News new | past | comments | ask | show | jobs | submit login

Yes, internally fxch is a register rename—_and_ fxch can go in the V-pipe and takes only one cycle (Pentium has two pipes, U and V).

IIRC fadd and fmul were both 3/1 (three cycles latency, one cycle throughput), so you'd start an operation, use the free fxch to get something else to the top, and then do two other operations while you were waiting for the operation to finish. That way, you could get long strings of FPU operations at effectively 1 op/cycle if you planned things well.

IIRC, MSVC did a pretty good job of it, too. GCC didn't, really (and thus Pentium GCC was born).






FMUL could only be issued every other cycle, which made scheduling even more annoying. Doing something like a matrix-vector multiplication was a messy game of FADD/FMUL/FXCH hot potato since for every operation one of the arguments had to be the top of the stack, so the TOS was constantly being replaced.

Compilers got pretty good at optimizing straight line math but were not as good at cases where variables needed to be kept in the stack during a loop, like a running sum. You had to get the order of exchanges just right to preserve stack order across loop iterations. The compilers at the time often had to spill to memory or use multiple FXCHs at the end of the loop.


> FMUL could only be issued every other cycle, which made scheduling even more annoying.

Huh, are you sure? Do you have any documentation that clarifies the rules for this? I was under the impression that something like `FMUL st, st(2) ; FXCH st(1), FMUL st, st(2)` would kick off two muls in two cycles, with no stall.


Agner Fog's manuals are clear on this. Only the last of FMUL's 3 cycles can overlap with another FMUL.

You can immediately overlap with a FADD.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: