> There is a reason that even the obsolete x87 floating point stack still runs a near optimal speed.
That's because SSE / AVX are faster than x87 floating point instructions. So modern CPUs just microcode-translate the x87 instructions into SSE / AVX micro-ops under the hood.
They do not translate x87 to SSE/AVX under the hood. It's goofy enough, (not just with the extra precision, but the status word needs to be renamed too) that it has dedicated hardware. Therefore there's a seperate register file that stores x87/mmx state (and avx-512 k mask registers).
I was going to say that there are no SSE/AVX micro ops and x87, SSE, AVX, AVX512 just get translated to the same internal format that implement the superset of all specific instruction behaviours, but looking at the instruction tables, for example for Ice Lake, you can see that the legacy FADD is converted to exactly one uop that is run on port 5, while ADDSS is also one uop but it can be executed on either port 0 or 1. So it seems that at least Ice Lake still has x87 specific uops.
You can see that something like the legacy FCOS is instead definitely microcoded as it expands to hundreds of uops. This has been the case for at least two decades.
That's because SSE / AVX are faster than x87 floating point instructions. So modern CPUs just microcode-translate the x87 instructions into SSE / AVX micro-ops under the hood.