So Epictronics recently looked at the 386SX, the version with the 16bit external bus, which was slower than the 286 at the same clock. What changed between that and this? Was the major difference the double clock hit on fetch? Or did it have a shorter prefetch queue as well like the 8088?
386SX was slower than a 286 at the same clock only for the legacy 16-bit programs and only for the 16-bit programs that did not use a floating-point coprocessor, as the 80387 coprocessors available for 386SX were much faster at the same clock frequency than the 80287 available for 286.
Moreover there was only a small time interval when 286 and 386SX overlapped in clock frequency. In later years 286 could be found only at 12 MHz or 16 MHz, while 386SX was available at 25 MHz or 33 MHz, so 386SX was noticeably faster at running any program.
Rewriting or recompiling a program as a 32-bit executable could gain a lot of performance, but it is true that in the early years of 386DX and 386SX most users were still using 16-bit MS-DOS applications.