Im skeptical as well. The primary reason IMO is the software. How do you easily reconfigure your FPGA to efficiently run whatever computationally intensive and/or specialized algorithm you have?
It is doable. I've seen it during my Computer Engineering courses 14 years ago.
Basically you analyze the code for candidates, select a candidate, upload your custom hardware design, run your operation on the hardware, and repeat.
The difficult part is that uploading your hardware to FPGA is in the order of tenths of seconds, which is ages when compared to the nano and micro seconds your CPU works.
So your specific operation must be worthwhile to upload.
A bit of FPGA on your CPU makes it more flexible, for example your could set a profile such as 'crypto' or 'video' to add some specific hardware acceleration to you general purpose CPU.
Imagine your CPU being able to switch your embedded GPU into another CPU core.
Let's say the current zen 2 had an FPGA onboard. AMD could sell you an upgraded design with AV1 support for a few dollars. Most people aren't going to buy a new CPU on the basis of a video decoder, but they'll buy an upgrade to the chip that auto "installs" itself. That's a sale AMD otherwise wouldn't have made.
Also, for the way most modern CPUs are used: how do you task switch? If the hardware is large enough, you can deploy multiple configurations at a time, but does software support that? Is is possible to have relocatable configurations?
In theory, you could even page out code, but I guess the speed of that will be slow. Also, paging in probably would be challenging because the logical units aren’t uniform (if only because not all of them will be connected to external wires)
This can be used with a client-server model, that is if there are enough free cells and I/O available on FPGA it could let it install the configuration and then any application could communicate with it concurrently, maybe with some basic auth.
But from what I understand of FPGAs, fragmentation would be a serious issue. You may have the free cells and I/O you need to implement some circuit, but if they’re dispersed over your FPGA or even connected, but in the wrong shape for the circuit you’re building, that’s useless.
An enormous crossbar could solve that, but I would think that would be way too costly, if practically possible at all.
Even GPUs multitask all the time, even though it's less obvious. Cooperative multitasking in this context means setting up and executing different shaders/kernels. The overhead involved in this is quite manageable.
Repurposing FPGAs to different tasks means loading a new bitstream into the device every time. So it is much more efficient to grant exclusive access to each user of the device for long stretches od time. The proper pattern for that is more like a job queue.
I believe there is some amount of support in OpenCL for FPGAs. If only we could get companies to property support OpenCL, we'd have a nice software interface to pretty much any kind of compute resource on a machine.
You're not wrong but I expect they'd make it so that the various models would be similar enough (at least within a given CPU generation) so that you could use mostly precompiled artifacts instead of rerouting everything from scratch.
I've always been pretty skeptical of their approach though, in order to be usable they'd need excellent tooling to support the feature, and if there's one thing that existing FPGA software isn't it's "excellent".
Getting FPGAs to perform well is often an art more than a science ("hey guys, let's try a different seed to see if we get better timings") so the idea that non-hardware people would start to routinely generate FPGA bitstreams for their projects is so implausible that it's almost comical to me.
Maybe one day we'll have a GCC/LLVM for FPGAs and it'll be a different story.
Beyond the GCC/LLVM, you also really need a standard library. Nobody is talking about that. Today, if you want a std::map on an FPGA, you have to either pay $100k or build it yourself. That's untenable.