It sounds like the NCSU guys are using the CPU as a prefetcher to speed up GPU kernel execution, not using the GPU to speed up normal CPU programs as the ExtremeTech article implies.
The CPU parses the GPU kernel and creates a prefetcher program that contains the load instructions of the GPU kernel. This prefetcher runs on the CPU, but slightly ahead of kernel execution on the GPU. This warms up the caches, so that when the GPU executes a load instruction, the data is already there.
Yes, you are right. In fact we don't have to infer this. The researchers state it directly in their abstract, "...a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs..."
It sounds like the NCSU guys are using the CPU as a prefetcher to speed up GPU kernel execution, not using the GPU to speed up normal CPU programs as the ExtremeTech article implies.
The article says the same thing you are -- that the CPU is used as a prefetcher for the GPU; read the 3rd paragraph:
To achieve the 20% boost, the researchers reduce the CPU to a fetch/decode unit, and the GPU becomes the primary computation unit. This works out well because CPUs are generally very strong at fetching data from memory, and GPUs are essentially just monstrous floating point units. In practice, this means the CPU is focused on working out what data the GPU needs (pre-fetching), the GPU’s pipes stay full, and a 20% performance boost arises.
This is only tangentially related, but with a title like that I was expecting a brainless regurgitation of a press release, or some kind of extrapolation from a paper that wasn't claiming that meaning at all.
Instead, I see a news article with a clear description, caveats and constraints clearly listed, and a portion of how this relates to the parent company. It's a shame that I find this surprising.
The fact that it's only a 20% increase makes it sound promising. Normally press releases will boast about "100x" increases in speed when they switch to using the GPU. And you can get that sort of increase for highly parallel tasks with low memory pressure. BitCoin mining, for example. But the low 20% speedup implies that they're doing this for general purpose computing.
That's hilarious. Using a whole CPU for prefetching data because of poor shared bus performance for both the CPUs and GPUs (?!). Instead of such crazy "software solution", I would rather prefer to use a portion of its L2 or L3 cache size (e.g. 1MB for a 3MB L2/L3 cache) for the GPU itself, and reduce the bus saturation with DMA transfers (e.g. just like the SPE units of the Cell CPU work).
However, what they did was demonstrate a novel way of making use of two different processing cores that exist on the chip (namely using both the CPU and an integrated GPU) to improve the performance of their benchmark - which certainly is both interesting and news.
Of course, a proof of concept is a long long way from it being of practical benefit!
A very, very long way, I would guess. A 20% performance gain is nice, but having to power a GPU to get it is not. I would expect that adding a second CPU instead of that GPU almost always will give you more than that 20% performsnce and less heat, for less money.
It depends on the application. If the application, as the article puts it, "pushes polygons around", then I imagine the APU concept may have the advantage.
Though, as previously noted, this APU concept is highly dependent on tailored software (compilers, etc.) and AMD has been banking their strategy on the fact that these critical pieces will take advantage of the APU.
I think the NCSU research (co-sponsored by AMD) is a move in the right direction for determining whether these APUs are an effective solution when compared to the multi-CPU architectures.
GPUs are pretty tailored and aren't really good for general purpose computing. Branching and cache coherency are much easier in the CPU compared to the GPU. I doubt that any of the advertised gains would be realized by normal users.
I hope this has something to do with HSAIL virtual ISA.
For example general purpose code in C compiled to HSAIL and then CPU makes intelligent decisions which parts of code to JIT-compile to CPU and which to GPU ...
Hopefully we can get this into drivers sooner than later. AMD has already been working with Microsoft to get a large performance gain out of BullDozer chips in Windows 8 simply by the way threads are prioritized.
The CPU parses the GPU kernel and creates a prefetcher program that contains the load instructions of the GPU kernel. This prefetcher runs on the CPU, but slightly ahead of kernel execution on the GPU. This warms up the caches, so that when the GPU executes a load instruction, the data is already there.