The article being from 2011 is perhaps why it can be as long as it is without mentioning "Coarse-grained reconfigurable arrays", or CGRAs, which, at least as of 2019 when I learned about them, seemed to occupy a good middle ground between conventional CPUs and FPGAs.
The idea is that, instead of being a bunch of gates like an FPGA, the components of the CGRA are at the scale of an ALU, or maybe an on-silicon network switch, with a single CGRA having different parts that are optimized for e.g. numerics, IO, encryption, caching, etc., which you can knit together into the processor you need.
That's maybe where this idea went?
Here's a more recent link covering similar ground:
It's worth noting that what you are describing is basically an FPGA nowadays.
FPGAs don't have "gates" as the basic building blocks.
Instead you have "logic cells" which are composed of a fixed size (often either 4 or 6 bit) LUT (look up table), one or two flip flops, and a multiplexer to choose whether to use the stored value or the new LUT value. They also sometimes contain basic ALU components like adders or multipliers. Those logic cells are then usually grouped together to form logic blocks which might have some amount of local memory/cache available. These blocks are the smallest "discrete" component of an FPGA and are configured as a whole block with configurations determined at synthesis time.
On top of this you have memory blocks and other "hard IP" like DSP slices, etc distributed around the IC for these logic blocks to take advantage of.
And then finally you have larger hard IP that a given chip only has a few of. These include your PLLs (phase locked loops) or other analog clock multiplier hardware (to allow you to run multiple clock domains on a single FPGA), your encryption and encoding/decoding accelerators, dedicated protocol hard IP (ethernet, PCIE, etc), and hardware that is directly attached to the IO (ADCs, DACs, pullup/pulldown resistor configuration, etc). And increasingly nowadays also full blown hard IP CPUs and GPUs that can interact directly with the FPGA.
No the previous poster has not described a FPGA, but a FPGA-like device that contains much more complex fixed-function blocks than the small DSP multipliers that are available in the currently existing FPGAs.
With 18-bit integer multipliers or the like you cannot compete in energy efficiency with the arithmetic execution units of a GPU.
The so-called CGRAs are an attempt to revive the idea of reconfigurable dataflow processors, with the hope of combining in the same device the advantages of the FPGAs with the advantages of the GPUs.
Xilinx FPGAs already contain a "dataflow processor" called "AI Engine" and they tend to have more TOPS/TFLOPS than the programmable fabric and DSPs combined.
That isn't even everything. Nowadays Xilinx FPGAs come with "AI engine tiles". Some FPGAs come with 400 tiles. Each of those tiles is a C programmable processor running with a local scratchpad memory. So an FPGA is probably the most heterogenous type of chip ever invented.
The claims about improved energy efficiency (due to the elimination of the instruction fetching and decoding and of the register files) can be correct only when such a CGRA is not used as a general-purpose CPU, but as an accelerator used to implement various iterative algorithms, i.e. when its dataflow compiler could be used as a replacement for something like CUDA.
A FPGA would have the same energy efficiency advantage for algorithms without much numeric computation, but it is not competitive with a GPU or a CGRA for most numeric computations, except DSP, because it includes only small fixed-point multipliers and adders, which are not as efficient as big vector floating-point fused-multiply-add execution units.
> The idea is that, instead of being a bunch of gates like an FPGA, the components of the CGRA are at the scale of an ALU, or maybe an on-silicon network switch, with a single CGRA having different parts that are optimized for e.g. numerics, IO, encryption, caching, etc., which you can knit together into the processor you need.
What the other guy said downthread, but seriously.
Xilinx FPGAs today do have LUTs (the 4-input or 6-input gate-like structures). But they also have VLIW + SIMD cores with L1 connected on powerful interconnect.
So CGRAs is probably "just" a modern Xilinx "AI Engine" FPGA.
-------------------
Major FPGAs for the last 10+ years have all had hardware multipliers as well. Multiplication is just one of those ASIC / hardware units that LUTs cannot emulate very well. Depending on your definition of "Coarse-grained reconfigurable arrays", you might want to look into DSP-slices and other "ALU-like" subunits of FPGAs of the last decade.
2011 "ancient"? Author alive! Amazing world we live in - deep witchcraft probably involved here. Perhaps upload?
Seriously though. There are plenty of useful readings that came out before the past 7 days. And digging them out and bringing them to our attention is useful.
The idea is that, instead of being a bunch of gates like an FPGA, the components of the CGRA are at the scale of an ALU, or maybe an on-silicon network switch, with a single CGRA having different parts that are optimized for e.g. numerics, IO, encryption, caching, etc., which you can knit together into the processor you need.
That's maybe where this idea went?
Here's a more recent link covering similar ground:
https://semiengineering.com/specialization-vs-generalization...