If you think GPUs are too power hungry then you're in for a shock with FPGAs. Switching FPGAs are incredibly power hungry since they run on much larger processes.
FWIW most modern FPGAs use discrete DSPs anyway so you're not really getting the flexibility at that level.
The processes used for FPGA are VERY competitive with GPUs. Stratix 10 is at 14nm, for example. Stratix V was built using 28 nm process and that was at least 5 years ago - on par or exceed NVidia.
You can fuse more operations into DSP using FPGA and/or you can perform less operations per FLOP. One example is to avoid rounding and packing/unpacking when creating deep pipelines for floating point processes.
The Stratix V board will not consume more than 60W when used as PCI Express card. The requirements for Xeon Phi is at least 250W.
With a difference of ~200W, there will be difference in ~4.5kW/h per day or ~1600kW/h or $160 in hard cash per year (US average). Very probably more - getting rid of heat produced, etc.
> Switching FPGAs are incredibly power hungry since they run on much larger processes.
I don't think the comment about process is really true, from what I can tell, Xilinx is only a few months behind the biggest SoC makers in terms of its process adoption, and is shipping 14nm parts currently. Not sure about Altera, but they are on Intel's process, which is bit ahead of the competitors anyway.
In terms of switching power, you definitely pay a penalty to have the reconfigurability in hardware, but on the other hand you don't have all the unused logic that you would on a GPU. I'd guess the comparative efficiency depends on the specific problem and specific implementation, but I don't have any numbers to back that up.
It's a couple things, process is a large part. You're also dealing with 4-LUT instead of transistors so you pay both in switching power and leakage since you can't get the same logic-to-transisitor density that's available on ASICs.
Also there's a ton of SRAM for the 4-LUT configuration so you're paying leakage costs there as well.
NVidia managed to get it right about year and half ago. Before that their gates leaked power all over the place.
The LUTs on Stratix are 6-to-2, with specialized adders, they aren't at all that 4-LUTs you are describing here.
All in all, there are places where FPGAs can beat ASICs. One example is complex algorithms like, say, ticker correlations. These are done using dedicated memory (thus aren't all that CPU friendly - caches aren't enough) and logic and change often enough to make use of ASIC moot.
Another example is parsing network traffic (deep packet inspection). The algorithms in this field utilize memory in interesting ways (compute lot of different statistics for a packet and then compute KL divergence between reference model and your result to see the actual packet type - histograms created in random manner and then scanned linearly, all in parallel). GPUs and/or CPUs just do not have that functionality.
The Arria 10 (previous high-end Altera series) was at 20nm. The new Stratix 10 is at 14nm. UltraScale+ is in the 14-20nm range, I think, and Xilinx got there first.
(I don't know if you can publicly get Stratix 10 devkits yet, but you can get an Arria at least.)
The unused logic part isn't exactly true. The way FPGA's are built doesn't allow for unused sections to be completely shut off. Instead of dark silicon it's more like grey silicon. The unused parts of the chip still use substantial power, unlike ASIC where these unused portions simply wouldn't exist
Not my personal/professional experience. Even with heavy DSP usage with nodes >14nm you can still create designs with lower than 30 W power consumption, since you can control the frequency and use low power design techniques. There are vendors like Microsemi/Lattice that specialize in low power FPGAs where you can do even better
So there might be specific products that hit certain market segments(FWIW I really like Lattice's offerings). It's just that Watt to Watt GPUs should be more efficient since they can use the latest process and don't have to carry around 4-LUT + SRAM.
On the DSP side, you're using a ASIC DSP(can't change the width for instance) anyway on most modern FPGAs so you're comparing ASIC to ASIC at that point.
Designing an ASIC with low power in mind will always beat an FPGA, no question there. But the design cost is prohibitive for many applications (not to mention the lack of flexibility to re-iterate/patch your design with little to no cost). But compared to GPUs I'm fairly certain you can do much better power-wise going with an FPGA.
You have finer control over your frequency, over which sections to power down, how often to switch, etc.
Of course you can get better price/GFLOP with GPUs + quicker time to market
FWIW most modern FPGAs use discrete DSPs anyway so you're not really getting the flexibility at that level.