> Only FPGAs have deterministic behavior, so they still have bright future GPUs ...

> Only FPGAs have deterministic behavior, so they still have bright future

GPUs are perfectly deterministic. I run the same code twice, I get precisely the same results. For now anyway, the world is moving in the opposite direction. Networks don't need such high determinism, small errors in the computations make no difference. We can even prove this quite reliably. So when GPUs become less accurate, it will be a big step forward. But for now, they are totally deterministic.

That's the theory anyway. Practically, FPGAs are far more error prone and far less deterministic than GPUs. If for example, your timings are too aggressive, you will get subtle instabilities and bugs that are obscenely difficult to track down.

> Tools is other topic, but they are perfect for this particular task

Tools for FPGAs are a total and unmitigated disaster. Altera and Xilinx have been intentionally shooting themselves in the foot for decades now by making their tools absurdly hard to use. The tools are all closed source, they're very awkward and poorly documented, you file a bug and hear about it in next ice age, and they cost a lot of money. For the millions that Altera/Xilins must make from software licensing they're killing their far more productive hardware business.

> FPGA debugging is easy when you can simulate all your code before going to hardware

When you said this, I felt a draft and a dark cold settled in. You can't be serious? Plenty of designs simulate fine but fail on real hardware. And even simulation failures aren't easy to track down. Have you actually deployed any deep learning on FPGAs? Because I have.

> Hardware debugging with integrated logic analyzer is really easy

You either haven't used GPUs so you don't see how trivially pleasant writing code is by comparison, or you've never used FPGAs in an environment where something had to work. I've literally had to use oscilloscopes to debug FPGAs, that's not uncommon. For example, the logic analyzers, they disturb your design. Bugs can go away when you put them in!

> For hardcore projects you can take 3rd part logic analyzer that spits every event of the system over 10G Ethernet interface.

The fact that you said "every event of the system" convinces me you've never done this. There is no such thing.

> Portability is the code depends solely on the author.

It is not! To squeeze performance out of an FPGA you need to adapt your code to the particular resources on the FPGA (like the logic blocks or IP cores). The difference between FPGAs from different generations and the difference between FPGAs at difference price points in the same generation is huge. If you don't carefully align your code with the resources on your FPGA you squander most of the gains. For my GPU-based code? I write it once. And I'm good. I've seen people upgrade to far more expensive FPGAs, get really sad nothing improved, and then start changing their code for months on end to get higher performance.

Also. Code that runs on one FPGA and is totally correct, may be buggy and unstable on another FPGA. It's all fun!

> GPUs are nice and powerful, but they are big, require separate computer and lots of power.

They don't need a separate computer and they are no bigger than an FPGA. [We've had things like the Intel Neural Compute Stick for a while now](https://software.intel.com/en-us/movidius-ncs). You just plug it into the USB port of your raspberry PI and go. Google has the same thing, NVIDIA has something similar.

That's another reason why this paper is junk. They compare against a server GPU tuned for maximum performance at the expense of power. They should also compare against the smaller embedded options in the space.