This is a terrible paper and the authors didn't employ even the most basic scientific standards. They search over the shape of the network that works best on the FPGA to give them their target performance on the FPGA. But they don't search over the network that gives the best target performance on the GPU! Of course the FPGA wins.
We have known for years now that deep networks are extremely compressible when they're trained. You can drop the vast majority of the weights and still maintain almost all of the performance. Just like you can drop the accuracy from floats, to ints, to int8, and even to bool for the weights and you can still perform well.
This papers is a joke and it deserves to be rejected from anywhere it's submitted. They optimize the shape of the FPGA network but not the GPU network. They don't apply the standard methods to prune weights in networks and they don't compare to those methods. They also pick one GPU network at random, lots of object detectors are faster than Yolo. People have explored the speed-performance tradeoff closely.
About FPGAs in general: there are great reasons why FPGAs have been just over the horizon for a long time. The toolchains suck and they're mostly very closed down. Debugging is very hard. They cost 20x or so as much as GPUs. Your code becomes very specific to one particular FPGA so upgrading is a total pain. And more.. There are plenty of better solutions out there like TPUs.
Only FPGAs have deterministic behavior, so they still have bright future. Tools is other topic, but they are perfect for this particular task. It makes no sense compare my 2 favorite IDEs: Qt and Vivado. Their purpose is very different. FPGA debugging is easy when you can simulate all your code before going to hardware. Hardware debugging with integrated logic analyzer is really easy. For hardcore projects you can take 3rd part logic analyzer that spits every event of the system over 10G Ethernet interface. Portability is the code depends solely on the author.
Edit: GPUs are nice and powerful, but they are big, require separate computer and lots of power.
> Only FPGAs have deterministic behavior, so they still have bright future
GPUs are perfectly deterministic. I run the same code twice, I get precisely the same results. For now anyway, the world is moving in the opposite direction. Networks don't need such high determinism, small errors in the computations make no difference. We can even prove this quite reliably. So when GPUs become less accurate, it will be a big step forward. But for now, they are totally deterministic.
That's the theory anyway. Practically, FPGAs are far more error prone and far less deterministic than GPUs. If for example, your timings are too aggressive, you will get subtle instabilities and bugs that are obscenely difficult to track down.
> Tools is other topic, but they are perfect for this particular task
Tools for FPGAs are a total and unmitigated disaster. Altera and Xilinx have been intentionally shooting themselves in the foot for decades now by making their tools absurdly hard to use. The tools are all closed source, they're very awkward and poorly documented, you file a bug and hear about it in next ice age, and they cost a lot of money. For the millions that Altera/Xilins must make from software licensing they're killing their far more productive hardware business.
> FPGA debugging is easy when you can simulate all your code before going to hardware
When you said this, I felt a draft and a dark cold settled in. You can't be serious? Plenty of designs simulate fine but fail on real hardware. And even simulation failures aren't easy to track down. Have you actually deployed any deep learning on FPGAs? Because I have.
> Hardware debugging with integrated logic analyzer is really easy
You either haven't used GPUs so you don't see how trivially pleasant writing code is by comparison, or you've never used FPGAs in an environment where something had to work. I've literally had to use oscilloscopes to debug FPGAs, that's not uncommon. For example, the logic analyzers, they disturb your design. Bugs can go away when you put them in!
> For hardcore projects you can take 3rd part logic analyzer that spits every event of the system over 10G Ethernet interface.
The fact that you said "every event of the system" convinces me you've never done this. There is no such thing.
> Portability is the code depends solely on the author.
It is not! To squeeze performance out of an FPGA you need to adapt your code to the particular resources on the FPGA (like the logic blocks or IP cores). The difference between FPGAs from different generations and the difference between FPGAs at difference price points in the same generation is huge. If you don't carefully align your code with the resources on your FPGA you squander most of the gains. For my GPU-based code? I write it once. And I'm good. I've seen people upgrade to far more expensive FPGAs, get really sad nothing improved, and then start changing their code for months on end to get higher performance.
Also. Code that runs on one FPGA and is totally correct, may be buggy and unstable on another FPGA. It's all fun!
> GPUs are nice and powerful, but they are big, require separate computer and lots of power.
They don't need a separate computer and they are no bigger than an FPGA. [We've had things like the Intel Neural Compute Stick for a while now](https://software.intel.com/en-us/movidius-ncs). You just plug it into the USB port of your raspberry PI and go. Google has the same thing, NVIDIA has something similar.
That's another reason why this paper is junk. They compare against a server GPU tuned for maximum performance at the expense of power. They should also compare against the smaller embedded options in the space.
The question in 2019 is not whether it outperforms FPGA, but whether it outperforms the various TPU-like things, like Google Edge TPU, Apple Neural Engine, Huawei's NPU, Bitmain, etc. And the answer to that question is very likely "no".
We have known for years now that deep networks are extremely compressible when they're trained. You can drop the vast majority of the weights and still maintain almost all of the performance. Just like you can drop the accuracy from floats, to ints, to int8, and even to bool for the weights and you can still perform well.
This papers is a joke and it deserves to be rejected from anywhere it's submitted. They optimize the shape of the FPGA network but not the GPU network. They don't apply the standard methods to prune weights in networks and they don't compare to those methods. They also pick one GPU network at random, lots of object detectors are faster than Yolo. People have explored the speed-performance tradeoff closely.
About FPGAs in general: there are great reasons why FPGAs have been just over the horizon for a long time. The toolchains suck and they're mostly very closed down. Debugging is very hard. They cost 20x or so as much as GPUs. Your code becomes very specific to one particular FPGA so upgrading is a total pain. And more.. There are plenty of better solutions out there like TPUs.