From long experience in working on regular expressions, at least some of the FPGA acceleration projects rely on comparison to fairly daft software implementations to yield a speedup. I haven't worked on these particular problems, but I think there's a tendency to regard these systems as magic.
Remember that the restructuring to make things work well on FPGA (regularization, finding lots of independent parallel work to do, removing branches, etc) also work really well on software. One of the best things that helped my high-performance software crafting on CPU was spending some time with GPGPU programming; I imagine that the discpline of working with FPGA would be similar.
I've seen a lot of FPGA stuff go by that seems like the competing software implementation should have been pipelined, unrolled, and generally made less stupid. So unless the software implementation has an independent force keeping it honest (e.g. it's a production system being used elsewhere), be careful. Also be careful of the tendency of FPGA papers to find One Weird Case where the software does badly and benchmark mainly about that.
I work in the FPGA industry and I've supervised university projects, seen lots of research, hired PhD grads and stuff. I've got to agree that one of the most frustrating parts of FPGA research is that it's almost uniformally done in comparison to the more laughable software implementations.
FPGAs in industry are used for a very small number of specific applications: Smart NICs, Early stages of wireless networks (5G whilst the standards are being hammered out), military (where you need high performance with no consideration of cost), and embedded, Prof Video (where the custom I/O is essential).
Generally, unless you're doing something that fits those applications well, the FPGA will not look good, and there are the same mistakes made in research time after time. For data centre these are twice as bad. The four really glaring ones are always:
* Quoting performance without taking into account the time to get the data onto the FPGA (generally via a PCI-E link that killed any chance of winning vs. CPU).
* Assuming performance scales linearly to fill up an FPGA (Full FPGAs can't run as fast as 10% full ones without significant effort)
* Profiling only the part of the problem or set of data that your code performs well for and not reporting how it transfers onto corner cases that CPUs would obviously do well for.
* Comparing against some noddy s/w solution when you've literally spent the last 3 years of your PhD optimizing the FPGA solution, and doing no background reading to see what the state of the art s/w does.
It just destroys a load of the research we see. The good applications are far less exciting, but the MS Catapult is a great example - the reason it's competitive is because they're using the custom I/O of the FPGA to move data around really quick, it's like a custom smart NIC almost.
Thanks for the detailed reply. My post may have seemed like partly-informed sour grapes but your information fits in well with what I've seen.
In a number of the applications I've seen the other killers are the fact that not only do you have the transfer costs you mentioned to the device, you also:
1. Have to get information back from the device - and in regular expression matching this might be 1 match in 1000 or 1 match in 5 if you're unlucky, and
2. Have to have a lot of parallelism to hit peak performance, yielding great throughput but so-so latency. At Sensory Networks during our hardware stage, we had a "2 Gbps regex accelerator" (hah) that didn't even hit that modest number on a single stream - it actually required 14 streams or so running at 142Mbps.
Many of the same sins are repeated for GPGPU.
The other thing that I notice is that the "noddy s/w solution" sometimes is the only thing out there. I looked at some accelerator work on Random Forest inference (not training) and - wow - all the RF implementations are naive. There are a lot of s/w tasks out there that no-one has bothered to optimize with any effort at all.
However, when your adviser says "make a GPGPU/FPGA thesis" I think a smart PhD just goes and does that, rather than sinking 6 months into building a really great s/w comparison. :-)
They present four papers in total that shed light on developments and deployments of FPGAs in data centers:
Project Catapult (Bing/Microsoft):
First paper:
> [...] provides insights into the development process of FPGA base systems. The target application is accelerating the Bing web search engine. [...] The paper shows how such a system can improve the throughput of document ranking or reduce the tail latency for such operations by 29 percent.
Second paper:
> The web-search accelerator was based on a unit of 48 machines, a result of the decision to use a torus network to connect the FPGAs to each other. Not only is the cabling of such units cumbersome, but it also limits how many FPGAs can talk to each other and requires routing to be provided in each FPGA, complex procedures to achieve fault tolerance, etc. [..] Hence, the second paper describes the solution being deployed in Azure: the FPGA is placed between the NIC (network interface controller) of the host and the actual network, as well as having a PCI connection to the host.
The other two papers debate whether FPGAs could actually be implemented using ASICs or other dedicated hardware. To do so, they discuss how FPGAs can be used in MySQL with a SSD+FPGA storage engine.
We also now have an FPGA accelerated Resnet-50 as a service on Azure with more models in the pipeline. (I work on the Azure Machine Learning side of this stuff)
https://www.usenix.org/sites/default/files/conference/protec...
There's also efforts to implement SQL on a FPGA:
https://www.nextplatform.com/2016/08/24/baidu-takes-fpga-app...