Genomics Marks the Next Sequence for FPGA's

dekhn · on Nov 2, 2015

Every time somebody has attempted to do this (for exmaple BLAST using custom ASICs or FPGAs), the next Intel chip- its general purpose capabilities, not any special stuff- has gotten fast enough and cheap enough to make custom hardware irrelevant. I don't see that changing any time soon- we're not actually blocked on any heavily CPU-bound problems in the genomics area right now, except perhaps machine learning on variants, and that's handled by GPUs, which are commodity hardware.

pjc50 · on Nov 2, 2015

Indeed. FPGAs only win if you have lots of integer operations that are truly parallel but require a moderate amount of memory bandwidth. GPUs win for floating point and CPUs tend to have far more cache than you can fit in an FPGA.

dekhn · on Nov 2, 2015

THat's beside my point too- the community doesn't have time to chase whatever the custom-solution-du-jour is this year. General purpose Intel processors are always a better investment because (IMHO) there isn't really any true CPU bottleneck at this point. Data storage, memory bandwidth, network bandwidth are the big three issues.

ethbro · on Nov 2, 2015

"Doesn't have time"... yet*

*This equation changes if suddenly the pace of semiconductor process miracles starts slowing down

dekhn · on Nov 3, 2015

Well, except that FPGAs and ASICs use the same process as general purpose chips, so they're susceptible to the same problems.

anonmeow · on Nov 2, 2015

The Moore's law has ended (doubling time has become longer than 2 years, and looks like 10nm-7nm will be the last manufacturing process for a long time). From now on we will see more special-purpose hardware.

dekhn · on Nov 3, 2015

special purpose hardware follows the same rules as Moore's law.

Last I checked, number of cores attached to large memory, on Intel chips, is still going up, and that's what affects throughput on embarassingly parallel jobs, which is what these are.

n00b101 · on Nov 2, 2015

> Every time somebody has attempted to do this ... the next Intel chip- its general purpose capabilities, not any special stuff- has gotten fast enough and cheap enough to make custom hardware irrelevant. I don't see that changing any time soon ...

This is a very interesting phenomenon. It was certainly true from around 1968 to 2005, that custom chips could not survive since the next Intel chip made them irrelevant. But I think that, not only is this going to be changing soon, in fact it has already started to change.

There are credible claims of BLAST (and other life science codes) being ~10x faster on GPUs than CPUs. [1] GPUs are specialized processors (with associated non-standard programming models and compilers) that have overtaken general purpose CPUs in many areas. This has been possible because, for the past 10 years, the "next Intel chip" has NOT been faster (single core clock speed) or exponentially cheaper (cost per core). Intel has been unable to make advances on these fronts due to fundamental engineering factors, such as the breakdown of Dennard scaling around 2005. [2]

GPUs (and FPGAs, and to some extent even conventional CPUs) have been able to continue advancing since 2005 because Moore's Law has been holding (even though the related Dennard scaling law has broken down). However, now we are seeing that even Moore's law is starting to break down. This is evidenced by the increasing delays in the Intel roadmap, with the most recent delay being 10nm process being pushed out to 2017. [3] It is now in question whether silicon process will ever even reach 7nm, and I don't think anyone is willing to bet that Moore's Law can continue on silicon beyond 7nm.

It is at the end of Moore's Law where custom chips become very interesting. This has been covered in 2013, in a presentation by Robert Colwell, who is the Director of DARPA's Microsystems Technology Office. [4] Colwell's thesis is that the End of Moore's law will revive specialized chip design. I find that prospect to be very exciting, not only for chip designers but of course also for software developers (especially compiler and programming language designers, as specialized chips are going to require specialized compilers, programming models and languages).

Of course there is also the other possible future path, within the 10nm to 7nm time frame (i.e. within the next 5 years) that Intel and others will find a way to extend Moore's Law, possibly by finding a viable alternative to silicon substrate. That would also be extremely exciting but somehow I don't think the future will be that simple (DARPA seems to take the view that this simple isn't going to happen, i.e. the Moore's Law exponential must logically end).

[1] https://www.nvidia.com/object/bio_info_life_sciences.html

[2] https://en.wikipedia.org/wiki/Dennard_scaling#Recent_breakdo...

[3] http://www.anandtech.com/show/9447/intel-10nm-and-kaby-lake

[4] http://www.hotchips.org/wp-content/uploads/hc_archives/hc25/...

dekhn · on Nov 2, 2015

I think you're missing the point that no important biological sequence problems are CPU-bound, they're IO-bound. Further, throughput in bio problems tends to be embarassingly parallel- so Intel putting more cores into chips with NUMA memory is a huge win. Next, when Intel puts more cores in a cheap, everybody's bio codes, which are embarassingly parallel, run faster because you just added more embarassingly parallel capacity. Nobody had to rewrite codes- in the GPU area, you have to rewrite codes every 2 years to compete with Intel throughput.

Moore's law is irrelevant here. It's about the total cost of doing science, and moving sequence analysis to GPUs hasn't really decreased the cost significantly. Note the first paper you linked to is from 2007- neither CPU nor GPU from that time period is relevant today. All the links I see for "GPU HMMER" point to a few marketing pages on the Nvidia site.

Whether BLAST is 10X faster on GPUs (its not) is irrelevant. It's that there aren't interesting problems to be solved by speeding up these kinds of calcuations in terms of single problem latency- what matters is throughput- aligning billions of reads in a short time- and those problems tend to be disk IO bound, not CPU bound.

I have no problem using GPUs- those are relatively easy to program now and we've raised a generation of grad students who can write codes to those platforms. They've proved their way.

It's ASIC and FPGAs which aren't competitive in this area.

thesz · on Nov 3, 2015

https://www.altera.com/content/dam/altera-www/global/en_US/p...

"We show that our implementation achieves 5.5x and 5.25x better performance per watt ratios than GPU and CPU implementations, respectively."

This is from Altera and they probably skewed to FPGA side. But still, FPGA can be very competitive if you consider power budget.

Xeon Phi costs $3900: http://www.amazon.com/Intel-Xeon-Phi-7120P-Coprocessor/dp/B0... It consumes up to 300W.

The average kWh in USA is $0.12: http://www.npr.org/sections/money/2011/10/27/141766341/the-p...

So Xeon Phi consumes up to $36 of electricity per hour, $864 per day and electricity start to dominate as quick as in 5 days.

Typical FPGA consume about 10 times as less power as Xeon Phi.

matt_d · on Nov 2, 2015

> I have no problem using GPUs- those are relatively easy to program now and we've raised a generation of grad students who can write codes to those platforms. They've proved their way.

> It's ASIC and FPGAs which aren't competitive in this area.

Out of curiosity, I'm wondering, what about the solutions directly attacking this problem -- i.e., ease-of-programmability & time-to-market?

For instance, I'm thinking of the Altera Software Development Kit (SDK) for OpenCL (AOCL) here -- I don't suppose this would be necessarily worse than "easy to program GPGUs", especially when targeting embarrassingly parallel problems (so, any overheads due to OpenCL model <-> FPGAs impedance mismatch, present due to OpenCL admittedly being originally designed for a very different hardware, could be in fact minimized here)?

In particular, the OpenCL examples don't look particularly complex (speaking as someone with GPGPU background): https://www.altera.com/support/support-resources/design-exam...

In addition, the capabilities present that allow to optimize-around loop-carried dependencies (a _huge_ problem for GPGPUs) like the pipeline parallelism made use of in the HPC examples (like the stateful PRNG; and which makes sense due to the specific nature of FPGA hardware -- more on that in a moment), seem to make this a more attractive platform for a significant set of number-crunching workloads.

This may very well be the right-tool-for-the-right-job decision. There are some very different trade-offs present regarding the kinds of parallelism natural to GPUs vs. FPGAs (admittedly it would be more precise to say "SPMD" instead of "SIMD" in the following, but I don't think it takes away from the key point): "The key difference between kernel execution on GPUs versus FPGAs is how parallelism is handled. GPUs are “single-instruction, multiple-data” (SIMD) devices – groups of processing elements perform the same operation on their own individual work-items. On the other hand, FPGAs exploit pipeline parallelism – different stages of the instructions are applied to different work-items concurrently."

Source: https://www.altera.com/en_US/pdfs/literature/wp/wp-201406-ac...

I don't believe that either kind is universally/strictly "better" than another, so it's all about the use cases -- at least that's how I think about it, perhaps I'm missing some other trade-offs?

Regarding the I/O-bound problems: Isn't this another reason for the attractiveness of high-performance FPGAs -- like, say, Stratix, compared to GPUs? What I'm thinking of is that you can have plenty (relative to GPUs) of very high performance (here, relative to both GPUs -- as well as high-end CPUs) SRAM caches, e.g., QDRII+ SRAM: http://www.cypress.com/products/sync-sram

For instance, one example would be the QDRII+ SRAM options in the block diagram here: http://www.alteraboards.com/product/s5-pcie-hq/

Myself, I'm still unconvinced about the best choice w.r.t. the consistent performance/price ratio maximization (both the device as well as the programmers costs) -- both high-end FPGAs as well as high-end GPUs seem rather on the expensive side, either way (well, and very high-end CPUs too, for that matter).

Completely independently of the above: I'm wondering, what do you think are the reasons for Intel investing in the partnership with Altera and developing its Xeon+FPGA hybrid hardware? I presume there must be something to it, it's a potentially large amount of resources to dedicate for a hardware project.

epistasis · on Nov 2, 2015

The problem that the FPGAs are trying to solve here, sequence alignment, is the most CPU intensive part of a bioinformatics pipeline, but it still takes very few computation cycles per byte of input/output. It's basically bound by the time to seek to random spots in DRAM, after input/output is fully saturated, which isn't hard to do with 40Gbps network cards.

There's little to gain from making this far faster, either, as the computation costs are still approximately rounding error compared to the cost of the data generation.

It's fun to think about, but there's little to win from this all at the moment. Once sequencing costs drop another 50-100x, it will become more practical.

noipv4 · on Nov 2, 2015

Convey computers used to sell 4U servers with lots of Virtex-6 FPGAs to accelerate short read mapping/alignment, and assembly. They got bought over by Micron, probably to help them port Bioinformatics applications to Micron's Automaton computing chips.

amelius · on Nov 2, 2015

I would like to see some numbers. To begin with, how many FPGAs can replace 1 Intel CPU, in terms of compute power? And what does an FPGA solution cost in dollars, versus a CPU-based solution (including motherboard, memory, powersupply, etc.)?

bravo22 · on Nov 2, 2015

a customized pipeline for a specific solution will always outperform a CPU, which is designed to be pretty good at a lot of things but not great at one single thing.

As for price... that depends on how much they want to charge for it ;)

amelius · on Nov 2, 2015

I'm not sure that a customized pipeline is always faster than a CPU. I guess that in a lot of cases, memory access will be the bottleneck.

Regarding the price, I'm talking about raw hardware costs (because it shows what profit can be made by making customized solutions).

elij · on Nov 2, 2015

Accelerated reference alignment (via something like an FPGA) is definitely a way forward as the problem is now relatively well understood.

Would be interesting to see where things go with respect to de-novo and graph alignment.

dharma1 · on Nov 2, 2015

has anyone used opencl to program FPGA's? How was it? What's the perf like?

gjulianm · on Nov 2, 2015

I haven't dived deep in it, but we tried an OpenCL program in a FPGA and we couldn't even make it work: OpenCL support is at version 1.0 (no vectors, no barriers) without double precision numbers. And, from what I've heard, the performance is not that good.

arca_vorago · on Nov 2, 2015

"We can analyze RNA, agricultural biology, different pipelines for cancer research—all of these have widely varying pipelines and some are just nuanced, but they all require something different to be loaded into the FPGA before that run.”"

Therein lies the rub. I was a sysadmin for a biotech company doing this kind of stuff, and in the end we evaluated fpga/asic and just couldn't justify the costs, not just hardware costs but time and barrier to entry costs, such as having to hire more programmers. I also kept a close eye on GPUblast, but it just wasn't keeping up with official, and due to pipeline specialties, mostly revolving blasting against all kinds of data, it didn't fit the bill. That's the main issue, that people doing blasting and related often have very specialized pipelines and different types of data and would almost be required to spend a lot of money to get such a system to work for them. It much less about the speed of the computation than it is about moving the data around. When you start dealing with 100Tb's of output a day...

What we did end up doing was two things, one, we wrote worker system that basically distributed the computation to almost every computer on the network, and two, I figured out that due to how embarrassingly parallel the computation was, AMD's newer opteron chips actually outperformed Intel's, and I build some beast servers, and took computation down from average of 1~3 days to ~4hrs. (think 64 actual core's via quad cpu motherboards... truly some of my favorite servers I have ever built.) Bottom line is that IO is the main bottleneck, so even things like ROCKS clusters or other distributed systems don't work as well as many of the big data people like to think.

We also had a very different than normal pipeline mostly revolving around which data to blast against and the methods we used, which were considered the core IP/proprietary, but the real IP was in the analysis.

As FPGA's, Asics, and GPU's make it easier to get the data, you still have to understand it and analyze it, and that is much harder than people understand IMHO.

I left for personal reasons, and I'm still under a non-compete, but I am very thankful for the opportunities I had to learn the bioinformatic's side under the eccentric genius I did, and perhaps one day I will turn some of my knowledge into a system that is useful for those wanting to do this kind of work. Also, I will never forget it as the first place I ever got asked to build a petabyte system...

For example, IO limitations were mainly centered around accessing large amounts of data that wouldn't fit in 128/256Gb of RAM setups, which ended up usually on disks. Even on ZFS/btrfs raid0 ssd's, interface limitations were being hit daily..(sata, backplanes, etc) and the recent PCIE-SSD revolution would really be able to make a big dent in that particular issue.