How many 32-bit RISC CPUs fit in an FPGA? Now vs. 1995?

crazygringo · on March 27, 2014

So, about 15 years ago I took a college course on microprocessor design, and our final project was to implement a simple microprocessor on an FPGA.

At the time, it seemed obvious that, over the next decade, FPGA's would work their way into general-purpose computing, so that (for example) Photoshop filters would simply reconfigure the FPGA to run blazingly fast. Likewise with games, or video codecs, or 3D rendering, or whatever else was processor-intensive.

But that clearly hasn't happened. Instead, GPU's took off as the main computational supplement to CPU's.

Does anyone here have any insight as to why? Is there a technological reason why FPGA's never turned into general-purpose hardware standard on every desktop and laptop? Has it been a chicken-and-egg problem? A standardization problem? Or something else? Do FPGA's still have potential for general-purpose consumer computing? Or are they going to be forever relegated to special-purpose roles?

pdq · on March 27, 2014

Moore's law is quite amazing.

I believe there are 4 reasons why GPUs have taken off conventionally as compared to FPGAs:

1. Last I checked, the FPGA vendors will not open their toolchains up, and not even document the bitstream formats. They will claim NDA, proprietary, etc. This has the massive side effect that you are stuck with their bloated, slow, crappy toolchains. If this were open, I guarantee hackers would be inventing all kinds of interesting ways to convert their software into FPGA bits.

2. FPGAs are VERY hard to write and debug. You have to write your design in an HDL language (either VHDL or Verilog), and you have to use a software simulator to prototype the design on first (and of course these tools are either quite pricy or if free they are usually limited or hard to use). Then you can synthesize the design and download it into the FPGA for running.

The next problem is debugging your design. The entire internal state of the FPGA is only accessible through slow scan, unless you dedicate a portion of your design to "monitors", which tap the traffic and store their values into internal RAMs. So you may have to respin the design just to get more monitors to debug where the issue is.

3. FPGA compilation is SLOW. When I used them professionally a few years ago, a Virtex5 could take multiple hours to resynthesize/place & route a medium-sized design. I believe that Virtex7 they are advertising could take over a day to respin if you change your design.

4. Most new machines already have a built-in graphics with a GPU that can be utilized as a general-purpose GPU. No one ships FPGAs in any conventional computer.

JamilD · on March 27, 2014

(2) can probably be addressed with OpenCL -- Altera seems to be working an SDK[1] that allows you to write C code which, as I understand it, would compile to an image that you could then program on to your FPGA (or you could just compile for execution on a processor). So fortunately, no Verilog or VDHL necessary.

(3) is another issue, but I don't think the consumer would necessarily need to worry about compilation. The developer would just include the compiled programming files for different FPGAs in the application.

If you mean that it'll be slow on the developer's side, that's definitely a valid point. I'm sure, however, that you'll see FPGA manufacturers start to move toward remote compilations so that you're not necessarily limited by the hardware you have in-house.

[1] http://www.altera.com/products/software/opencl/opencl-index....

sounds · on March 27, 2014

More about (3):

Altera calls it "logiclock," Xilinx has a different term, but the idea is that you don't need to re-synthesize the entire FPGA for every change. In fact, you may not want to. If you are tweaking a certain region, you're usually happier if the place & route doesn't send your stuff through a route that then kicks off a line in another block so that the timings are now off in that other block.

For an FPGA the timing is how you measure performance and getting the best timings can take quite a bit of work. Being able to lock that once you've got it right is a big plus.

prutschman · on March 28, 2014

Xilinx's ISE calls it "SmartGuide".

It's kind of a mixed bag. It's worked okay for me if changes are truly minor, but if there are large changes to the logic it doesn't seem to be very good about "forgetting" what it learned from the previous pass. Three or four times this week I've had a design fail to make timing with SmartGuide, but work when doing P&R from scratch.

TD-Linux · on March 28, 2014

(2) is called High Level Synthesis. While a great idea, and quite practical, it does not lift you from the burden of understanding the FPGA. Generally you have to make your C code fit a very rigid format that compiles to a pipeline or similar - the advantage is that you can test it as C code, not that you take off the shelf code and run it on a FPGA. There would be no point to that - a hard processor will always be faster at running arbitrary C code.

In addition, most of these tools compile to HDL, so they only add to compilation time.

hugway · on March 27, 2014

Regarding (3) remote compilations, Altera is lifting off the hardware burdens for developers with remote compilations on cloud. see cloud.altera.com for Altera's tool extension

anigbrowl · on March 27, 2014

Interesting points. However, it's a bit hard to complain about it being slow (#3) considering the enormous complexity involved, no? I remember compiling things on a 486 (when I got into Linux) being pretty damn slow as well for anything substantial.

I don't mean to excuse the manufacturers, but at the same time they seem to be selling into a pretty small market and it's not clear to me that opening things up will magically lead to a big expansion in chip sales that will negate the competitive risk of being the first to open up. If you have time, I'd like to learn more about this since you seem to have a lot of experience with this technology.

pdq · on March 27, 2014

I am certainly not expecting compiling HDL to be the same as compiling software, as yes it is drastically more complicated. However, compile times are growing more than linearly with respect to the number of gates (or logic blocks) that are in each FPGA. Whereas software grows linearly as the programs get larger. Also you have the other dimension of getting your block timed, so it can run on the FPGA at a guaranteed frequency. This is definitely a non-trivial problem, but one that I believe would be better solved by some hackers if there were a "GCC" for Verilog FPGA synthesis.

In my opinion there is very little for the manufacturers to gain by keeping their bitstream formats proprietary and undocumented. I don't think there is a competitive advantage, as all the manufacturers are pretty much doing the same thing. And their FPGA block diagrams are already open and documented (you can see how many flip flops, clocks, and muxes are in each logic cell, how the routing works, and where the memory cells and other units are).

plainketchup · on March 28, 2014

I have only passing familiarity with FPGAs, so perhaps you can excuse my ignorance.

I was under the impression that FPGA vendors often license functional blocks (like PCIe SERDES) to FPGA users. Might it be that part of the purpose of obscuring the bitstream format is to make it more difficult for customers to use those functional blocks without paying the toll?

aortega · on March 28, 2014

The bitstream format, at least for Xilinx, is not that obscure. Actually was documented in an old application note.

Brashman · on March 28, 2014

Take a look at VTR (formerly VPR): http://code.google.com/p/vtr-verilog-to-routing/. It's an academically developed tool for doing FPGA place and route. At the end of the day, you'll still need to use the proprietary tools to convert to the appropriate bitstream, but this an open source solution for the "heavy lifting" portion. However, last I checked the solutions produced by VPR aren't as good as the commercial tools.

adwn · on March 28, 2014

> However, last I checked the solutions produced by VPR aren't as good as the commercial tools.

Well, that's no surprise: FPGA vendors spend a lot of manpower on improving their Place&Route software. If you wanted to build something competitive, you'd need a lot of money plus access to proprietary, non-public, information.

morcheeba · on March 28, 2014

Bingo. FPGA routing is NP-Hard; Compiling software is generally P.

demallien · on March 28, 2014

If you take the travelling salesman problem, you can dramatically simplify the problem by constraining the salesman to visit all of the cities within the same state sequentially.

Similarly, you can reduce the complexity of routing calculations by applying some constraints. You will potentially lose the possibility of an optimal solution, but you will gain a far faster compilation time. As always with engineering, it's a trade-off.

morcheeba · on April 4, 2014

Yep, global vs. local routers are kinda like stay-withing-the-state.

One of my favorite ideas on this is space-filling curves: http://www2.isye.gatech.edu/~jjb/mow/mow.pdf

stonemetal · on March 28, 2014

a bit hard to complain about it being slow (#3) considering the enormous complexity involved, no?

It is if the original premise was to make Photoshop filters fast. A GPU can make my Photoshop filters fast now an FPGA implementation can make them fast 8 to 24 hours from now.

wolfgke · on March 29, 2014

Not if you upload precompiled bitstreams into the FPGA.

marktangotango · on March 28, 2014

Laying out components on a chip and routing non overlapping edges (ie wires) is call "orthogonal edge routing". Graph drawing algorithms don't get much attention outside their niche (oddly to me at least). But this is one area that has profound importance.

nitrogen · on March 28, 2014

I've made a note of the term "orthogonal edge routing", hopefully for eventual incorporation into my own software (http://www.nitrogenlogic.com/docs/palace/). Thanks.

bambam12897 · on March 28, 2014

"You have to write your design in an HDL language"

What???? There is a very mature set of tool for converting MATLAB to HDL.

http://www.mathworks.com/products/hdl-coder/

http://www.mathworks.com/products/hdl-verifier/

http://www.mathworks.com/products/filterhdl/

I'm sure there are similar tools for other languages. It requires you to program in a slightly different way (certain operations aren't optimal for FPGAs), but is extremely user-friendly.

At the company I work for, all FPGA programming is done in MATLAB. Unfortunately I work in a different department so I can't give you any technical details, but from what I understand, no one has written things directly to HDL in years.

jey · on March 28, 2014

    Some people, when confronted with a problem, think
    "I know, I'll use MATLAB." Now they have two problems.

(Apologies to jwz.)

tspiteri · on March 28, 2014

It depends on what you are doing. There are tools in the Xilinx tool set enabling you to design filters or other things in MATLAB Simulink, which are very convenient. But if you want to write a processor, however small, you cannot really do that well. Because the hard part is getting a good design with proper timing and synchronization between different parts. I think it's harder to get such a good design using MATLAB/C to HDL tools than it is designing directly in HDL.

justincormack · on March 28, 2014

The GPU vendors dont exactly have open toolchains either. OpenCL is not open in that we get to write assembler for the GPU... It is only a little better.

But GPUs got shipped in volume. I think they were just cheaper for the performance level.

axman6 · on March 28, 2014

Not sure how relevant it is, but isn't PTX available through LLVM?

sparky · on March 28, 2014

LLVM has an open-source PTX backend, and newer versions of the official CUDA compiler use LLVM to generate PTX internally, but PTX is a device-independent intermediate layer, and the PTX-to-SASS compiler is closed-source.

fieldforceapp · on March 27, 2014

2. HLS tools exist[0][1][2] to convert C to HLD for FPGA programming, these results are used in production designs. As @jangray says below, debug tools are sold by commercial vendors as a value-added capability for production teams; it's not a "freemium" market.

[0] http://en.wikipedia.org/wiki/High-level_synthesis [1] http://www.xilinx.com/products/design-tools/vivado/integrati... [2] http://www.synopsys.com/Systems/BlockDesign/HLS/Pages/defaul...

cottonseed · on March 28, 2014

I agree with your reasons.

Also, GPUs might be a better match for the kinds of codes people care about. The world didn't need arbitrary bit-level computations except in rare cases, it needed insane memory bandwidth and high floating point throughput (exactly what a photoshop filter would need). The generality of most FPGAs mean they're not great for standard circuits that can be optimized. Maybe this means the FPGA market might see some success with a different trade-off of flexibility vs fixed hardware. The rise of FPGAs like the Zynq with dedicated processors or distributed RAM and DSP units is already happening.

cottonseed · on March 28, 2014

1. is interesting. I always wondered if a completely open FPGA vendor would have any chance in the market.

elq · on March 27, 2014

I'm reminded of Viva from Starbridge which was supposed to make FPGA "programs" easier to build and debug - using large generic blocks.

I have no idea what happened to them, but I suppose the problem was harder than they believed or at least claimed.

sdeyerle · on March 27, 2014

It got bought by Data I/O and renamed Azido. I've used it briefly and it made me beg to go back to Verilog.

DigitalJack · on March 28, 2014

Wow. That really says something (to me anyway), begging to go back to Verilog.

There are a lot of pain points in the HDLs, but it seems like Verilog has more than the others.

I saw someone working on a clojure HDL, I think it might have compiled down to or emitted Verilog. I thought it was more confusing than the HDLs to begin with, but depending on one's background it might make more sense.

hcarvalhoalves · on March 27, 2014

Probably only 4. matters.

femto · on March 28, 2014

I'm of the opinion that #2-#4 flow from #1. Fix #1 and the rest will eventually go away, so #1 is the one that matters.

My reasoning for each:

#2. More open tools would allow alternative programming models. For example, gcc already has a vhdl front end. Why not a gcc back end for an FPGA? That would open the door to more familiar languages.

#3. More open tools and specifications would allow programmers to start optimising and rethinking the FPGA compilation process, potentially leading to radical reductions in run times.

#4. People won't want FPGAs in their machines until they are easy to use. Solving #1 (and consequently #2 and #3) will make FPGAs easier to use, increasing demand, prompting manufacturers to consider including programmable logic in their machines. Granted its a chicken and egg situation between adoption and better tools, but opening the tools and specifications up could break the cycle.

adwn · on March 28, 2014

#2: Intermediate representations for hardware are vastly different from intermediate representations for software. That's because the execution model for hardware is vastly different from the execution model for software. You'd need a Sufficiently Smart Compiler(TM) to convert from the latter to the former and get even the slightest amount of efficiency (in general – I'm not talking about specialized DSP-filter-to-HDL tools).

#3: No. The information needed for synthesis (HDL to netlist), mapping, and placing is publicly available. These topics are actively researched, yet so far no truly usable open source tool has emerged. Routing tools are not possible without the information available, though.

#4: Sure, solving #1-3 would make FPGAs easier to use, but #2 and #3 don't follow from #1 even if #1 was satisfied.

jmz92 · on March 28, 2014

Regarding your comment on an "FPGA backend" for GCC, you have to understand that simulating a VHDL design (what is implemented) is a drastically simpler task than synthesizing an FPGA image. Logic optimization, place and route, timing analysis--these are things entirely out of the scope of the GCC project, and the details differ significantly between FPGA vendors and between an individual vendor's products. It just isn't a realistic goal.

femto · on March 28, 2014

Agreed, that it is outside the current scope of gcc. Given that gcc is able to handle a simulation, that would indicate that gcc's intermediate representation is able to capture the semantics of a VHDL netlist? That's where I'm starting from.

Assuming the above, I'm thinking of a project, independent of gcc, that takes gcc's intermediate representation and does all the FPGA specific tasks that you mention. Yes, it would be a huge project, comparable in scope to gcc itself, and even that might be an underestimate. It could start small, to make it realistic, then incrementally expand its scope, just like linux and gcc did. Eventually, the FPGA vendors might have to choose between participation or losing customers? It might be able to exploit some of gcc's backend infrastructure in the FPGA process, but who knows?

> It just isn't a realistic goal.

Or it's a red rag to a bull, to the right person. :-)

userbinator · on March 28, 2014

> #3. More open tools and specifications would allow programmers to start optimising and rethinking the FPGA compilation process, potentially leading to radical reductions in run times.

Including using an FPGA to accelerate that process -- probably possible since from what I know there is a lot of parallelism involved in synthesising logic.

adwn · on March 28, 2014

> there is a lot of parallelism involved in synthesising logic.

But not the kind of parallelism that's fast on an FPGA.

TD-Linux · on March 28, 2014

I can answer this, as an electrical engineer.

FPGA's are horrendously inefficient, space and power-wise, for many tasks. Much of the core is taken up by the programmable routing between different components, and generally most components will be not fully utilized (such as LUTs and RAM).

Yes, the Virtex-7 (a VERY EXPENSIVE fpga) can hold 1000 very simple 32 bit cores.... but a top end NVIDIA GPU has 2688 CUDA cores. While not entirely independent, these CUDA cores have far deeper pipelines and far superior ALU's to the one in the article. If your software fits the programming model, GPUs will handily beat a FPGA.

FPGA's are great for prototyping ASICs, and cases where timing is of critical importance - try implementing a VGA video generator on a CPU. Basically everywhere an FPGA would excel, an ASIC excels more, but FPGAs are great for low volume and or specialty hardware where a GPU is not good at accelerating the task at hand.

hrjet · on March 28, 2014

> the Virtex-7 (a VERY EXPENSIVE fpga) can hold 1000 very simple 32 bit cores.... but a top end NVIDIA GPU has 2688 CUDA cores.

But the point of the FPGA is to not build cores inside it; the OP article is just an exercise. Application specific logic built with FPGA might be much more efficient than generic CUDA cores.

Also, it is a given that an ASIC will be more power efficient than an FPGA, but the FPGA will be generic and hence more money efficient.

damian2000 · on March 28, 2014

Agree with yout points that FPGAs are great for low volume prototyping/one-offs, interfacing, and when timing is of critical importance.

femto · on March 27, 2014

A contributing factor is that the Partition-Place-Route (PPR) software, that is necessary to map a design onto the FPGA die, is proprietary and controlled very tightly by the FPGA vendors, as are the specifications of the raw dies and the meaning of each bit in the corresponding bitstream.

Consequently, FPGAs don't lend themselves to an uncontrolled explosion of innovation. Compare with GPUs and CPUs where open source compilers are available and anyone can have a crack at innovating, at whatever level they choose.

As an example, if you want to dynamically generate programming for a Xilinx FPGA, you need to incorporate Xilinx's binary only PPR program somewhere into the flow. That acts as a ball and chain around the leg of reconfigurable computing, impeding its development and adoption.

al2o3cr · on March 27, 2014

One guess would be that GPUs address a wider market segment (everybody needs a display + hardware to drive it). More people buying GPUs, lower prices, more $$$ for R&D, repeat.

Another likely issue with FPGAs is that for everything but raw compute they still need support circuitry (physical I/O ports, memory interfaces, etc) that are different for every app but not really "reconfigurable" in the same sense as the FPGA fabric.

Finally, I'd wonder if reconfiguration time is part of the problem - until relatively recently, reconfiguring an FPGA was all-or-nothing and could take multiple milliseconds. Not a big deal when configuring a device once on boot, but serious headwind when trying to context-switch between different jobs that need FPGA assistance.

anigbrowl · on March 27, 2014

I think GPUs took off because the specific problems they solve were quite well-defined and enough people in the videogame/3d grapics market agreed about which protocols to use. Even for things like Photoshop and other general-purpose graphics software, GPU support seems to be a bit hit and miss.

I actualy have 2 FPGA-based devices that I use daily - polyphonic analog synthesizers, to be specific. Manufacturers are tight-lipped about how they're using the FPGAs, but it appears to be for ultra-rapid reconfiguration of analog circuit topologies without the load time delays that result from a traditional microcontroller > D/A converter arrangement. There are also field-programmable analog arrays on the market, but I have yet to see one in a commercial product and it seems like they have some way to go before being economical for audio synthesis applications.

I love FPGAs although I don't know much about how to et started with programming them. There is an ASIC coming out of patent in a year or so which I'd like to re-implement in an FPGA package, and I've thought about reaching out to the original architect who lives nearby and is a friendly fellow. I'm not sure how feasible this is, though.

kyzyl · on March 28, 2014

I'd second the DE2 suggestion. It's a cheap, but pretty versatile board. It has lots of companion demos and tutorials, as well as a ton of resources online from other starters asking "How the heck do I do <task> on my DE2???" on forums and such.

Quartus II Web Edition is Altera's free IDE, and it comes with pretty much everything you need: a big suite of libraries (most of which are also free to use), a graphical entry environment, an HDL editor, the full synthesis toolchain, and integrated Eclipse tools for writing embedded C/C++/ASM code. Other than that all you'd need to get is ModelSim-altera, the free version of the standard simulation environment.

Altera also has some pretty comprehensive (i) free online training, (ii) IP block, i.e. library, documentation, and (iii) complete reference designs. I'd recommend checking out all three, especially (i), since at the beginning it's easy to get tunnel vision just learning VHDL or whatever, and then realize that you're somewhat clueless as to how to actually get things done in any useful capacity. There's a dizzying amount of jargon and proprietary bullshit, so it's useful to just have someone tell you what everything means, and How It's Done(tm).

All that said, if that ASIC is anything complicated it might be a bit of a big project to jump into. That and it's possible that it would require more resources than an entry level board will supply.

rjsw · on March 27, 2014

To get started with FPGAs just download one of the free IDEs and have a play. You can buy dev boards for under $100 to get some feedback that what you are doing really works.

adamnemecek · on March 27, 2014

Can you recommend some boards? Whenever I looked into it, I got overwhelmed with all the options and I was not quite sure what to look for.

zhemao · on March 28, 2014

The DE1-SoC board is $199 or $150 with a student discount. http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=E...

It has an ARM processor on the same chip as the FPGA, which I've found to be incredibly useful.

If you don't want the ARM processor, the regular DE1 is somewhat cheaper: $150 or $125 w/ student discount.

http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=E...

kyzyl · on March 28, 2014

Yeah, the SoC boards a nice. So much of what you want to do on an FPGA these days requires the use of a processor anyhow, you might as well have an actual hard block instead of a NIOSII or whatever. We use DE1's at my company as the FPGA analog (pun not intended) of a Arduino Uno, for when we want to do a quick test of some part of a design in hardware, or want to stick the device into something but don't want to tie up a high end unit for something trivial.

Be aware that the DE1 has pretty limited resources for somethings. For exmaple, the on-chip memory can get tapped out pretty quickly when doing embedded applications, and it has less I/O and less RAM available off-chip.

Also note that the DE1-SoC actually uses a Cyclone V device, whereas the original DE-1 uses a Cyclone II, a nearly discontinued chip that is a big step down.

pramalin · on March 28, 2014

Intrigued by FPGA's capabilities, when checking out my classmate's profile http://orangesorter.com/, I started looking into some development kits to learn.

I found great documentation and economical boards starting at $55 available here: http://www.xess.com/store/fpga-boards/

BTW, I also found Scala based hardware construction language Chisel very interesting https://chisel.eecs.berkeley.edu/.

damian2000 · on March 28, 2014

The "Parallella" hybrid board is awesome, but unfortunately has been saturated by orders, probably due to being $99. It contains a Xilinx dual ARM CPU with FPGA plus a custom Epiphany 16-core processor. http://parallella.org/

JamilD · on March 27, 2014

The Altera DE2 is great to start off with, and it's also heavily discounted for University students.

coryrc · on March 27, 2014

The Basys 2 is a little old, but it's very simple and easy.

blt · on March 28, 2014

Which synthesizers?

anigbrowl · on March 28, 2014

An Elektron Analog Four and a DSI Tempest (possibly about to be replaced with another Elektron box). I can dig up pictures of the PCBs if that's helpful but I don't have the knowledge for detailed analysis of how it works.

Thanks for the other replies on dev boards etc.

kryptiskt · on March 27, 2014

Anytime an algorithm is important enough to be of interest for hundreds of millions of user it is worth the investment to etch its primitives into silicon. And if there are new and exciting computations that will only be in hardware next year, CPUs are powerful enough that software can tide us over until the silicon arrives.

beltex · on March 28, 2014

Pretty much the same answer from a CPU researcher formerly at Intel, in a reddit AMA.

"I like FPGAs, but I doubt they will ever become widely deployed. They pay ~20x overhead, so any algorithm that is a good fit for them becomes a new instruction in the next CPU generation. The reprogrammability is only a feature in highly constrained (i.e. niche) environments."

http://www.reddit.com/r/IAmA/comments/1yj77b/as_requested_i_...

anigbrowl · on March 27, 2014

Especially when the price of entry to ASIC land is relatively low by industrial standards - http://electronics.stackexchange.com/questions/7042/how-much...

eg for something very simple you can do it for thousands or tens of thousands...but in that price range it's probably still more economical to implement in software or do it discretely with SMT unless you are very sure about the existence of a market.

hershel · on March 27, 2014

It's not cheap. If you want to compete wit cpu/fpga/gpu ,you'll need to use the latest manufacturing processes and do complex designs,and now we're talking about more than 100 million USD investment.

pjc50 · on March 27, 2014

Not unless you're trying to compete with Intel. Consider the various bitcoin ASIC miners: beat all other technologies for solving that problem for an investment of a few million.

However, algorithms are so rarely the stumbling block for applications; data storage, management and communications are. The major exception is graphics, hence the GPU.

hershel · on March 27, 2014

a bitcoin miner isn't a isn't a general processing unit.It does something specific that's very very easy to optimize and implement, and you don't use latest manufacturing process but tools like easic , hence the small investment.

kyzyl · on March 28, 2014

I don't think he was suggesting that developing a general process cpu that is competitive with Intel's offering costs only 10k. He said that getting your ASIC design, i.e. application-specific integrated circuit, etched into silicon is cheaper than is used to be and that's true. I know guys who implemented things like novel power converters on an IC as their masters projects, using only the budget of, well, master projects.

The money is a huge barrier to entry for hobbyist types, but if we're talking about commercial stuff, it'll probably cost you a lot more to pay for the engineers who are competent enough to implement something that will actually work than it will to do the fabrication.

anigbrowl · on March 28, 2014

Hence my comment about it being cheap for very simple applications. If you want to compete with market leading vendors then yeah of course it's going to be hideously expensive.

8_hours_ago · on March 27, 2014

Somewhat tangential, but the open hardware laptop that bunny and xobs are designing has a built-in FPGA, so you aren't the only person who thinks it would be nifty to have one. Personally, I think it'd be lots of fun to hack with and would open up a lot of interesting projects (perhaps a port of the 2048 game to an FPGA??).

http://www.kosagi.com/w/index.php?title=Novena_Main_Page

revelation · on March 27, 2014

They don't have that FPGA to offload the CPU, I think its rather to interface with external components. GPUs don't come with external IO and for a lot of high-speed buses you are just dead in the water without an FPGA to handle the delicate timing.

That is really the unique benefit of FPGAs: the implicit parallelism makes them a really great fit for the kind of high-speed bit banging that would be impossible to get right on a normal CPU (not to mention very difficult to program in the first place).

ANTSANTS · on March 28, 2014

How do you virtualize an FPGA? A GPU can be virtualized and shared between several processes. It takes waaay too long to flash an FPGA to implement time-division multiplexing on it. All you can do is "slice up" the FPGA and give different processes different portions, and even that is a security vulnerability waiting to happen unless the FPGA is designed to keep these "blocks" separately contained.

How do you balance the need for high-performance communication between the FPGA and the rest of the computer with the inherent inability to trust whatever configuration has been loaded into the FPGA? Software security is hard enough without infinitely reconfigurable devices lurking inside our machines.

How does a possible FPGA configuration work optimally across a wide variety of FPGAs (which can have various "built-in", non-reconfigurable components) and sizes that can only be determined at "flashtime"?

What exactly do regular users need an FPGA for that isn't already handled by dedicated silicon with much greater performance and much less power usage than an FPGA would?

EDIT: I'd like an FPGA card for my PC just for running FPGA "emulators," because you can recreate a SNES or whatever in a tiny fraction of the gates that you need to make a CPU capable of emulating it with cycle accuracy at full speed. I seriously doubt there's much demand for that, though: if you're going to go to the trouble of buying special hardware for emulation, you might as well just buy the console and a flash cart.

crazygringo · on March 28, 2014

I would imagine that the FPGA would not be shared. Much like only one app can be full-screen on your computer, only one process could use the FPGA. The OS wouldn't let a second process use it if a first process already was.

As for regular users, personally I'd like it for video codecs. I'd like it for video transcoding. I'd like it for faster MP3 encoding. I'd like it for Photoshop filters. I'd like it for speech recognition. I'm sure I could think of things that other users would like to use it for, like 3D rendering. I don't know of dedicated silicon that does ANY of these things, except specifically for H.264 decoding. That's it. And the thing is, for all the items I've listed, it's totally fine that my computer is only ever doing one of them at a time. These are all "regular user" uses, whereas virtualization is not needed very much for regular users.

mprovost · on March 28, 2014

Most of the uses that you mention involve processing large (ish) amounts of data, and that's one of the drawbacks to GPUs. They're great as long as everything fits into the amount of RAM on the card. Otherwise you're limited by the PCI bus and you're constantly shuttling data back and forth from main memory. It's frustrating when you only have 6GB of RAM on the GPU and 96 or more attached to the CPU. They're just not that good at data processing.

adamnemecek · on March 27, 2014

Macbooks supposedly contain FPGA chips (Google around but it's mentioned for example here [1] go to "step 13"). I was trying to look into it a while back but I don't think I ever got some definite answers as to what they are used for.

[1] http://www.ifixit.com/Teardown/MacBook+Pro+15-Inch+Unibody+M...

wtallis · on March 28, 2014

FPGAs aren't particularly hard to find in the PC ecosystem, especially on first-gen parts or custom jobs like Apple machines. I've seen some SSD and RAID controllers ship as FPGAs, and some sound cards bundling them for DSP use. They're usually not reprogrammable.

sliverstorm · on March 27, 2014

1) FPGAs are not easy to develop for, and totally different from the skillset of a Photoshop developer. It is not as easy as dreaming up a design and pressing a button; if you are anywhere close to pushing the envelope it takes expertise & time.

2) GPUs, being purpose-built and mass-marketed, are both much cheaper and much faster. Think of the Ford Fiesta ST vs. a Jeep Wrangler. The Jeep is simple & more reconfigurable, yet is slower and more expensive- for exactly those reasons!

3) GPUs and CPUs compliment eachother well. The only gap is heavily parallel branching code- GPUs are bad at if-statements, CPUs are bad at heavily parallel. But a branch predictor is the key to if-statements, and branch predictors are hard.

FPGAs are fundamentally a prototyping tool. Can you think of examples of prototyping tools that eventually broke into the main market, replacing the incumbent product?

aortega · on March 28, 2014

>FPGAs are fundamentally a prototyping tool. Can you think of examples of prototyping tools that eventually broke into the main market, replacing the incumbent product?

Python

adwn · on March 28, 2014

> FPGAs are fundamentally a prototyping tool.

FPGAs can be used for prototyping ASICs, but that's by far not their major use-case.

sliverstorm · on March 28, 2014

What is then? I know they are used alongside PLDs, but I thought PLDs still dominated that use-case.

adwn · on March 28, 2014

They are essentially low-volume ASIC replacements. Some applications:

- digital signal processing (DSP)

- network equipment (modern high-end FPGAs are able to function as 100G Ethernet switches and routers, just connect them to some PHYs and fast DRAM)

- systems on chip (SoC): processor, fast DSP, all kinds of IOs, memory controller – all on a single chip

- realtime video stream processing

For more examples, see http://www.xilinx.com/applications/index.htm

hershel · on March 27, 2014

In general GPU's are same or better in floating points per given area of silicon. on top of that , FPGA's add other expenses due to lower market size.

Maybe FPGA's are competitive in integer math, but that's a pretty small niche ,not big enough for a coprocessor.

aortega · on March 27, 2014

Very few common user tasks can be solved more efficiently with an FPGA than with a CPU or GPU.

CPU gives you speed in computation. Except for a custom ASIC, nothing can match a modern CPU in single thread speed.

GPUs gives you parallelism. You can get thousands of decent CPUs in cheap boards today.

IMHO FPGAs main advantages today are bandwidth and auditability. But neither are very important in most applications.

jbp · on March 27, 2014

I also had to implement simple microprocessor on an FPGA about 10 years ago.

I think one of the main reason is programming the FPGA is not easy, the tools needed for FPGA development are all proprietary. Every FPGA has different way to program it.

listic · on March 28, 2014

Maybe in the next 5 years, when Moore's law may run out http://en.wikipedia.org/wiki/5_nanometer (unless Intel et al. have some extraourdinary technology up their sleeves) we will finally see the advent of FPGA's?

Without the annual growth of transistor density, the second best avenue to gain performance will probably be specialization, and reconfigurable specialized hardware looks more attractive than fixed one.

goalieca · on March 28, 2014

Given everyone would have different models, the compilation process (including placing and routing) would absolutely destroy things. I've seen them take 1+ hours on a fairly fast machine. The problem with connecting N logic blocks is that you need N^2 buses to connect them directly to one another. Luckily they aren't but that's why the placing/routing optimization problem takes so long to compute as transistor count is now in the billions.

lucian1900 · on March 27, 2014

GPUs were "there" and there was a logical path towards increasing their usefulness. It was partly an accident that they're not more programmable.

Technology sucks :(

NickNameNick · on March 27, 2014

Didn't Cyrix cpus have some fpga in them, that would be dynamically assigned to any repetitive code that the cpu was busy doing?

Brashman · on March 28, 2014

I think a big reason is the chicken-and-egg problem. GPUs gained large spread adoption because of their use for gaming. It's only relatively recently that they've been exploited for more general compute purposes. There hasn't really been a "killer app" for FPGAs that warrants many people to have one.

InclinedPlane · on March 27, 2014

Speed. Reconfiguring an FPGA is a very slow process, and FPGA's can't match the clock speeds of conventional ASICs. That may change in the near future with FPGA-like devices built using memristors, which potentially could be reconfigurable at the speed of writing RAM and run at the same clock speeds as any ASIC.

p1esk · on March 28, 2014

Actually, modern FPGA are "capable of dynamically reconfiguring at multi-GHz rates": https://www.tabula.com/technology/technology.php

InclinedPlane · on March 28, 2014

That's marketing hype. If you reconfigure just one gate it can be extremely fast, but typical FPGAs take a significant fraction of a whole second to reconfigure. Though they are getting faster over time.

p1esk · on April 2, 2014

Did you even read how their chips work?

hyp0 · on March 28, 2014

Not a direct answer, but it's not just GPUs: on ARM "System on a Chip", you can get all sorts of components: audio, hardware codecs, networking, radio, and low-power companion cores (littlebig).

So, your vision of general-computation-in-silicon has occurred, just static, not dynamic.

agumonkey · on March 27, 2014

I wonder how many people would have bet on GPUs in the late 90s. Even high-end 3d chipset were ousted quite brutally (3dlabs, evans & sutherlands) from the gpu market by small 'gaming' companies. And now they invade server side computations... quite funny.

_pmf_ · on March 28, 2014

> Does anyone here have any insight as to why?

1. They are already there in commodity HW.

2. GPU vendors make money with their hardware; they need not extort developers for a development environment like FPGA vendors think they have to do. Therefore, development for GPUs has a lower barrier of entry.

dmytrish · on March 28, 2014

I have very limited knowledge of the subject, but I had an impression that GPUs started to catch up on programmable capabilities with shaders. As far as I know, shaders reprogram the rendering pipeline, so is it a kind of specialized HW programming?

pkaye · on March 27, 2014

FPGA are still more costly when compared to ASIC in large volume. Some of these high end FPGA can run $1K to $10K assuming you have even access to them due to their limited availability.

thrownaway2424 · on March 27, 2014

Most applications don't really need the flexibility of an FPGA. Just having massively parallel multiply-accumulate turns out to cover huge amounts of digital signal processing and machine learning applications.

A long time ago computers sometimes came with DSPs dedicated to Photoshop-like programs. The GPU is a logical extension of that. The FPGA is something entirely different.

frozenport · on March 28, 2014

They were used in the late 90s, and companies such as Cray had C to FPGA compilers. These were implemented as c style calls that the compiler transformed to FPGA.

The coding was hard and performed below expectation. The only customers that benefited were military/intelligence agencies who were using outdated techniques such as filtering in the Fourier basis.

The problem was data-transfer compounded by inflexible code generation. The data still needed to travel from RAM which meant that even without computation you couldn't get over a 50% increase in performance. It would need to be send back for more complex processing.

jangray · on March 27, 2014

Hi, Jan here.

pdq, the last time I built this design, with more fully elaborated processors (control units + multiplier FUs) it took three hours and 16 GB physical RAM on a Core i7-4960HQ rMBP.

kbenson · on March 27, 2014

Do you know if the process parallelizes well? If so, this seems like something that high-end temporary AWS instances could help quite a bit with.

jangray · on March 27, 2014

With the Xilinx ISE toolset I am currently using (which Xilinx is deprecating in favor of the new Vivado toolset) it parallelizes/multithreads poorly. I understand that the place and route algorithm is based upon simulated annealing, in which you make small random perturbations to the current layout configuration, measure whether it is better or worse, and sometimes retain the new configuration, and sometimes roll back. This gradually evolves the system to a configuration which maximizes some objective function, avoiding getting stuck in a local maximum. It has traditionally been a challenge to parallelize this sequential algorithm through design partitioning because of placement and routing interactions between the partitions.

In some flows you can do a coarse floorplan of your design and route the submodules separately and then stitch them together. I imagine this is how the very largest devices are implemented in manageable design iteration times.

I don't usually worry about that, though. Since my design is just so many replicated tiles, I tend to do design iterations of 4- or 16-processor elements to test the impact on clock period / timing slack. That usually takes 2-3 minutes per design spin. Only once in a while do I place and route the whole chip to confirm some change doesn't impact timing closure.

1ris · on March 27, 2014

Considering that the synthesis is a PN-hard problem (I think so?), how much better is your 1995 design in a 1995 FPGA synthesized with a 2014 CPU compared with a 1995 CPU? Is there a huge difference?

jangray · on March 27, 2014

"how much better" -- I didn't understand, sorry.

Even with 18 years of x86 performance advances, it takes much longer to PAR the large FPGAs now than it did back in the day.

aortega · on March 27, 2014

I believe the problem here are the Xilinx tools. They basically suck. Latest Quartus from Altera can build a decent sized (about 10000 LUTs) design in about 10 minutes in my I5 M2520 notebook.

I'm currently synthesizing a 45nm ARM cortex-M0 design using Cadence Encounter flow. The complete process (RTL compiler+place+route) takes only 5 minutes!

jangray · on March 27, 2014

I use both vendors' toolsets, have found occasional bugs in both, have my regrets, but all-in-all they are quite comparable, and quite remarkable for what they enable.

In both tools most of my design spins take <3 minutes.

If you were building an ASIC you'd pay $$$,$$$ for such tools. The economics of FPGAs are such that the tools are either free or $,$$$. Both are reasonable and accessible to an enthusiast/practicing EE, respectively.

I am so grateful to Ross Freeman, inventor of FPGAs, and all the engineers that followed in his footsteps, for democratizing access to state of the art high performance digital logic. For $100 or so you can get a 28 nm device filled with 10,000s of LUTs and hundreds of RAM blocks and build whatever you can imagine. Amazing.

aortega · on March 27, 2014

Indeed, I always wonder why such a (relatively) exotic and low-volume technology is also very low-cost.

3 minutes is very fast...one of my projects takes about 30 minutes for ~30K Luts on a modern Core I7, I used too many registers.

Developing and debugging it is basically torture.

1ris · on March 27, 2014

I did expect that. But given the old FPGA: Would you have gotten significantly better results if you would have had a i7 back in 1995? I'm not talking about getting it done faster, but getting it better.

ggreer · on March 27, 2014

Fab tech has improved at an astonishing rate. The Willamette core (Pentium 4) from 2000 fit 42 million transistors in 217mm². Eight years later, Silverthorne (Atom) fit two cores and 47 million transistors in 25mm². That's nine Atom CPUs in the same space as one Pentium 4.

Today's quad-core Haswell is made of 1.4 billion transistors crammed into 177mm². That's almost 8 million transistors per square millimeter.

anigbrowl · on March 27, 2014

It seems like we're going to run into the limits of that by the end of the decade or soon after though. I've made quite a few submissions about this but nobody ever seems to read them :) I really wonder whether we are making advances with parallelism and other technologies fast enough to offset the oncoming barriers to shrinkage and speed.

My naive best guess is that when CPUs stop getting much faster the next wave of innovation will be on bus speeds. But I'm not a chip guy so perhaps my view of the problem is overblown, but I'm very interested in learning more about this from others.

hershel · on March 27, 2014

Yes moore's law is in deep troubles, but there are few paths for innovation in that area: bus speeds, imprecise but faster and lower power computation(which works for some things), much more efficient ways to build on chip memory(which takes 50% of chip area), 3d manufacturing techniques to decrease cost, ,affordable ways to make customizable microprocessors and moving from silicon to other materials to achieve much lower power consumption.

So on average we might be able to see 50x improvement in the next 30 years (according to a darpa manager).

zhemao · on March 28, 2014

> when CPUs stop getting much faster

If by faster you mean clock speed, they haven't been getting faster for a while now. Memory and IO speeds do lag behind, but we have caching to solve the former and SSDs to solve the latter. One pressing issue now is getting the power consumption down. This is especially important for laptops and mobile phones. FPGAs are pretty good at using less power, but you have to sacrifice a lot of performance and programmability.

anigbrowl · on March 28, 2014

Ah sorry, I meant in terms of execution speed not clock, and was thinking mainly of shrinking die sizes > more transistors > more operations per cycle. I apologize for the vagueness.

bayesianhorse · on March 27, 2014

What practical applications does this have? Could we see something like Python's Theano? The latter is a library which is capable of turning symbolic representations of linear algebra into parallelized and optimized code for the CPU or GPU.

I think that these FPGAs, when put into consumer computers, will be more like "data centers on a chip" rather than processing cores. In a simple example they could run map/reduce type operations, colocating storage and computing silicon.

zhemao · on March 28, 2014

That's not really what FPGAs are good for. Unless you are using a really high-end FPGA (read: >$10,000/unit), the raw compute power is not going to beat a decent gaming GPU. The benefit of FPGAs is that you get very precise timing control, which makes then very good for software-defined IO and other latency-sensitive hard real-time applications.

jokoon · on March 28, 2014

I'm sure the future will be massively multi processor. I just wish it would come faster.

Although I'm quite sure those sorts of amazing designs are already well used by the NSA. I'm sure hardware research could be the real breakthrough in cryptanalysis.

aortega · on March 27, 2014

I wonder why Xilinx choose to use the j32 CPU in this example when they have better designs like the MicroBlaze, with about same size but faster and able to boot Linux.

EDIT: Oh I see. Microblaze is propietary and didn't exist back then.

jangray · on March 27, 2014

Xilinx didn't choose anything, rather they simply linked to a blog (mine). These cores are more austere, smaller, simpler than MicroBlaze.

http://www.fpgacpu.org/log/sep00.html#000919