Hacker News new | past | comments | ask | show | jobs | submit login
FPGAs Have the Wrong Abstraction for Computing (cornell.edu)
253 points by matt_d on June 23, 2019 | hide | past | favorite | 135 comments



"Vendors keep bitstream formats secret, so Verilog is as low in the abstraction hierarchy as you can go. The problem with Verilog as an ISA is that it is too far removed from the hardware."

I wonder if the author is aware that the bistream formats for both Lattice iCE40 series and Xilinx Virtex 7 series FPGAs have now been reverse engineered, and there is a complete open source toolchain that can be used for these. So Verilog is no longer as low as you can go.

Efforts of this type are also underway for other parts and there is a growing movement in this direction - see talks from Clifford Wolf at recent CCC events.


Except these are not mainstream efforts supported by vendors.. Which means that they're effectively off limits to serious developers are large engineering firms who are the typical customers of FPGA hardware


Yes, exactly like Linux, Apache, Docker or your favorite language were off-limits to serious developers in large engineering firms at some point in time. Things are evolving, so we'll see how it goes.


Lattice is inching towards kinda sorta cheering them on, even though they won't openly admit it yet. They've been an underdog after Xilinx and Altera/Intel for a long time, but have gained a ton of popularity on hobbyist dev boards since Yosys showed up (search twitter for "tinyfpga" "icebreaker" "ecp5" etc. if you want examples).

The hobbyists playing with these boards at home are now more familiar with the parts and the tools, and there's also enough data comparing synthesis results between the open-source and proprietary tools with non-trivial designs that it's getting easier to predict performance (and be confident that the fitter won't crash).

Anecdotally, a lot of these hobbyists are now pushing to use Lattice parts at work, where appropriate.


Two thoughts:

* Most of the algorithms that we want to work with in this domain are doing arithmetic operations on ints and floats. This isn't super difficult to do in an RTL, but it's like implementing C++ objects in assembler. You can do it, but you need to think harder than you should.

* FPGAs make you worry about timing. This is a massive shift in thinking for software people. It's also not a value-add; I don't want to care about timing. And it enforces chip-wide dependencies (you can have separate clock domains, but not many of them).

If you simplify the model to "pipelines of arithmetic ops" and then provide an abstraction that eliminates timing (e.g. all ops run in a fixed number of clock cycles and the compiler automatically pipelines them where necessary) then I think you'd have something usable. But this is basically a GPU with a lot of SRAM. Such a constrained problem would run extremely well on any modern GPU or SIMD machine, without the power and cost and obscurity constraints of FPGAs.


I disagree that worrying about timing is not a value add, at least for real time [1] applications. Worrying about the timing is absolutely crucial for applications like servo loops. Making timing guarantees seems a lot more straightforward with an fpga than an RTOS.

Controls folks and software people tend to have very different conceptions of what “real-time” means. I’m talking about a loop that must execute once every (say) microsecond exactly or things start physically breaking.


Haven’t touched it in a while, but I recall working with BSV in college to be quite different from normal Verilog. Removed a lot of the annoyances with Verilog and put all the functionality behind a Haskell front end. No need to worry about timing, etc. because each component was designed to follow some “ready bit” based interface.


Yup, wrote a toy processor in Bluespec. It was wonderfully easy to do, and typing was pretty handy too.


I feel like this is the approach that Mathworks [1] (Matlab/Simulink -> FPGA) and Intel [2] (OpenCL -> FPGA) use.

[1] https://www.mathworks.com/solutions/fpga-asic-soc-developmen...

[2] https://www.intel.com/content/www/us/en/software/programmabl...


The big difference for FPGAs from GPUs is that you pay a power cost for that flexibility(both in SRAM, and LUTs). Heck, even most modern FPGAs include a fixed function for multiply operations and the like.

Unless your doing something with strong timing constraints and you need it to be very wide they just don't make sense before you even get to the HDL question.


Right, and this why I believe that FPGAs aren't competitive with GPUs for computationally-bound tasks.

FPGAs are weak here because (a) routing costs power, (b) Most ALUs need to be implemented from LUTs, (c) high-order data flow needs to consider timing of all dependencies, manually, and getting this wrong is very wasteful of resources.

GPUs have an apparent weakness where their ALUs are overly general for most tasks. If you're doing primarily INT8 tasks then all of that floating point hardware goes to waste, and it appears that an FPGA has an advantage (INT8 is cheap and predictable there.)

However, ALU volume isn't the constraint on GPUs -- it's memory bandwidth. Every prosumer GPU make in the last decade has had an excess of ALUs and a shortage of memory ops per second.

This is where we're seeing specialized TPUs, deep learning accelerators and image processing accelerators that provide the right ops paired with the right types of memory in the right places. They're seeing large performance and power improvements over GPUs, at the cost of being further specialized.

FPGAs have an advantage for low-latency applications and where predictable timing (nanoseconds) is required, but I don't see them displacing GPUs for compute-bound tasks.


The weakness of GPUs is that the minimum amount of work is incredibly high for acceleration to make sense. If you have a batch with less than 10000 units of work then you shouldn't even think of writing a GPU kernel. GPUs will never used for stream/network processing workloads because no one wants to wait for a buffer to fill up. A FPGA could in theory offer the same throughput but with a much lower latency.


The Volta and higher architecture share ALUs with the floating point units, so nothing is really going to waste. Same with the tensor cores.


I DO want access to timing as most of what I want to use an FPGA to implement directly relates to timing. Think: real-time audio.


Sure. Then why not use a DSP? Then you've got lower price, easier programming and can still achieve cycle-accurate timing if you desire.


The DSP is great for the computation side of audio, but unless that DSP has every bus you want to interface with, you also need an FPGA for I/O routing to PCI, Ethernet, ADCs and DACs, etc.


Is this really the case? The timing parent is talking about is fractions of a cycle, maybe five orders of magnitude above audible frequencies.

How could you take advantage of exposed timing?


1) multiplexed audio. Sure, your frame clock may only be 44.1khz, but the bit clock on a mere 8 channels of this will be 11.2896mhz, to say nothing of oversampling, to say nothing of processing

2) low latency processing of the above may require in-situ pipelining in which the pass-through buffer itself IS the processing buffer, etc. imagine being able to eat the pot you boil your pasta in.

3) why WOULDN'T any well-designed elegant system have every single tick as a function of that bit clock? It makes no sense to deliberately place spanners in your own path, particularly as pertains to jitter etc. It's not like you are going to pause your wavelet transform and check your email in process...


point being: Control. Precise control over timing is required for deterministic temporal activities.

Removing precise control over timing from the language stack one uses to program FPGA's with is removing a desirable feature for many of their uses cases.

If one is interesting in glossing over all this abstraction, why is one wishing to use an FPGA at all?

I will reverse the question and say: "in which scenarios is someone hoping to avoid addressing precise timing constructs in FPGA programming?"

Obviously I'm not referring to clock propagation delay or quantum entanglement etc LOL I mean the intentional macro stuff wrt "timing"


Parent was referring to clock propagation delay (resulting in multiple clock domains on the same chip), hence my confusion.

If you're talking about higher level timing, nobody is arguing for taking that away from you.


I agree. I've seen System-Verilog used for simulating such propagation delays, but it's not exactly a language-abstractable concept yet (precisely as it involves the actual hardware gate implementation) in that such black-box simulations estimate average propagation delay as a fixed function dependent upon results obtained from testing specific functionality blocks in specific devices specific gate-counts away from I/O pins, etc. It would be highly desirable to have HDL level modeling of this stuff in the abstract sense, although I confess to not knowing how that could possibly work, given the above.


I don't want to care about timing

If you don't care about timing at all, that suggests you don't really care if it's fast- in which case, why are you using an FPGA?


Timing here refers to a design consideratiob, e.g. how fast can the clock go before a logic bit at the output of one unit gets misinterpreted as the opposite level on the input of another unit. That depends on the layout of the design, how many inputs are connected to that output, etc.

They don't want to care about timing. Not that they don't care about how fast it runs.


Is it just me or did it sound like the author was complaining the whole time?

"FPGAs aren't evolved like modern processors with standardized programming models, so we must throw out the current model but I have no idea what is better."

Having worked with FPGAs, I can understand his complaints about the toolchain, they absolutely suck and are mostly closed source. This technology is just like microprocessors in its infantry, its evolving.

FPGAs have made crazy progress in the last decade and are getting to the point where it's now affordable for hobbyists and consumers to work with them instead of merely just aerospace & defense contractors with massive hardware budgets.

The problem with Verilog as an ISA is that it is too far removed from the hardware. The abstraction gap between RTL and FPGA hardware is enormous: it traditionally contains at least synthesis, technology mapping, and place & route—each of which is a complex, slow process. As a result, the compile/edit/run cycle for RTL programming on FPGAs takes hours or days and, worse still, it’s unpredictable: the deep stack of toolchain stages can obscure the way that changes in RTL will affect the design’s performance and energy characteristics.

Yes, RTL varies significantly from the actual hardware of the device (LUTs, Memory, Peripherals, etc..) but from a design standpoint, I don't see anything else that would make more sense to work with. FPGAs are significant BECAUSE of the fact that you get build and design at that level. Let's not forget that ISAs are a higher level abstraction of RTL...


If I understand your argument correctly, you are saying that complaining is not justified when FPGAs have made crazy progress. Can you explain why making crazy progress means that we could continue using the same abstractions or that the abstractions ar right?

GPU's made crazy progress but they changed because there was better way to make even crazier progress.


> As a result, the compile/edit/run cycle for RTL programming on FPGAs takes hours or days and, worse still, it’s unpredictable: the deep stack of toolchain stages can obscure the way that changes in RTL will affect the design’s performance and energy characteristics.

That's all true but one should really not forget what FPGA's are intended for in the first place: the design of integrated circuits and hours or days is so much better than months for even the simplest custom IC.


There are a few efforts to modernize the toolchain. Inaccel is making some great efforts like integration with Spark, K8s https://www.inaccel.com/fpgas-goes-serverless-on-kubernetes/

Disclaimer: I am investor in InAccel


I have designed ASICs and FPGAs for nearly 30 years, and seen the evolution of this technology first hand. To say that FPGAs have the wrong abstraction is to not understand what an FPGA is and what is intended to accomplish.

Transistors are abstracted into logic gates. Logic gates are abstracted into higher-order digital functions like flip-flops, muxes, etc. It is the mapping of algorithms/functions onto gates that is the essence of digital design. This is difficult work that would be impossible at today's scales (5-billion+ transistors) without synthesis tools and HDLs. And, given that an ASIC mask set costs 1MM+ for a modern geometry, it needs to be done right the first time (or at least the 2nd). Furthermore, the mapping to gates needs to be efficient, throwing more gates at a problem increases area, heat, and power, all of which need to be minimized in most contexts.

My first job out of college was designing 386 motherboards. Back then we were still using discrete 74xx ICs for most digital functions. The boards were huge. PLDs allowed better intergration and were cost effective since a single device could implement many different functions, and reduced board area and power consumption. CPLDs moved this further along.

FPGAs grew out of PLD/CPLDs and allowed a significantly higher level of integration and board area reduction. They offered a way to reduce the cost of a system without requiring the investment and expertise required for an ASIC. But, FPGAs themselves are an ASIC, implemented with the same technology as any other ASIC. So, FPGAs are a compromise; the LUTs, routing, etc are all a mechanism to make a programmable ASIC. Compared to an ASIC, however, FPGAs require more power and can implement less capability for a given die size. But, they allow a faster and lower cost development cycle. To bring this back around, the LUTs and routing mechanisms are functions that have been mapped to gates. To use an FPGA, algorithms still need to be mapped onto the LUTs and this is largely the same process as mapping to gates.

This article was pointless, even the author acknowledges: "I don’t know what abstraction should replace RTL for computational FPGAs." And, "Practically, replacing Verilog may be impossible as long as the FPGA vendors keep their lower-level abstractions secret and their sub-RTL toolchains proprietary." As I have argued above, knowing the FPGA vendors lower-level abstractions won't make the problem any better. The hard work is mapping onto gates/LUTs. And that analogy is wrong: "GPU : GPGPU :: FPGA : " An FPGA is the most general purpose hardware available.

The best FPGA/ASIC abstraction we have today is a CPU/GPU.


> As I have argued above, knowing the FPGA vendors lower-level abstractions won't make the problem any better.

Where did you argue that? Why is it reasonable to expect that proprietary synthesis tools are going to be better than an open source one? That definitely was not the case, long term, with proprietary C compilers of yesteryears. LLVM is the future, and mostly because of LLVM-IR. So ASTs are well optimized in a general format, why shouldn't digital logic circuits be similar? Yes, actually mapping this to xlnx (etc.) primitives is going to be different for each vendor... In the same sense that mapping LLVM-IR to aarch64 and amd64 is going to be different. So what? That doesn't mean that all is lost.

> The hard work is mapping onto gates/LUTs.

I think it's reasonable to expect that things like FIRRTL have the potential to outperform the synthesis tools that exist currently. The closer the representation gets to a pure graph theory problem, the better chance we have at reasoning about it.

The author makes a good point about Verilog being the current interface. Look at how FIRRTL has to be transpiled back to Verilog to be piped into synthesis tools. That's madness, and it's very opaque, and there's a lot of information lost that we just have to trust the tools to recover. Verilog is a lossy format, and that's the takeaway from this article for me, and you haven't addressed that point at all.


I argued it in the second quote: "The hard word is mapping onto gates/LUTs". It doesn't matter whether its gates or FPGA luts, the work is the same.

I never argued anything regarding proprietary vs opensource tools. I love opensource tools and appreciate projects such as FIRRTL, Symbiflow, Yosys, Chisel, Clash, etc have amazing potential. Having access to the FPGA vendors low-level abstractions enables the broader use and development of these tools, which is important. My point was only that gates/LUTs are the fundamental building blocks of all digital computing. They are not easily abstractable, and to say that FPGAs have the wrong abstraction is not the best way to look at the problem. FPGAs aren't going to fundamentally change, they aren't going to evolve from FPGA : GPFPGA (to answer the author's analogy). But, tools can always be improved and make FPGA design more accessible.


> Why is it reasonable to expect that proprietary synthesis tools are going to be better than an open source one? That definitely was not the case, long term, with proprietary C compilers of yesteryears. LLVM is the future, and mostly because of LLVM-IR.

Lol citation needed. Last I checked the performance oriented commercial closed source C/C++ compilers still outperform Clang and LLVM. And so does gcc for most cases for that matter.


Clue: Gcc is Free Software too.

New languages typically are not implemented with Gcc, though. So, long term, LLVM probably wins.


I don’t follow.

Though gcc still trumps clang in some areas, neither gcc nor llvm beat the commercial compilers like ICC in a wide range of workloads.. obviously there is still a market for ICC and AOCC.

The original point was that open sourcing would necessarily lead to better performing tooling. And again to that I maintain... citation needed.


As a autodidact with a focus on analogue circuits and experience in microcontroller stuff the thing that always threw me off aboit FPGAs is that there was never a real oh-don’t-mind-I-am-just-looking kind of FPGA environment.

When I started with MCUs I started with an arduino. The thing it did for me was to give me a feeling when to use a microcontroller and when to use something else entirely.

Of course the level of control I had with an arduino was far from optimal, but it worked out of the box and guided me into the subject (a bit like a childrens bicycle: neither fast nor special, but helps in avoiding pain and frustration for the learner).

I wished I had this kind of thing in an affordable fpga way. Simple enough to get me hooked, with examples and good sane defaults etc.

This is what mainstream means: idiots like me who didn’t get a formal education on the subject but want to try things out.


Here you go:

Cheap FPGA boards for educational purposes: https://store.digilentinc.com/fpga-for-beginners/

The software is free: https://www.xilinx.com/products/design-tools/ise-design-suit...

The hard part is several semesters worth of textbooks to go through that cover digital logic (try Mano's "Digital Design: With an Introduction to the Verilog HDL" to start with) through computer architecture in order to know what to do with the board.


Much, much larger and better $113 FPGA development board with free 1 year license: https://www.microsemi.com/existing-parts/parts/139680 You never pay the $159 list price. Besides, the FPGA alone would cost you >$350 retail. Arrives with Risc-V preprogrammed and 'hello world' type demos on wifi and usb.


Really exotic and complex board to start with. No community support (think about Digilent or Terasic). I also guess, no examples. I visited a seminar about RISC-V and Microsemi presentation was really weak on this topic. Do not recommend this for getting started. Only for experienced users looking for pain. Cheap and affordable board is for example Max1000 from arrow.


Examples RiscV softcore, ADC, wifi, 12,5 Gbps Serdes(!), tic tac to, console echo. https://github.com/Future-Electronics-Design-Center/Avalanch... https://github.com/RISCV-on-Microsemi-FPGA/PolarFire-Eval-Ki...

Community support is indeed just beginning but the risc-V community and the HiFive1 - SiFive community support these polarfire FPGA's. Even 50 Risc-V softcores fit on this FPGA.


After the one year, the software costs $995 per year ("The board includes a 1 Year Libero Design Software Gold License worth $995!"). Not really a good deal for a beginner.


You buy a new board every year for $113 to extend the license. My point remains, its by far the most powerful FPGA for $113 and good for beginners.

Just a happy user, not affiliated with Microchip/Microsemi


There are also some boards listed here: https://symbiflow.github.io/#boards

If you want something even cheaper look for some ICE40 boards, like up5k MDP, or tinyfpga.

A few interesting things to try out: https://www.cl.cam.ac.uk/teaching/1112/ECAD+Arch/background/... https://www.cl.cam.ac.uk/teaching/1112/ECAD+Arch/files/Thack...

An intro to verilog: http://zipcpu.com/tutorial/


Leaving aside the fact that open source toolchains exist for various Lattice FPGA's (the ice-40 and ecp5 etc)

About the simplest environment for doing what you just described is http://papilio.cc/

While using the existing Xilinx Webpack tools for actual synthesis, place and routing, etc. the Papilio Design-IDE will LITERALLY let you add peripherals to virtual Arduino like appendages! It takes advantage of a number of community projects like the Wishbone bus, and achieves a nearly drag-n-drop level of visual design tool.

Once you have loaded your custom arduino chip onto the papilio boards FPGA you can program it with a modified version of the Arduino IDE!!!!

One of their virtual chips you can start with IS the arduino atmega 328! Another is the ZPU-ino, an implementation of the Zylin ZPU (a 32bit mcu) done by Alvie Boy that allows you to program this much more powerful device ALSO by the Arduino IDE!


There are a bunch of FPGA project breakout boards on Crowd Supply, at generally low prices, cheap enough to try one at a time until you find one that clicks for you.


I highly recommend PYNQ. It comes with a Jupyter interface through an ethernet port that lets you interactively program and execute. Of course, actually designing the RTL overlay is the hard part, and you'll need to get comfortable with Vivado to do it properly.

http://www.pynq.io/


Yes! Part of what an abstraction /API should do is to map the difficulties of the actual physical/electrical problem to the language. Arguing then that the language makes things too hard is missing the point... the difficulty of generically mapping gates and routes and clocks is hard and that has been exposed to you as LUTs and flip-flops and wires and clocks in something that looks like a programming language with a compiler that still sometimes fails to make implementations fit into the FPGA structure.

That some difficulties are hidden doesn't mean they are easy. Unless you have at least a proposed solution (language, compiler, architecture, ASIC) that lets people solve similar problems to the FPGA tool chain, it's just complaining.


Decade of FPGA work here; thought the same thing.

Fundamental misunderstanding of FPGAs, presents no alternatives. Zero worth article.


Question: What is the simplest publicly available/open design for a 386 Motherboard? (74xx IC's are preferable, no matter how large it would make the entire board... I envision myself building an ancient 386 Motherboard in the future, to teach myself aspects of computer engineering...)


I've done three ASICs using only schematics in the 1980s (back when 2.5 um was the bomb). I've done ASICs, FPGAs, but mostly full custom designs since then. And this article is wrong-headed on many levels.

Verilog is an event-driven modeling language. It easily describes large collections of processes where a given process is triggered for reevaluation any time one of its inputs changes. That is what it was designed for back in the 80s. Using it for automatic synthesis of logic came later.

If your mental model is that Verilog is an ISA then things will be very confusing. Programming language semantics and ISAs are two different things.

Yes, the Verilog language is pretty ugly. But it allows low level modeling at the primitive level, even if that primitive is a single logic gate. And like any programming language, hierarchy is used to build useful abstractions for the particular problem domain.


Seems like the author is arguing that when FPGAs are used for "computation" and not for prototyping or replacing ASICs for low volume production, there should be an ISA to abstract the gory details like in GPUs/CPUs, and Verilog is not an adequate abstraction in that regard because it's more like a programming language.

Now, I may misinterpret the author's argument, but I don't understand the obsession over ISAs. It's a bit short sighted IMHO and, I agree with you, HDLs may not be elegant or pretty but they do their job just fine, ie. low level modeling at the primitive level. They are not programming languages.

I also don't see how vendor locking fits with the rest of the narrative. Whatever intermediate abstraction the vendors offer (the author gives no examples which makes the post read more like a rant) it still won't expose the inner workings which (unlike CPUs/GPUs) are crucial for truly open FPGA development.

Now should we strive for a better alternative, an evolution to FPGAs that benefit from an "ISA" type of abstraction? I'm not smart enough to visualize it, but sure why not. An example would be nice though


If Verilog is your target, compile times will be too slow.

This is better:

"We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing...We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool."

https://ieeexplore.ieee.org/document/7551387

And based on this representation , they've built a high level language for FPGA accelerators, available to download:

https://spatial-lang.org/


The article is not about using Verilog to model ASICs. He even says himself it's fine for that.

He wants to compute with FPGAs. So, if you're not interested in computing with FPGAs, you won't agree with him. Full Stop.


OK, we've added "for computing" to the title above.


The premise of the post seems to be that Verilog RTL is the lowest form of abstraction in which you can program an FPGA, and that everything lower is guarded by proprietary FPGA tools.

That is simply not true.

You can manually instantiate FPGA primitives (LUTs/BRAMs/DSPs) with all major FPGAs, and if you’re truly desperate you can add placement attributes to place these primitives exactly where you want them.

That’s as close to the metal as I can imagine (other than specifying the actual routing network), and from there one could build up any abstraction level one desires.


That's a bit of a grey area. Some FPGA primitives work with inference only and finding those templates is hard. Some FPGA primitives have modules you can instantiate, but again finding all legal parameterizations will take you forever (it's never properly documented) and the simulation models rarely match the hardware behaviour so it's difficult to verify.


I think the heart of what the article is getting at is represented well by the following quote:

"To let GPUs blossom into the data-parallel accelerators they are today, people had to reframe the concept of what a GPU takes as input. We used to think of a GPU taking in an exotic, intensely domain specific description of a visual effect. We unlocked their true potential by realizing that GPUs execute programs."

Up until the late 2000s, there was a lot of wandering-in-the-wilderness going on with respect to multicore processing, especially for data-intensive applications like signal processing. What really made the GPU solution accelerate (no pun intended!) was the recognition and then real-world application (CUDA & OpenCL) of a programming paradigm that would best utilize the inherent capabilities of the architecture.

I have no idea if those languages have gotten any better in the last few years, but anything past a matrix-multiply unroll was some real "here be dragons" stuff. But: you could take these kernels and then add sufficient abstraction on top of them until they were actually usable by mere humans (in a BLAS flavor or even higher). And even better if you can add in the memory management abstraction as well.

Point being: still not there for FPGA computation, though there was some hope at one time that OpenCL would lead us down a path to decent heterogeneous computing. Until there's some real breakthroughs in this area though, the best computation patterns that are going to map out using these techniques are the things we're already targeting to either CPUs or GPUs.


I'm annoyed this post ended with an open question. I was hoping it might at least have some ideas, as I do agree that verilog is a horrible abstraction layer.

However, I don't think an ISA for FPGA can exist, not one that that allows for quick synthesis.

Sure, you could drop down to a layer where everything is described as LUTs, FFs and routing; But you still need to run "place and route" before you can execute it, and that the expensive part of synthesis.


IMO, if what you want is to accelerate computation, what you need is basically an FPGA with higher-level building blocks. Instead of an array of lookup tables, you want an array of what are essentially ALUs and registers. The abstraction should be something more akin to a dataflow graph, where you can route data from one unit to another.

Yes, I know FPGAs already have adders, multipliers and registers inside them. My point is, they should have more of that. For computation, the focus needs to be on having as many of these useful high-level building blocks as possible, and making the routing of data between these blocks as easy as possible. For instance, routing 32-wide or 64-wide buses between these units, instead of individual wires.

The "problem" with FPGAs is that they are trying to be able to emulate any ASIC. If what you want is a general-purpose computational accelerator, you need hardware that is a bit more high-level and more specialized. More tailored for that specific purpose.


More ALUs reminds me of the GreenArrays chips: hundreds of really tiny cores arranged on grid, each one running a small program. Pretty weird stuff. And the canonical development tool is a very weird color forth environment.


I saw a live presentation about greenarrays by its creator and wasn't impressed. His presentation was a long list of cool assembly-level tricks he had come up with to program his own chip, it seemed like he was trying to show off his skills to the audience. The chip has a very arcane design that is impractical to use, hence why it saw very little adoption.


I only saw a friend of mine hacking that stuff once. Or at least he tried to. I don't actually know if he ever got to a point where he had something interesting running. I know that I would never put up with that color forth environment, though.

Even though the GreenArrays design may be quite flawed, there mihht be use cases where the array of small cores idea could be put to use. I'm also remimded a bit of the XMOS multicore microcontrollers for some reason, although they are quite different. But the core link can work even between microcontroller packages, allowing the xreation of decently sized grids. But I can't think of a good use case for a large grid of those. Thus is more about nerworkimg microcontollers on an embedded context.


The cadence hardware emulators are like that. Arrays of tiny processors that only do bitwise ops. They look like FPGAs to programmers though.


Agree. At the risk of sounding like a nihilist, such articles that ask more questions than they answer rarely raise any questions of insight, and this has the effect of making people forget the value of the whole article, given that the articles often end with the "open questions" section.

IMHO. YMMV.


It's a bit disappointing, but this is a blog post, not a research paper, so maybe the problem is with our expectations?


With a title like "FPGAs Have the Wrong Abstraction", it rather suggests that the author has something better in mind.


FPGAs have the wrong architecture. Routing fabrics are a premature optimization. They need to be a 2d array of 4:4 look up tables. One bit in from each cartesian neighbor, a latch, and one bit out to each of them. It's turing complete. You alternate the latch clocks, thus preventing all timing issues. The delays are predictable. You can route almost trivially. You can route around defects. You can prove that programs will work as designed.

See my rants for the last decade about "Bitgrid" if you want to know more.


Already been done, IIRC. Algotronix had a similar nearest-neighbor architecture FPGA in the form of the CAL1024 back in 1989†.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.152... (See p. 16)

https://www.algotronix.com/company/Photo%20Album.html

That architecture never really was particularly successful. I vaguely seem to recall that the issue was that not having long routing lines mean that huge swaths of the grid had to be wasted to do routing of signals.

† As an aside, Xilinx acquired Algotronix and evolved their design into the XC6200, which was notable for being one of the earliest run-time reconfigurable FPGAs


If you have a symmetric grid, you can quickly rotate, flip, and otherwise transform the program. Since all the outputs are independent, you can have computation in a cell that has results flowing into the chip, and back out to the edge at the same time. Delays are simply a matter of counting how many cells a result passes, and thus a good signal to feed to an optimizing algorithm.


The general version of the thing you're describing is a 2d systolic array. Designs like this already map reasonably well to standard FPGAs.

The problem with the 4:4 LUT you described above is that you almost always want the compute cell part of your design to have inputs and outputs that are wider than a single bit. For example, if you were designing an FIR filter as a systolic array, you might want one filter coefficient to come from the top of each cell, and your signal to pass through the cells from left to right. A realistic design might have a 16 bit signal and 10 bit coefficients.

This would require 160 LUTs per cell just to size the signals correctly. You'd need a lot more in practice, because in addition to fitting your compute cell logic, state machine logic, and arithmetic logic, you also need to waste extra LUTs to route the arithmetic carry chains and state machine signals across this LUT grid.

And even if you manage to do all that, you're still left with a bit-serial design that needs to be clocked an order of magnitude faster than a design with wider inter-cell transfer sizes to get similar performance.

When you start thinking about ways to get around some of these limitations while keeping designs with high "spacial locality" efficient, you get something that looks a lot like a modern FPGA.


If you had a systolic array that was generic and routless, you could very easily compile (at run time) the filter coefficients directly into the logic, updating as you go.

I'd feed inputs from the left, passing them through towards the right, then reflect them back in a manner that would keep timing right. Sums would accumulate and appear as outputs on the left.

If you clocked the whole array with a 2 phase clock, you'd get an answer out every cycle, once all the pipelines were filled.


This is not a new idea, but doing so will make fpgas harder to fab


In case anyone else was wondering: FPGA stands for field-programmable gate array[0]. Frustratingly, this is mentioned nowhere in the article, not even after posing the question "What is an FPGA?" right at the beginning.

[0] https://en.m.wikipedia.org/wiki/Field-programmable_gate_arra...


I worked on something like this at a startup, it was very interesting work, but it is hard to convince people that it has value.

The biggest issue is that standard processors (i.e. x86_64) continue to get faster and better, and in a straight line, they're already much faster than an FPGA. Secondarily, software written for an FPGA ends up being specific to the FPGA, mostly because the number of LUTs per chip and the routing.

So there's two things going on, one is that you need something very parallel in order to have a compelling argument for implementing it on an FPGA, and secondarily, in a couple of years, the C implementation of the same thing ends up being faster because processors have improved and number of cores has increased, caches have gotten better on your standard processor. So for your code to improve, you have to rework it entirely, possibly to the extent of rethinking the algorithm you've used. (FPGAs advance in the same way, but a new compiler doesn't speed up your code, because you wrote it very close to the metal).

So in reality, the issue isn't that FPGAs have the wrong abstraction, it's that they have practically no abstraction, at least when it comes to having a real compiler that will optimize your code in a meaningful way for the chip that you are using, the way something like GCC would. Even if you are writing verilog or vhdl, you still need to consider the number of LUTs you have, how the placement will work out based on the size of your different modules (and of course, clock timings & pipeline stalls). You get some help with that stuff from the compiler, but when you then upgrade to a bigger chip, there are diminishing returns in the help that it provides. It is really like you are building your own arbitrary CPU. In that regard, it is difficult to find good people.

None of the commercial attempts at high level languages have been successful, largely because they suck. You're stuck writing C code when you'd rather write something that actually considers the advantages that an FPGA has (the extremely parallel processing of bytes). Implementors need to think more in terms of parallel graphs and tree structures that coalesce than they do loops. From that perspective, you need actual language primitives that actually match with the advantages of the platform (which verilog and vhdl do, but in a low-level, ham-fisted way). So it's an incredibly tough problem to tackle.


This is completely incorrect. There are many serial applications that run significantly faster in an fpga even at one-fifth of the clock of a CPU. The reason being that you can dedicate all of the resources to performing that one task instead of sharing it and using instructions that are not optimized for it. There is a reason signal processing is still predominantly done in asics, fpgas, and DSPs.


I think you misunderstood. I said very little about the relative performance.

I know quite well that the FPGA can be faster. The point is you have to work 10-50x as hard to get there, if you want to do anything interesting. Arbitrary FPGA application code is not a thing.

Practically speaking, by the time in your development cycle that you’ve built your custom hardware and software, got it through emissions and are ready to sell it commercially, there’s a new intel chip that is competitive with what you’re doing, using C code. (It’s an FPGA so you have to deliver custom hardware with it). FPGAs have advanced as well in this time, but you’re not using that, you’re using a 2-4 year old one.

We were picking apart tcp packets and processing the data inside of them. What we built worked, and it was faster than a CPU, but we had a lot of clever non-standard custom compilers to do it.

It was really fun, challenging and interesting work, but really hard to justify in a practical sense.

Things like signal processing are a good application mostly because they’re well understood and the code follows the same basic pattern. Novel applications are much more difficult.


There is no abstraction for anything unless you distribute it? What is the abstraction for AV1 decoding on CPUs that don't support it? Implementing it in a software library?

The only problem with FPGAs is that they have even more interesting side channel problems, but FPGAs will be more space efficient than half a dozen ASIC implementations on a chip, or even a dozen cores to replace the implementations.


Reading this article and thinking to myself that I wrote this 13 years ago: http://fpgacomputing.blogspot.com/2006/05/methods-for-reconf...


Somewhat related: There's now an Open Source toolchain[0] for FPGAs.

[0] https://symbiflow.github.io/


Which doesn't handle DSP and BRAM primitives properly, it can't even do packed Verilog arrays, there's no VHDL capability AT ALL, and it supports only extremely primitive Lattice parts (and "eventually" will support the Xilinx 7 series).

Great for blinkenlights. Useless for work.


> Useless for work

... yet?

It's an open source project and it is trying to achieve what multiple vendors building moats could not.

Maybe it is not production grade but maybe all it needs is time and contributions.


It’s a hobbyist thing. Useless in corporate setting. I can summon vendor’s field application engineers when I have a problem. Guess what I can do when I have problems with open source toolchain and project’s deadline is coming. Xilinx and Altera will not support 3rd party tools due this liability issue for sure. Time and contributions will not help this venture, the problems are not technical.


People used to say that about regular compilers.

Yes, when you are targeting an ASIC and a manufacturing pipeline, you need more support. But not everybody is. In fact, 7+ billion aren't. There is no shame in shipping a low-volume FPGA product, or even in targeting an existing FPGA demo board. Or even a commercial SOC or PC, if it does the job.


"People used to say that about compilers"

Yes. Very talented engineers are paid large sums of money to work on open source compilers when the company's main concern is NOT compilers - Red Hat wants the compilers stable on Linux, and helps out. Same for apple. Etc, etc.

"Open Source" is just somebody else footing the bill because it IS NOT the main business concern - compilers aren't your product differentiator, but you need them to work on your systems, you pay talented engineers to work on them.

This dynamic DOES NOT HOLD for FPGAs. There is no "second source" of Xilinx FPGAs. Nobody will pitch in to help Xilinx because the only people who employ an army of talented people with the direct niche skills needed are their competitors.

F/OSS software is not great once you get past small, single use tools. KiCad became barely useable in Version 5, and it's still a decade behind what we had commercially a decade ago - and there are probably what, 500 PCB designers for every ASIC designer? 50 for every FPGA designer?

Xilinx and Intel fund programs at schools to develop talent they can use to further develop their systems and chips. That's remarkably NARROW.

In short: No, this tool will NEVER achieve critical mass, unless Lattice decides to support it directly - and then it still won't work on Xilinx chips. It is an absolute dead end, because the economics that helped the success story of GCC will not apply.


There seems to be commercial support available for Yosys/nextpnr: https://www.symbioticeda.com/fpga-design


There is no synthesis engine for VHDL in it that I'm aware of. It will never handle encrypted IP, so you can't just buy a core if you don't have a certain expertise in house.

It's... Useless in most commercial settings. The Symbiyosys tools in a commercial setting are only useful for formal verification. And they have some extreme limitations.


> There is no synthesis engine for VHDL in it that I'm aware of.

The answer is one simply Google search away: http://www.clifford.at/yosys/cmd_verific.html

The commercial version of Symbiyosys uses Verific to parse and elaborate designs. The Verific front-end supports Verilog, SystemVerilog and VHDL.

When it comes to formal verification, the Verific front-end and the Yosys Verific backend support a large set of the formal verification primitives of SystemVerilog. (See also the above link.)



It is also much less unpleasant to work with than the extremely awful vendor toolchains.


And yet, whenever given the choice with my hobby projects, I always drop the open source available FPGAs and toolchains for Altera or Xilinx.

The moment any kind of debugging or timing analysis is required, the open source chain currently falls flat.

Things will improve over time, but I think some of it is painted a bit too rosy.


> The abstraction gap between RTL and FPGA hardware is enormous: it traditionally contains at least synthesis, technology mapping, and place & route—each of which is a complex, slow process. As a result, the compile/edit/run cycle for RTL programming on FPGAs takes hours or days and, worse still, it’s unpredictable: the deep stack of toolchain stages can obscure the way that changes in RTL will affect the design’s performance and energy characteristics.

This is basically what pushes me away from FPGAs, even though there is now icestorm so i can avoid Windows. As an outsider I totally agree with the author, I love the idea of using FPGAs as accelerators, but it's way easier to do GPGPU right now.


Problem with Verilog and VHDL is widely known. This is why I proposed[1] to create a unified LLVM-like framework to prevent many projects from reinventing the wheel, and be able to use each other's results.

[1] https://github.com/SymbiFlow/ideas/issues/19


Are you aware that Intel's OpenCL compiler for FPGA uses actual LLVM?


Yes, but LLVM is not very suitable in general for FPGA/ASIC design. FIRRTL[1][2] looks much more complete and fit among various alternatives.

[1] https://bar.eecs.berkeley.edu/projects/firrtl.html

[2] https://github.com/freechipsproject/firrtl


This is a useless post: bunch of complaining about the author’s own ignorance of FPGA and computational fabrics, willful misunderstanding of the differences between FPGAs, GPUs, ASICs, and lastly, not even an attempted proposal for a way forward past the author’s inabilities. Why did I just waste time reading this crap?


I’m not quite sure I understand your discontent.

You assert the author displays a “willful misunderstanding of the differences between FPGAs, GPUs, and ASICs.” What differences did the author misrepresent? And is it fair to call it “willful”?

From what I can tell, the ideas should be taken at a 10,00ft view: Verilog as an interface to (computational) FPGAs is not good enough because it’s inaccessible to domain scientists in other fields (where CUDA and similar are). The way I read the post was the author is framing a research question: “how do we design a new abstraction for FPGAs to do for them what we’ve done for GPUs?”


> Even RTL experts probably don’t believe that Verilog is a productive way to do mainstream FPGA development. It won’t propel programmable logic into the mainstream. RTL design may seem friendly and familiar to veteran hardware hackers, but the productivity gap with software languages is immeasurable.

The author seems to view "mainstream" as meaning being as easy as CPU or GPGPU programming. I don't think it makes much sense trying to accomplish this on FPGAs; you're better off using CPUs, GPUs, or making something domain-specific like TPUs. The benefit of FPGAs is that they allow you to build your own architecture, and define data movement in a way that is specific to your application, at the cost of increased development effort. The complexity encountered in doing RTL arises from the inherent complexity in using FPGAs effectively.

There is a case to be made for something in between a CPU and an FPGA that allows easier development, but gives you some ability to control your data movement to get higher performance. Processor meshes, like what Xilinx is including with their upcoming Versal chips, might be a good solution to this (though in typical FPGA vendor fashion this too is locked behind proprietary tooling).


> I don't think it makes much sense trying to accomplish this on FPGAs; you're better off using CPUs, GPUs, or making something domain-specific like TPUs.

I dunno, we have a lot of libraries and frameworks who make it their purpose to efficiently run code on either a CPU or GPU (or TPU, if we're talking about things like tensorflow); a FPGA is conceptually just an extension of that except, ofc, that on-the-fly FPGA recompilation usually outweighs any performance benefits you'd get from using the FPGA based hardware accelerator, so the only reasonable uses are one and done layouts with occasional patches/updates. If we could get an FPGA gate pipeline that was as efficient as the modern shader pipeline, I think what we could use them for could expand greatly - imagine a tracing JIT that could dynamically load hot sequential code into an FPGA to speed it up, the same way we can load matrix ops into a GPU today.

With where FPGA toolchains are today, it just doesn't work _because_ the compilation process is so bad and slow, and nobody except the engineers within Xilinx and Altera even have a reasonable shot at making it better since both the compiler and the compilation target are closed. Part of it is simply that these compilation passes aren't really configurable (beyond expressing layout and pipelining preference) - if I could get near instant bitstream generation in exchange for 30% extra space used, I'd use that setting for all but my final builds, and strongly consider invoking it at runtime.


Hm. Are you aware of [1] https://www.microsoft.com/en-us/research/project/emips/ & [2] https://blog.netbsd.org/tnf/entry/support_for_microsoft_emip... ?

Especially the two papers which come up when you search like this? [3] https://duckduckgo.com/?q=microsoft+extensible+mips

MIPS-to-Verilog, Hardware Compilation for the eMIPS Processor Karl Meier, Alessandro Forin Microsoft Research September 2007 Technical Report MSR-TR-2007-128 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052

[4] https://www.microsoft.com/en-us/research/wp-content/uploads/...

and

EXTENSIBLE MICROPROCESSOR WITHOUT INTERLOCKED PIPELINE STAGES (eMIPS), THE RECONFIGURABLE MICROPROCESSOR

A Thesis by RICHARD NEIL PITTMAN

Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE

May 2007

Major Subject: Computer Engineering

[5] https://oaktrust.library.tamu.edu/bitstream/handle/1969.1/59...


I've been thinking about this for so long! I believe Intel did research on it with possible results. Are you interested in pursuing it?


GPUs have become immensely popular and helpful in many real-world use cases. This happened in a fraction of time that FPGA had to find a useful niche in the market.

As a mostly outsider to the FPGA world, I conclude one of two is true:

1: FPGA performance gain is not meaningfully better than GPU

2: FPGA are so difficult to make efficient use of that the performance gain is seldom realized

Which ever the case may be, GPGPU has significantly advanced the HPC space and FPGA has not. I don't see that ever changing unless one of the underlying variables changes. 2 years ago I saw a talk at a big-name conference about FPGA-accelerated SQL-like queries, it seemed cool, it was talked up as the next big thing in the data science space. I've never heard of it since that time, meanwhile we now have full SQL compliant GPGPU solutions. I'm not hung up on SQL or HPC, but it looks like FPGA never found anything it's truly great at while GPGPU has - across multiple domains.


Acceleration is only some part of what FPGA do. But it sounds like you imply it is the only purpose of FPGA, it is far from it. How GPU are used in chip prototyping? Or in safety critical designs? Can GPU implement fast proprietary interface or protocol parsing?


I think that both GPGPU and FPGA are areas of development and exploration with different types of focus on the hardware and software. Both are important and will lead to other forms (e.g. TPU NNs). I can easily see FPGA SQL queries working well and being further optimized into a more specific QPU since the scale of the problem pays for development of efficiency gains.


Compilers can get quite good at optimization problems. The trick is finding an abstraction that hides details in just the right way that compilation can result in something about as good as hand optimized code in 1/10th the developer time.


If you could come up with a better abstraction, it would probably be better to build a new genre of computation device around that abstraction, rather than sticking with FPGAs.


Good luck getting anybody to build them. FPGAs suck, but exist. Existence is worth more than a thousand good architectures.

Similarly, GPUs. When enough people abused them, they started evolving toward what the abusers actually needed.

A decent abstraction would lead to FPGAs more useful for things other than emulating ASICs. But it has to come first.


The entire article can be summarized in this paragraph:

  The problem with Verilog as an ISA is that it is too far removed from the hardware. The abstraction gap between RTL and FPGA hardware is enormous: it traditionally contains at least synthesis, technology mapping, and place & route—each of which is a complex, slow process. 
But there is nothing below a netlist that is still mappable to different types of FPGAs (since different FPGAs do not even necessarily need to have LUTs at all, much less the same LUT types!) so I fail to see the use of it.

A GPU ISA changes very little between generations of the GPU, however a rather small change in the structure of the FPGA usually implies a completely new bitstream. So even if there was bitstream documentation it would be very FPGA-specific, unlike an ISA.

P&R does not look that much different from compiling, in the sense that you can spend as much effort as you want on it in order to produce a better or worse result for your machine.


Readable version of the quoted paragraph:

The problem with Verilog as an ISA is that it is too far removed from the hardware. The abstraction gap between RTL and FPGA hardware is enormous: it traditionally contains at least synthesis, technology mapping, and place & route—each of which is a complex, slow process.

(Don't use code formatting for quotes.)


The solution that emerges is not just a new hardware abstraction for current generation of FPGAs. It would be similar to the abuse of GPU's in the early 2000s.

You must design completely new type of Computational FPGAs (CFPGA?) that can have good generic hardware abstraction. It might be something little above the logic blocks and hardware blocs and routing.


This has happened (is happening). Intel's latest FPGA's have potentially thousands of floating point multiply/add blocks and Xilinx have a "many cores + on-chip network" design on its way.


People are already making FPGAs. Good luck persuading them to make something else for you.


This is what is called a CPLD. It never attracted much interest.


> But there is nothing below a netlist that is still mappable to different types of FPGAs

But that's the problem - HDL like Verilog and VHDL are not simple netlist descriptions, but instead a weirdly abstracted away behavioural definition of a circuit that synthesizes into a netlist using tons of (sometimes automagical) heuristics.


They are not "simple netlist descriptions", but they can be used like one. This is what usually one understands when you use the words "RTL" and "Verilog" interchangeably, because when you start with the automagical stuff, you'll find out most P&R tools _cannot_ actually accept those.


That was an interesting read, an Adrian has done a lot of interesting work, but I can't agree with the theme that FPGAs are just another compute engine with a wonky ISA.

I've met a number of software people who are now developing FPGA designs, and hardware people who are the same place (even looking to hire US Persons with those skills for SDR work :-) and invariably the software folks get a brain cramp because it "looks" like code, but it doesn't "work" like code.

I have often wondered how close this experience is to English speaking people listening to a conversation about a software design and implementation becoming frustrated because they think they understand each word but they aren't understanding any semantics of the conversation.

I'm one of those people who essentially double majored in EE and CS[1] which is sort of like growing up in a house where one parent speaks one language and the other speaks another. In those situations you tend to naturally accept when one discipline (or language) doesn't overlap cleanly with the other.

As a result I disagree with the author that an FPGA is just a computation engine with a "wonky" ISA. Thinking about it that way is some what limiting. That said, some of the things that give FPGAs their programmability might be useful additions to GPUS, mostly redefining the internal data paths of the GPU to allocate more "bits" to change dynamic range for different operators might allow some interesting things to be done.

HDLs themselves are interesting problems, because you have a target, switching matrix and logic elements of an FPGA, and text. Designing a way to express the capabilities in one that translates to the other is very hard. So hard in fact that there are circuits you can construct in the FPGA that you cannot express directly in a HDL, and there are things you can write in an HDL that cannot by synthesized into a set of linked logic elements and a clock. On the surface this might seem like saying "Well yeah you can write things in assembly that the compiler won't generate." but it goes deeper than that. Both from programming what the pins on the package do to changing clock networks to get timing closures on complex designs, an FPGA is not a stored program computation device unless you configure it as one.

[1] Computer Engineering as a major wasn't a "thing" yet so I ended up taking all the EE major classes and nearly all the CS major classes (I missed out on some of the logic oriented math classes of CS because of the physics and materials science requirements on the EE side).


Instead of complaining about the quality of the article, I’m thinking about the open question at the end. What would be a higher level abstraction for FPGA development?

Is it possible to create a language around FSMs? Most hardware seems to have two parts: the actual logic that implements some functionality and then some FSM that implements the control logic. The FSM may also have a lot of implicit/assumed states (like a counter for some timeout). Maybe a higher level language can expose these design pattern in a nicer way and hide all the messy low level details (like sequential/combinational logic, connecting ports and wires, matching signal widths, etc).



Right, well, the problem with all articles of this kind is that the authors are doing the equivalent of trying to force a square peg into a round hole.

You do not PROGRAM an FPGA in the software engineering sense of the word, you CONFIGURE it.

You do not use a programming language to create your configuration bitstream; you use a HARDWARE DESCRIPTION LANGUAGE.

You describe the hardware you want and how it is to be interconnected. You do this either explicitly —by literally using code that wires resources as you specify—or by inference— using idioms, if you will, that you know result in specific hardware within the device.

Thinking of Verilog or VHDL as software programming languages is wrong and can only lead to frustration. FPGA’s are still very much the domain of hardware engineering. Well, at least if what we are after are efficient high-performance results.

If, as a software engineer, you want to take advantage of an FPGA to accelerate processing, you should work with a capable FPGA hardware engineer to create a device with an “API” (using the term loosely) that exposes the desired functionality. I’ve done just that more times than I can remember; hanging a large FPGA off something like a small 8 bit 8051-derivative processor that allows the micro to access powerful computing resources exposed through means easily accesible with simple C functions in real time.

If you use the right tool for the job and do it correctly it can be blissful; try to force a paradigm that does not match reality and all you get is frustration.


The author is obviously deeply aware of all this. It is exactly what he is complaining about.

Why should he have to hire you to do something he could do himself, given better tooling?

It is the same problem as business people had, needing to explain their problem to a system analyst and get programmers to code it. Then spreadsheets came along, and 99% of business programming just totally disappeared into it.

Maybe you can do it better than he could. Doesn't matter, there's only one of you, and what he could do would be good enough. Just like spreadsheets.


Your perspective on this is flawed. The comparison to spreadsheets isn’t applicable here. Not even close.

That said, I do not expect you to understand my perspective if you are not a hardware engineer. This isn’t a put-down, just reality.

There have been many attempts to make FPGA’s fit into a nice software engineering paradigm. I can’t think of one, a single one, that compares well to what a hardware engineer can do by treating it like hardware.

A simple example I can give you comes from my own work going back about 15 years. Part of my design needed a very high performance multiphase FIR filter. The tools, even with all possible optimizations enabled, could not get the performance we needed out of this chip. If we wanted to stay with that approach the only option was to go up to a larger and faster FPGA as well as up one speed grade. That would have cost a bundle.

Instead, I hand-placed, wired and routed the filters. As a result I was able to get 2.5 times the performance the compiler could get. We were able to stay with the smaller chip and squeezed even more out of it during the five year product life cycle.

FPGA’s are hardware, not software.


If you built the exact same product today, almost the cheapest FPGA you can buy would be fantastically overprovisioned. So, today, all your extra effort then would be superfluous, and you could treat the design like software, using defaults.

Programming computers used to be like hardware; where you stored a variable on the drum mattered because you couldn't afford to wait a whole rotation to fetch it again. Now most code is just code, and Python is often fast enough for production.

I know the difference. When I take packets from a 40Gbps link, I'm programming hardware. I'm counting cycles. Most problems are not like that, and don't deserve your attention. That doesn't mean they shouldn't be done, it just means somebody else should do them, using less specialized tools.


And that’s where you are wrong. Who builds the same product they were building 15 or 20 years ago? It never gets simpler unless what you are working on is trivial. Technology has always been about pushing the limits.

You never have enough chip resources, speed and computing power to spin the next step up in a design, much less keep up with the state of the art in any non-trivial domain a decade or two later.

Put a different way: If you are using FPGA’s correctly they will never be like software...because that’s wasteful of resources and performance. If it’s easy and “like software” it can probably be done in software with an off-the-shelf CPU/GPU.


Not true. Many, many people have so much computing power to spare they program in Python. In some cases Python isn't up to the job, and C++ on a microcontroller isn't quite, either, but a cheap FPGA could do it easily.

What is true is that they don't hire you for those projects. People who have harder problems and enough money to bring you to bear do. But you are far from representative.

Are they using FPGAs correctly? Who cares? They don't.


It would sure help your case if you provided representative use cases. Look through Xilinx application notes and the reality of FPGA’s is very clear. It’s fine to discuss these things academically, but, as a practitioner and businessman the demarcation lines are very clear, both on technical and financial grounds.


Just because something "has been done before" doesn't mean it's not useful now, the context has changed. Electric cars were once the default, and then internal combustion took over... and now they are back... but the were definitely "done before".

Bulk computation is what we need. The maximum use of all of the transistors in a chip, at the minimum necessary clock speed and voltage to get the job done. Delays don't matter at all if you get a result each clock cycle.


You are right that the FPGA vendors does not provide so far the right way of abstraction for Computing.

That's way at InAccel we developed the FPGA resource manager that allows to instantiate and deploy FPGAs in the same way as you invoke typical software functions. The FPGA manager takes care the scheduling, the resource management and the configuration of the FPGAs from a pool /marketplace of hardware accelerators.

That way is easier than ever to use FPGAs in the same way you use optimized libraries for CPUs/GPUs.

And we have a free community edition: More info at: https://www.inaccel.com/


Has anybody here read "FPGAs for Software Programmers" (https://www.amazon.com/FPGAs-Software-Programmers-Dirk-Koch/...) ?

Any inputs/thoughts on how appropriate it might be for a Software Engineer to learn and program FPGAs?


There's a reason why GPUs are the defacto standard increased performance.


Whivh vendors if any doesn't keep bitstreams a secret?


What I find interesting about people who pick fights with FPGA programming is that they entirely focus on the things that FPGA doesn't do. If you can solve your problem on a CPU then it will practically never be better to do it on an FPGA. If you want to make FPGAs better, you need to figure out how to make what we already do on an FPGA easier.

I think part of the problem here for example is

>That is, Verilog is to an FPGA as an ISA is to a CPU.

No! Not at all! For a start I can literally write verilog that won't work on any FPGA ever

always @ (rising edge multiplier_result[8]) begin $print("This is nuts!") end

Verilog is not an FPGA language, Verilog is a hardware language that Vendors implement a subset of on any given FPGA.

Here would be my advice to someone who wants a better abstraction for FPGAs: stop relating them to other things that behave very differently. Timing is important, placement is important, mapping is important. If your abstraction doesn't include these elements it is fundamentally flawed.


I agree with the first part of your comment but you lost me there:

>Timing is important, placement is important, mapping is important.

It's true, but Verilog doesn't usually expose that. You let the tool figure out the placement, routing and do the timing analysis. If you want to mess with that by yourself you usually have to use tool-specific methods to partition your design and do partial routings for instance. I don't think you're arguing otherwise but then I'm not sure what you're getting at.

I don't directly do FPGA development myself but I do work with FPGA/ASIC developers (I do the software side of things) and boy I do not envy them. It makes me realize how lucky we are in the software world for GCC/LLVM and all the other open source tools we get to use. You don't have to use unstable, buggy, obscure programs that are effectively unusable for anything remotely advanced if you don't have a support contract with the vendor to help you figure out what to do when something is not working as intended.


The toolchain is indeed rubbish. One of the ironies of the world- while Software engineers tend not to understand how FPGAs actually work and which bits are important, the tools to compile for FPGAs are generally written by hardware engineers who don't know how to write good software tools.


You say that because you're not enlightened enough to find the abstract beauty behind 10k lines of spaghetti TCL script.


The part you quoted is a part of a thought experiment. The author is explicitly saying that Verilog is _not_ an ISA:

> By way of contradiction, let’s imagine what it would look like if RTL were playing each of these roles well.

The point of this section was to point out that the set of people who use FPGAs as accelerators (a la MSFT catapult) use it with the ISA abstraction: they don’t want to care about he timing, they just want to get the dataflow graph to accelerate the computation.

Disclaimer: I work with the author of the article on acceleration architectures and languages.


Regarding the end of the article:

A new category of hardware that beats FPGAs at their own game could bring with it a fresh abstraction hierarchy. The new software stack should dispense with FPGAs’ circuit emulation legacy and, with it, their RTL abstraction

Are you or the author working on something along these lines and if so can you give an example? I'll tell you the truth I was expecting a proposition and I felt kind of "robbed" when I reached the end and no alternative was given


Right but people who use FPGAs as accelerators a la MSFT catapult went to Altera and asked them specifically to write and ship RTL to enable them to do so. They use it with an ISA abstraction because they bought so many chips that Altera did a big chunk of the design work for them. That's a valid approach but I wouldn't say that's really doing FPGA design.


That's not talking about an FPGA any more, more like a hypothetical FPPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: