Hacker News new | past | comments | ask | show | jobs | submit login
Assembler for Nvidia Maxwell architecture (github.com/nervanasystems)
132 points by luu on May 3, 2015 | hide | past | favorite | 10 comments



I once found the ISA[1] for the CUDA GPUs, and always kind of assumed Nvidia provided you with an assembler. However, given that GPGPU programming is not anything close to my line of work, I never had the chance of checking my assumption.

Its amazing that he took the time to make an assembler. Also, I'm left wondering how much performance he tan gain from his tool compared to the Nvidia supplied toolchain.

[1] http://docs.nvidia.com/cuda/parallel-thread-execution/


PTX is a virtual ISA that is not the same as the machine code that runs on the device.

In his Introduction[1] wiki page, he says that he studied sgemm implementations and came to the conclusion that NVidia is not using PTX, but an assembler for the real ISA which is not distributed to developers. He claims that his sgemm implementation is almost 5% faster than NVidia's and its faster than anything that can be done in PTX.

[1] https://github.com/NervanaSystems/maxas/wiki/Introduction


Looking at some of the details in https://github.com/NervanaSystems/maxas/wiki/Control-Codes reminds me somewhat of the Itanium: a very wide architecture that is capable of high throughput for specialised applications, but requires a lot of software-level support to even work correctly (e.g. consider dependencies.) The fact that it's not well-documented is another similarity.

It would be great if nVidia released more documentation, because chances are developers could squeeze even more performance out of their hardware that way.


From what I understand, currently almost all GPU architectures* are some variant of VLIW, with some "tricks" to get the processing cores to work with high efficiency (that is, use all the cores available for a given task).

A few years ago Michael Abrash wrote a wonderful article[1] on the Larrabee project, by Intel, which ultimately resulted not on a graphics card, as he intended, but on a "processing card"[2][3]. Of course, anything you read by Abrash in that topic will be more than interesting.

* Intel HD GPUs may or may no be VLIW, I have never been able to find detailed specs of their architecture.

[1] http://www.drdobbs.com/parallel/a-first-look-at-the-larrabee...

[2] http://en.wikipedia.org/wiki/Larrabee_%28microarchitecture%2...

[3] http://en.wikipedia.org/wiki/Xeon_Phi


I was about to say "not really, the only current VLIW GPUs are ARM's Mali" but after reading more of the documentation I guess you're kind of right... Maxwell requires explicitly encoding when to dual-issue instructions, which is basically the defining feature of VLIW.

That said, VLIW is designed/intended for architectures that are highly superscalar within a single thread of execution. GPUs have eschewed that model in favor of simply executing more threads. So a better (simplistic) model of viewing modern GPUs is "AVX-1024 with massive hyperthreading"


And with that cue, you gave me room to place another article by Michael Abrash on the Larrabee architecture (which is pretty much what you thought as a model): http://www.drdobbs.com/parallel/rasterization-on-larrabee/21...

This thread has been wonderful for reading all this amazing things about different architectures, hardware and software, but I should get working. I leave with a lot more to read the following weekend! :)


Yes, I was aware PTX is a virtual ISA. I had missed the 4.8% number on the Introduction when I first read it.


There are some benchmarks of a neural network toolkit built on top of this: https://github.com/soumith/convnet-benchmarks

compare NVIDIA's own cuDNN R2 versus NervanaSys-16 and NervanaSys-32. Pretty impressive!

I've also tried out his GEMM implementation on a GTX 980. Seems like it can be up to twice as fast as the one from cuBLAS for some matrix sizes.


In case anyone is wondering, "Maxwell" is Nvidia's current shipping GPU microarchitecture: https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%2...


Good work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: