Hacker News new | past | comments | ask | show | jobs | submit login
Render a neural network into CUDA/HIP code (github.com/facebookincubator)
177 points by fzliu on June 2, 2023 | hide | past | favorite | 65 comments



AITemplate's original designer is here. We quit Meta in January and start HippoML (https://hippoml.com/). We just disclosed our new engine's performance on LLM: https://blog.hippoml.com/large-language-model-inference-from... On Apple M2 Max our new engine encode/decode is 13.8X/2.4X faster than llama.cpp


Really doesn't surprise me that much. Llama.cpp seems like an OK first passs but I assume there is loads of time left on the table in terms of graph optimizations optimizing for the memory hierarchy properly.


It also doesn't use Apple GPUs at all. Its 100% CPU inference, with some CUDA/OpenCL (but no metal and no zero-copy) offload at the moment.


It is actually non-trivial to get GPU run fast, especially on SoC with strong CPU like M2.


GPU programming in general is definitely not trivial, as I can confirm with struggeling to learn WebGPU right now.

But it really depends on the problem, simple math operations on lots of data is usually indeed trivially faster. Like AI mostly is with math on matrices.

Or for example I just implemented a simple 2D raycastsolver in wgsl and as a first project it is totally not optimized - but even on my old laptop with crappy integrated GPU, but (relativly) fast CPU - I can now do 10000 raycasts per frame easily, while the cpu (wasm!) struggles with 500.

The raw power of the gpu is really awesome. But every step is hard and debugging a mess. Which is why only a handful of people seems do be doing it. But now would probably be a good time, to get into it.. as I think gpu compute just has started and will get big.


I've been out of the space for a long time, and it's possible you know these already, but these are a couple weird tricks that can help:

* Radix sort is your friend. Fun fact, O(n log n) is not the fastest a sort can run, it's the fastest a comparison-based sort can run. Radix sort runs in O(N) time, and in fact parallelizes extremely well on GPUs. Extremely. They are great at it. And there are in-place radix sorts too, just a bit slower (same asymptotic performance tho).

* "Insert an element into this collection" style steps can be replaced by a sort and a prefix-sum operation. If you know the offset of the first element with key K, and you know the offset of the first element with key J, you know the offset and size of that "collection" for K within a flat array ("size(K) = offset(J) - offset(K)"). Both of these run super fast in parallel and if you can tweak your problem around to be some kind of sorting operation that usually produces good speedups like this. Easiest way to get a speedup from everything I've heard, if your program works fast from a lookup table it's a good approach.

* Recomputing (or duplicate computation) is often much faster than storing intermediate results. "Procedural generation" is interesting because you can re-compute the generation step on demand. Random123 is also very nice compared to a (CuRand) mersenne twister/etc - why are you, a believer in the cryptographic maxim that hash(key, msg) is uncorrelated to hash(key, msg+1), still storing RNG state? Being able to play back arbitrary parts of a datastream at will is incredibly powerful, you can fastforward and rewind through the data previously used to interact with an item, as long as you know the epoch of the interaction you want for a particular key. And because computation is cheaper than memory, and memory bandwidth - it's really actually practically free in program time terms to just do some math. This is a form of data compression and performance enhancement.

* Generally you must understand the idea of divergence and memory coalescing/alignment to keep those lanes executing. And it is highly preferable to use sorts and prefix scans and butterfly operations (reduction, etc) even within a warp, because traditional "mutex/atomic" paradigms don't work well with 100k threads. But this is just the programming idioms of this particular platform, I am sure LISP is similar too in terms of "oh that's how you do that" once you're accustomed.

("This thread would like to write 3 items to an output array, can I get an alignment into a datablock the warp group is going to allocate from a global counter via atomic_add and write to" is another common task element in many parallel workloads. That is the same prefix-sum idiom, or a butterfly reduce, just at a warp level. And then when all the warps have finished, you can sort this unsorted output by a key array to align it, and start processing it again...)

* Cuda Unbound (CUB) provides reference implementations of all of these operations, and it handles portability across CUDA generations. Having good primitives makes a big difference, most of the workload isn't business logic, it's library calls, so using a library that just does it right gives you a lot of bang for buck. Don't write it yourself.

* Texture maps aren't just for graphics, they are a black box that lets the GPU perform 2D and 3D coalescing and some interpolation.

* Intelligent use of constants memory is another one, probably as is the use of CPU memory. If a value will be seldom accessed, you can probably stuff it into host memory and just accept the slowdown. Or you can store only epochs on the GPU and recompute intermediate values as needed. Try to ensure that all threads in a warp will do it too (host access vs recomputing).

* Raytracing is of course impervious to all of this (so don't worry too much that you can't magically hammer a speedup out of it, nobody really can). You can accelerate the raycasting and intersection testing (and AMD and NVIDIA and Intel all do this differently) but as a general matter rays are completely random and uncoalesced and divergent, they bypass almost all of the ideal GPGPU "hot paths". Ray sorting/shader execution reordering is something that needs hardware assistance, and Intel and NVIDIA both have hardware along these lines. The idea of Intel of making a facility for async future/promise dispatch for sparse tasks (and then sorting the shaders to get good coalescing/etc) is really neat and they've said it's going to come to GPGPU. https://youtu.be/SA1yvWs3lHU?t=289

* You can, however, use your rays more efficiently. And that's an area of active focus for everyone. And I think more efficient use of TAAU samples is probably where raster is going too.


Personally I would lean away from radix sort once the problem size gets too large, on Turing they made the per-SM L1 cache coherent and on Ampere they made L2 cache significantly larger so getting maximum throughput requires more actively designing for cache-friendly access patterns. The access pattern of radix sort will be very strided, so you get low cache utility for those longer distance swaps. If your problem fits into shared memory or even registers (for in-warp shuffles) then definitely, radix sort is super efficient!

Memory coalescing probably got a little more lenient than when you were programming. If any group of threads in a warp is accessing a 128B size-and-aligned region of memory, that’s perfectly matched the 128B line size between L1TEX cache and L2 cache and you get coalesced accesses. Even 32B size-and-aligned (e.g. each pair of threads accesses two halves of a 32B region) isn’t all that bad, since it aligns with the 32B L2 cache line size to global memory. You run the risk of getting poor load/store utility if you don’t access the other 96B in that L1TEX line somewhere in the same block, though.

Constant memory is less important than it used to be, since constant memory goes through the texture memory, and the L1 cache was unified and combined with the texture cache. There’s still an LDG.CONSTANT SASS instruction on Turing, though, so there’s still a difference in the hardware that escapes me right now. IIRC you can write a MOV instruction with the source operand in constant memory (rather than LDG.CONSTANT) which doesn’t put load on the load-store unit (LSU). Could be useful if you’re getting warp stalls due to memory pipe throttling.


Interesting, love it!

> on Turing they made the per-SM L1 cache coherent and on Ampere they made L2 cache significantly larger

Yeah Turing/Ampere are more than people give them credit for. They are quite a bit more powerful in compute in general iirc, the core is really Volta-descended iirc and not Pascal.

> so getting maximum throughput requires more actively designing for cache-friendly access patterns. The access pattern of radix sort will be very strided, so you get low cache utility for those longer distance swaps. If your problem fits into shared memory or even registers (for in-warp shuffles) then definitely, radix sort is super efficient!

Does dumping warp-sorted data into the buffer help? Eg can each warp or each block sort their output so what they're dumping in can be "galloped", like a partially-sorted mergesort or something? Finding those boundaries between output cells is (ironically) another prefix-scan lol.

Is it just that you need a different sort, or does sorting just not work that well now?

Are prefix-scans still good at all?

Interesting, anything in particular to read about it from Anandtech or Chips+Cheese or what? Or should I just read the whitepapers? (/sigh, reading primary sources, my only weakness)

(I really wish there was an Agner Fog for GPUs!)

> IIRC you can write a MOV instruction with the source operand in constant memory (rather than LDG.CONSTANT) which doesn’t put load on the load-store unit (LSU).

hackerman.jpg


> Interesting, anything in particular to read about it from Anandtech or Chips+Cheese or what? Or should I just read the whitepapers? (/sigh, reading primary sources, my only weakness)

I haven’t had great luck with consulting non-primary-sources, most of my knowledge comes from reading NVIDIA blog posts and GTC presentations as they become relevant. Lately I’ve been working with CUTLASS and reading through that documentation — maybe start with their presentations and work back through their references? I’ve learned a lot by reading the architecture tuning guides from NVIDIA, too.

> Does dumping warp-sorted data into the buffer help? Eg can each warp or each block sort their output so what they're dumping in can be "galloped", like a partially-sorted mergesort or something? Finding those boundaries between output cells is (ironically) another prefix-scan lol. > > Is it just that you need a different sort, or does sorting just not work that well now?

I’m not sure, honestly! All I know is that recently I’ve been looking at radix sort kernels with lower-than-expected memory throughout and low cache hit rates :)

> Are prefix-scans still good at all?

The CUB linear-time prefix scan kernels seem to be fantastic still, they operate basically at the speed of DRAM with really high compute utilization. When I’ve seen lower-than-expected performance with these kernels, it’s because of an issue with an inefficient transform being made as part of the input/output iterator ranges, or because of some local memory usage due to an indexing operation in a local variable that couldn’t be fully inlined into registers.


zero-copy with mmap was added to llama.cpp, but the way it was implemented sparked controversy.


I think GP meant zero-copy communication with the GPU, eg. through `newBufferWithBytesNoCopy` [0], which is only possible with unified memory architectures, eg. integrated GPUs.

The mmap change was just about mapping the model files in memory instead of copying them, which has less overhead.

[0]: https://developer.apple.com/documentation/metal/mtldevice/14...


> I think GP meant zero-copy communication with the GPU, eg. through `newBufferWithBytesNoCopy` [0], which is only possible with unified memory architectures, eg. integrated GPUs.

It can still be beneficial on discrete GPUs because it avoids a copy inside the driver.


Yeah this is precisely what I meant. I think it is possible in Vulkan too.

I brought it up because shuffling weights around takes lots of time, takes up more RAM, and saps memory bandwidth the IGP and CPU desparately need.


Very interesting.

Is 8bit/4bit support in the works? Will it work with bitsandbytes out of the box? Speedy inference is great, but in practice many users are running the biggest ~4-bit LLM that will fit into their RAM/VRAM pool these days. This is why llama.cpp is so good, its (AFAIK) the only implementation that will split a 4 bit quantized model so easily.


Yes. We support >= 1bit <= 16bit models out of box for various of models.


What is your planned business model here?


We will disclose more details very soon.


Will this have some similarities to what Mojo is trying to solve?


Mojo is trying to create a new language to solve the problem, and specialized for CPU. We are using a more pragmatic way to solve GPU AI computation problem.


For bandwidth-bound problems like large language models, you could also solve it with properly written CPU kernels (Mojo) and usage of the AMX accelerators for compute-intensive parts.

I'd be more interested if they had a GPU port of Stable Diffusion, which is really compute-intensive. That's where the GPU has a major advantage over CPU, on lower-end chips like M1/M2 and M1/M2 Pro.

> specialized for CPU

Mojo's SIMD execution model should map directly to GPUs. Instead of writing shaders thinking about a single GPU thread, you access the assembly ISA directly, thinking about an entire warp/simdgroup. That's how I think when writing SIMD-group matrix kernels anyway.


Mojo is not at all specialised for CPU. It sits on top of MLIR and excellent support for all major accelerators is planned.


Would it work with instructor-xl or similar which is designed for embeddings and retrieval? On device for privacy is key.


Yes


Do you know how it's speed compares to exllama, specifically with an nvidia gpu by chance?


We haven't compared yet.


Any idea how hippo, AI Template and TVM compare in performance?


Hippo is faster than AITemplate, and supports more generative models. We haven't compared vs TVM, but for absolute token/s on M2 Max, Hippo is able to run decoding on LLAMA with datacenter level GPUs performance (with other SW).


The difference in bandwidth between M2 Max and data centers GPUs isn't that much (less than a factor of 5). The difference in compute is much, much larger. If you only have fast GEMV kernels, and not fast GEMM kernels, you're basically locked into an inference engine that can only run GPT-style transformers efficiently. However, it can *technically* support all the models, but at what ALU utilization?


Thanks, I've added myself to the waitlist. Please let us know when this can be tried!


How does Hippo compare with TensorRT?


Did Facebook invest in this. Is that why it's under Facebookincubator?


We developed AITemplate majorly for Meta's focus at that time, eg Ads/Ranking need. For HippoML is startup we are building for Generative AI. HippoML is not using AITemplate.


Also here are some other interesting projects in the ML compilation space:

- Apache TVM (mlc-llm is a good demo)

- Hidet (a torch.compile backend)

- Alibaba BladeDISC

- Nvidia TensorRT (a classic, but much less of a nightmare to install now)

- Torch MLIR (SHARK has some demos/implementations)


And of course, Chris Lattner’s Modular AI https://www.modular.com/


Yes. I look forward to a future where we don't have to wrestle between software and hardware. AI just runs, whether on CPU, GPU, TPU, QPU, or whatever architecture you want to design. Time to move on and focus on more important problems, like solving climate change.


I just ran a 512x512 Stable Diffusion benchmark with this yesterday

Pytorch Eager Mode with some optimizations: ~6it/s

Pytorch Inductor (torch.compile with dynamic=True): ~7it/s

AITemplate: ~9it/s

All of them support changing settings and such, albeit with some work in progress bugs/caveats.

That is 512x512 on a 2060, so I would expect the gains to be bigger on newer GPUs with more overhead to take advantage of.


Did you try TensorRT?


Not yet. TRT diffusion has been an enormous pain in the past, so I have kinda avoided it, but Nvidia just recently contributed an img2img pipline in HF diffusers.


The latency improvements are impressive but the ability to run models beyond their typical memory limitations is way cooler.


CUDA: NVIDIA GPU 'framework'

HIP: AMD GPU 'framework'

This takes Neural Network defined in python and convert them to C++ code calling CUDA / HIP for maximum inference speed


I like this humility: "AITemplate is co-created by Meta engineers: Bing Xu, Ying Zhang, Hao Lu, Yang Chen, and Terry Chen, with major contributions coming from more talented engineers."


aliens


Interesting that AMD GPUs seem to be 1st class citizens here. Consumer class gear is much cheaper per unit of VRAM by the looks of it


The thing with AMD is that the selection of GPUs that can actually run your GPGPU code is typically much more limited then with Nvidia, so things work on far fewer consumer GPUd. Here it's limited to "CDNA2 (MI-210/250) GPUs". Those are priced comparatively to Nvidia's A100, 10k+ per card.


at first glance i thought may be its like tinygrad. but looks has many ops than tiny grad but most maps to underlying hardware provided ops?

i wonder how well tinygrad's apporach will work out, ops fusion sounds easy, just walk a graph, pattern match it and lower to hardware provided ops?

Anyway if anyone wants to understand the philosophy behind tinygrad, this file is great start https://github.com/geohot/tinygrad/blob/master/docs/abstract...


antinucleon lol

LLaMA is a memory-bound AI model, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making GPU *theoretically* 2x faster without algorithm changes.

Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:

https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen...

I later explained the technique to write very optimized kernels for a specific quantization format. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for designing such GPU kernels.


To give more credit, I know your company was working on your own optimizations, which are prior art. It's possible that you made a 180 GB/s shader on your own (quite slow compared to my 319 GB/s). Or that the 319 GB/s was used, but the self-attention bottleneck was non-negligible.

However, for whatever reason, when Greg Gerganov started work on the Metal backend, you made a product announcement almost a day later. That seems like a non-coincidence and there must be some logical explanation.


ELI5 what this means?

I am losing my bibliography, etymology and vocabulary with every single AI advancement article.

Where learn AI vocab, please?

-

I nee an FN AI teacher to just give me daily updates on AI and verbiage, models, etc...

Hey AI - if you're so smart, build a podcast that teaches me about yourself and how to be a better meat parent whom made you.///\\\\


Starting with a trained PyTorch model, it builds optimized C++ binaries for running inference (not training) on NVidia and AMD GPUs. Various optimizations mentioned a lot, so presumably models run faster than with just running them via regular PyTorch.


I'm curious why that is called "rendering" rather than "compiling". Is the code boiler plate and just a change in the NN's representation?


How much faster is it?


Depends on the model and GPU. Here is an example of almost 2x on a 3060 for StableDiffusion: https://www.youtube.com/watch?v=_6BsUijOWoM


Very much not an expert here but what I understand is that most deep learning frameworks (PyTorch, Tensorflow, etc.) have some overhead associated with them just being on the graphics card. This takes PyTorch code and removes the overhead by translating the network into a "native" language for the card (CUDA for NVIDIA).

What I'm not sure is what "HIP" is in this context.

The way I'm reading this is it's the difference between running code in an interpreter vs. on the bare metal (for the GPU)


HIP is AMD's open-source re-implementation of CUDA libraries.


HIP is AMD's contribution to the open source community to overcome Nvidia's CUDA software moat

You write in HIP C++ language and run them on either NVIDIA or AMD platforms. This way you get cross-platform code and are not stuck with Nvidia.

Use HIPify tool to automatically convert existing sources from CUDA to HIP.

It's been around for many years - but the fact that so many people still don't know about it - speaks for the sad state of AMD communication.

https://docs.amd.com/bundle/HIP-Programming-Guide-v5.3/page/...


HIP is a pathetic CUDA API clone. Gratuitous renames don't do it any good and are more representative of NIH rather than anything else (sadly).

They should have shipped a proper header set instead of hipify.


It doesn't really help understand what they are, but for completeness CUDA is an acronym for "Compute Unified Device Architecture" while HIP is "Heterogeneous-compute Interface for Portability"


What is an FN AI?


An autonomous Belgian firearm.


Soon. :(


A 'fUCKIn' ai'


I don't see any comparisons with torch.compile. Kind of unfair to compare it to eager mode.


Would like to see updated benchmarks supporting the claimed performance gains in this project. Currently showing torch 1.12 which is rather weak compared to the latest 2.0 and torch.compile()


Anyone seen details on if this can handle splitting a model across GPUs?


It reminds me Theano




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: