For those interested in learning more about this, Chris Lattner (project lead on MLIR and Swift for TensorFlow) and I co-taught two lessons covering these topics. They're lesson 13 and 14 here:
https://course.fast.ai/videos/?lesson=13
Meh. I haven't been able to get any speedup out of XLA, and it wasn't for the lack of trying. I will take TF seriously when it's at least as fast as PyTorch, and frugal enough with GPU ram to process batches of the same size as PyTorch. Right now, PyTorch simply blows it out of the water, and it doesn't have any advanced compiler backend or anything. Just lots good old fashioned performance engineering elbow grease, nothing fancy.
In short, you don't want to spend weeks on hand-optimized kernels, every time you want to implement a new type of layer for a large-scale network, or on an experimental target hardware.
As someone who has spent weeks on hand-optimized kernels, I've yet to see a compiler based system that comes anywhere near my kernels. I've tried them all, because writing and optimizing assembly on multiple platforms is hard AF.
The highest performance approach (on CPU) seems to be JIT assembly (as used in Intel OpenVINO). On Intel, that beats the living Jesus out of everything else, including my work. On ARM it's a free for all, and there's no clear leader, especially in quantized inference. On GPU whatever it is NVIDIA is doing is the right thing to do.
I'm skeptical that XLA/MLIR/TVM-like approaches can come close to, let alone exceed, the performance of hand-tuned kernels, for the same reason why hand-tuned assembly beats the shit of what compiler generates most of the time. I've yet to see it happen in practice. And you just need a few of those kernels, strategically placed where most of the computation happens, as per Pareto principle. For something like TF Google has the resources to get that done. It just chooses not to, to sell you more GPU-hours.
Same here, I guess it's a complex function of hardware and what you're trying to do. Personally I ended up writing my own autodiffing tensor library in C++, because all existing solutions had abysmal performance on my problem (lots of local updates in large tensors). The speedup is >50x compared to TF, pytorch, julia, jax.
> Personally I ended up writing my own autodiffing tensor library in C++
But this will not be a general library right ? You must have only included certain subset of functions of TF or PyTorch or whatever. Autodiffing is also included in certain proprietary libraries like ones from NAG. But I doubt its possible to achieve 50x speedup without compromising on functionality.
Of course it's a herculean task to write a library with that many features. But I don't think that's the issue, it's more that the devs of TF can't possibly optimize for every use case. For me, I knew what kind of ops I needed, so I could focus on getting those as fast as possible.
It mean you have efficient "crud" operators? This interest me because I'm building a relational language and toy with the idea of tensors as the "table" structure, but get rid of that because updates...
Would you be able to sketch what makes PyTorch better than TF? And maybe why Google are sticking with TF -- it's not like Google lack technical ability to do / copy what PyTorch does.
Perhaps Google could copy PyTorch. But Tensorflow has a lot of overhead, and for both political/technical reasons, there's no easy path to go from Tensorflow to Pytorch.
You could just as easily ask: Why is Google sticking with {Hangouts/(Allo,Duo)/Angular} instead of doing/copying {Zoom/WhatsApp/React}? It's not like Google lacks the technical ability.
It's basically imperative GPU accelerated NumPy with autograd and really nice libraries. You can be productive in it in a day (much less if you already know numpy), and if something goes wrong, it tells you what went wrong. It's also easily twice as fast as TF on GPU, trivial to use on multi-GPU setups, has an easy to use dataset interface, and allows you to run batches that are twice as large (which means faster training).
Google certainly has _technical_ capacity to do what PyTorch does, but not the _organizational_ capacity to scrap TF and start over. So they're trying to half-ass it with TF 2.0. It still sucks though.
Most of the transistors in a modern computation environment are just sitting there waiting to be touched by an instruction (far, far more transistors in RAM than in the CPUs, overall).... computing prematurely optimized on the wrong architecture.... there's still a lot of room to grow, speed wise.
> Most of the transistors in a modern computation environment are just sitting there waiting to be touched by an instruction
True, but it's important to note that if all the transistors in a modern CPU switched on at once, it would quickly overheat. This is the "power wall" -- we can squish more transistors in one die than we can actually turn on at one time due to electrical and thermal constraints.
> far, far more transistors in RAM than in the CPUs, overall
Also true, and this is an active area of research. Many people have tried various approaches to performing computations using DDR and other memory technologies. In the past, people were trying to trying to use DDR to run automata. These days there seems to be a lot of focus on processor-in-memory technologies; it turns out memristors can actually be used for computation, effectively turning the entire memory array into a hugely wide SIMD RISC processor. Here is some recent work presented on this subject:
> there's still a lot of room to grow, speed wise.
Yes and no. Single-threaded performance is close to tapping out. Production processes can only shrink so far before physics starts getting in the way. Pipelines and speculation can only get so deep (and broaden surface area for security vulnerabilities). Performance growth for massively parallel workloads is continuing along at a healthy clip, and will probably continue to do so for quite some time. Of course the trouble is that end-user desktop software is generally not massively parallel.
Actually, we can significantly boost single-threaded raw speed. We can't do this for the memory wall however, because the approach is based on MOS current mode logic (MCML).
We can build 20 GHz CPUs now with passive cooling, but they don't beat cutting-edge CMOS cores in memory-hard single-threaded workloads. They do reach 2-cycle add and like 3-cycle mul latency though. I hope someone just plops a RISC-V core with that kind of design down, which you can use with like explicit preloading into a tiny cache that gets 2 or 3 cycle load latency into registers.
I'm sure some computations could work well on that sort of very fast, shallow-pipeline core suited well for highly-sequential stuff like maybe SAT/SMT solvers and other inherently divide-and-conquer algorithms.
You're changing the effective throughput weather you do it by upping the clockrate or deepening the pipeline. Using half the data per cycle at twice the clock rate will cause the same memory pressure.
> which you can use with like explicit preloading into a tiny cache
That will kill it. As soon as you put it on the compiler designers or programmers to do something special to realize performance benefits, you're going to loose to architectures that don't.
Sure, compiler writers and programmers will optimize for your architecture... if it's popular and widely used. So you have a chicken and egg problem where you need to get adoption in the first place by running existing workloads faster.
> We can build 20 GHz CPUs now with passive cooling,
Citation? Like for real, that's cool and I'd like to read about it!
See the related HN discussion from earlier today about differentiable programming with Swift for TensorFlow and its corresponding changes to the Swift language: https://news.ycombinator.com/item?id=20890149
Julia would have been the proper answer given the target community.
Naturally with Lather on board, it is Swift all the way, even though its support is WIP on Linux and nonexistent on Windows.
As for Kotlin/Native it is very imature, with incompatible memory semantics between other variants, and to be honest I don't see Kotlin ever taking off outside Android.