I've wondered for a while: is there a space for a (new?) language which invisibly maximises performance, whatever hardware it is run on?
As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?
There's definitely a space for it. It may even be possible. But if you consider the long history of lisp discussions (flamewars?) about "a sufficiently smart compiler" and comparisons to C. Or maybe Java vs C++, it seems unlikely. At least very very difficult.
There are little bits of research on algorithm replacement. Like, have the compiler detect that you're trying to sort, and generate the code for quick sort or timsort. it works, kinda. There are a lot of ways to hide a sort in code, and the compiler can't readily find them all.
Not for mixed CPU/GPU, but there is the concept of a superoptimizer, that basically brute forces for the most optimal correct code. But it is not practical, besides using for very very short program snippets (and they are usually CPU-only, though there is nothing fundamental why it couldn’t utilize the GPU as well).
I'm not sure that's even possible in principle; consider the various anti-performance algorithms of proof-of-waste systems, where every step is data-dependent on the previous one and the table of intermediate results required may be made arbitrarily big.
It's a bit like "design a zip algorithm which can compress any file".
I don’t see why such a “proof of waste” algorithm would be an obstacle to such an optimizer existing. Wouldn’t it just be that for such computational problems, the optimal implementation would still be rather costly? That doesn’t mean the optimizer failed. If it made the program as efficient as possible, for the computational task it implements, then the optimizer has done its job.
I'd imagine it wouldn't be very difficult to build language constructs that are able to denote when high parallelism is desirable; and let the compiler deal with this information as necessary.
I’m not sure if that’s a good idea at the moment, but we should start with making development with vector instructions more approachable. The code should look more or less the same as working with u64s.
As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?
Has this been done? Is it possible?