I meant to say that they optimized deeply for known and popular use cases and th...

I meant to say that they optimized deeply for known and popular use cases and that it doesn't take ungodly amount of expertise to perform better, depending on the way you express your problem or its dimensions or whatever they didn't cover -edit to add- if your use-case doesn't fit.

I also meant to say that the domain is full of low hanging fruits if your problem doesn't fit whatever NVIDIA didn't optimize deeply. An intern may beat the cuXXX libraries with a little work and you can work up to max perf, yes, with serious effort.

There is probably thousands of man hours plunked in BLAS on Intel hardware and anyone who seriously tried to do AVX2/AVX512 knows it's hard to reach actual max perf on all problems. Yet I don't read 'only Intel experts can code efficient code'. It's no more true for CUDA than other parrallel or memory-weird architectures I've worked on. Yes it's different, but getting max perf has always been hard on any modern hardware.

As for the gulf between acceptable and good, the problem is similar here too: people stop when they've reached their goal or feel they can scale more efficiently by other means. I really don't see the difference with heavily optimized x86 stuff. We keep seeing new stuff you can do to improve AVX512 code or new places where you can apply it (JSON parsing, utf validation...) and it's been out for a while too. There hasn't been any free lunch there for a long, long time.