When I write CUDA kernels I am very specific in optimizing for nVidia GPU features:
* I have 1024 threads per block
* I have 48KB of shared memory per block
* I have 32 threads per warp and need make sure that
global to local memory reads are coalesced.
* Think SIMD and avoid branching as much as possible
My kernels usually follow a specific pattern:
(1) Read global memory into local memory: making sure that
if thread i reads memory[n], then thread i+1
reads memory[n+1].
(2) __syncthreads().
(3) Do computation in in most thread balanced way possible.
This very specific pattern doesn't really work elsewhere. In fact, optimizing in this fashion and then porting to C++ and elsewhere loses the specific optimization. Programming in more general way loses all the things that makes the program fast. Anyway, I definitely going to look this over more.
AMD's "HCC" is C++ with templates added onto it. Unlike CUDA, HCC implements all relevant features in C++ Templates alone.
If you simply do 64-thread per "warp" (AMD's groups are per-64) and 32KB LDS aka Shared memory per block, you would be able to write portable high-performance code between AMD GPUs and NVidia GPUs.
AMD seems like its a bit behind with regards to GPGPU adoption. But AMD's hardware seems to be a good bit cheaper. You can get HBM2 models at ~$2000 from AMD (Firepro WX9100, which is Vega architecture)
Although... as they say... hardware is cheap. I'm sure most datacenters will prefer the $8000 NVidia V100 instead, because there are more people using that hardware. In particular, its easier to get started with a V100 due to AWS and other cloud-compute offerings.
There is no OpenCL in the story there. AMD created HIP as a direct copy of CUDA. It then gets compiled to work on AMD hardware, or can be translated to CUDA.
AMD also supports OpenCL (which I prefer to both CUDA and HIP), but it's not connected to HIP.
The fact that HIP exists is why we choose cuda for our program 8 years ago. Opencl and cuda were the choices. I thought about how I would design such a gpu language aND then looked at cuda and opencl. What I dreamt up matched cuda exactly. I forsake one day that it would no longer matter what you wrote in cuda because it could be easily translated to any type of gpu hardware. In the past few years, Portland group made cuda do cpu compiler and now HIP.
Google won out over Oracle in the end but it took a long time — from 2012 until 2016; four years of court cases — with some courts finding that API structures were copyrightable and others found that they were not, or that reimplementation was fair use. I guess we now have a precedent thanks to this but it could still be an issue? IANAL so I don’t know.
Somebody speculated that the lawsuit was used by Oracle to establish firm boundaries in what can be considered a copyright infringement, with the intent to implement Amazon/GCE APIs for drop-in compatibility for their cloud offering. So they tried all kinds of ridiculous stuff to see what works/what doesn't in order to properly cover their backs while "stealing" other APIs.
Given that Google helped sink Sun by ripping them off, and Sun wasn't in the position to pay for lawyers, to apply the same medicine that they did to Microsoft....
AMD is working on a cudnn comptability layer iirc (MiOpen) and is ROCm group has created a cuda transpiler (to intermediary HIP then to amd binary via hcc)
Edit: OK I was reading the docs, I think I got it: hipDNN is a wrapper that (once finished) will search and replace calls (from cuDNN to hipDNN), then hipDNN itself, in turn, will call MIOpen, not sure if that's right, I would appreciate if someone who knows more could confirm
Yup! I spent one very LONG weekend trying to get tensor flow running on my amd GPU. Managed to finally get it running but it was slower than using my CPU ....
The combination of tensorflow and codeplay's computeCPP might already work. At least they have added SyCL support for the linear algebra library (Eigen)
I think they also have some TensorFlow-specific SYCL kernels; I'm not sure it all goes through Eigen. Plus they're at TF 1.6, whereas AMD's HIP version seems to be stuck on TF 1.0. AMD's method of maintaining a pure fork and running hipify on every upstream commit doesn't seem very sustainable...
From Codeplay's blogs it seems they ported the tensor part of Eigen to SyCL. The Matrix part is not supported under SyCL.
Tensorflow does have options to compile models to custom code (XLA), that part might not use Eigen.
,,HIP is not intended to be a drop-in replacement for CUDA, and developers should expect to do some manual coding and performance tuning work to complete the port.''
It's a great start, but I'm sure a lot of cases are not yet handled (like asm instructions?)